# improved_active_multitask_representation_learning_via_lasso__2ab56e4b.pdf

Improved Active Multi-Task Representation Learning via Lasso

Yiping Wang 1 Yifang Chen 2 Kevin Jamieson 2 Simon S. Du 2

To leverage the copious amount of data from source tasks and overcome the scarcity of the target task samples, representation learning based on multi-task pretraining has become a standard approach in many applications. However, up until now, most existing works design a source task selection strategy from a purely empirical perspective. Recently, Chen et al. (2022) gave the first active multi-task representation learning (A-MTRL) algorithm which adaptively samples from source tasks and can provably reduce the total sample complexity using the L2-regularizedtarget-source-relevance parameter ν2. But their work is theoretically suboptimal in terms of total source sample complexity and is less practical in some real-world scenarios where sparse training source task selection is desired. In this paper, we address both issues. Specifically, we show the strict dominance of the L1-regularizedrelevance-based (ν1-based) strategy by giving a lower bound for the ν2-based strategy. When ν1

is unknown, we propose a practical algorithm that uses the LASSO program to estimate ν1. Our algorithm successfully recovers the optimal result in the known case. In addition to our sample complexity results, we also characterize the potential of our ν1-based strategy in sample-cost-sensitive settings. Finally, we provide experiments on realworld computer vision datasets to illustrate the effectiveness of our proposed method.

1. Introduction

Deep learning has been successful because it can effectively learn a proper feature extractor that can map highdimensional, highly structured inputs like natural images

1College of Computer Science and Technology, Zhejiang University 2Paul G. Allen School of Computer Science & Engineering, University of Washington. Correspondence to: Yiping Wang <yipingw@zju.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

and natural language into a relatively low-dimensional representation. Recently, a big focus in deep learning has been on few-shot learning, where there is not enough data to learn a good representation and a prediction function from scratch. One solution is using multi-task learning, which uses data from other sources to help the few-shot target. This approach is based on the idea that different tasks can share a common representation. The process starts by training on a lot of source tasks to learn a simpler representation and then uses that pre-trained representation to train on a limited amount of target data.

Accessing a large amount of source data for multi-task representation learning (MTRL) may be easy, but processing and training on all that data can be costly. Therefore, it is important to find ways to minimize the number of samples, and perhaps the number of sources, needed from source tasks while still achieving the desired performance on the target task. Naturally, not all source tasks are equally important for learning the representation and maximizing performance on the target task. But to the best of our knowledge, most research in this area chooses which tasks to include in the training of the multi-task representation in an ad hoc way (Asai et al., 2022; Fifty et al., 2021; Yao et al., 2022; Zaiem et al., 2021; Zamir et al., 2018; Zhang et al., 2022b). Notable exceptions include (Chen et al., 2021; 2022) that study ways to improve training efficiency and reduce the cost of processing source data by prioritizing certain tasks during training with theoretical guarantees.

On the other hand, the significant empirical success of MTRL has motivated a number of theoretical studies (Du et al., 2021; Chen et al., 2022; Tripuraneni et al., 2021). In particular, (Du et al., 2021) and (Tripuraneni et al., 2021) provide generalization (excess risk) upper bounds on the estimation error of the target task for passive multi-task representation learning (P-MTRL). Here, passive means that samples are drawn from tasks according to some nonadaptive sampling strategy fixed before data is observed (e.g., an equal number of samples from each task). Tripuraneni et al. (2021) also proves a lower bound related to the quality of whole feature representations in P-MTRL.

In this paper, our main focus is to guarantee a specific level of accuracy on a target task while provably using the least amount of data from other related tasks. This is achieved

Improved Active Multi-Task Representation Learning via Lasso

through task-level active learning. Chen et al. (2022) is the first work to propose an active multi-task representation learning (A-MTRL) algorithm that can provably reduce the total number of samples from all the tasks compared to the passive learning version (P-MTRL) by estimating the relevance of each source task to the target task and sampling accordingly. However, this previous work has several limitations and leaves some questions open, in both theory and practical application. For example, they did not study the lower bounds of the excess risk on the target task for Multi-Task Transfer Learning. Furthermore, Chen et al. (2022) proposed an L2 regularized source-to-targettask relevance quantity ν2, but it is unclear whether this relevance score is the best criterion for the A-MTRL design compared to other possible relevance scores. As we will show later, their A-MTRL algorithm may not be optimal. In our work, we build on (Chen et al., 2022) by optimizing their upper bound of the excess risk and show that this yields an asymptotically optimal sampling strategy which corresponds to an L1 regularized relevance quantity ν1 and samples from this distribution accordingly. Moreover, we provide the first sampling-algorithm-dependent minimax lower bound of excess risk on the target task for both the A-MTRL in (Chen et al., 2022) and P-MTRL, which shows that our algorithm can strictly outperform these baselines even in the worst case.

In addition to the theoretical bounds, Chen et al. (2022) also has practical limitations. When there exist multiple sampling strategies that are seemingly equivalent under their framework, their algorithm tends to put a little weight on all tasks by nature of the L2 regularized solution ν2. This is sometimes undesirable in practice as will illustrate by two examples. First, setting up a sample-generating source can be more expensive than actually generating the samples. For example, in robotics, each source task can be considered as a specific real-world testing environment that can take weeks to set up, but then samples can be generated quickly and plentifully (Shi et al., 2021). Second, previous research assumes that the cost of samples is the same no matter how much data we need or for how long. However, in reality, subscribing to data from a single source for a long period of time can lead to a lower average label cost. Therefore, even with the same sample complexity from sources, choosing fewer source tasks can be more beneficial. We propose a general-purpose cost-sensitive A-MTRL strategy that addresses these scenarios and demonstrates the potential of our proposed L1 regularized strategy in various cost-effective situations.

1.1. Our Contributions

We summarize our contributions as follows.

We begin by proving that the sampling distribution

over tasks using our proposed L1 strategy defined in Def. 2.5 minimizes the target excess risk upper bound of (Chen et al., 2022). We then consider a class of strategies Lp-A-MTRL (A-MTRL with Lp strategy) and show that, when T k2, for Ntot number of total source samples, L1-A-MTRL is strictly dominant over this class by proving that its estimation error decreases at least as fast as e O( k σ2 k Ntot ) while the error of the L2-A-MTRL/P-MTRL strategies suffers algorithm-dependent minimax lower bound of at least eΩ( T kσ2 k Ntot ). Here T is the number of source tasks, k is the dimension for the non-shared prediction function and σk characterizes the diversity of source tasks which will be specified later. These minimax lower bounds are novel to the MTRL literature. (Section 3.1, 3.2)

While the L1-A-MTRL strategy provably has sample complexity benefits over other sampling strategies, it is not directly implementable in practice since it requires prior knowledge of ν1 (i.e., those bounds only demonstrate the performance of the sampling distribution, not how to find it). Consequently, inspired by (Chen et al., 2022), we design a practical strategy that utilizes the Lasso and a low order number of samples to estimate the relevance vector ν1, and then apply the L1 strategy to sample source data using the estimated ν1. We show that this practical algorithm achieves a sample complexity nearly as good as when ν1 is known. The key technical innovation here is that when the regularization parameter is lower bounded, the Lasso solution can be close to the ground truth value. (Section 3.3)

Going beyond these main results, we demonstrate that our L1-A-MTRL strategy can be extended to support many sample-cost-sensitive scenarios by levering its sparse source task selection properties. We formulate this setting as an optimization problem and formally characterize the benign cost function under which our L1-A-MTRL strategy is beneficial (Section 4)

Finally, we empirically show the effectiveness of our algorithms. If we denote the practical algorithm of (Chen et al., 2022) by L2-A-MTRL, we show that our proposed L1-A-MTRL algorithm achieves 0.54% higher average accuracy on MNIST-C relative to L2-AMTRL (92.6%), which confirms our theoretical results. We then restrict most of the data to be sampled from no more than 10 tasks, in order to mimic the samplecost-sensitive setting with decreasing per-sample cost. Here we find L1-A-MTRL achieves 2.2% higher average accuracy relative to the uniform sampling (94.3%). (Section 5).

Improved Active Multi-Task Representation Learning via Lasso

2. Preliminaries

In this section, We describe the relevant notations and the problem setup for further theoretical analysis.

2.1. Notation

Miscellaneous. Let [T] := {1, 2, ..., T} denotes the set of source tasks and n[T ] := {n1, n2, ..., n T } denotes the number of samples dedicated to each task. Likewise, n[T ],i := {n1,i, ..., n T,i} represents the data from each task at the i-stage for multi-stage learning procedure. If S is an index set, |S| denotes the number of elements in S. We use p to denote the lp norm of vectors and use | | or to denote the l2 norm for convenience. Let sub Gd(ρ2) be the d-dimensional sub-gaussian variables with variance ρ.

Singular Values. For A Rm n, we denote by σi(A) the i-th singular value of A, which satisfy σ1(A) ... σr(A) > 0 with r = rank(A). And we specify κ(A) as the condition number of A, i.e., κ(A) = σ1(A)

σr(A) if σr(A) > 0.

Asymptotic comparison. We use the standard O, Ω, Θ notations to hide the universal constants, and further use e O, eΩ, eΘ to hide logarithmic factors. We use a b or b a to denote a = O(b) and use a b to denote a = Θ(b).

2.2. Problem Setup

Multi-Task. Let t [T] be the index of the T source tasks and index T +1 denotes the target task. Each task t [T +1] is associated with a joint distribution µt over X Y, where X is the input space and Y is the output space. In this paper we assume X Rd and Y R.

Data Generation. Like in (Chen et al., 2022), we assume there exists an underlying representation function ϕ : X R which maps the input space X to a feature space R Rk where k d. And the representation functions are restricted to be in some function classes Φ, e.g., linear functions, convolutional networks, etc. We further assume that each t-th task for t [T + 1] follows a ground truth linear head w t that maps the particular feature to the corresponding label. To be more specific, we assume the i.i.d sample (xt, yt) µt satisfies

yt = ϕ (xt) w t + zt, zt N(0, σ2 z) (1)

where xt pt and xt is independent to zt. For convenience, we denote Xt = [xt,1, .., xt,nt] Rnt d to be the input data matrix which contained nt i.i.d. sampled data (xt,1, yt,1), ..., (xt,nt, yt,nt) µt from the t-th task, and Yt = [yt,1, ..., yt,nt] Rnt, Zt = [zt,1, ..., zt,nt] Rnt to be the labels and noise terms aligned to the inputs. For convenience, we define Ntot := PT t=1 nt to be the total sampling number from all the source tasks.

Transfer Learning Process. As in (Du et al., 2021), firstly

we learn the representation map on the source tasks by solving the following optimization problem

ˆϕ, ˆw1, ..., ˆw T = arg min ϕ Φ,w1,...,w T Rk 1 T

1 nt Yt ϕ(Xt)wt 2 2

(2) Here we allow nt to vary from different tasks rather than requiring uniform sampling and ϕ(Xt) := [ϕ(xt,1), ..., ϕ(xt,nt)] Rnt k. Then we retain the learned representation and apply it to the target task while training a specific linear head for this task:

ˆw T +1 = arg min w T +1

1 n T +1 YT +1 ˆϕ(XT +1)w T +1 2 2 (3)

Goal. Our main goal is to bound the excess risk (ER) of our model on the target task with parameters (ˆϕ(x), ˆw T +1) while minimizing the total cost of sampling data from the source tasks. Here, like in (Du et al., 2021; Chen et al., 2022), we define the population loss as LT +1(ˆϕ, ˆw T +1) = E(x,y) µT +1[(y T +1 ˆϕ(x T +1) ˆw T +1)2]. Then from (1) we can define the excess risk:

ER(ˆϕ, ˆw T +1, ϕ , w T +1)

= LT +1(ˆϕ, ˆw T +1) LT +1(ϕ , w T +1)

= Ex p T +1[(ˆϕ(x) ˆwt ϕ (x) w t )2]

It should be mentioned that in this paper, we consider the model performance under the worst circumstance, therefore we treat the ground truth parameters ϕ , w T +1 as the arguments of excess risk, which is different from that in the previous works (Du et al., 2021; Chen et al., 2022).

Linear Representation. Our theoretical study concentrates on the linear representation function class, which is widely used in many previous works (Du et al., 2021; Tripuraneni et al., 2020; 2021; Thekumparampil et al., 2021; Chen et al., 2022). Namely, we let Φ = {x 7 B x | B Rd k} and thus ϕ(Xt) = Xt B Rnt k. Without loss of generality, we assume the ground truth representation map B is an orthonormal matrix, i.e., B Od,k, which is also commonly used in the related works (Chen et al., 2022; Tripuraneni et al., 2021; Kumar et al., 2022).

Other assumptions. Assume Ext pt[xt] = 0, Σ t = Ext pt[xtx t ] and ˆΣt := 1 nt (Xt) Xt for any t [T + 1]. We have the following assumptions for the data distribution:

Assumption 2.1. (sub-gaussian input). There exists ρ 1 such that xt pt is sub Gd(ρ2) for all t [T + 1].

Assumption 2.2. (proper variance) For all t [T + 1], we have σmax(Σ t ) = Θ(1) and σmin(Σ t ) = Θ(1).

Variance conditions are common in the related works (Tripuraneni et al., 2021; Du et al., 2021; Chen et al., 2022) and

Improved Active Multi-Task Representation Learning via Lasso

Assumption 2.2 is a generalization than identical variance assumption used in (Tripuraneni et al., 2021; Chen et al., 2022) which requires Σ1 = ... = ΣT +1 = Id. Specially, we only use the identical variance assumption in Section 3.3. Assumption 2.3. (high dimension input and enough tasks) The parameters satisfy d > T k 1 and d k.

Finally, we also need diverse task assumption mentioned in (Tripuraneni et al., 2021; Du et al., 2021; Chen et al., 2022). Denote W := [w 1, ..., w T ] Rk T , then we assume: Assumption 2.4. (diverse task) The matrix W satisfies σmin(W ) > 0.

Assumption 2.4 claims that W has full row rank, so we can definitely find some ν RT such that W ν = w T +1, and thus (6) in Def. 2.5 is well-defined. It s a necessary assumption for learning reasonable features as proven by (Tripuraneni et al., 2021).

2.3. Scope of A-MTRL algorithms in this paper

Here we state the scope of the A-MTRL algorithm considered in this paper. Recall that in (Chen et al., 2022), the learner samples in proportional to ˆν(t)2

ˆν 2 2 number of data from task t, where ˆν is defined via the following solution:

arg min ν ν 2 s.t. W ν = w T +1 (5)

Here instead of focusing on this L2 regularization, we study the whole candidate set of source-target relevance terms and the corresponding sampling strategies. Formally, we generalize Definition 3.1 of (Chen et al., 2022) to propose: Definition 2.5. (Lp Nq sampling strategy) Let ν(t) be the t-th element of vector ν RT and N be the minimum number of sampling data from every source task. The Lp Nq strategy is defined as taking nt = max{c |νp(t)|q, N} for some constant c > 0, where nt is the number of samples drawn from from the t-th task, and

νp = arg min ν ν p s.t. W ν = w T +1. (6)

If p = q, we denote Lp as the abbreviation of Lp Nq. For example, if N = 0, then the L1 strategy corresponds to nt = Ntot ν1 1 |ν1(t)| and the L2 strategy corresponds to

nt = Ntot ν2 2 2 |ν2(t)|2, where Ntot is the total source sampling budget.

In the rest of the paper, we will focus on this Lp Nq sampling strategy set.

3. Main Results

3.1. Optimal Strategy L1-A-MTRL with Known ν

In this section, we aim to obtain the optimal sampling strategy that can achieve the required performance on the target

task with the smallest possible number of samples from source tasks. Firstly, with linear representation assumption, we rewrite ER( ˆB, ˆw T +1, B , w T +1) in (4) as follows:

Ex p T +1 x ( ˆB ˆw T +1 B w T +1) 2 2. (7)

Then from the intermediate result of Theorem 3.2 in (Chen et al., 2022), we get the upper bound of excess risk for all A-MTRL methods:

Theorem 3.1. (Chen et al., 2022) Fix a failure probability δ (0, 1). If Assumption 2.1, 2.2, 2.3, 2.4 hold, and the sample size in source and target tasks satisfy nt ρ4(d + ln( T

δ )) for all t [T] and n T +1 ρ4(k + ln( 1

δ )), then with probability at least 1 δ we have:

ER( ˆB, ˆw T +1, B , w T +1)

σ2(kd ln(Ntot

T ) + k T + ln(1

δ )) eν 2 2 + σ2 (k + ln( 1

δ )) n T +1 (8) where ν {ν RT |W ν = w T +1} and eν(t) = ν(t) nt .

The key idea behind Theorem 3.1 is as follows. (Du et al., 2021) provides the first upper bound for the MTRL problem. They consider sampling data evenly from each source task and demonstrated that following the transfer learning process (Eqn. 2, 3), the target task error can be controlled by the source-task training error O(σ2(kd + k T)/Ntot) and the target-task fine-tuning error O(σ2k/n T +1). However, in their proof, the ground truth linear head w T +1 is required to satisfy a distribution Q such that Ew Q[ww ] O( 1

k). (Chen et al., 2022) go beyond this limitation by leveraging the equation W ν = f W eν = w T +1, where ew t = w t nt and eν (t) = ν (t) nt . This idea introduces the source-target relevance vector ν RT into the bound and results in Eqn. 8.

Inspired by Theorem 3.1, in order to minimize the excess risk bound with a fixed sampling quota Ntot, we need to find the optimal sampling strategy n[T ] = {n1, ..., n T } by solving the following optimization problem:

min ν,n[T ] eν 2 2 =

s.t. W ν = w T +1

t=1 nt = Ntot

nt N, t [T]

Here N( ρ4(d + ln( T

δ ))) is the minimum sampling number for every source task as in Theorem 3.1. In this section, we will transform (9) into a bi-level optimization problem and obtain the asymptotic optimal solutions of (9).

Improved Active Multi-Task Representation Learning via Lasso

3.1.1. OPTIMAL STRATEGY FOR ANY FIXED ν

We first consider a fixed ν in (9) and find the optimal sampling strategy accordingly, and we get: Lemma 3.2. For any fixed ν such that W ν = w T +1, the optimal n [T ] for minimizing eν 2 2 satisfies n t = max{c |ν(t)|, N} for every t [T], where c > 0 is some constant such that PT t=1 n t = Ntot.

This lemma indicates an optimal sampling strategy under some fixed, arbitrary ν {ν |W ν = w T +1}. We can then apply Lemma 3.2 to the previous bound (8) and deduce the theoretical optimal bound on the sample complexity of the source tasks for any suitable ν. Here, for simplicity, we skip the trivial case where the model achieves sufficiently high accuracy with uniformly allocated sampling data N by requiring ε2 min(1, σ2(kd + k T) ν 2 1 T N ). This condition guarantees that Ntot TN, and we get: Corollary 3.3. Assume Assumption 2.1, 2.2, 2.3, 2.4 hold and ν is fixed. Then the optimal sampling strategy n[T ] satisfies nt = max{c |ν(t)|, N}, t [T], and with probability at least 1 δ, the optimal A-MTRL algorithm satisfies ER ε2 with ε2 min(1, σ2(kd + k T) ν 2 1 T N ) whenever the total sampling budget from all source tasks Ntot is at least e O(σ2(kd + k T) ν 2 1ε 2) (10)

and the number of target samples is at least e O(σ2kε 2).

Discussion. To show the optimality of our bound, we compare this with the result in (Chen et al., 2022). Their known ν2 (denoted as ν in their original paper) is equivalent to

arg min ν ν 2 s.t. W ν = w T +1.

Under the same setting but using this ν2, with probability at least 1 δ , A-MTRL algorithm with sampling strategy n[T ] such that nt = max{c (ν(t))2, N}, t [T] satisfies ER ε2 with ε 1 whenever Ntot is at least

e O(σ2(kd + k T)s ν2 2 2ε 2) (11)

and the required number of target samples is also e O(σ2kε 2). Here s = minγ [0,1](1 γ) ν2 0,γ + γT

and ν2 0,γ := |{t : |ν2(t)| > q

γ ν2 2 2N 1 tot }|. From Lemma 3.2 we know our strategy is better than the previous under given arbitrary ν setting, so we have ν 1

T ν 2, ν {ν |W ν = w T +1}. In particular, we show the gap between ν 1 and

s ν 2 can be very large under some special cases as follows.

Example: Almost Sparse ν. Assume T 1, Ntot NT T, then we consider an extreme case where

1 1 T 1 , t = 1

1 T 1 , t = 2, ..., T (12)

Then ν is approximately 1-sparse since 1 T 1 1, and we

have ν 1 = q

1 1 T 1 + 1 < 2, ν 2 = 1. Let γ0 :=

Ntot (T 1)2 , it s easy to prove s min{γ0, 1} T 1. This result in

s ν 2 ν 1 and A-MTRL in (Chen et al., 2022) requires a sample complexity that is min{ Ntot 2(T 1), T

2 } times larger than our optimal sampling strategy.

3.1.2. OPTIMAL ν IN CANDIDATE SET

Secondly, suppose we are able to access the whole set {ν |W ν = w T +1}, now we aim to find the optimal ν from the candidate set for sampling. Once we find such a ν , we can utilize rules in Lemma 3.2 to obtain n [T ] and apply all the results above. Here we focus on the case in (Chen et al., 2022) where ER bound ε2 0 and Ntot + and we deduce that L1-minimization solution is the best choice.

Theorem 3.4. Let (ν1, n1 [T ]) denotes the sampling parameters of L1 strategy defined in Def. 2.5, i.e.,

ν1 = arg min ν ν 1 s.t. W ν = w T +1

n1 t = max{c |ν1(t)|, N}, t [T] (13)

Let (ν , n [T ]) denote the optimal solution of (9). Then as Ntot + we have ν1 ν , n1 [T ] n [T ].

Theorem 3.4 shows that the L1 strategy can correspond to the asymptotic optimal solution of (9). As a reference, Alg. 1 in (Chen et al., 2022) is equivalent to L2 strategy, and we call these classes of methods Lp-A-MTRL (A-MTRL with Lp strategy) method with known νp for further discussion.

3.2. How Good Is L1-A-MTRL with Known ν? Comparison on the Worst Target Task

To show the effectiveness of the L1 strategy with known ν1, we analyze the performance of MTRL algorithms on a worst-case target task w T +1 that maximizes the excess risk. Firstly, for better comparison, we define the samplingalgorithm-dependent minimax lower bound of excess risk. Let Γ(σk) = {W Rk T |σmin(W) σk} for any σk > 0, then we define:

Definition 3.5. (minimax ER lower bound) The mini-max lower bound of ER on the target task for Lp-A-MTRL method ERLp(σk) is defined as

inf ( ˆ B, ˆ w T +1) sup (B ,W ,w T +1) Ex µT +1 x ( ˆB ˆw T +1 B w T +1) 2 2

= inf ( ˆ B, ˆ W ) sup (B ,W ,νp) Ex µT +1 x ( ˆB ˆWνp B W νp) 2 2

(14) where W varies on Γ(σk) such that Assumption 2.4 holds and νp denotes the Lp-minimization solution of W ν = w T +1 like (6). Similar definitions hold for P-MTRL.

Improved Active Multi-Task Representation Learning via Lasso

Remark 3.6. The left term of (14) denotes the case that we consider the average error of the best prediction model ( ˆB, ˆw T +1) on any target task facing any possible ground truth parameters (B , W , w T +1). The equality of (14) holds because choosing ˆw T +1 is equivalent to choosing any ˆW {W |W νp = ˆw T +1}, given the (W , w T +1)- dependent νp. Note that we consider the Lp strategy as Def. 2.5 which is determined by (νp, nt), so once we choose some Lp-A-MTRL algorithm, (14) just depends on model parameters and σk.

With this definition, we show that with known νp, the ER on the worst target task for L1-A-MTRL can reduce up to T

k times of total sampling data from source tasks than that of L2-A-MTRL(Chen et al., 2022) and P-MTRL. Theorem 3.7. Assume conditions in Theorem 3.1 hold, w T +1 = Θ(1) and ν1, ν2 defined in Def. 2.5 are known. Then for L1-A-MTRL, we claim ν1 is at most k-sparse, i.e., ν1 0 k. If Ntot TN and W Γ(σk), then with probability at least 1 δ, for ER defined in (7) we have 1 :

ERL1 σ2(kd ln(Ntot

T ) + k T + ln(1

δ )) k σ2 k Ntot

but for P-MTRL and L2-A-MTRL, with probability at least 1 δ we have :

ERL2(σk), ERpassive(σk) σ2 d T σ2 k Ntot

So when T k2, L1-A-MTRL outperforms L2-A-MTRL and P-MTRL for the worst target task.

Discussion. In essence, the sparsity of νp causes the difference in model performance on the worst-case target task. For the upper bound of L1-A-MTRL, We show eν1 2 2 k/(σ2 k Ntot). And for the lower bound of L2-A-MTRL and P-MTRL, we utilize the fact that inf ˆ B, ˆ W sup B ,W Xt( ˆB ˆW B W ) 2 2 σ2kd (up to logarithmic factors) and the result that when the row of f W is well aligned with eν2, then ( ˆB B )f W eν2 ( ˆB B )f W F eν2 , where ν2 can be chosen to satisfy eν2 T/(k σ2 k Ntot).

3.3. L1-A-MTRL Algorithm and Theory

In the previous sections, we showed the advantage of AMTRL with the L1 sampling strategy when ν1 is given. However, in practice, ν1 is unknown and needs to be estimated from W and w T +1, which themselves need to be

1For the previous upper bound in Theorem 3.1, people estimate non-shared w T +1 by linear-probing on the target task so (8) contains target-related error term. However, under the cheating case in Theorem 3.7, knowing νp means we already have such information as long as nt is large enough since W νp = w T +1. We want to emphasize that this known νp assumption is used for illustrating why L1 strategy is better, but not for practical use.

estimated with unknown representation B at the same time. In this section, we design a practical L1-A-MTRL algorithm shown in Algorithm 1 which estimates the model parameters ˆB, ˆW, ˆw T +1 and relevance vector ˆν1. Here in our algorithm setting, we let

β1 = 105Tk3 C6 W σ6 (d + ln(4T

β2 = k(d + T + ln(1

δ )) ˆν1 2 1ε 2 + β1

where CW is defined in Assumption 3.8. β1 and β2 characterize the sample complexity required to explore at the first and second stage, respectively, and they are determined by Tβ and Ntot defined in Theorem 3.10.

We want to highlight that unlike the L2-minimization approach of (Chen et al., 2022), our L1-minimization solution does not have a closed form solution which creates more challenges in controlling the estimation error between ˆν1

and ν1. To deal with this problem, we use the Lasso Program (Wainwright, 2019; Tibshirani, 1996) to estimate ˆν1:

ˆν1 arg min ν RT{1

2 ˆw T +1 ˆWν 2 2 + λk ν 1} (16)

where the regularization parameter λk is chosen by users. We prove that with proper λk, ˆν1 will be sufficiently close to ν1 in l1 norm when the following assumptions hold.

Assumption 3.8. (bounded norm) There exists CW , R > 0 s.t. σmax(W ) CW and w T +1 2 = Θ(R).

Assumption 3.9. (identical covariance) we have: Σt = Σ = Id for all t [T + 1].

Assumption 3.8 implies t [T], w t 2 = W en 2 CW , which is a very common condition in the previous work (Du et al., 2021; Tripuraneni et al., 2021; Chen et al., 2022). Assumption 3.9 is a stronger variance condition than Assumption 2.2, but it s also used in (Tripuraneni et al., 2021; Chen et al., 2022) and we only need it in this section. With these assumptions we are prepared to state our theoretical guarantee for our practical L1-A-MTRL algorithm:

Theorem 3.10. Let Assumption 2.1, 2.3, 2.4, 3.8, 3.9 hold. Let γ = max{2160k3/2C2 W /σ, p

2160k3/2C3 W /σ}, where σ = σmin(W ) > 0. For L1-A-MTRL method, we set the regularization parameter by:

γ max{1, CW

Then to let ERL1 ε2 where ε2 min(1, σ2(kd + k T) ν 2 1 T N ) with probability 1 δ, the number of source samples Ntot is at most

e O(σ2(kd + k T) ν1 2 1ε 2 + Tβ) (18)

Improved Active Multi-Task Representation Learning via Lasso

Algorithm 1 L1-A-MTRL Method

1: Input: confidence δ, representation function class Φ, ER bound ε 1, minimum singular value σ 2: Initialize N = β1/T with (15) and λk with (17), 3: Phase 1: Exploration ν 4: Draw N i.i.d samples from every source task datasets 5: Estimate ˆϕ1, ˆW 1 and ˆw1 T +1 with Eqn.(2), (3) 6: Estimate ˆν1 by Lasso Program (16) 7: Set β2 with Eqn. (15) 8: Phase 2: Sampling 9: Set n2 t = max{β2|ˆν1(t)| ˆν1 1 1 , N}. 10: Draw nt i.i.d samples from t-th source task datasets 11: Estimate ˆϕ2, ˆW 2 and ˆw2 T +1 with Eqn.(2), (3)

where β = max{γ2 σ2 z σ4 , γ2 C2 W σ4 ρ4, ρ4, σ2 z σ2 } (d + ln( 4T

δ )), and target task sample complexity n T +1 is at most

e O(σ2kε 2 + α) (19)

where α = max{γ2 σ2 z C2 W σ4R2 , γ2 C2 W σ4 ρ4, ρ4} (k + ln( 4

Discussion. Comparing to the known ν case in Corollary 3.3, in this unknown ν setting we find our algorithm only requires an additional ε-independent number of samples Tβ for the sampling complexity from source tasks and α for that from target task to achieve the same performance. (Chen et al., 2022) have similar results, but their additional term β in their Theorem 4.1 has an order of ε 1. Technically, (Chen et al., 2022) directly uses the closed form of least square solution and proves that |ˆν2(t)| = Θ(|ν2(t)|), t [T] if nt c ε 1. However, for Lasso-based L1-A-MTRL method, we can choose some proper parameter λk which can upper bound not only the noise term but also the l1-error between Lasso solution and true vector as ˆν1 ν1 1 = Θ( ν1 1) if nt c ε0

(Lemma E.3). Here c , c > 0 are model-related constants.

Moreover, we remark that we have a similar limitation as (Chen et al., 2022) that we require some prior knowledge of σ. However, since they only hit the additional constant terms, they are unlikely to dominate either of the sampling complexities for reasonable values of d, k, T and ε 1.

Lastly, it is worth mentioning that similar results to Theorem 3.10 also apply when our L1-A-MTRL algorithm incorporates multiple sampling stages, as presented in Algorithm 2 in Appendix. The reason is that we only need to ensure that the minimum sampling budget is larger than N which is independent of the stage, and the additional proof follows a similar approach to that of Theorem E.4 in (Chen et al., 2022).

4. Extentsion: Cost-sensitive Task Selection

In Section 3, we proved that the L1 strategy can minimize the total number of samples from the source tasks. Implicitly, this assumes the cost of each task is equal, and the first sample costs the same as the n-th sample. In contrast, we could also consider a non-linear cost function for the t-th source task ft : N R, which takes in the number of random label query n and outputs the total required cost. For example, this could encode the notion that a long-term data subscription from one single source may result in decreasing the average cost over time.

Here we show that, even in this task-cost-sensitive setting, our L1-A-MTRL method Algorithm 1 can still be useful under many benign cost functions. Consider the following example.

Example: Saltus Cost Function. Assume Ntot and N are fixed. If nt,1 = N, f : ft for all t [T] and f is composed by fixed cost and linear variable cost:

f(n) = Cfix + Cvar(n N) , n > N 0 , n N (20)

where for each source task t we have N free data for sampling. As a reference, one practical instance for this case is programmatic weak supervision, where setting up a source requires some high cost but afterward, the query cost remains low and linear (Zhang et al., 2022a). If we want to find some proper ν to minimize the total cost PT t=1 ft(nt), then it s equivalent to finding the L0 minimization solution of ˆWν = ˆw T +1, where ˆW, ˆw T +1 is estimated by free data. Of course, L0 minimization is known to be intractable, so with proper λf, the L1-A-MTRL method can be a good approximation.

Now, we are ready to give a formal definition of our goal and the characterization of when our L1-A-MTRL method can be useful. Based on the excess risk upper bound in Theorem 3.1, to get ER ϵ2, we are aimed to solve the following optimization problem.

min n[T ],2

t=1 ft(nt,1 + nt,2)

s.t. σ2k(d + T)

nt,1 + nt,2 ε2

W ν = w T +1 nt,2 0, t [T]

Then we have the following guarantees as long as ft satisfies the properties shown there.

Theorem 4.1 (informal). Assume ft is a piecewise secondorder differentiable function, and on each sub-function, it satisfies ft 0, ft 0, 2ft 0 and ft(nt,1 +

Improved Active Multi-Task Representation Learning via Lasso

nt,2) = Ω(n 2+q t,2 ) for some q (0, 2]. Denotes the optimal solution of (21) as (n [T ],2, ν ). Then under a similar data generation assumption as before, we have

n t,2 = ht(|ν (t)|) (22)

where ht is a monotone increasing function that satisfies: ct,1x ht(x) ct,2x2/q where ct,1, ct,2 > 0 . Moreover, we claim A-MTRL algorithm with n [T ],[2] sampling strategy is k-sparse, i.e., n [T ],2 0 k.

Discussion. If ft(n t,2) c > 0, (112) is equivalent to L1 strategy mentioned in the previous sections. However, for many other cases, it might be NP-hard to optimize (21), such as the Saltus Cost Function example shown above. Therefore, our previous algorithm L1-A-MTRL can be widely applied to these task-cost-sensitive scenarios to approximate the optimal strategy.

5. Experiments

Although our theoretical analysis only holds for a linear representation, our experiments also show the effectiveness of our algorithm on neural network representations as well in the task selection case. In this section, we follow the experimental settings in (Chen et al., 2022) and empirically evaluate L1-A-MTRL on the corrupted MNIST (MNIST-C) dataset proposed in (Mu & Gilmer, 2019). We reflect the preponderance of our algorithm on the two scenarios mentioned above. The first one is cost-agnostic, which aims to minimize the total sampling number from the source tasks and can reach all the source tasks. Another scenario is taskcost-sensitive like Section 4 and we particularly concentrate on k task-selection algorithms which correspond to cost functions like saltus cost function, and the learner is only allowed to sample from only k tasks after the initial exploration stage. We call the first case full task scenario and the second one k-task selection scenario for convenience. Please refer to Appendix G.1 for further illustration of our intuition for the k-task selection scenario.

5.1. Experimental Setup

Datasets. The MNIST-C dataset is a comprehensive suite of 16 corruptions applied to the MNIST test set. Like in (Chen et al., 2022), we divide each corruption-related sub-dataset into 10 tasks according to their labels ranging from 0 9 and thus get 160 separate new tasks denoted by {corruption type} {label} . For instance, brightness 0 denotes the data corrupted by brightness noise and relabeled to 1/0 based on whether the data corresponds to number 0 or not. And once we choose 1 task called {type A} {label B} for the target task, the other 150 tasks that don t contain type A corruption will be chosen as source tasks.

Experimental Setups and Comparisons. Like in (Chen

et al., 2022), we replace the cross-entropy loss, which is commonly used for MNIST, with the regression l2 loss in order to align with the theoretical setting in this paper. As the model setting, for full tasks scenario, we use the linear representation as defined in our theorem. We set d = 28 28, k = 50 and there are T = 150 source tasks in total. And we compare L1-A-MTRL and L2-A-MTRL(Chen et al., 2022) algorithms on the above datasets with 160 different target tasks. For the k-task selection scenario, we use a 2-layer Re LU CNN followed by a fully-connected linear layer as the representation map. Since neural networks can better capture the feature, here we set a smaller representation dimension k = 10 to show the advantage of the sparse task selection algorithm while other parameters follow the setting in the case of the full tasks. We compare L1-A-MTRL, which has been proved to be k-sparse from Theorem 3.7, together with vanilla k-sparse baseline that randomly selects k = 10 source tasks for sampling data at the second stage. Please refer to Appendix G.2 for details of algorithm implementation and Appendix G.3 for details on how to determine the value of λk.

5.2. Results

Full tasks scenario. In summary, L1-A-MTRL achieves 0.54% higher average accuracy among all the target tasks than L2-A-MTRL and results same or better performance in 126 out of 160 tasks. Due to the imbalanced dataset, 10% is the error rate of the baseline which randomly guesses the label, and the average prediction incorrect rate for L2-AMTRL is 7.4%.

k-Task selection scenario. Similarly, L1-A-MTRL achieves 2.2% higher average accuracy among all the target tasks than the vanilla baseline which has the average prediction error rate of 5.7%. And our algorithm results in the same or better performance in 149 out of 160 tasks. This shows the effectiveness of our method on neural network representation.

In Section G.4 of the appendix, we provide additional comparisons of the empirical sampling budgets for different algorithms. The results demonstrate that L1-A-MTRL requires fewer samples compared to L2-A-MTRL and PMTRL while achieving comparable performance. These findings further underscore the effectiveness of our L1-AMTRL algorithm.

6. Conclusion

We introduced a novel active sampling strategy L1-A-MTRL to sparse sample from target-related source tasks and learn a good representation that helps the target task. From a theoretical perspective, we first showed that L1-A-MTRL is strictly better than the previous L2-A-MTRL by proving a

Improved Active Multi-Task Representation Learning via Lasso

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0.39 0.75 0.67 0.71 0.98 0.11 0.07 -0.34 0.64 1.7

0.33 0.14 0.19 0.03 0.03 0.04 -0.03 1 0.01 -0.02

-0.26 0.54 1 0.63 1.4 1.4 1.4 0.77 -0.02 -0.19

0.89 0.3 0.16 1.2 0.05 0.46 1 0.98 -0.24 1.9

0.94 1.5 0.93 1.4 0.24 0.94 1.7 0.11 0.17 0.03

0.84 0.29 0.77 2.1 1.9 2.1 0.02 0.49 0.18 3.2

1.2 0.32 0.01 0.05 -0.07 -0.06 -0.01 -0.03 -0.01 0.01

0.16 0.41 0.61 0.4 0.58 0.38 0.85 -0.31 -0.75 0.02

-0.09 1.4 1.1 -0.29 0.25 0.45 -0.43 1.1 0.11 0.02

0.66 0.19 0.99 0.56 0.19 2 0.06 2.9 -0.05 1

0.43 0.53 1.2 0.32 1.6 0.75 0.29 -0.28 -0.2 0.07

-0.42 -0.07 0.7 0.61 -0.06 1.1 1.9 0.16 1.5 0.67

0.18 0.97 0.42 0.42 1.3 0.58 0.86 1.4 2.2 0.06

0 0.35 1.7 0.91 1.5 -0.12 0.3 0.07 -0.21 -0.01

0.34 -0.07 0.16 -0.34 -0.25 0.11 0.01 0.04 -0.01 0

-0.63 1.1 1.4 0.76 0.86 1 -0.14 0.45 0.97 0.23

0 1 2 3 4 5 6 7 8 9

canny_edges

dotted_line

impulse_noise

motion_blur

0.8 1.4 4.9 4.6 3.8 4.7 2.6 4.3 5.6 8.2

0.14 2.1 2.1 1.4 1.9 2.3 1.1 0.95 2.7 3

3.4 1.5 2.1 2.2 2 4.3 1.7 1.4 3.1 6.9

5.8 0.05 4.2 1.3 0.27 0.83 -1.2 -0.02 2 1.4

2.2 1.8 2.6 2.5 1.9 4.5 1.8 2.4 1.9 2

1.9 0.01 0.99 2.1 1.5 4.2 1.8 0.78 1.3 6.3

0.95 5.2 1.4 1.4 2.5 2.3 1.3 2.5 2 4.5

5 5.2 0.74 2.4 1.6 4.2 0.58 2.5 1 5.5

2.8 1.3 1.8 2.6 0.68 3.9 1.6 1.9 -0.8 4.9

1.4 1.9 3 2.3 -0.37 1.4 2.7 2.9 -3.2 3.2

1.1 3.4 1.6 0.98 2.3 4.8 1.9 3.5 4 4.8

0.69 4.1 1.1 1.2 1.3 5 1.5 3.3 4.1 6.3

1.3 2.4 1.3 1.5 1.5 4.8 1.7 2.4 5.1 5.1

-0.22 1.7 0.8 0.47 2.1 0.59 1.7 0.61 2.4 3.6

0.01 -1.3 -0.3 -0.22 0.08 1.2 0.7 0.11 0 0.04

0.13 0.25 1.4 1.2 0.7 5.4 -0.07 1.6 1.5 2.3

Figure 1. Performance Comparison. These pictures show the prediction difference (in %) between our method and baseline for all target tasks, the larger the better. The y-axis denotes the corruption type while the x-axis denotes to the binarized label, and each grid on (x, y) corresponds to the case that the target task is {y} {x} . Left: full tasks scenarios. Compare L1-A-MTRL and L2-A-MTRL using linear representation. Right: k-task selection scenarios. Compare two k-sparse task selection algorithms L1-A-MTRL and passive-learning baseline, which randomly selects k source tasks for the second-stage sampling, using Convnet representation.

novel sampling-strategy-dependent lower bound and then provided a tighter upper bound correspondingly. From the empirical perspective, we showed our algorithm is not only effective under the standard setting but can achieve even better results in the practical scenario where the number of source tasks is restricted.

Acknowledgements

SSD acknowledges the support of NSF IIS 2110170, NSF DMS 2134106, NSF CCF 2212261, NSF IIS 2143493, NSF CCF 2019844, NSF IIS 2229881.

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.

Asai, A., Salehi, M., Peters, M. E., and Hajishirzi, H. Attentional mixtures of soft prompt tuning for parameterefficient multi-task knowledge sharing. ar Xiv preprint ar Xiv:2205.11961, 2022.

Bai, Z., Shen, Y., Shui, N., and Guo, X. Introduction to riemann geometry, 1992.

Bertsimas, D. and Tsitsiklis, J. N. Introduction to linear optimization. 1997.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.

Advances in neural information processing systems, 33: 1877 1901, 2020.

Chen, S., Crammer, K., He, H., Roth, D., and Su, W. J. Weighted training for cross-task learning. ar Xiv preprint ar Xiv:2105.14095, 2021.

Chen, Y., Jamieson, K., and Du, S. Active multi-task representation learning. In International Conference on Machine Learning, pp. 3271 3298. PMLR, 2022.

Chua, K., Lei, Q., and Lee, J. D. How fine-tuning allows for effective meta-learning. Advances in Neural Information Processing Systems, 34:8871 8884, 2021.

Cohen, M. B., Lee, Y. T., and Song, Z. Solving linear programs in the current matrix multiplication time. Journal of the ACM (JACM), 68(1):1 39, 2021.

Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. Exploiting shared representations for personalized federated learning. In International Conference on Machine Learning, pp. 2089 2099. PMLR, 2021.

Du, S. S., Hu, W., Kakade, S. M., Lee, J., and Lei, Q. Fewshot learning via learning the representation, provably. Ar Xiv, abs/2002.09434, 2021.

Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., and Finn, C. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503 27516, 2021.

Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. ar Xiv preprint ar Xiv:2202.10054, 2022.

Improved Active Multi-Task Representation Learning via Lasso

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28, 10 2000. doi: 10.1214/aos/1015957395.

Mu, N. and Gilmer, J. Mnist-c: A robustness benchmark for computer vision. ar Xiv preprint ar Xiv:1906.02337, 2019.

Pajor, A. Metric entropy of the grassmann manifold. 1998.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Shachaf, G., Brutzkus, A., and Globerson, A. A theoretical analysis of fine-tuning with linear teachers. Advances in Neural Information Processing Systems, 34:15382 15394, 2021.

Shi, G., Azizzadenesheli, K., O' Connell, M., Chung, S.-J., and Yue, Y. Meta-adaptive nonlinear control: Theory and algorithms. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 10013 10025. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ 52fc2aee802efbad698503d28ebd3a1f-Paper. pdf.

Standley, T., Zamir, A., Chen, D., Guibas, L., Malik, J., and Savarese, S. Which tasks should be learned together in multi-task learning? In International Conference on Machine Learning, pp. 9120 9132. PMLR, 2020.

Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Sample efficient linear meta-learning by alternating minimization. ar Xiv preprint ar Xiv:2105.08306, 2021.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the royal statistical society series bmethodological, 58:267 288, 1996.

Tripuraneni, N., Jordan, M., and Jin, C. On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems, 33: 7852 7862, 2020.

Tripuraneni, N., Jin, C., and Jordan, M. I. Provable metalearning of linear representations. In ICML, 2021.

Wainwright, M. J. High-Dimensional Statistics: A Non Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. doi: 10.1017/9781108627771.

Xu, Z. and Tewari, A. Representation learning beyond linear prediction functions. Advances in Neural Information Processing Systems, 34:4792 4804, 2021.

Yao, X., Zheng, Y., Yang, X., and Yang, Z. Nlp from scratch without large-scale pretraining: A simple and efficient framework. In International Conference on Machine Learning, pp. 25438 25451. PMLR, 2022.

Zaiem, S., Parcollet, T., Essid, S., and Heba, A. Pretext tasks selection for multitask self-supervised speech representation learning. ar Xiv preprint ar Xiv:2107.00594, 2021.

Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712 3722, 2018.

Zhang, J., Hsieh, C.-Y., Yu, Y., Zhang, C., and Ratner, A. A survey on programmatic weak supervision. ar Xiv preprint ar Xiv:2202.05433, 2022a.

Zhang, Z., Wang, S., Xu, Y., Fang, Y., Yu, W., Liu, Y., Zhao, H., Zhu, C., and Zeng, M. Task compass: Scaling multi-task pre-training with task prefix. ar Xiv preprint ar Xiv:2210.06277, 2022b.

Improved Active Multi-Task Representation Learning via Lasso

A. Related Work

Empirical works on P-MTRL and A-MTRL. Multi-task representation learning has been widely applied and achieved great success in the natural language domain GPT-2 (Radford et al., 2019), GPT-3(Brown et al., 2020), vision domain CLIP (Radford et al., 2019) and multi-model Flamingo (Alayrac et al.). Nevertheless, such large models are costly in both data collecting/cleaning and training. Recently, many works focus on efficiently selecting the source task. In the natural language domain, for example, (Yao et al., 2022) use a heuristic retriever method to select a subset of target-related NLP source tasks; More recently, works like (Asai et al., 2022; Zhang et al., 2022b) use prefix/prompt to capture the relation between source and target tasks. Similar topics have also been studied in the vision domain, for example, (Zamir et al., 2018) propose a transfer learning algorithm based on learning the underlying structure among visual tasks, which they called Taskonomy, and there are many following works propose different approaches on this Taxonomy dataset, including (Fifty et al., 2021; Standley et al., 2020).

Theoretical works on P-MTRL. There are also many existing works on provable P-MTRL. Tripuraneni et al. (2020; 2021); Du et al. (2021); Thekumparampil et al. (2021); Collins et al. (2021); Xu & Tewari (2021) assume there exists a ground truth shared representation across all tasks. In particular, Tripuraneni et al. (2020; 2021); Thekumparampil et al. (2021) assume a low-dimension linear representation like us while Du et al. (2021) generalize to both high-dimensional representation and 2-layer Relu network. Tripuraneni et al. (2020) also further considers any general representation with linear predictors. Both works obtain similar results. Besides, many recent works focus on fine-tuning in theoretical contexts (Shachaf et al., 2021; Chua et al., 2021; Chen et al., 2021; Kumar et al., 2022).

For the lower bound, for the first time, Tripuraneni et al. (2021) proves a minimax lower bound for the estimation error of the estimated representation layer measured by subspace angle distance. But we claim it can t directly deduce a similar lower bound of the test error on the target task, which relates to one of our main contributions. The reason is that though the estimated representation may be far away from the ground truth one, the learner can estimate a proper target predictor to achieve a sufficiently small test error as long as B w T +1 (almost) lies in the column space of ˆB, where the notations are defined in the preliminary.

Theoretical works on A-MTRL. In order to overcome the problems in P-MTRL, some subsequent works focused on giving different priorities to the source tasks by methods like active learning (Chen et al., 2022) and weighted training (Chen et al., 2021). Representatively, Chen et al. (2022) is the first work to propose A-MTRL which calculates the proper sampling number for each source task. It iteratively estimates the relevance of each source task to the target task by estimating the relevance vector ν . Chen et al. (2022) utilizes the L2 strategy defined in Def. 2.5 to decide the sampling strategy and significantly outperforms passive MTRL (P-MTRL), which uniformly samples from the source tasks, both theoretically and empirically. Nevertheless, the optimal sample strategy for A-MTRL is underexplored, and the non-sparsity of ν2 may cause inconvenience for task-cost-sensitive scenarios. We develop our works based on the problem setting in (Chen et al., 2022) and propose a more efficient sampling strategy. As another approach, Chen et al. (2021) concentrates on learning a weighting over the tasks. The crucial difference between their work with ours is that they can attach to the whole dataset whereas we assume not but actively query new data from some large datasets (e.g., the task represented by the search terms to Wikipedia or Google). They also assume that some tasks may not only be irrelevant but even harmful and need to be down-weighted.

B. Technical Notations

We summarize the technical notations used in the appendix as follows.

Grassmann Manifold. Assume d k, we denote by Grd,k the Grassmann manifold which contains all the subspaces that are spanned by k linearly independent d-dimensional vectors. For d k, we let Od,k be the set of matrices whose column contains k orthonormal vectors that are in Rd. Then any B Od,k corresponds to an element, which is spanned by the column vectors of B, of Grd,k. Actually, an element in Grd,k is corresponds to an equivalent class of d k matrices that satisfies the equivalent relation :

Y X Y = XA, A GL(k, R) (23)

where GL(k, R) denotes general linear group over R of degree k.

Subspace Distance. Finally, we use the same definition as (Tripuraneni et al., 2021) and (Pajor, 1998) to define the distance between the subspaces in the Grassmann manifold. We let sp(T) = (P

i 1 |σi(T)|p)1/p for any matrix T and any

Improved Active Multi-Task Representation Learning via Lasso

p [1, ]. In particular, s is the operator norm of T. For E, F Od,k, from Proposition 6 of (Pajor, 1998) we define sq(E, F) = (2 Pk i=1 |1 σ2 i (ET F)|q/2)1/q to be the subspace distance between the spaces spanned by the column vectors of E and F, respectively. Particularly, s (E, F) = p

1 σ2 k(ET F).

C. Proof of Theorem 3.4

Proof of Lemma 3.2. We can use the following equivalent optimization problem to prove our Lemma:

min n[T ] G(n[T ]) :=

s.t. c0(n[T ]) := Ntot

ct(nt) := nt N > 0, t [T]

The corresponding Lagrangian function for (24) is

L(n[T ]) := G(n[T ]) λ0c0(n[T ])

t=1 λtct(nt) (25)

Then from the Karush-Kuhn-Tucker condition, for all t [T] we have the necessary condition

L nt = |ν (t)|2

n2 t + λ0 λt = 0

λtct(nt) = λt(nt N) = 0

So we get λ0 > λt 0, t [T] and

λ 0.5 0 |ν (t)| , λt = 0 nt N,

N , λt > 0 nt = N. (27)

thus we finish the proof.

As a supplement, we give another proof for the special case in this Lemma where we assume nt > N for every t [T]. Let β(t) := ν (t)

ν 2 , αt = nt Ntot and thus PT t=1 β2(t) = PT t=1 αt = 1. Therefore by Cauchy inequality,

eν 2 2 = ν 2 2 Ntot

= ν 2 2 Ntot (

ν 2 2 Ntot (

t=1 |β(t)|)2 = ν 2 1 Ntot

The equality in (28) is achieved iff |β(t)| αt = c αt for evert t [T] with c > 0, which means that nt is proportional to |ν (t)|.

Proof of Corollary 3.3. As stated in Lemma 3.2, n t = max{c |ν(t)|, N} and c > 0 is some constant such that PT t=1 n t = Ntot, so we have c |ν(t)| n t c |ν(t)| + N. Sum up both sides of the inequality for all t [T], then:

c ν 1 Ntot c ν 1 + TN (29)

Improved Active Multi-Task Representation Learning via Lasso

Therefore, if we assume Ntot TN, then we get Ntot = (1 + o(1))c ν 1. In fact, we only need to ensure that Ntot > 2TN, which results in c > Ntot/(2|ν|1), since the coefficient in the error bound (8) is unconsidered.

Let S1 = {t [T]|nt N, |ν(t)| > 0} and S2 = {t [T]|nt < N, |ν(t)| > 0}. Then for any fixed ν, from Lemma 3.2, we have the following inequality for the optimal strategy:

c |ν(t)| = (1 + o(1)) X

Ntot ν 1 = (1 + o(1)) ν 2 1 Ntot (30)

Here the inequality holds if and only if S2 is empty, which means that for all t [T], ν(t) must satisfies ν(t) = 0 or c |ν(t)| N. Combining (30) and Theorem 3.1, we get the results.

Proof of Theorem 3.4. From (30) we know that if c |ν(t)| N for all t [T] such that |ν(t)| > 0 , then n t = Ntot|ν(t)|/ ν 1, t {t [T]||ν(t )| > 0}, and eν 2 2 attain its minimum ν 2 1/Ntot.

We prove that such a condition can be achieved when Ntot is sufficiently large. Assume that c |ν(t)| < N always holds for some t [T] where ν(t) = 0. Then if we choose Ntot = TN + N/|ν(t)| ν 1, we will have c |ν(t)| = c ν 1 |ν(t)|/ ν 1 (Ntot TN)|ν(t)|/ ν 1 N, where we use the fact that NT + c ν 1 Ntot from Eqn. 29. This is contradicted by the assumption, and thus we can always find some Ntot such that c |ν(t)| N if ν(t) = 0.

So for any given ν, the optimal sampling strategy nt(ν)(Lemma 3.2) can let eν 2 2 achieves its minimum ν 2 1/Ntot. Then we vary ν among the solution candidate set of W ν = w T +1 and find L1-minimization solution ν1 can minimize ν 2 1/Ntot. Therefore, (ν1, n1 [T ]) is optimal for the original problem (9).

D. Proof of Theorem 3.7

D.1. Preparations for minimax lower bound

First, we reclaim some concentration inequalities commonly used in the previous work (Du et al., 2021; Chen et al., 2022). Lemma D.1. (A variant of Lemma A.6 in (Du et al., 2021)) Let a1, ..., an be i.i.d. d-dimensional random vectors such that E[ai] = 0, E[aia i ] = I, and ai is ρ2-subgaussian. For δ (0, 1), ϵ (0, 1

2), suppose n > 1 ϵ2 caρ4(d + ln( 1

δ )) for some universal constant ca. Then with probability at least 1 δ we have

i=1 aia i (1 + 2ϵ)Id (31)

Recall that Σ t = Ext pt[xtx t ] and ˆΣt := 1 nt (Xt) Xt for any t [T + 1], then we have:

Lemma D.2. (A variant of Claim A.1, A.2 in (Du et al., 2021)) Suppose for δ (0, 1). Let nt > 1 ϵ2 caρ4(d + ln( 2T

δ )) for all t [T], then with probability at least 1 δ

2 over the inputs X1, ..., XT in the source tasks, we have

(1 2ϵ)Σt ˆΣt (1 + 2ϵ)Σt (32)

Here ca > 0 is a universal constant. Similarly, let n T +1 > 1 ϵ2 caρ4(k + ln( 2

δ )). Then for any given matrix B1, B2 Rd k

that is independent of XT +1 , with probability 1 δ

2 over XT +1 we have

(1 2ϵ)B 1 ΣT +1B2 B 1 ˆΣT +1B2 (1 + 2ϵ)B 1 ΣT +1B2 (33)

And then, we show that PT t=1 |Xt( ˆB ˆwt B w t )|2 σ2(k T + k(d k)). The upper bound has been shown in Claim A.3 in (Du et al., 2021), and the lower bound will be shown in the following theorem. Theorem D.3. With conditions in Theorem 3.7, with probability 1 δ we have:

inf ( ˆ B, ˆ W ) sup (B ,W )

t=1 |Xt( ˆB ˆwt B w t )|2 σ2(k T + k(d k)) (34)

The key theorems and lemmas are as follows.

Improved Active Multi-Task Representation Learning via Lasso

Theorem D.4. Let G0 := {BW|B Od,k ; W Rk T }, and G1(δ1) := {BW|B Od,k ; W Rk T ; W F δ1, t [T]} be a local packing of G0, where wt is the t-th column vector of W. Then there is a lower bound for G1 s packing number: ln M(G1(δ1), F , 1) k(d k) + k T (35)

where 1 will be determined soon.

Lemma D.5. [Adapted from (Pajor, 1998)] For any 1 k d such that k d k, for every ϵ > 0, we have

ϵ )k(d k) N(Grd,k, s , ϵ) (c2

ϵ )k(d k) (36)

with universal constants c1, c2 > 0. From the relation between packing number and covering number (Wainwright, 2019), we have: M(Grd,k, s , ϵ) (c1

ϵ )k(d k) (37)

Lemma D.6. Let B1, B2 Od,k, w1, w2 Rk. With SVD we get (B1) B2 = PDQT , where P, Q Ok,k, D = diag(σ1, ..., σk). Obviously σi [0, 1], and we define v1 = P w1, v2 = Q w2. If subscripts denotes the index of vectors, we have:

|B1w1 B2w2|2 =

i=1 [2|v1 i ||v2 i |f(v1 i , v2 i ) + (|v1 i | |v2 i |)2] (38)

f(v1 i , v2 i ) =

( 1 σi, sign(v1 i v2 i ) = 1

1 + σi, sign(v1 i v2 i ) = 1 (39)

And we can get the lower bound:

|B1w1 B2w2|2 2|v1 k||v2 k|(1 σk) +

i=1 (|v1 i | |v2 i |)2 0 (40)

Proof of Lemma D.6. By the calculation we get this result:

|B1w1 B2w2|2 = (B1w1 B2w2) (B1w1 B2w2)

= |w1|2 + |w2|2 2(w1) (B1) B2w2

= |v1|2 + |v2|2 2(v1) Dv2

i=1 ((v1 i )2 + (v2 i )2 2v1 i v2 i σi)

To make each term of the equation above non-negative, we use sign function:

|B1w1 B2w2|2 =

i=1 [(v1 i )2 + (v2 i )2 2sign(v1 i v2 i ) v1 i v2 i + 2v1 i v2 i (sign(v1 i v2 i ) σi)]

i=1 [(v1 i )2 + (v2 i )2 2|v1 i ||v2 i | + 2|v1 i ||v2 i |(1 sign(v1 i v2 i )σi)]

i=1 [(|v1 i | |v2 i |)2 + 2|v1 i ||v2 i |f(v1 i , v2 i )]

Besides, we begin to construct a separate set for G1. Firstly we let GB = {B1, ..., BMB} be a ϵB-separated set for metric s in Grd,k, where ϵB min( c1

2 , 1) as c1 in Lemma D.5.

Improved Active Multi-Task Representation Learning via Lasso

Then denote (Bm) Bn = P(m, n)D(m, n)Q(m, n), where P(m, n), Q(m, n) Ok,k, D(m, n) = diag(σ1(m, n), ..., σk(m, n)), and P(m, n) = Q(m, n) = D(m, n) = Ik iff m = n. On the other hand, for t [T], we denote vj t,i(P(m, n)) to be the i-th component of vj t (P(m, n)) := P(m, n) wj t, and similarly for vj t (Q(m, n)) := Q(m, n) wj t.

Lemma D.7. Suppose GV = {V j = (vj 1, ..., vj T )|j S, vj t Rk, vj t satisfy Equ. 43 and attain largest |S|}:

|vj t,k| δV

TϵB , j, t [T]

t=1 |vj t |2 CV δV

V i V j F =

t=1 |vi t vj t |2 δV

where CV is a universal constant and 4 < CV < 5. For m, n [MB], let GW (P(m, n)) := {W j = (wj 1, ..., wj T ) | V j GV , s.t. W j = P(m, n)V j} and similarly for GW (Q(m, n)). Then let GBW = {(B, W)| m, n [MB], W m GW (P(m, n)), W n GW (Q(m, n)), s.t. BW {Bm W m, Bn W n}}, and we claim that GBW is a δV -separated subset of G1 with Frobenius norm.

Proof of Lemma D.7. For each t [T], we divide into 2 cases:

Case 1. For the case m = n, we will work out the lower bound of Equ. 40. Since for any m = n:

s (m, n) = q

1 σ2 k((Bm) Bn) ϵ2 B

1 σk((Bm) Bn) ϵ2 B 1 + σk((Bm) Bn) > ϵ2 B 2

combined with the first inequality of Equ. 43, we know by the definition of GBw, there exist some i, j such that:

t=1 |Bmwm t Bnwn t |2 2

t=1 |vi t,k||vj t,k|(1 σk) δ2 V (45)

Case 2. For the case m = n, note that σi = 1 for all i [k]. Combined Equ. 38 , Equ. 39, Equ. 43 and condition ϵB < min( c1

2 , 1), there exist some i, j such that:

t=1 |Bmwm t Bmwm t |2 =

l=1 (vi t,l vj t,l)2 =

t=1 |vi t vj t |2 δ2 V ϵ2 B δ2 V (46)

Combined them together, we see that for any m, n [MB], any W m GW (P(m, n)), W n GW (Q(m, n)) such that Bm = Bn, W m = W n not hold in the meantime, we have:

Bm W m Bn W n F =

t=1 |Bmwm t Bmwm t |2 δV (47)

Proof of Theorem D.4.

From the construction in Lemma D.7, we consider flattening V j into a k T vector ηj Rk T , where V j GV = {V j = (vj 1, ..., vj T )|j S, vj t Rk, vj t satisfy Equ. 43 and attain largest |S|}. Then the last two conditions in (43) show that ηj is a δV

ϵB -separated set contained in a ball of radius CV δV

ϵB in l2-norm. Actually, the first condition means removing the small central part along very axis of ηj in the above ball, and it s clear to see that GV has the same order of the cardinality if we drop the first inequality of (43). So if we use card to denote the cardinality of a set, we get:

Improved Active Multi-Task Representation Learning via Lasso

ln(card(GV )) k T (48)

Then from the definition of GW and GBW in Lemma D.7, we see that:

ln(card(GBW )) = ln((MB(MB 1)

2 2 + MB) ln(card(GW )))

= 2 ln(MB) + ln(card(GV ))

k(d k) ln(c1/ϵB) + k T, (48)

k(d k) + k T, (ϵB < min(c1

Choose 1 = δ1ϵB/CV and we finish the proof.

Proof of Theorem D.3.

Note that λ = σmin(Σ1/2 t ), λ = σmax(Σ1/2 t ) and κ = λ/λ, we can construct the local packing following Lemma D.7 by using f W to replace W where ewt = ntwt. And we choose δ 1 = 0.9δ1 where δ1 = δV

ϵB . Then we have:

t=1 Xt(Biwi t Bjwj t) 2 2 1.1λ Bif W i Bjf W j F

1.1λ CV δ1 δ 1 0.9δ1 < 6λδ 1

t=1 Xt(Biwi t Bjwj t) 2 2 0.9λ Bif W i Bjf W j F δ 1λ (51)

Here for convenience we choose CV = 4.5, and this will just influence the universal constant since CV is Θ(1) as in Lemma D.7. Note the sum of excess risks on the source tasks in (50), (51) is actually a semi-metric between (Bi, W i) and (Bj, W j), and it s easy to construct the corresponding δ 1λ-separated set GBW from GBf W set obtained in Lemma D.7. We recall that Yt = Xt B w t + Zt, and define Yt P j t where P j t = N(Xt B w t , σ2Int). And we further let P j := QT t=1 P j t . Then by the independency among every tasks, we have the Kullaback-Leibler divergence:

D(P i P j) =

t=1 D(P i t P j t )

t=1 Xt(Biwi t Bjwj t) 2 2

18λ 2(δ 1)2

Note that GBW is a δ 1λ-separated set over G1, which is a local packing of G0, we then let M = M(G0, F , (δ 1)2) and have the following Fano s lower bound (Wainwright, 2019):

inf ( ˆ B, ˆ W ) sup (B ,W )

t=1 Xt( ˆB ˆwt B w t ) 2 2 (0.9λ)2 inf ( ˆ B, ˆ W ) sup (B ,W )

t=1 ˆB ˆW B W 2 F

1 M 2 PM i,j=1 D(P i P j) + ln 2

Improved Active Multi-Task Representation Learning via Lasso

Besides, let c2 1 be the universal constant in Theorem D.4. Note d, T > k 1 and thus c2(k(d k)+k T )

3 > ln 2, we

let (δ 1)2 = c2σ2(k(d k)+k T )

108λ 2 , which enable CF ano 1

2. Then finally we have:

inf ( ˆ B, ˆ W ) sup (B ,W )

t=1 Xt( ˆB ˆwt B w t ) 2 2 σ2(k(d k) + k T)

Then from Assumption 2.2 and our notation above, we have κ2 = λ/λ = Θ(1), so we finish the proof.

D.2. Main Proof for the ER bound of P/A-MTRL

Lemma D.8. Denote that for any p N+:

νp(w T +1) = arg min ν ν p s.t. W ν = w T +1 (55)

and let H(cw) = {w Rk| w 2 = cw} (56)

with constant cw > 0, then for any fixed W , we have

sup w T +1 H(cw) νp(w T +1) 2 = cw σmin(W )

sup w T +1 H(cw) ν1(w T +1) 1

k cw σmin(W )

Proof of Lemma D.8.

First equality of (57) Firstly, by definition, we directly have for any w T +1,

σmin(W ) νp(w T +1) 2 W νp(w T +1) 2 = w T +1 2 (58)

Next we are going to prove the lower bound to show the equality. Let W = UDV , where U Ok k, V OT k, D = diag(σ1(W ), ..., σk(W )) with σ1(W ) > ... > σk(W ). There always exists an w satisfies

w 2 = Uek (59)

Then it is easy to see that the corresponding νp(w ) satisfies V νp(w ) = w T +1 2 (σmin(W )) 1ek. After rearranging, we have

w T +1 2 σmin(W ) = w T +1 2 σmin(W )ek

2 = V νp(w ) 2 νp(w ) 2 sup w T +1 νp(w T +1) 2 (60)

Combine (58) and (60) we finish the first part.

Second equality of (57). It is easy to upper bound

ν1(w T +1) 1 q

ν1(w T +1) 0 ν1(w T +1) 2 q

ν1(w T +1) 0 w T +1 2 σmin(W) (61)

where the last inequality again comes from (58) and the definition W ν1(w T +1) = w T +1. Now we can upper bound ν1(w T +1) 0 by k from the following arguments.

Improved Active Multi-Task Representation Learning via Lasso

Note that the original l1 minimization for the undetermined linear equation W ν = w T +1 is equivalent to finding the solution to the following linear programming problem.

s.t. W ν = w T +1,

where 1 := (1, ..., 1) R2T , ν := (ν+, ν ), ν+ := max(ν, 0), ν := max( ν, 0) and W := (W , W ) Rk 2T . Since W ν = w T +1 holds and there exists at least one optimal solution which is a basic feasible solution for LP (62). From Def. 2.9 and Theorem 2.3 in (Bertsimas & Tsitsiklis, 1997), we know that the cardinality for the basis of basic feasible solutions is rank(W ) = k. so ν1 at most k-sparse, i.e., ν1 0 k.

We show the Lemma that reflects our motivation to get the lower bound of eν2 2 1. Lemma D.9. Assume conditions in Theorem 3.7 hold, Ntot , and W can be any matrix in Γ(σk) = {W Rk T |σmin(W) σk}, then for L2-A-MTRL and P-MTRL we have

sup w T +1 H(cw) eν2(w T +1) 2 1 T c2 w Ntot σ2 min(W ) (63)

Proof of Lemma D.9. For passive learning, actually we can choose any νp such that W νp(w T +1) = w T +1, then from Lemma D.8 we have:

sup w T +1 H(cw) eνp(w T +1) 2 1 = T Ntot sup w T +1 H(cw) νp(w T +1) 2 1 = T c2 w Ntot σ2 min(W ) (64)

For L2 strategy we have nt = max{c ν2(t)2, N}. refer to the SVD decomposition of W in Lemma D.8 and the worst target vector w defined in (60), we have

ν2(w ) = V D 1U w = w 2 V D 1U Uek = w 2σ 1 min(W ) V ,k (65)

where V ,k is the k-th column vector of V OT,k. Since Ntot TN and ν2 2 = w 2σ 1 min(W ) V ,k 2 2 = w 2σ 1 min(W ), then for any t S, we have

nt Ntot |ν2(t)|2

ν2 2 2 = Ntot V 2 t,k (66)

So as Ntot + , t S |Vt,k| > 0. Note that the minimax lower bound used in Theorem 3.7 is proved by using Fano s inequality to the δV -separated subset as in Lemma D.7, and the corresponding separated set GW for W Rk T

is constructed from GV . Clearly GW := {W GW |W = UDV , t [T], s.t.Vt,k = 0} occupy zero volume space in GW , and thus we can use GW GW to replace the original GW set by excluding a corrsponding zero volume space in (43) from Lemma D.7 which has no difference to the original results. So set w 2 = cw, with probability 1 o(1) we have Vt,k > 0 and thus

sup w T +1 H(cw) eν2(w T +1) 2 1

c |ν2(t)|2 + X

where c = Ntotσ2 min(W )c 2 w .

We then prove a simple lemma to show that with a particular condition, we have Av A F v .

Lemma D.10. Assume v Rb, A, A Ra b and A F = c A for some a, b N+ and c (0, 1). Further assume that A satisfies Av = A F v , then

(A + A)v 1 c

1 + c A + A F v (68)

Improved Active Multi-Task Representation Learning via Lasso

Proof of Lemma D.10. We proof it directly:

(A + A)v Av A v = A F v A v ( A F A F ) v

1 + c( A F + A F ) v 1 c

1 + c A + A F v (69)

With such Lemma, we can prove an important Lemma for the lower bound of L2-A-MTRL and P-MTRL.

Lemma D.11. Recall the definition of H(cw) and νp(w T +1) in (55) and (56), we have the following results for L2minimization solution.

inf ( ˆ B,f W ) sup

(B ,f W ,w T +1 H(cw)) ( ˆBf W B f W ) ν2(w T +1) 2 2 σ2 k(d k) T c2 w k Ntot σ2 min(W ) (70)

Proof of Lemma D.11. The key idea is that we want to find some f W such that ( ˆB B )f W ν2 ( ˆB B )f W F ν2 when all the row vectors of f W are almost aligned with ν2. Without loss of generality, we assume ν2(t) = 0, t [T], and thus when Ntot , we have nt = c |ν2(t)|2, t [T], where c 1 is some constant satisfies c = Ntot/ ν 2. We prove the Lemma step by step.

First, we construct a specific g W , which is almost rank-1 and has rows aligned with eν2, to achieve the lower bound. For any given ν(t), we define g W := 1

c u χ + W (71)

where u RT is some vector to be determined later and

χ(t) := sgn(ν2(t)) = I[ν2(t) > 0] I[ν2(t) < 0], χ RT (72)

i=2 eσieαi eβ i (73)

Obviously, χ =

T. Here eαi, eβi RT , i {2, . . . , k} and we let {u/ u 2, eα2, . . . , eαk} and {χ/

T, eβ2, . . . , eβk} to be two orthonormal bases of two k-dimensional subspace of RT . The reason for such a definition of W is that the Eqn. 71 will naturally be an SVD form of f W . And for simplicity, we let eσi eσk, i {2, . . . , k}. We then let

c u χ F = 2 W F eσk = u

(k 1)c (74)

Then from eν2(t) = ν2(t)/ nt = χ(t)/

c and ew (t) = ntw (t) =

c |ν2(t)| w (t), we have

W ν2 = g W eν2 = T

i=2 eσkeαi eβ i χ = T

Note that W ν2 = w T +1 H(cw), i.e., cw = W ν2 , we have the following conditions for u and eσk.

T , eσk = cw

(k 1)T (76)

We then choose ν2 = ν := 1 = [1, . . . , 1] RT , thus χ = ν = 1, ν =

T and for W we have:

i=2 eαi eβ i = 1

Improved Active Multi-Task Representation Learning via Lasso

σmin(W ) = σk = eσk

(k 1)T (78)

which results that T = ν 2 > c2 w[4(k 1)σ2 k] 1. And we get

c |ν(t)|2 = T ν2 2

Ntot Tc2 w kσ2 k Ntot (79)

We check that ν is a valid choice for the L2-minimization solution. Let W = UDV be the SVD form of W , where U = (u/ u , eα2, . . . , eαk) Ok k, V = (χ/

T, eβ2, . . . , eβk) OT k, D = diag(σ1, . . . , σk), σ1 . . . σk 0. Note that

V D 1U W ν = V V ν = ( 1

i=2 eβi eβ i )ν = χ + 0 = ν . (80)

Therefore, ν = arg min W ν =W x x 2. Here we use the fact that eβ i ν = eβ i χ = 0, i {2, . . . , k}.

Finally we have:

inf ( ˆ B,f W ) sup

(B ,f W ,w T +1 H(cw)) ( ˆBf W B f W ) ν2(w T +1) 2 2

inf ˆ B sup B ( ˆB B )f W ν 2 2

inf ˆ B sup B ( ˆB B )f W 2 F ν 2 2, (Eqn. 74, Lemma D.10)

inf ˆ B sup B

t=1 Xt( ˆB B )w t 2 2 Tc2 w kσ2 k Ntot , (Eqn. 79)

σ2 k(d k) Tc2 w kσ2 k Ntot

For the first and last inequality, we restrict the local packing space on W and obtain the results in a manner similar to Theorem D.3. Specifically, we note that the orthonormal matrix B can be viewed as a Grassmann manifold that is diffeomorphic to a k (d k) dimensional linear matrix(Bai et al., 1992), and the constraint Bw = 0 introduces at most d additional limiting equations to B, which will not influence its local packing number. Therefore, it becomes straightforward to prove the last inequality using a methodology similar to the proof of Theorem D.3. And we finish the proof.

Finally, we turn to our main theorem in Sec. 3.2.

Proof of Theorem 3.7. From the conditions, we have cw = Θ(1).

Upper bound of ER for L1-A-MTRL. From Eqn. 29 from the proof of Lemma 3.2, we get eν1 (1+o(1)) ν1 2 1/Ntot when Ntot TN. Then use the second inequality of (57) in Lemma D.8, we have

sup w T +1 H(cw) eν1(w T +1) 2 1 sup w T +1 H(cw)

ν1(w T +1) 2 1 Ntot k c2 w Ntot σ2 min(W ) (82)

Improved Active Multi-Task Representation Learning via Lasso

For the upper bound, let ewt = ˆwt nt, ew t = ˆw t nt and eν2(t) = ν (t) nt for all t [T], then we have:

Ex µT +1 x ( ˆB ˆw T +1 B w T +1) 2 2 = (Σ T +1) 1 2 ( ˆB ˆW B W )ν1 2 2

(Σ T +1) 1 2 ( ˆBf W B f W ) 2 F eν1 2

t=1 nt (Σ T +1) 1 2 ( ˆB ˆwt B w t ) 2 eν1 2

t=1 nt (Σ t ) 1 2 ( ˆB ˆwt B w t ) 2 eν1 2, (Assumption 2.2)

t=1 Xt( ˆB ˆwt B w t ) 2 eν1 2, (Lemma D.2)

σ2(kd ln(Ntot

T ) + k T + ln(1

δ )) eν1 2, (Claim C.1 in (Chen et al., 2022)) (83)

Then combine (83) and (82) we prove the result for L1-A-MTRL.

Lower bound of ER for P-MTRL/L2-A-MTRL. For L2-A-MTRL, we derive the results from Lemma D.11. It can be easily verified that the same results hold for P-MTRL since we set ν = [1, . . . , 1] RT in Lemma D.11.

E. Proof of Theorem 3.10

Before proofing the original Theorem, we first illustrate an assumption naturally used for the sparse linear model and Lasso Program (Wainwright, 2019):

Assumption E.1. (RE condition) Let ν be supported on a subset S [T] with |S| = s (From Theorem 3.7 we know s k). Then W satisfies Restricted Eigenvalue condition over S with parameters (κ, 3) if:

W 2 2 κ 2 2, C3(S) (84)

where Cα(S) := { Rk| Sc 1 α S 1}.

What should be mentioned is that in this section we just consider L1-A-MTRL, so we replace ˆν and ν with ˆν1 and ν1, respectively.

Since σ2 max(W ) κ σ2 min(W ), we rewrite Theorem 3.10 with RE condition as follows. Once we prove the following theorem, we can replace κ with σ2 min(W ) and σ2 max(W ) correspondingly and immediately prove the original theorem.

Theorem E.2. Let Assumption 2.1, 2.3, 2.4, 3.8, 3.9, E.1 hold. Let Λ denote the lower bound of ν 1, q =

q ν 1) and γ max{2160sq CW Λ 1, p

2160sqκΛ 1} and σ = σmin(W ) > 0. Then in order to let ERL1 ε2

with probability 1 δ, the number of source samples Ntotal is at least

e O(σ2(kd + k T) ν 2 1ε 2 + Tβ) (85)

where β = max{γ2 σ2 z κ2 , γ2 C2 W κ2 ρ4, ρ4, σ2 z κ } (d + ln( 4T

δ )), and target task sample complexity n T +1 is at least

e O(σ2kε 2 + α) (86)

where α = max{γ2 σ2 z κ2Λ2 , γ2 C2 W κ2 ρ4, ρ4} (k + ln( 4

Lemma E.3. (A variant of Theorem 7.13 in (Wainwright, 2019)) Assume that Assumption E.1 hold. Any solution of the Lagrangian Lasso (16) with regularization parameter lower bounded as λk 2 ˆW z satisfies the following bound

ˆν ν 1 4 s ˆν ν 2 (88)

Improved Active Multi-Task Representation Learning via Lasso

Remark E.4. In Theorem E.5 we want ϵ min(0.05, κ 4γCW ) with high probability, so from Lemma D.2, we need

nt > max(400, 16γ2C2 W κ2 )caρ4(d + ln( 2T

δ )) for all t [T] and n T +1 > max(400, 16γ2C2 W κ2 )caρ4(k + ln( 2

δ )) for universal constant ca > 0.

To get the bound of regularization parameter λk, we turn to control the bound of the noise term z since ˆW and ˆw T +1 are solved by original least square method.

Theorem E.5. If ni t max{3γ2 σ2 z κ2 , 16γ2 C2 W κ2 caρ4, 400caρ4, 12σ2 z κ } (d + ln( 4T

δ )), ni M+1 max{3γ2 σ2 z κ2 ν 2 1 ,

16γ2 C2 W κ2 caρ4, 400caρ4} (k + ln( 4

δ )), and Assumption E.1 , 3.8 , 3.9 hold. Then with probability 1 δ we have

ˆν ν 1 2160

γ s max{CW , κ

Remark E.6. If (89) holds and

k R σ = Θ( ν 1), then active learning method with L1-minimization just multiplies an additional term 1 + 2160

γ s max{CW , κ

ERactive σ2(kd ln(Ntot

T ) + k T) ν 2 1 Ntot (1 + 2160

γ s max{CW , κ

γ })2 + σ2 (k + ln( 1

δ )) n T +1 (90)

Proof of Theorem E.5.

Substep 1: Decompose z.

As the analysis of original least square method in (Chen et al., 2022), for every t [T + 1] we have:

ˆwi t = arg min w Xi t ˆBiw Yt 2

= ((Xi t ˆBi) Xi t ˆBi) 1(Xi t ˆBi) Yt

= (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B w t + 1

nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) (Xi t) Zt

Then we have

zi = ˆwi T +1 ˆW iν

t=1 ˆwi tν t

=(( ˆBi) ˆΣi T +1 ˆBi) 1( ˆBi) ˆΣi T +1B w T +1

t=1 (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B w t ν t

+ 1 n T +1 (( ˆBi) ˆΣi T +1 ˆBi) 1( ˆBi) (Xi T +1) ZT +1

1 nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) (Xi t) Ztν t

= (( ˆBi) ˆΣi T +1 ˆBi) 1( ˆBi) ˆΣi T +1B w T +1 (( ˆBi) Σ ˆBi) 1( ˆBi) Σ B w T +1 | {z } Ei 1

t=1 (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B w t ν t

t=1 (( ˆBi) Σ ˆBi) 1( ˆBi) Σ B w t ν t )

| {z } Ei 2

+ 1 n T +1 (( ˆBi) ˆΣi T +1 ˆBi) 1( ˆBi) (Xi T +1) ZT +1 | {z } Ei 3

1 nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) (Xi t) Ztν t | {z } Ei 4

Improved Active Multi-Task Representation Learning via Lasso

where the third equality of Equ. 92 use Equ. 91 and the fourth equality comes from w T +1 = W ν . It s obvious that Ei k, k {1, 2, 3, 4} all have 0 expectation, and to control the bound of z, we just need to bound these 4 term in l2-norm for all i and use the inequality

z 2 = Ei 1 Ei 2 + Ei 3 Ei 4 2 2( Ei 1 2 + Ei 2 2 + Ei 3 2 + Ei 4 2) (93)

Substep 2: Calculate Error Terms Ei .

For the first term, with Inequ. 33 and Assumption 3.9 we have

Ei 1 2 (( ˆBi) ˆΣi T +1 ˆBi) 1( ˆBi) ˆΣi T +1B (( ˆBi) Σ ˆBi) 1( ˆBi) Σ B 2 w T +1 2

w T +1 2 1 + 2ϵ

1 2ϵ(( ˆBi) Σ ˆBi) 1( ˆBi) Σ B (( ˆBi) Σ ˆBi) 1( ˆBi) Σ B 2

w T +1 2 4ϵ 1 2ϵ (( ˆBi) ˆBi) 1( ˆBi) B 2

4ϵ 1 2ϵ w T +1 2, (σmax(( ˆBi) B ) 1)

4ϵ 1 2ϵCW ν 1, ( w T +1 2 =

t=1 W etν t 2 max t W et 2 ν 1)

The fourth inequality is relevant to subspace angle distance between p and q, where ˆBi and B are orthonormal matrices whose colums form orthonormal bases of p and q respectively, as section 2 in (Tripuraneni et al., 2021). The second term Ei 2 has upper bound similar to Ei 1:

t=1 (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B (( ˆBi) Σ ˆBi) 1( ˆBi) Σ B 2 w t ν t 2

4ϵ 1 2ϵ (( ˆBi) ˆBi) 1( ˆBi) B 2

t=1 w t ν t 2

4ϵ 1 2ϵCW ν 1

For the third term, from Lemma E.8 with probability at least 1 δ

Ei 3 2 1 n T +1 (( ˆBi) ˆΣi T +1 ˆBi) 1( ˆBi) (Xi T +1) ZT +1 2

1 n T +1 (1 2ϵ) (( ˆBi) Σ ˆBi) 1 2 ( ˆBi) (Xi T +1) ZT +1 2

2k + 3 ln( 4

Analogously, from Lemma E.8 with probability at least 1 δ

1 nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) (Xi t) Ztν t 2

1 nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) 2 (Xi t) Zt 2|ν t |

2d + 3 ln( 4T

δ ) nt |ν t |

2d + 3 ln( 4T

δ ) mint(nt) ν 1

Improved Active Multi-Task Representation Learning via Lasso

Substep 3: Final Calculation.

Combining (94), (95), (96), (97) and (93), with probability at least 1 δ we have

zi 2 16ϵ 1 2ϵCW ν 1 + 2 1 + 2ϵ

2k + 3 ln( 4

δ ) n T +1 +

2d + 3 ln( 4T

δ ) mint(nt) ν 1)

16 0.9 4 γ κ ν 1 + 2

1.1 0.9 κ ν 1

γ 2, (Conditions)

Choose that

k R γσ max{CW , κ

γ max{CW , κ

9 max{CW , κ

γ 2 (max t ˆwi t 2) zi 2, ((98), (101))

2 max t |( ˆwi t) zi| 2 ˆW zi

Finally from Lemma E.3 , the solution of (16) with regularization parameter λk satisfies:

1 4κ λk, (Lemma E.3, E.9)

k R σ max{CW , κ

Lemma E.7. Assume conditions in Theorem E.5 hold, then the norms of column vectors of ˆW have similar uppper bound to that of W :

9 max{CW , κ

Proof of Lemma E.7. This can be done by directly calculation as (95) and (97)

ˆwi t 2 = (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B w t + 1

nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) (Xi t) Zt 2

(( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B 2 w t 2 + 1

nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) 2 (Xi t) Zt 2

1 2ϵCW + 1 + 2ϵ

9 max{CW , κ

Lemma E.8. Assume Assuption 3.9 holds. For any t [T], with probability 1 δ

(Xi t) Zt 2 σz

nt(1 + 2ϵ)(2d + 3 ln(4T

As for target task, for any B Rd k that is independent of ZT +1 , with probability 1 δ

B (Xi T +1) ZT +1 2 σz

n T +1(1 + 2ϵ)(2k + 3 ln(4

Improved Active Multi-Task Representation Learning via Lasso

Proof of Lemma E.8. We firstly proof 104. Using SVD we have B (Xi T +1) = UBXDBXV BX, where UBX Ok k, VBX On k, DV X = diag(σ1(B (Xi T +1) ), ..., σk(B (Xi T +1) )). Let Q := V BXZT +1, we know Q N(0, σ2 z Ik) since B, Xi T +1 are independent to ZT +1, so does VBX. Note that 1 σ2z Q 2 2 χ2(k), and thus with probability

at least 1 δ

4 we have (Laurent & Massart, 2000)

1 σ2z Q 2 2 k + 2

Then use (105), with probability at least 1 δ

B (Xi T +1) ZT +1 2 2 = Z T +1(Xi T +1)BB (Xi T +1) ZT +1

= Z T +1VBXD2 BXV BXZT +1

j=1 σ2 j (B (Xi T +1) )Q2 j

σmax((Xi T +1) Xi T +1) Q 2 2

n T +1 (1 + 2ϵ) σ2 z(2k + 3 ln(4

δ )), (Assumption 3.9 , (105))

As for source tasks, (104) don t hold since ˆ Bi is not independent to Xi t and Zt. Then in order to get (103), we just need to note that rank(Xi t) = d and others steps are similar to the proof above.

Lemma E.9. If all the conditions of Theorem E.5 hold, then ˆW satisfies RE conditions with parameter ( 1

Proof of Lemma E.9. Applying SVD to 1 nt (Xi t) = Ut Dt V t , where Ut Od d, Vt On d, Dt = diag(σ1,t, ..., σd,t).

Let Qt := V t Zt t, we know Qt N(0, σ2 z 2 t Id) since Xi t, t are independent to Zt, so does Vt. Furthermore, we have PT t=1 1 nt Ut Dt Qt N(0, σ2 z PT t=1 1 nt 2 t Ut D2 t U t ) = N(0, σ2 z PT t=1 1 nt 2 t ˆΣi t) due to task independence. Notice that:

(1 2ϵ)Id ˆΣi t = 1

nt (Xi t) Xi t = Ut D2 t U t (1 + 2ϵ)Id, (Assumption 3.9, (32)) (107)

We immediately have σ (Dt) [ 1 2ϵ, 1 + 2ϵ]. From the density function of multivariate normal distribution, let ˆΓ := PT t=1 1 nt 2 t ˆΣi t and e t = t nt , then from (107), when x 2 is sufficiently large we have:

(2π) d 2 e 2 1 + 2ϵ exp( 1

e 2 2(1 + 2ϵ) ) 1

(2π) d 2 |ˆΓ|1/2 exp( 1

2x ˆΓ 1x) (108)

Thus in order to bound the L2 norm of PT t=1 1 nt Ut Dt Qt with high probability, we just need to bound the L2 norm of

random vectors with distribution N(0, σ2 z(1 + 2ϵ) e 2 2). Let ξ N(0, σ2 z(1 + 2ϵ) e 2 2), like (105), with probability at least 1 δ

ξ 2 2 σ2 z(1 + 2ϵ) e 2 2(2d + 3 ln(4

Improved Active Multi-Task Representation Learning via Lasso

Then with probability at least 1 δ

4 we have the following inequality for all RT

t=1 (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B w t t +

1 nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) (Xi t) Zt t 2

t=1 (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) ˆΣi t B w t t 2

1 nt (( ˆBi) ˆΣi t ˆBi) 1( ˆBi) (Xi t) Zt t 2|

1 + 2ϵ W 2 1 1 2ϵ ( ˆBi) (

1 nt (Xi t) Zt t) 2|

1 + 2ϵ W 2 1 1 2ϵ

1 nt Ut Dt Qt 2|

1.1 0.9 σz 2

(2d + 3 ln( 4

δ )) mint(nt) |, (Conditions, (109))

1.1 0.9 4 κ 2|, (nt 12σ2 z κ (d + ln(4

From the definition of RE condition like Assumption E.1, we done the proof.

Lemma E.10. Let q =

k R σ (so q ν 1). If γ max{2160sq CW Λ 1, p

2160sqκΛ 1}, then

γ sq max{CW , κ

γ } ν 1 (111)

Proof of Lemma E.10. Just note that if γ max{2160sq CW ν 1 1 , q

2160sqκ ν 1 1 }, then we can prove (111) by direct calculation. Then since ν 1 Λ by definition, we get the result.

Proof of Theorem 3.10/E.2. Combine Theorem E.5 and Lemma E.10 and we can figure out the result like (90). For the estimation of β1 = Tβ, we use the fact that s k, ν 2 R/σ and ν 1 ν 2 R/σ, then from the definition of γ, we can figure out that β1 should be at least Θ(Tk3C6 W /σ6).

F. Proof of Theorem 4.1

First, we rewrite the assumption and theorem formally.

Assumption F.1. (decreasing gradient) Assume ft is a piecewise second-order differentiable function, and on each sub-function, it satisfies ft 0, ft 0, 2ft 0 and ft(nt,1 + nt,2) = Ω(n 2+q t,2 ) for some q (0, 2].

Remark F.2. Assumption F.1 covers a wide range of functions that may be used in practice, including the above example (20). The last upper bound constraint in Assumption F.1 shows that we need ft to decrease moderately, and it s used for our main theorem in this section.

And our main result for Section 4 is:

Theorem F.3. Let nt,1 n1 for all t [T] and assume Assumption 2.1, 2.3, 2.4, 3.8, F.1 hold. Without loss of generality, we also assume R = Θ(1) and CW = Θ(1) where CW , R are defined in Assumption 3.8. Then denotes the optimal solution of (21) as (n [T ],2, ν ), we have n t,2 = ht(|ν (t)|) (112)

where ht is a monotone increasing function that satisfies: ct,1x ht(x) ct,2x2/q where ct,1, ct,2 > 0 and q defined in Assumption F.1. Moreover, we claim A-MTRL algorithm with n [T ],[2] sampling strategy is at least k-sparse task selection algorithm.

Improved Active Multi-Task Representation Learning via Lasso

And we also rewrite the optimization problem (21) formally:

min n[T ],2 g(n[T ],2) :=

t=1 ft(nt,1 + nt,2)

s.t. c0(n[T ],2, ν) := ε2

CERσ2k(d + T)

nt,2 + nt,1 0,

t=1 w j,tν(t) (w T +1)j = 0, j [k]

cm(n[T ],2) := nm,2 0, m [T]

where CER > 0 is a constant.

Proof of Theorem F.3. Here we note that the main insight for such a theorem is that we want to prove the objective function is concave relative to ν. So we just prove for global second-order differentiable function and it can be easily generalized to the piecewise second-order differentiable function by showing the maintenance of concavity.

Step 1: Use KKT conditions to reduce the variable s number

Firstly we define the Lagrange function:

L(n[T ],2, ν) = g(n[T ],2) λ0c0(n[T ],2)

j=1 λjcj(ν)

m=1 λm+kcm(n[T ],2) (114)

Then from KKT conditions we have

n t,2,ν (t) = ft(nt,1 + n t,2) λ 0 ν (t)2

(n t,2 + nt,1)2 λ t+k,2 = 0, t [T]

n t,2,ν (t) = 2λ 0 ν (t) n t,2 + nt,1

j=1 λ jw j,t = 0, t [T]

λ 0 0, λ 0c0(n [T ],2, ν ) = 0

λ m+k 0, λ m+kcm(n [T ],2) = 0, m [T]

Note that when n t,2 > 0, λ m+k = 0 and ft(nt,1 + nt,2) = Ω(n 2+q t,2 ). then from the first equation of (115) we deduce (112) and its property immediately.

Also, with (112) we can reduce the number of variables of the original problem from 2T to T + 1. To avoid confusion we denote α = λ0, γ(t) := ν(t) for new optimization problem (116). It s clear that if the optimal solution of the original optimization problem (113) is (ν , n [T ],2) and the corresponding lagrange coefficient for the first equality constraint of (113) is λ 0, then the optimal solution (γ , α ) of the following problem (116) is equal to (ν , p

min γ,α l(γ, α) :=

t=1 ft(nt,1 + ht(α|γ(t)|))

s.t. d0(γ, α) := ε2

CERσ2k(d + T)

ht(α|γ(t)|) + nt,1 = 0

t=1 w j,tγ(t) (w T +1)j = 0, j [k]

Step 2: The objective function of (116) is concave

Improved Active Multi-Task Representation Learning via Lasso

From the KKT conditions above we know for any feasible solution (γ, α) and any t [S], there exist a unique xt > 0 such that α|γ(t)| = p

ft(nt,1 + x) (nt,1 + x). Then from the key Lemma F.4 we know the objective function of (116) is concave relative to |γ(t)| for all t [S].

Step 3: Analyze γ from the sub-problem of (116)

The first equality constraint of the problem (116) is non-linear relative to γ and α, which results that the feasible region of (116) having non-linear boundary. This makes it difficult for us to get the closed form of the optimal solution for (116).

Fortunately, the other equality constraints, which are equivalent to W γ = w T +1, are not only linear but also have nothing to do with α. So we try to find out the optimal solution of sub-problem (117) and connect it to that of (116).

min ξ l(ξ, α) :=

t=1 ft(nt,1 + ht(α|ξ(t)|))

s.t. D(ξ) := W ξ w T +1 = 0

In (117) α is taken as a given value and ξ plays the same role as γ as above. Define opt(α) : R Ω , where Ω is the set of optimal solutions for (117) with given α.

Firstly we show that the optimal solution of (117) is k-sparse. From step 2 we know l(ξ) is concave for any |ξ(t)|, t [S], which means that the region contained by the isosurface of the objective function is concave where the axes are made up of |γ(t)| for t [S]. Consequently, the solutions of the system of linear equations that minimize such a concave function will give out sparse results (Tibshirani, 1996).

Secondly, we say the optimal solution of the original optimization problem (116) is k-sparse. For a non-trivial case, where the algorithm achieves require performance and terminates at the first stage, we know d0(γ, 0) < 0, and if α , d0(γ, α) ε2 CERσ2k(d+T ) > 0. Then from continuality of ht we see that for any γ (α) opt(α), there exist a unique α0 such that γ (α0) is a feasible solution for (116). On the other hand, every optimal solution (γ , α ) of (116) should be the optimal solution of sub-problem (117),i.e. it should satisfy γ opt(α ). Thus γ is k-sparse, and so as ν . Therefore A-MTRL with n [T ],[2] strategy is k-sparse task selection algorithm.

Lemma F.4. Assume ft, ht, nt,1 follow the conditions and results in Theorem 4.1, W Rk T , w T +1 Rk. Then if for any feasible solution (γ, α) of (116), any t [S], there exist a unique xt > 0 such that α|γ(t)| = p

ft(nt,1 + x) (nt,1 + x), then the objective function of (116) relative to |γ(t)| is concave for all t [S].

Proof of Lemma F.4.

Firstly we denote nt,1 as n for convenience. Note that from the chain rule:

|γ(t)| = ft(n + ht(α|γ(t)|)) ht(α|γ(t)|)) α (118)

Clearly l(γ, α) is also monotone increasing relative to |γ(t)|. For the second order of l(γ, α) we have:

|γ(t)|2 = { 2ft(n + ht(α|γ(t)|)) ( ht(α|γ(t)|))2 + ft(n + ht(α|γ(t)|)) 2ht(α|γ(t)|)} α2 (119)

Firstly we need to figure out the relation between the derivative of ht and ft. From the first equation of (115) and the definition of ht we have: ht( p

ft(n + x) (n + x)) = x (120)

Since ht is monotone contineous function, from inverse function theory we have

ft(n + x) (n + x)) = 2 p

ft(n + x) (n + x) 2ft(n + x) + 2 ft(n + x) (121)

Improved Active Multi-Task Representation Learning via Lasso

Let g(x) := p

ft(n + x) (n + x), from assumption F.1 we know g is a continous monotone increasing function and g (0, + ). Besides, from conditions we have that for each t [S] there is a unique x := xt > 0 such that g(xt) = α|γ(t)|, with which we can simplify the gradient:

2ht(α|γ(t)|) = 2ht( p

ft(n + x) (n + x))

ft(n + x) (n + x) 2ft(n + x) + 2 ft(n + x))/dx ht( p

ft(n + x) (n + x))

= 2( 2ft(n + x))2(n + x) 4 2ft(n + x) ft(n + x) 2(n + x) 3ft(n + x) ft(n + x)

[(n + x) 2ft(n + x) + 2 ft(n + x)]3

Denote h1 t := ht( p

ft(n + x) (n + x)), h2 t := 2ht( p

ft(n + x) (n + x)). Thus we have:

1 α2 2l(γ, α)

|γ(t)|2 = 2ft(n + x)( ht( p

ft(n + x)(n + x)))2 + ft(n + x) 2ht( p

ft(n + x)(n + x))

ft(n + x)(n + x) [(n + x) 2ft(n + x) + 2 ft(n + x)]2 {3( 2ft(n + x))2 2 3ft(n + x) ft(n + x)}

= 2 ft(n + x)(n + x) 3( 2ft(n + x))2 2 3ft(n + x) ft(n + x)

[(n + x) 2ft(n + x) + 2 ft(n + x)]3

= 2 ft(n + x)(n + x) q(x), (q(x) := 3( 2ft(n + x))2 2 3ft(n + x) ft(n + x)

[(n + x) 2ft(n + x) + 2 ft(n + x)]3 )

So if q(x) < 0 holds for all x > 0, we finish the proof. First we assume that ft(y) = At (Bt+y)δ where At > 0, Bt 0 and δ [0, 2 q). Then

q(x) = 3 δ2A2 t (n+x+Bt)2δ+2 2 δ(δ+1)A2 t (n+x+Bt)δ+3+δ

2At (n+x+Bt)δ δAt(n+x) (n+x+Bt)δ+1 = At (n + x + Bt)δ+1 δ(δ 2) 2Bt + (2 δ)(n + x) (124)

Since n + x > 0 and 0 δ < 2, we have q(x) < 0, x > 0. Besides, due to the fact that ft is monotone decreasing and non-negative, together with Assumption F.1 and n > 0, we can find δi [0, 2 q), At,i > 0, Bt,i 0 for i = 1, 2 such that At,1 (Bt,1+x+n)δ1 ft(x + n) At,2 (Bt,2+x+n)δ2 . So q(x) < 0 holds for any ft that satisfies Assumption F.1.

Remark F.5. If δ in (124) is in (0, 2), then the optimization problem (113) is not computable.

G. Supplements to the Experiments Section

G.1. Explanation of k-task selection scenario

We provide an illustration of our intuition for the k-task selection scenario in Section 5. We emphasize that the specific choice of the cost function is not critical in such a scenario, since solving the exact optimization problem (Eqn. 21) can be computationally challenging. For instance, the cost functions could correspond to Lp-minimization (0 p < 1) solutions of the relation equation W ν = w T +1, which is known to be NP-hard.

To address this challenge, as discussed in Theorem 4.1, we employ L1-A-MTRL as an approximation to the optimal solution of (21). This approach is justified by the fact that the time complexity for solving the approximate solution of (21) using L1-A-MTRL with relative accuracy δ is just poly(T) ln(T/δ) from (Cohen et al., 2021), and the L1-minimization solution is also k-sparse. Therefore, in cost-sensitive scenarios, our main focus is on addressing the question: How well can active multi-task representation learning algorithms perform when no more than k tasks are available for further sampling? This leads us to the setting of the k-task selection scenario.

G.2. Details of Algorithm Implementation.

In practice, ˆW and ˆw T +1 may differ at different epochs after the model converges due to the noise of data points. So to enhance the stability of ˆν, we calculate ˆν at every epoch in the last 20 rounds and take their average as the final reference to

Improved Active Multi-Task Representation Learning via Lasso

Algorithm 2 Multi-Stage L1-A-MTRL Method

Input: confidence δ, representation function class Φ, stage number S, scaling L > 1, minimum singular value σ Initialize N = β1/T from (15) and ˆν1 = [1, 1, ...]. for i = 1 to S do

Set ni t = max{βi|ˆνi(t)| ˆνi 1 1 , N}. For each task t, draw nt i.i.d samples from the offline dataset denoted as {Xi t, Y i t }T t=1 Estimate ˆϕi, ˆW i on the source tasks with Eqn.(2) Estimate ˆwi T +1 on the target task with Eqn.(3) Estimate νi+1 by Lasso Program (16) Set βi+1 = βi L end for

calculate n[T ] for both our algorithm and baselines, while the total number of epochs at each stage is no less than 2000. For full tasks scenario, note that L2-A-MTRL(Chen et al., 2022) utilize the iterative L2-A-MTRL algorithm with 4 stages to optimize the model we also run our algorithm iteratively with 4 stages for comparison, and the detailed procedure for multi-stage learning is in Algorithm 2. We mention that Chen s method requires multiple stages but we allow both single-stage (Algorithm 1) and multi-stage (Algorithm 2) versions.

Here we set N = 100. We sample 500 data from the target task, while at the final stage, we sample around 30000 to 40000 data from the source tasks. For k-task selection scenario, we run the algorithm with 2 stages. Here we set N = 40. We sample 200 data from the target task and around 12000 data from the source tasks.

G.3. How to choose λk

Determining the optimal value of λk requires additional knowledge of σ = σmin(W ), which are dataset-dependent prior parameters. To address this, we explore two approaches to determine λk in our experiments:

Tuning way: We roughly tune λk exponentially for the 2-phase L1-A-MTRL approach (Algorithm 1). And to further obtain the optimal λk, we can utilize grid search to find better λk. Once we identify a good λk, we can run the multi-phase L1-A-MTRL algorithm (Algorithm 2) using that λk and a larger Ntot to achieve improved results.

Lazy way: Alternatively, we can simply choose a very small value for λk, such as 10 10, for our algorithm.

To provide a clearer illustration of the first approach, we apply the 2-phase L1-A-MTRL on the identity 9 dataset in full task scenarios, where k = 50 and T = 150. In the first phase, each source task is assigned N = 100 data points, and in the second phase, the total budget for the source data is Ntot = 33k. The results are presented in Table 1.

Table 1. The relevance between λk and the second-stage test error on the target task identity 9. λk 1.0 10 1 10 2 10 3 10 4 2 10 4 10 5 10 6 10 8 10 10 10 16

Error 0.0691 0.0690 0.0703 0.0694 0.0561 0.0570 0.0655 0.0655 0.0633 0.0631 0.0625

The optimal value for λk is approximately 10 4. Additionally, we observe that, except for the terms 10 4 and 2 10 4, the target error decreases as λk decreases. For other target tasks, although we don t find an optimal λk similar to that of identity 9, we consistently observe that smaller values of λk lead to better performance for L1-A-MTRL. We think this phenomenon can be attributed to our Theorem 3.7, which considers a worst-case scenario where the noise may be significant. However, in practice, smaller values of λk are often sufficient to control the noise. Furthermore, since smaller values of λk result in a smaller bias when solving the Lasso program, L1-A-MTRL with small λk consistently exhibits good performance.

Therefore, to save time and resources, we adopt the lazy way instead of the tuning way in the experiments presented in this paper. We set λk = 10 10, and the empirical results demonstrate that L1-A-MTRL with such a small value of λk still achieves excellent performance.

Improved Active Multi-Task Representation Learning via Lasso

G.4. Additional Experiments on Sampling Budgets

To better show the empirical difference of the sampling budget in the experiments of MNIST-C, we consider the full task scenario (mentioned in Sec. 5) and evaluate the model performance by utilizing the 5-epoch L1-A-MTRL (Algorithm 2) with fixed minimum sampling data from every source task N = 100 and increasing total sampling number Ntot. Due to the limited time and resources, we randomly select two target tasks shear 1 and identity 9, and obtained the results in Table 2.

From Table 2 we find that to achieve accuracy higher than 95% on the shear 1 target task, P-MTRL (passive sampling) requires more than 86k source data, L2-A-MTRL(Chen et al., 2022) requires about 61k source data and L1-A-MTRL just requires about 33k source data. Since at the later phase, we can reuse the evenly sampled data (TN = 15k in total) from the first phase, L1-A-MTRL just requires labeling additional 18k source data at the later phase to achieve 95% accuracy, while L2-A-MTRL requires approximately 46k extra data, and P-MTRL requires no less than 71k extra data. Similar results apply to identity 9 . To achieve an accuracy above 93.7% on the identity 9 target task, P-MTRL requires more than 95k source data, L2-A-MTRL(Chen et al., 2022) requires about 61k source data, while L1-A-MTRL requires only about 33k source data. The above results illustrate the effectiveness of our L1-A-MTRL algorithm.

Table 2. Test error on the target task shear 1 and identity 9 with different Ntot. shear 1 Ntot

Algorithms 15000 32850 44000 60700 86000

P-MTRL 0.0544 0.0538 0.0536 0.0520 0.0518 L2-A-MTRL(Chen et al. (2022)) 0.0544 0.0511 0.0519 0.0494 0.0488 L1-A-MTRL(Ours) 0.0544 0.0496 0.0478 0.0442 0.0428

identity 9 Ntot

Algorithms 15000 33000 43800 60900 95400

P-MTRL 0.0932 0.0834 0.0778 0.0738 0.0652 L2-A-MTRL(Chen et al. (2022)) 0.0932 0.0909 0.0638 0.0627 0.0621 L1-A-MTRL(Ours) 0.0932 0.0631 0.0620 0.0605 0.0551