# on_the_statistical_benefits_of_curriculum_learning__65b4c115.pdf On the Statistical Benefits of Curriculum Learning Ziping Xu 1 Ambuj Tewari 1 Curriculum learning (CL) is a commonly used machine learning training strategy. However, we still lack a clear theoretical understanding of CL s benefits. In this paper, we study the benefits of CL in the multitask linear regression problem under both structured and unstructured settings. For both settings, we derive the minimax rates for CL with the oracle that provides the optimal curriculum and without the oracle, where the agent has to adaptively learn a good curriculum. Our results reveal that adaptive learning can be fundamentally harder than the oracle learning in the unstructured setting, but it merely introduces a small extra term in the structured setting. To connect theory with practice, we provide justification for a popular empirical method that selects tasks with highest local prediction gain by comparing its guarantees with the minimax rates mentioned above. 1. Introduction It has long been realized that we can design more efficient learning algorithms if we can make them learn on multiple tasks. Transfer learning, multitask learning and metalearning are just few of the sub-areas of machine learning where this idea has been pursued vigorously. Often the goal is to minimize the weighted average losses over a set of tasks that are expected to be similar. While previous literature often assumes a predetermined (and often equal) number of observations for all the tasks, in many applications, we are allowed to decide the order in which the tasks are presented and the number of observations from each task. Any strategy that tries to improve the performance with a better task scheduling is usually referred to curriculum learning (CL) (Bengio et al., 2009). The agent that schedules tasks at each step is often referred as the task scheduler. Though curriculum learning has been extensively used in 1Department of Statistics, University of Michigan, Ann Arbor. Correspondence to: Ziping Xu . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). modern machine learning (Gong et al., 2016; Sachan & Xing, 2016; Tang et al., 2018; Narvekar et al., 2020), there is very little theoretical understanding of the actual benefits of CL. We also do not know whether the heuristic methods used in many empirical studies can be theoretically justified. Even the problem itself has not been rigorously formulated. To address these challenges, we first formulate the curriculum learning problem in the context of the linear regression problem. We analyze the minimax optimal rate of CL in two settings: an unstructured setting where parameters of different tasks are arbitrary and a structured setting where they have a low-rank structure. Finally we discuss the theoretical justification of a popular heuristic task scheduler that greedily selects tasks with highest local prediction gain. Related literature. Despite the lack of the literature on curriculum learning theory, we found the following problems under a similar setting. Active learning (Settles, 2009; 2011), addresses the problem of actively selecting unlabeled data points to maximize model accuracy. Active learning can be treated as a curriculum learning when each unlabeled point is a task with point mass. Active learning has also been used for domain adaptation or transfer learning setting (Persello & Bruzzone, 2012). Multi-source domain adaptation (Wang & Deng, 2018; Sun et al., 2015) also considers multiple candidate source domains for a given target task. Bhatt et al. (2016) proposed an iterative adaptation methods to integrate the source data from each domain. However, such adaptive procedure has not been theoretically understood. 2. Background We would like to point out previous work on three crucial aspects of CL: two types of benefits one may expect from CL, task similarities assumptions, and task scheduler used in empirical studies. Two types of benefits. There are two distinct ways to understand the benefits of CL. From the perspective of optimization, some papers argue that the benefits of curriculum can be interpreted as learning from more convex and more smooth objective functions, which serves as a better initialization point for the non-convex target objective function (Bengio et al., 2009). The order of task scheduling is es- Submission and Formatting Instructions for ICML 2022 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1 2 3 4 Target Figure 1. An example of tasks with increasing non-convexity. Solid lines of different colors represent the true object functions of different tasks. sential here. As an example, Figure 1 shows the objective functions of a problem with four source tasks and one target task with increasing difficulty (non-convexity). Directly minimizing the target task (marked in purple line) using gradient descent can be hard due to the non-convexity. However, the simple gradient descent algorithm can converge to the global optima of the current task if it starts from the global optima of the previous task. We refer to the benefit that involves a faster convergence in optimization as optimization benefit. Optimization benefits highly depend on the order of scheduling. Generally speaking, if one directly considers the empirical risk minimizer (ERM) which requires global minimization of empirical risk, there may not be any optimization benefit. The second type of benefit concerns the benefit brought by carefully choosing the number of observations from each task while independent of the order, which we call statistical benefit. For example, we have two linear regression problems that are identical except for the standard deviation of the Gaussian noise on response variables. If we consider the OLS estimator on the joint dataset of the two tasks, there is a reduction in noise level when more samples are allocated to the task with a lower level, and the benefit is independent of the order by the nature of OLS. A statistical benefit can be seen as any benefit one can get except for the reduction in the difficulties of optimization. (Weinshall & Amir, 2020) focused on a special curriculum learning task where each sample is considered a task and they analyzed the error rate on the samples of different noise levels. They analyzed the benefits on the error rate of linear models, whose global minimizer can be easily found. Thus, it should also count as the statistical benefits. In general, the two types of benefits can coexist. A good curriculum should account for both the non-convexity and the noise levels. However, due to the significantly different underlying mechanism in the two learning benefits. it is natural to study them separately. This paper will focus on the analysis of the statistical benefits. Thus, we analyze algorithms that map datasets to an estimator for each task that may involve finding global minima of the empirical errors for non-convex functions. Similarity assumptions. We discussed the problem with two almost identical tasks, where we can achieve perfect transfer and the trivial curriculum that allocates all the samples to the simpler task is optimal. However, tasks are generally not identical. Understanding how much benefits the target task can gain from learning source tasks has been a central problem in transfer learning and multitask learning literature. The key is to propose meaningful similarity assumptions. Let X and Y be the input and output space. Assume we have T tasks with data distributions D1 . . . , DT over X Y. Let (Xt, Yt) be a sample from task t. Let ft = E[Yt | Xt], a mapping X 7 Y, be the mean function. In this paper, we adopt the simple parametric model on the mean function with ft represented by parameter θ t Rd. We consider two scenarios: structured and unstructured. In Section 3, we adopt simple linear regression models and do not assume any further internal structure on the true parameters. Two tasks are similar if θ t1 θ t2 2 is small. A learned parameter is directly transferred to the target task. This setting has been applied in many previous studies (Yao et al., 2018; Bengio et al., 2009; Xu et al., 2021). In Section 4, we study the multitask representation learning setting (Maurer et al., 2016; Tripuraneni et al., 2020; Xu & Tewari, 2021), where a stronger internal structure is assumed. To be specific, we write ft(x) = x T B β t , where x Rd, B Rd k is the linear representation mapping and β t Rk is the task specific parameter. Generally, the input dimension is much larger than the representation dimension (d k). These two settings, while representative, do not exhaust all of the settings in the literature. We refer the reader to (Teshima et al., 2020) for a brief summary of theoretical assumptions on the task similarity. Task schedulers. Many empirical methods have been developed to automatically schedule tasks. (Liu et al., 2020) designed various heuristic strategies for task selection for computer vision tasks. (Cioba et al., 2021) discussed several meta-learning scenarios where the optimal data allocations are different, which interestingly aligns with our theoretical results. For a more general use, one major family of task scheduler is based on the intuition that the task scheduler should select the task that leads to the highest local gain on the target loss (Graves et al., 2017). Since the accurate prediction gain is not accessible, online decision-making al- Submission and Formatting Instructions for ICML 2022 gorithms (bandit and reinforcement learning) are frequently used to adaptively allocate samples (Narvekar et al., 2020). However, there is no theoretical guarantee that such greedy algorithms can lead to the optimal curriculum. Notation. For any positive integer n, we let [n] = {1, . . . , n}. We use the standard O( ), Ω( ) and Θ( ) notation to hide universal constant factors. We also use a b and a b to indicate a = O(b) and b = Ω(b). 3. Unstructured Linear Regression In this section, we study the problem of learning from T tasks to generate an estimate for a single target task. 3.1. Formulations We consider linear regression tasks. Let 1 . . . , T denote T tasks. Let θ t Rd denote the true parameter of task t. The response Yt of task t is generated in the following manner Yt = XT t θ t + ϵt, where ϵt is assumed to be the Gaussian noise with E[ϵ2 t] = σ2 t and Xt N(0, Σt), where Σt Rd d is the covariance matrix that is positive definite. Any task, therefore, can be fully represented by a triple (θ t , σt, Σt). Throughout the paper, we are more interested in the unknown parameters rather than the covariate distribution or the noise level. We simply denote θ Rd T the parameters of a problem (T tasks) and let θ t be the t-th column of the matrix. We make a uniform assumption on the covariance matrix of input variables. The same assumption is also used by (Du et al., 2020). Assumption 1 (Coverage of covariate distribution). We assume that all C0Id Σt C1Id for some constant C0, C1 > 0 and any t [T]. Goal. Let Sn t be random n samples from task t. Let l : R R 7 R be a loss function and Lt(θ) = E[l(XT θ, Y )] be the expected loss of a given hypothesis θ evaluated on task t. Moreover, we denote the excess risk by Gt(θ) = Lt(θ) inf θ Lt(θ ). Our goal in this section is to minimize the expected loss of the last task T, which we call the target task. Throughout the paper, we use square loss function. Transfer distance. Algorithms tend to perform better when the tasks are similar to each other, such that any observations collected from non-target task bear less transfer bias. We define transfer distance between tasks t1, t2 as t1,t2 = θ t1 θ t2 2. It is not fair to compare the performances between problems with different transfer distances. To study a minimax rate, we are interested in the worst performance over a set of problems with similar transfer distance. Let Q RT be the distance vector encoding the upper bounds on the distance between the target task to any task. We define the hypothesis set with known transfer distance as Θ(Q) = p{θ : θ t θ T 2 Qt} RT d. The hypothesis set with unknown transfer distance can be defined as Θ(Q) = p{θ : θ pt θ T 2 Qpt}, where p [T]T is any permutation of [T]. We say this hypothesis set has unknown transfer distance because even if there exists some small Qt such that the transfer distance is low, an agent does not know which task has the low transfer distance. Curriculum learning and task scheduler. This paper concerns only the statistical learning benefits. Since the order of selecting tasks does not affect the outcome of the algorithm, we denote a curriculum by c [N]T , where each ct is the total number of observations from task t and P t ct = N. Note that c can consist of random variables depending on the task scheduler. The set of all the curriculum with a total number of observations N is denoted by CN = {c [N]T : P Any curriculum learning involves a multitask learning algorithm, which is defined as a mapping A from a set of datasets (Sn1 1 , . . . , Sn T T ) to a hypothesis θ for the target task. A task scheduler runs the following procedure. At the start of the step i [N], we have ni,1, . . . , ni,T observations from each task. The task scheduler T at step i is defined as a mapping from the past observations (Sni,1 1 , . . . , Sni,T T ) to a task index. Then a new observation from the selected task T (Sni,1 1 , . . . , Sni,T T ) [T] is sampled. Minimax optimality and adaptivity. One of the goals of this work is to understand the minimax rate of the excess risk on the target task over all the possible combinations of multitask learning algorithms and task schedulers. We first attempt to understand a limit of that rate by considering an oracle scenario that provides the optimal curriculum for any problem. Rigorously, we denote the loss of a fixed curriculum c CN with respect to a fixed algorithm A and problem θ by RN T (c, A | θ) = ESc1 1 ,...,S c T T GT (A(Sc1 1 , . . . , Sc T T )). We define the following oracle rate, which takes infimum over all the possible fixed curriculum designs given a fixed Submission and Formatting Instructions for ICML 2022 task set with different θ in a hypothesis set Θ. RN T (Θ) := inf A sup θ Θ inf c CN RN T (c, A | θ). (1) In general, the above oracle rate considers an ideal case, because the optimal curriculum depends on the unknown problem and any learning algorithm has to adaptively learn the problem to decide the optimal curriculum. We ask the following question: can adaptively learned curriculum perform as well as the optimal one as in Equ. (1)? To answer the question, we define the minimax rate for adaptive learning: RN T (Θ) := inf A inf T sup θ Θ EGT (A(Sc T ,1 1 , . . . , Sc T ,T T )), (2) where c T CN is the curriculum adaptively selected by the task scheduler T and the expectation is taken over both datasets S1, . . . , ST and c T . In this section, we are interested in the oracle rate in (1) compared to some naive strategy that allocates all the samples to one task. This answers how much benefits one can achieve compared to some naive learning schedules. We are also interested the gap between Equ. (1) and Equ. (2). 3.2. Oracle rate In this section, we analyze the oracle rate defined in Equ. (1). We first give an overview of our results. For any problem instance, there exists a single task t such that the naive curriculum with ct = N matches a lower bound for the oracle rate defined in Equ. (1). For any task t [T], its direct transfer performance of the OLS estimator on the target task T can be roughly bounded by 2 t,T + dσ2 t /N. Thus, our result implies that essentially, the goal of curriculum learning is to identify the best task that balance the transfer distance and the noise level. Theorem 1. Let Q be a fixed distance vector defined above. The oracle rate within Θ(Q) in Equ. (1) can be lower bounded by RN T (Θ(Q)) C0 min t {Q2 t + dσ2 t N }. (3) Proof highlights. Kalan et al. (2020) showed a minimax rate of the transfer learning problem with only one source task. They considered three scenarios, which can be uniformly lower bounded by the right hand side of Equ. (3). Our analysis can be seen as an extension of their results to multiple source tasks. In general, let the rate in (3) be δ and C > 0 be a constant. Let t be the best task indicated by (3). Any task t with a large distance (Qt > Cδ) is not helpful to learn the target task. Thus, samples from these tasks can be discarded without reducing the performance. For any task t with Qt Cδ, we will show that any sample from task t gives almost as much as information as the best task t gives. Thus, one can replace them with a random sample from the best task t without reducing the loss. Then the problem can be converted to a single source task problem, from where we follow the lower bound construction in Kalan et al. (2020). 3.3. Minimax rate for adaptive learning The problem can be hard when the transfer distance is unknown. We introduce an intuitive example to help understand our theoretical result. Assume we have three tasks including one target task and two source tasks. One of the two source tasks is identical to the target task. We have n samples for both source tasks, while no observations from the target task. In this example, even if one of the source tasks is identical to the target task, no algorithm can decide which source task should be adopted, since we have no information from the target task. In other words, any algorithm can be as bad as the worst out of the two source tasks. This is not an issue when the transfer distance is known to the agent in the oracle scenario. This example implies that to adaptively gain information from source tasks, we will need sufficient information from the target task. Otherwise, there is risk of including information from tasks that contaminates the target task. Similarly, (David et al., 2010) also showed that without any observations from the target task, domain adaptation is impossible. More generally, even if we have some data from the target task, we will show that one is not able to avoid σ2 T term, the learning difficulty of the target task. Now we formally introduce our results. Theorem 2. Assume T 4. Let Qsub > 0. Let Q be a fixed distance vector that satisfies Q1 = 0 and Qt = Qsub > 0 for all t = 2, . . . , T 1. The minimax rate in Equ. (2) can be lower bounded by RN T ( Θ(Q)) min{σ2 T log(T) N , Q2 sub} + min t dσ2 t N . (4) Theorem 2 implies that without knowing the transfer distance, any adaptively learned curriculum of any multitask learning algorithm will suffer an unavoidable loss of σ2 T log(T)/N, when Qsub is large. Compared to the rate σ2 T d/N without transfer learning, there is still a potential improvement of a factor of d/ log(T) when Qsub and σ2 t are small. Upper bound. As we showed above, there is a potential improvement of d/ log(T). This is because given the prior information that one of the source tasks is identical to the target task, the problem reduces from estimating a Submission and Formatting Instructions for ICML 2022 d-dimensional vector to identifying the best task from a candidate set, whose complexity reduces to log(T). In fact, a simple fixed curriculum could achieve the above minimax rate. Assume that any θ t 2 C2 for some constant C2 > 0. Let c T = N/2 and for all the other tasks ct = N/(2T 2). For each t = 1, . . . T 1, let θt be the OLS estimator using only its own samples. Let ˆθt be the projection of θt onto {θ : θ 2 C2}. Then we choose one estimator from t = 1, . . . , T 1, that minimizes the empirical loss for the target task: t = arg min t [T 1] i=1 (YT,i XT T,iˆθt)2. (5) Theorem 3. Assume there exists a task t such that t,T = 0 and θ t 2 C2. With a probability at least 1 δ, ˆθt satisfies GT (ˆθt ) C0C2 2 log(Td/δ) C2σ2 T N + d Tσ2 t C1N Note that t is a random value. However, when all t satisfy d Tσ2 t < maxσ2 T log(T), the first term is the dominant term and our bound matches the lower bound in (4). This could happen when σ2 T σ2 t for all t = 1, . . . , T 1. General function class. As we mentioned before, though it is difficult to identify the good source tasks, the complexity of doing so is still lower than learning the parameters directly. We remark that this result can be generalized to any function class beyond linear functions. Keeping all the other setup unchanged, we assume that the mean function f t Ft : X 7 Y for some input space X and output space Y shared by all the tasks. For convenience, we assume there is no covariate shift, i.e. the input distributions are the same. We give an analogy of Theorem 3. Assumption 2 (Assumption B in Jin et al. (2021)). Assume l( , y) is L2-strongly convex and L1-Lipschitz at any y Y. Furthermore, for all x Rd and t [T], E[ l(f t (X), Y ) | X = x] = 0. Assume we have N/(2T 2) observations for all tasks t = 1, . . . , T and N/2 observations for the target task. Let ˆft be the empirical risk minimizer of the task t. Similarly to (5), let t = arg min t [T 1] l N T ( ˆft), where l N T is the empirical loss on task T. Let L = mint [T 1] LT (f t ) and t = arg mint [T 1] LT (f t ). We will use Rademacher complexity to measure the hardness of learning a function class. We refer readers to (Bartlett & Mendelson, 2002) for the detailed definition of Rademacher complexity. Proposition 1. Given the above setting and Assumption 2, we have with a probability at least 1 δ, GT ( ˆft ) L + L1 RN/T (Ft ) + where RN(F) is the Rademacher complexity of function space F. This bound improves the bound for single target task learning, which scales with RN(FT ), when RN(FT ) RN/T (Ft ). The underlying proof idea is still that identifying good tasks is easier than learning the model itself. 4. Structured Linear Regression Now we consider a slightly different setting, where we want to learn a shared linear representation that generalizes to any target task within a set of interest. A lot of recent papers have shown that to achieve a good generalization ability of the learned representation, the algorithm have to choose diverse source tasks (Tripuraneni et al., 2020; Du et al., 2020; Xu & Tewari, 2021). They all study the performance of a given choice of source tasks, while it has been unclear whether an algorithm can adaptively select diverse tasks. 4.1. Problem setup We adopt the setup in Du et al. (2020). Let d, k > 0 be the dimension of input and representation, respectively (k d). We also set T d. Let B Rd k be the shared representation. Let β 1, . . . , β T Rk be the linear coefficients for prediction functions. The model setup is essentially the same as the setup in Section 3.1 except for the true parameters being B βt. We call this setting structured because if one stacks the true parameters as a matrix, the matrix has a low-rank structure. To be specific, the output of task t given by Yt = XT t B β t + ϵt. We use the same setup for the covariate Xt as in Section 3 and we consider σ2 1 = = σ2 T = σ2 for some σ2 > 0. Diversity. Let ti be the task selected by the scheduler at step i. It has been well understood that to learn a representation that could generalize to any target task t with arbitrary β t , we will need a lower bound on the following term i=1 β tiβ T ti =: λN,k, (7) where λk is the k-th largest eigenvalue of a matrix, i.e. the smallest eigenvalue. Basically, we hope the source tasks Submission and Formatting Instructions for ICML 2022 cover all the possible directions such that any new task could be similar to at least some of the source tasks. Equ. (7) serves as an assumption in (Du et al., 2020). When the true β t are known, we can simply diversely pick tasks. When the β t are unknown, the trivial strategy that equally allocates samples will perform badly. For example, let T k and let all the βt, t = k + 1, . . . , T be identical. The trivial strategy will only cover one direction sufficiently, which ruins the generalization ability. In this section, we will show that it is possible to adaptively schedule tasks to achieve the diversity even in the hard case discussed above. 4.2. Lower bounding diversity In this section, we introduce an OFU (optimism in face of uncertainty) algorithm that adaptively selects diverse source tasks. Two-phase estimator. We first introduce an estimator on the unknown parameters. Assume up to step i, we have dataset Sni,1 1 , . . . , Sni,T T for each task t. We evenly split each dataset Sni,t t to two datasets S(1) i,t and S(2) i,t , both with a sample size of ni,t/2 . We solve the optimization problem below: ˆBi = arg min B Rd k min βt Rk,t [T ] (x,y) S(1) i,t y x T Bβt 2, and ˆβi,t = arg min βt Rk (x,y) S(2) i,t yj xj Bβtj 2. Note that we split the dataset such that ˆBi and ˆβi,t are independent. Algorithm 1 CL by optimistic scheduling 1: Input: T tasks and total number of observations N and constant γ > 0. 2: Sample γ(d + log(N/δ)) samples for each task. 3: for i = T γ(d + log(N/δ)) + 1, . . . , N do 4: Construct confidence set Bi,t for each t [T] according to Equ. (8). 5: Select ti according to Equ. (9). 6: end for 7: return: Curriculum (t1, . . . , t N). Optimistic task scheduler. Our algorithm runs by keeping a confidence bound for B β t for each t [T] and each step i [N]. Lemma 1 introduces a suitable upper bound construction. Lemma 1 holds under the following assumptions. Lemma 1. Let κ = C0/C1. Assume Assumption 1 hold. There exists universal constants γ > 0, α > 0 such that, at all step i > T γ(d + log(N/δ)) , with a probability 1 δ, we have for all t [T], ˆBT i ˆβi,t B β i,t 2 2 αC5σ2dk log(κNδ/T) where ni,t is the number of observations from task t up to step i. Following the bound in Lemma 1, we construct the confidence set with width Wi,t := C5σ2dk log(κNδ/T) At each step i for each task t, we construct a confidence set around ˆBi ˆβi,t, Bi,t = {θ Rd : ˆBi ˆβi,t θ 2 2 Wi,t}. (8) Then following the principle of optimism in face of uncertainty, we select the task ti such that ti arg max t [T ] max θ Bi,t λk( j=1 θj θT j + θθT ) (9) and θi = arg maxθ Bi,t λk(Pi 1 j=1 θj θT j + θθT ). Here θi is our belief for task t at the step i. Now we are ready to present our lower bound results for diversity. Our results hold under two assumptions. The first assumption require the representation matrix B is not degenerated. We also assume boundedness on β t s. Assumption 3. Assume the largest singular value of B is smaller than C4 for some C4 > 0. Assumption 4 (Boundedness). We also assume that β t 2 C5 for all t [T]. Theorem 4. Suppose Assumption 3 and 4 hold. Assume for all ν Rk, ν 2 = 1, there exists some task t such that νT B β t β T t B T ν λ for some λ > 0. Let ti, i = 1, . . . , N be the tasks select by Algorithm 1 for some constant α. Then there exists some α > 0, such that with a probability at least 1 δ, σ2C2 5dk T log(κN/(Tδ)) C2 1C2 4λN . If we are provided with the oracle, we will only have the first term above. When N is sufficiently large, the second term in Theorem 4 is negligible and we will achieve diversity asymptotically as long as dk T N. Our proof follows the standard framework for OFU algorithms. We first show the correctness of the confidence set implied by Lemma 1. Then the key steps are to show the optimism, i.e. λk(PN i=1 θi θT i ) = Ω(λ/k) and to bound the difference term between the belief λk(PN i=1 θi θT i ) and the actual value λN,k. We provide the proof in Appendix E. Submission and Formatting Instructions for ICML 2022 4.3. Upper bound results Though the lower bound in Theorem 4 is already satisfying, we still want to shed some light on whether the dependency on p 1/N is avoidable by showing an upper bound result in Theorem 5. Theorem 5. For any curriculum learning algorithm, there exists T tasks (T > k) such that for all ν Rk, ν 2 = 1, there exists some β t , β t ν 1 and N ] maxt1,...,t N [T ] λk(PN i=1 β tiβ T ti ) N Theorem 5 states that the p 1/N dependency is unavoidable, while there is still a gap of dk4 between the upper bound and the lower bound. Our hard case construction is inspired by the case where the naive strategy that allocates samples evenly. To be specific, we consider T tasks such that k of them are diversely specified and all the other T k tasks are identical. Naive strategies will fail by having λN,k 1 k T . We divide T tasks into T/k blocks. Then we construct similar problems. Different problems have the diverse tasks in different blocks. The difficulty of the problem becomes identifying the block with diverse tasks, which is analogous to the idea of bandit model in a general sense. From here, we follow a similar proof of stochastic bandits (Lattimore & Szepesv ari, 2020). The full proofs can be found in Appendix F. 5. Analysis of Prediction Gain In this section, we give some theoretical guarantees on prediction-gain driven task scheduler under the unstructured setting discussed in Section 3. We do not consider the structured setting because it is not clear how to apply the prediction-gain driven method to multitask representation learning setting. Prediction Gain and convergence rate. We define prediction gain in the following way. At the step i, a multitask learning algorithm A maps any trajectory Hi = {xti,j, yti,j}i j=1 to a parameter θ Rd for the target task. Let the estimate at step i be θi. The prediction gain is defined as G(A, Hi+1) := LT (θi) LT (θi+1). At the start of the round i, the prediction-gain based task scheduler selects ti [T] such that G(A, Hi) is maximized. Note that in general, prediction gain is not observable to the algorithm before xti,i and yti,i are actually sampled. There are simple ways to estimate prediction gain, for example, from several random samples from each task. In a linear model, the prediction gain is equivalent to convergence rate. LT (θi) LT (θi+1) = θi θ T 2 ΣT θi+1 θ T 2 ΣT . Weinshall & Amir (2020) discussed various benefits of curriculum learning by show that their strategy gives higher local convergence rate. It is not clear from the context that the greedy strategy that selects the highest local prediction gain gives the best total prediction gain in long run. Decomposing prediction gain. Considering a identical covariance matrix Σt = I, the loss over a given parameter θ can be written as θ θ T 2 2 + σ2 T . Assume the gradient is calculated from a sample from the task t. According to the update of SGD, at the step i, we have θi+1 θ T = (I ηix(t) i x(t)T i )(θi θ T )+ηix(t) i (ϵi+x T i θ t,T ), where θ t,T = θ t θ T . The one-step prediction gain is θi θ T 2 θi+1 θ T 2 = ηi θi θ T 2 (2 ηi x(t) i 2 2)xix T i η2 i x(t) i (ϵi + x T i θ t,T ) 2 2 ηi(θi θ T )T (I ηix(t) i x T i )x(t) i (ϵi + x T i θ t,T ). The first term on the R.H.S is the absolute gain shared by all the tasks. On expectation, the second term is Eη2 i xi(ϵi x T i θ t,T ) 2 2 = Eη2 i xi 2 2(σ2 t + θ t,T 2 xix T i ). (10) In expectation, the third term is Eηi(θi θ T )T (I ηixix T i )xix T i θ t,T = E(1 ηi xt 2 2)ηi(θi θ T )T xix T i θ t,T . (11) Now we discuss term (10) and (11), respectively. (11) is independent of σ2 t and it is a dynamic effects depending on the current estimate θi. That means (11) is independent of the task difficulty and its constantly changes. When (θi θ T )T xtx T t (θ t θ T )T < 0, the task t has a larger prediction gain. This is when the gradient descent direction is consistent in both target and the task t. For term (10), we notice that task difficulty σ2 t and transfer distance t,T play equal importance in the prediction gain measure regardless of the number of observations. Optimality of prediction gain. Let t be the optimal task defined by t = arg min t 2 t,T + dσ2 t N . Submission and Formatting Instructions for ICML 2022 We consider an averaging SGD algorithm with a step size ηi = 1/i. In general, let θN = PN i=1 θi/N. The following Theorem shows that the performance of the averaging SGD with an accurate prediction-gain based task scheduler matches the minimax lower bound in Theorem 1. Theorem 6. Assume Σ1 = = ΣT = I. Assume θ t 2 2 C5 for all t. Given T tasks with noise levels σ2 1, . . . , σ2 T and transfer distance 1,T , . . . , T,T , let θN be the averaging SGD estimator with an accurate predictiongain based task scheduler defined above. We have GT ( θN) 2 t ,T + (dσ2 t + C5) log(N) Theorem 6 gives an upper bound on GT ( θN) that matches the lower bound in Theorem 1. 5.1. Simulation Studies To compliment the theoretical analyses, we conduct simulations studies by applying actual SGD with tasks chosen to maximize the local prediction gain. We consider two SGD scenarios: 1) assuming the algorithm has the accurate estimate on the prediction gain as in our analysis; 2) algorithms that have to estimate prediction gain. For the second scenario, we follow Graves et al. (2017), which regard the task scheduling as a sequential decisionmaking problem. A popular choice of agent is to use adversarial bandit model. To be specific, we use EXP3 algorithm. See Appendix H for details of the algorithm. Bandit algorithm runs by maximizing rewards. In our experiments, let θi be the estimate at the step i. We sample one observation (xi, yi) from the target task after each gradient descent, and the reward ri at the step i is given by ri = (yi θT i 1xi)2 (yi θT i xi)2. To evaluate the accurate prediction gain, we directly calculate the distance θi θ t 2. Following the setup throughout the paper, we consider a multitask linear regression problem. We set T = 5 and σ2 t = 0.001, 0.01, 0.1, 1, 1 for t = 1, . . . , 5, respectively. Note that the 5-th task is the target task. We test the effects of total number of observations n = 10, 50, 100, 500, 1000 and the effects of dimension d = 5, 10, 50, 100. By default, we set n = 1000 and d = 5. The true parameters of all the tasks are sampled from N(0, 0.001Id). On expectation, the transfer distance 2 t,T between task t and the target task is about 0.01d. The input x s are sampled from the same distribution N(0, Id) for all the tasks. Figure 2 shows the L2 distance of the final estimate and the true parameters of the target task. Our simulation results suggest that 1) prediction-gain based task scheduler can 10 100 1000 n 0 25 50 75 100 d Figure 2. L2 distance between the final estimate θi and θ T under different total numbers of observations n and different numbers of dimensions d. The confidence intervals are the standard deviation of 1000 independent runs. significantly improve the performance over the target task, when there exists some source task with low transfer distance and low noise; 2) there is still benefits when scheduler has to adaptive select tasks which coincidences our Theorem 3. The results are robust under different choices of n and d. 6. Discussion In this paper, we discussed the benefits of Curriculum Learning under two special settings: multitask linear regression and multitask representation learning. In the multitask linear regression setting, it is fundamentally hard to adaptively identify the optimal source task to transfer. In the multitask representation learning setting, a good curriculum is the curriculum that diversifies the source tasks. We show that the extra error caused by the adaptive learning is small and it is possible to achieve a near-optimal curriculum. Then we provided theoretical justification for the popular prediction-gain driven task scheduler that has been used in the empirical work. Our results suggest some natural directions for future work. We show a lower bound (Thm. 5) on the diversity in the multitask representation learning setting, while leaving a gap of d compared to our upper bound (Thm. 4). We believe this gap is because a loose construction of the hard cases that ignores the difficulty of learning the shared representation. Another direction is to show whether prediction-gain methods with no accurate gain estimation could still have performance close to lower bounds for the adaptive learning Submission and Formatting Instructions for ICML 2022 An important direction is to consider the how the order of presenting tasks affects the learning performance. Since the order of tasks are irrelevant for the analysis on empirical risk minimizer, one have to analyze the actual benefits in terms of optimization. Acknowledgement. We acknowledge the support of NSF via grant IIS-2007055. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41 48, 2009. Bhatt, H. S., Rajkumar, A., and Roy, S. Multi-source iterative adaptation for cross-domain classification. In IJCAI, pp. 3691 3697, 2016. Cioba, A., Bromberg, M., Wang, Q., Niyogi, R., Batzolis, G., Shiu, D.-s., and Bernacchia, A. How to distribute data across tasks for meta-learning? ar Xiv preprint ar Xiv:2103.08463, 2021. David, S. B., Lu, T., Luu, T., and P al, D. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 129 136. JMLR Workshop and Conference Proceedings, 2010. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020. Gong, C., Tao, D., Maybank, S. J., Liu, W., Kang, G., and Yang, J. Multi-modal curriculum learning for semisupervised image classification. IEEE Transactions on Image Processing, 25(7):3249 3260, 2016. Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In international conference on machine learning, pp. 1311 1320. PMLR, 2017. Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan, M. I. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 68(2):1 29, 2021. Kalan, S. M. M., Fabian, Z., Avestimehr, A. S., and Soltanolkotabi, M. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. Advances in Neural Information Processing Systems, 33: 1959 1969, 2020. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Liu, C., Wang, Z., Sahoo, D., Fang, Y., Zhang, K., and Hoi, S. C. Adaptive task sampling for meta-learning. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XVIII 16, pp. 752 769. Springer, 2020. Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1 32, 2016. Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., and Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21:1 50, 2020. Persello, C. and Bruzzone, L. Active learning for domain adaptation in the supervised classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 50(11):4468 4483, 2012. Sachan, M. and Xing, E. Easy questions first? a case study on curriculum learning for question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 453 463, 2016. Settles, B. Active learning literature survey. 2009. Settles, B. From theories to queries: Active learning in practice. In Active learning and experimental design workshop in conjunction with AISTATS 2010, pp. 1 18. JMLR Workshop and Conference Proceedings, 2011. Sun, S., Shi, H., and Wu, Y. A survey of multi-source domain adaptation. Information Fusion, 24:84 92, 2015. Tang, Y., Wang, X., Harrison, A. P., Lu, L., Xiao, J., and Summers, R. M. Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In International Workshop on Machine Learning in Medical Imaging, pp. 249 258. Springer, 2018. Teshima, T., Sato, I., and Sugiyama, M. Few-shot domain adaptation by causal mechanism transfer. In International Conference on Machine Learning, pp. 9458 9469. PMLR, 2020. Tripuraneni, N., Jordan, M. I., and Jin, C. On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems, 33: 7852 7862, 2020. Submission and Formatting Instructions for ICML 2022 Wang, M. and Deng, W. Deep visual domain adaptation: A survey. Neurocomputing, 312:135 153, 2018. Weinshall, D. and Amir, D. Theory of curriculum learning, with convex loss functions. Journal of Machine Learning Research, 21(222):1 19, 2020. Xu, Z. and Tewari, A. Representation learning beyond linear prediction functions. Advances in Neural Information Processing Systems, 34, 2021. Xu, Z., Meisami, A., and Tewari, A. Decision making problems with funnel structure: A multi-task learning approach with application to email marketing campaigns. In International Conference on Artificial Intelligence and Statistics, pp. 127 135. PMLR, 2021. Yao, J., Killian, T., Konidaris, G., and Doshi-Velez, F. Direct policy transfer via hidden parameter markov decision processes. In LLARLA Workshop, FAIM, volume 2018, 2018. Submission and Formatting Instructions for ICML 2022 A. Proof of Theorem 1 Proof. Our proof is inspired by the proof of (Kalan et al., 2020), which gives a lower bound construction for the two-tasks transfer learning problem. Our results can be seen as an extension of their constructions to multiple-source tasks setting. We define the optimal task t = arg min t {Q2 t + dσ2 t N }. Let δ2 = (Q2 t + dσ2 t N )/64. In general, we construct T M parameters {θt,i}t [T ],i [M] with the t-th row corresponding to the hypothesis set of the t-th task. We start by constructing the the hypothesis set of the target task and the task t . Let δ = Qt /16 + δ. By definition, we have δ 1.5δ. Consider the set Θ = {θ : θ 2 2δ }. Let {θt ,1, . . . , θt ,M} be a δ -packing of the set in the L2-norm ( θt ,i θt ,j 2 δ ). We can find the packing with log(M) d log(2). Since θt ,i, θt ,j Θ, we also have θt ,i θt ,j 2 4δ for any i, j [M]. Now we construct hypothesis set for the target task. For all i [M], we choose θT,i such that θT,i θt ,i 2 = Qt /16. So the construction for the target tasks satisfies θT,i θT,j 2 δ Qt /16 δ/2 and θT,i θT,j 2 4δ + Qt /16 5δ . Now we discuss two cases. For any task t with Qt 5δ , we randomly pick a parameter in the hypothesis set of the target task which we denote by θt and we set all θt,i = θt for all i [M]. This construction is valid since any θt,i θT,i 2 5δ Qt. For any task t with Qt 5δ , we will use the same construction as we use for t . Let J be a random variable uniformly over [M] representing the true hypothesis. The samples for each task t is i.i.d. generated from the linear model described in Section 3.1 with a parameter θt,J. Our goal is to show that on expectation, any algorithm will perform badly as in Theorem 1. Let Et be a random sample from task t given the true parameter being θt,J. Similarly to (5.2) in (Kalan et al., 2020), using Fano s inequality, we can conclude that RN T (Θ(Q)) δ2 1 log(2) + PT t=1 nt I(J; Et) log(M) We proceed by giving an uniform bound on the mutual information. We will need the following lemma to upper bound the mutual information term. Lemma 2 (Lemma 1 in (Kalan et al., 2020)). The mutual information between J and any sample Et can be upper bounded by I(J; Et) 1 M 2 P i,j DKL Pθt,i Pθt,j , where Pθt,i is the induced distribution by the parameter θt,i. Furthermore we have DKL Pθt,i Pθt,j = Σ1/2 t (θt,i θt,j) 2 2/(2σ2 t ) C0 θt,i θt,j 2 2/(2σ2 t ). Using Lemma 2, we bound the mutual information of any task t. Lemma 3. Under the constructions introduced above, the mutual information I(J; Et) 512C0 7σ2 t δ 2 for all t [T]. Proof. For any task in the first case discussed above (Qt 5δ ), the mutual information I(J; Et) is 0. Thus the statement holds trivially. Submission and Formatting Instructions for ICML 2022 Now we discuss the second case above. By definition, we have Q2 t + dσ2 t N Q2 t + dσ2 t N = 64δ2. (14) Note that Qt 5δ = 5(Qt ,T /16 + δ) 7.5δ. Plugging back into (14), we have dσ2 t /N 7δ2, and by definition we have dσ2 t /N 64δ2. Therefore, we have σ2 t 7σ2 t /(64). Since the constructions are the same for the second case, the mutual information can be uniformly bounded by I(J, Et) 1 M 2 X 7σ2 t θt ,i θt ,j 2 2 512C0 Finally, we follow the analysis in Section 7.4 of (Kalan et al., 2020). Using Lemma A on Equation (13), we have RN T (Θ(Q)) δ2 1 log(2) + N 512C0 Plugging in δ = Qt /16 + δ, we can conclude RN T (Θ(Q)) Q2 t + dσ2 t N . B. Proof of Theorem 2 Proof. We first show the lower bound of the first term within the maximization. We construct the following problem: we have T 1 tuples of parameters {(θ1,i, . . . , θT,i}T 1 i=1 , where θt,i corresponds to the parameters of the t-th task. Let { θi}T 1 i=1 be a set of parameters that are 2δ-separated for some δ > 0. The parameters of our source and target tasks are chosen in the following manners: 1. θt,i = θt for all t [T 1]. 2. θT,i = θi for all i [T 1]. Two important properties of this construction is that 1) there is always one source task that is identical to the target task; 2) the information from source tasks can not help learn the target task. Let J follow the uniform distribution over [T 1]. Assume we have n1, . . . , n M and n T be the number of observations for T 1 source tasks and target task from parameter (θ1,J, . . . , θT ,J), respectively. Proposition 2. Since J is independent of (θ1,J, . . . , θT 1,J), we have the mutual information I(J; θ1,J, . . . , θT 1,J) = 0. Let ψ be any test statistics that maps our dataset to an index. For all ψ, by Fano s Lemma, we can conclude that RN T ( Θ(Q)) δ2 1 i=1 P{ψ(Sn1 1 , . . . , Sn M M , Sn T T ) = j} δ2 1 I(J; ψ(Sn1 1 , . . . , Sn T T )) + log(2)) log(T 1) Submission and Formatting Instructions for ICML 2022 To proceed, we analyze the mutual information I(J; ψ(Sn1 1 , . . . , Sn T T )) I(J; Sn1 1 , . . . , Sn T T ) (By the independence of Sn1 1 , . . . , Sn T 1 T 1 and Sn T T ) I(J; Sn1 1 , . . . , Sn T 1 T 1 ) + I(J; Sn T T ) = I(J; Sn T T ). Let E be a random sample from the target task. We follow the analysis from (Kalan et al., 2020), which construct θ by the 2δ-packing of the set {θ : θ Rd, θ 2 4δ}. Then we can find such packing as long as T 1 d log(2). The mutual information by the above construction gives I(J; Sn T T ) n T I(J; E) n T 32δ2 By choosing the optimal δ = log((T 1)/2)σ2 T /(64n T ), we have for some c > 0, RN T σ2 T log(T 1) Note that the lower bound by Theorem 1 still applies here. In total, since n T N, we have RN T σ2 T log(T) N + min t dσ2 t N . The above analysis only works when Q2 sub δ = log((T 1)/2)σ2 T /(64n T ). Otherwise, one will at least suffer Q2 sub plus the learning difficulty term mint dσ2 t N . C. Proof of Theorem 3 Since the number of observations for each source task is N/(2T 2), we notice that Lt(ˆθt ) Lt(θ t ) C0d Tσ2 t log(T/δ) Using Assumption 1 and the definition of the loss function, we have ˆθt θ t 2 2 C0d Tσ2 t log(T/δ) C1N . The proof of Theorem 3 is similar to many proofs of generalization bound. We let the empirical loss on the target task w.r.t θ be ˆLT (θ) = 2 YT,i XT T,iθ 2 . Write YT,i = XT T,iθ T + ϵT,i. Let ˆΣT = 1 n T Pn T i=1 XT,i XT T,i be the sample covariance matrix. We start with LT (ˆθt ) min t [T 1] LT (ˆθt) ˆLT (ˆθt ) min t [T 1] ˆLT (ˆθt) + 2 max t [T 1] |LT (ˆθt) ˆLT (ˆθt)| = 2 max t [T 1] |LT (ˆθt) ˆLT (ˆθt)|, where the last equality is based on the definition of t . Now we bound the difference term. Let n T = N/2. Submission and Formatting Instructions for ICML 2022 Lemma 4. With a probability at least 1 δ, we have |LT (ˆθt) ˆLT (ˆθt)| C1C2 2d log(T/δ)σ2 T n T + n T for all t [T 1]. Proof. We make the following decomposition. |LT (ˆθt) ˆLT (ˆθt)| = | ˆθt θ T 2 ΣT + σ2 T 1 i=1 [XT T,i(ˆθt θ T ) ϵT,i]2| ˆθt θ T 2 ΣT ˆΣT + |σ2 T 1 i=1 ϵ2 T,i| + 1 i=1 XT T,i(ˆθt θ T )ϵT,i|. Now we bound the three terms above separately. The second term is the concentration for χ2(n T ) distribution. We have with a probability at least 1 δ/(3T), |σ2 T 1 n T Pn T i=1 ϵ2 T,i| σ2 T ( p log(3T/δ)/n T + log(3T/δ)/n T ). To proceed, we consider the concentration of sample covariance matrix. Lemma 5 (Matrix Hoeffding s inequality). Let X1, . . . , Xn be centered, independent, symmetric, d d random matrices that are sub-Gaussian with parameters V1, . . . , Vn. Then for all δ > 0 with a probability 1 δ, log(2d/δ)2σ2 n , where σ2 = 1 Using the boundedness of both ˆθt and θ T , ˆθt θ T 2 C2. Then applying Lemma 5, we have with a probability at least 1 δ/(3T), ˆθt θ T 2 ΣT ˆΣT C2 2C0 log(Td/δ) For the third term, we apply the martingale concentration inequality on the sum Pn T i=1 XT T,i(ˆθt θ T )ϵT,i, we have with a probability at least 1 δ/(3T), we have for all t [T 1] i=1 XT T,i(ˆθt θ T )ϵT,i| C0C2σ2 T log(T/δ) D. Proof of Proposition 1 Proof. The target task is basically minimizing the empirical loss over T 1 estimators. We first apply the standard generalization bound with Radermacher complexity LT ( ˆft ) min t [T 1] LT ( ˆft) + where c is a universal constant. To proceed, we bound mint [T 1] LT ( ˆft). By the assumption, we have L = mint LT (f t ). Let the task that realizes the minimization be t . Using Assumption 2, we have min t [T 1] LT ( ˆft) LT ( ˆft ) = E(XT ,YT )l( ˆft , YT ) L + L1EXT ˆft f t 2 L + L1 L2 (Lt ( ˆft ) L t ). We can apply the generalization bound on Lt ( ˆft ) L t , which gives us the result. Submission and Formatting Instructions for ICML 2022 E. Proof of Theorem 4 E.1. Proof of Lemma 1 We will borrow some techniques from (Du et al., 2020) for the proof of Theorem 4. We start with the proof of Lemma 1, which provides a valid confidence set for the unknown parameters B β t . First, we let X(1) i,t be the covariance matrix of the first split S(1) i,t . Claim 1 (Covariance concentration on the first split.). For δ (0, 1), there exists a constant γ1 > 0 such that with a probability at least 1 δ/10, we have 0.9Σt 2 ni,t X(1)T i,t X(1) i,t 1.1Σt for all i {i [N] : ni ,t γ1(d + log(N/δ))}. Claim 2 (Covariance concentration on the second split.). For δ (0, 1), there exists some γ2 > 0 such that for any given B Rd 2k that is independent of X(2) i,t , with a probability at least 1 δ/10, we have 0.9BT Σt B 2 ni,t BT X(2)T i,t X(2) i,t B 1.1BT Σt B for all i {i [N] : ni ,t γ2(d + log(N/δ)). Let γ = max{γ1, γ2}. Recall that κ = C0/C1. Then the good events in Claim 1 and 2 hold for all i : ni, γ(d + log(N/δ)) . We first apply the Claim A.3 in (Du et al., 2020), which guarantees the loss on the source training data. We rephrase it here as Lemma 6. Note that the only difference is that we require the good events hold for all i : ni,t s are sufficiently large. Lemma 6 (Claim A3 in (Du et al., 2020)). With a probability at least 1 δ/5, we have t=1 X(1) i,t ( ˆBi ˆβi,t B β t ) 2 2 σ2 (k T + dk log(κi/T) + log(N/δ)) for all i : ni, > γ(d + log(N/δ)) . (16) Note that X(1) i,t ˆBi ˆβi,t = PX(1) i,t ˆ Bi Yt = PX(1) i,t ˆ Bi(X(1) i,t B β t + zt). To proceed, for any fixed t [T], we have σ2 (k T + dk log(κi/T) + log(N/δ)) t=1 X(1) i,t ( ˆBi ˆβi,t B β t ) 2 2 t=1 PXt ˆ Bi(X(1) i,t B β t + zt) 2 2 t=1 PXt ˆ Bi X(1) i,t B β t 2 2 2 PΣ1/2 t ˆ BiΣt B β t 2 2 (Using Claim 1) t=1 ni,t PΣ1/2 t ˆ BiΣt B β t 2 2 j=1 PΣ1/2 t ˆ BiΣt B β tj 2 2 Submission and Formatting Instructions for ICML 2022 Then we have ˆBi ˆβt B β t 2 2 C1 Σ2 t ( ˆBi ˆβt B β t ) 2 2 1 0.9C1ni,t X(2) i,t ( ˆBi ˆβt B β t ) 2 2 (Using Claim 2) = 1 0.9C1ni,t PX(2) i,t ˆ Bi X(2) i,t B β t 2 2 + PX(2) i,t ˆ Bzt 2 2 For the second term above, 1 σ2 PX(2) i,t ˆ Bzt 2 2 χ2(k). Thus with a probability at least 1 δ, PX(2) i,t ˆ Bzt 2 2 k + log(NT/δ) for all t [T] and i > T γ(d + log(N/δ)) . Therefore, we obtain the bound: for all i > T γ(d + log(N/δ)) and all t [T], it holds that ˆBi ˆβt B β t 2 2 σ2 k T + dk log(κi/T) + log(N/δ) C1ni,t /C5 + k + log(NT/δ) C5σ2dk log(κNδ/T) E.2. Proof of the full theorem Now we prove the full theorem. By Assumption 3, we convert our target λd(PN i=1 β tiβ T ti ) to λd(PN i=1 B β tiβ T ti B T ): i=1 β tiβ T ti ) 1 NC4 λd( i=1 B β tiβ T ti B T ). Then we follow the standard decomposition framework of UCB analysis: i=1 B β tiβ T ti B T ) = 1 i=1 θi θT i ) + λk( i=1 B T β tiβ T ti B ) λk( i=1 θi θT i ) Our proof proceeds by first showing 1 N λk( i=1 θi θT i ) λ/d, which is usually interpreted as optimism. Then we bound the difference term i=1 B β tiβ T ti B T ) λk( i=1 θi θT i ) which is expected to vanish when N becomes large. Proof of optimism. We apply Lemma 1 and have θi Bα i,t. Since B β t Bα i,t for all t [T], i [N], it is easy to show that the greedy selection over Bα i,t will lead to i=1 θi θT i ) λ(N We prove by induction. Assume at any step n, we have for all ν 2 = 1, i=1 θi θT i ν λ(i/k 1). Submission and Formatting Instructions for ICML 2022 We will show that at the step n + k, we will at least have i=1 θi θT i ν λ(i/k). The proof is simple, if there exists a ν such that the above inequality fails, we will select a task that brings it to λ(i/k). This process can be done at most k times. Upper bounding the differences. We first write the difference of eigenvalues in terms of the difference of the matrices. We will use a trick here. i=1 B β tiβ T ti B T ) λk( i=1 θi θT i ) = min ν 2=1 νT N X i=1 B β tiβ T ti B T ν min ν 2=1 νT N X i=1 θi θT i ν min ν 2=1 νT N X i=1 B β tiβ T ti B T ν min ν 2=1 νT N X i=1 θi θT i ν i=1 (B β tiβ T ti B T θi θT i )ν i=1 min ν 2=1 νT (B β tiβ T ti B T θi θT i )ν i=1 B β ti θi 2( B β ti 2 + θi 2) i=1 B β ti θi 2. Applying Lemma 1 and by the construction of the confidence set Bα i,t, we have B β ti θi 2 C5σ2dk log(κNδ/T) i=1 B β tiβ T ti B T ) λk( i=1 θi θT i ) X C2 5σ2dk log(κNδ/T) C2 5σ2dk TN log(κNδ/T) Plugging this back to the decomposition term (17) we arrive the final bound. F. Proof of Theorem 5 Assume we have T tasks in total. We pick a set of orthogonal vectors {β1, . . . , βk} Rk with βi 2 2 = λ. We first construct a simple instance in the following way: the first k tasks are diverse such that (β i , . . . , β k) = (β1, . . . , βk). Then all the other tasks share the same parameter β1. We denote the instance by v. This construction is hard for naive task scheduler that evenly allocates samples to all the tasks since the direction for β 1 will be over-exploited. We evenly divide T tasks into M = T/k blocks. Let Tm be the total number of visits in the m-th block. For any task scheduler, there exists m [M] such that E[Tm] N M by pigeonhole theorem. Submission and Formatting Instructions for ICML 2022 Then we construct another instance denoted by v such that v is the same as v except for that the m-th block has the parameters (2β1, . . . , 2βk) for the k tasks in the block. Let Pv and Pv be the probability measure on the linear regression model with true parameter defined in Section 3.1 for v and v . Define N,k(T , v) be the expected difference using task scheduler T on instance v, i.e. N,k(T , v) := max t1,...,t N λk( i=1 β tiβ T ti ) λN,k. Thus, applying Bretagnolle Huber inequality (Theorem 14.2 (Lattimore & Szepesv ari, 2020)) we have N,k(T , v) + N,k(T , v ) Nλ 2k (Pv(T1 N/M) + Pv (T1 > N/M)) Nλ 2k exp( D(Pv, Pv )). where D(P, Q) is the relative entropy between distributions P and Q. Then we apply Lemma 15.1 (Lattimore & Szepesv ari, 2020), which we rephrase here. Lemma 7. Let Pt and P t be the probability measure of the t-th task using true parameters from v and v , respectively. We also let Tt be the number of observations on the t-th task. Then we have D(Pv, Pv ) = t=1 Ev[ Tt]D(Pt, P t) Ev[Tm] max t=m(k 1)+1,...,mk D(Pt, P t). Now since D(Pt, P t) = βt β t 2/(2σ2) = λ2/(2σ2), we have N,k(T , v) + N,k(T , v ) Nλ 2k exp( Nλ2 Choosing λ = 2Mσ2 N , we have N,k(T , v) + N,k(T , v ) σ G. Proof of Theorem 6 Proof. We follow the standard procedure of the convergence analysis of SGD. Let ti be the task that the task scheduler chooses at the step i. Let v(t) i be the virtual gradient calculated at the step i if task t is scheduled, i.e. v(t) i = x(t)T i (θ t θi)x(t) i + ϵ(t) i x(t) i , where ϵ(t) i and x(t) i is the random noise and the input sampled at the step i from task t. To start with, let θ(t) i+1 be the virtual next step if task t is scheduled. We have θ(t) i+1 θ T = θi θ T ηiv(t) i . By algebra, we derive θ(t) i+1 θ T 2 θi θ T 2 = 2ηi(θi θ T )T x(t) i x(t)T i (θi θ T )+ 2ηi(θi θ T )T x(t) i x(t)T i (θ t θ T ) 2ηix(t)T i (θi θ T )ϵi + η2 i v(t) i 2. Taking expectations over x(t) i and ϵ(t) i and arrange the equation, we have θi θ T 2 = θi θ T 2 Et,i θ(t) i+1 θ T 2 2ηi + (θi θ T )T (θ t θ T ) + ηi 2 v(t) i 2., (18) Submission and Formatting Instructions for ICML 2022 where Et,i takes marginal expectation over the randomness of x(t) i and ϵ(t) i . Note that LT (θi) σ2 T = θi θ T 2. Plugging this into Equ. (18), we have LT (θi) σ2 T = θi θ T Et,i θ(t) i+1 θ T 2 2ηi + (θi θ T )T (θ t θ T ) + ηi 2 v(t) i 2. Since this holds for any t, we let t = t and note that by the definition of the task scheduler with accurate prediction gain estimate, E θi θ T 2 E θ(t) i θ T 2. E[LT (θi)] σ2 T E θ(t ) i θ T 2 θ(t ) i+1 θ T 2 2ηi + E(θi θ T )T (θ t θ T ) + ηi 2 E v(t ) i 2. Summing over all i = 1, . . . , N, we have i E[LT (θi)] Nσ2 T i=1 E θ(t ) i θ T 2 θ(t ) i+1 θ T 2 i=1 E(θi θ T )T (θ t θ T ) + 2 E v(t ) i 2. Note that taking ηi = 1/i, the first term on the right hand side collapses to NE θ(t) N+1 θ T 2 0. To proceed, we have i=1 E(θi θ T )T (θi θ T ) i=1 E(θi θ T )T (θ t θ T ) + X 2 E v(t ) i 2 i=1 E(θi θ T )T (θ t θ T ) + log(N)(σ2 t d + C5) i=1 E θi θ T 2 i=1 θ t θ T 2 + log(N)(σ2 t d + C5). By solving the inequality, we have i=1 E θi θ T )T 2 i=1 E θ t θ T 2 + log(N)(σ2 t d + C5) i=1 2 t ,T + log(N)(σ2 t d + C5) Divided by N on both side, we reach Theorem 6. Submission and Formatting Instructions for ICML 2022 H. Additional details for simulations We provide the details for EXP3 algorithm. The EXP3 task scheduler picks tasks from {1, . . . , T}, which is the action set for the bandit problem. In our experiment, η = 0.85. Algorithm 2 Exponential-weight Algorithm for Exploration and Exploitation (Exp3) 1: Input: T, K, η 2: Set ˆS0,t = 0 for all t [T]. 3: for i = 1, . . . , n do 4: Calculate Pit exp(η ˆSi 1,t) P t [T ] exp(η ˆSi 1,t ) for all t [T] 5: Sample ti from Pi and receive reward ri 6: Calculate ˆSi,t ˆSi 1,t + 1 1{ti=a}(1 ri) Pit 7: end for