# optimistic_rates_for_multitask_representation_learning__5108b780.pdf

Optimistic Rates for Multi-Task Representation Learning

Austin Watkins Johns Hopkins University Baltimore, MD 21218 awatki29@jhu.edu

Enayat Ullah Johns Hopkins University Baltimore, MD 21218 enayat@jhu.edu

Thanh Nguyen-Tang Johns Hopkins University Baltimore, MD 21218 nguyent@cs.jhu.edu

Raman Arora Johns Hopkins University Baltimore, MD 21218 arora@cs.jhu.edu

We study the problem of transfer learning via Multi-Task Representation Learning (MTRL), wherein multiple source tasks are used to learn a good common representation, and a predictor is trained on top of it for the target task. Under standard regularity assumptions on the loss function and task diversity, we provide new statistical rates on the excess risk of the target task, which demonstrate the benefit of representation learning. Importantly, our rates are optimistic, i.e., they interpolate between the standard O(m 1/2) rate and the fast O(m 1) rate, depending on the difficulty of the learning task, where m is the number of samples for the target task. Besides the main result, we make several new contributions, including giving optimistic rates for excess risk of source tasks (Multi-Task Learning (MTL)), a local Rademacher complexity theorem for MTRL and MTL, as well as a chain rule for local Rademacher complexity for composite predictor classes.

1 Introduction

Transfer learning has emerged as a powerful tool in modern machine learning. The goal is to find a good predictor on a given target task with little data, by extracting knowledge from source task(s). The source tasks can either be explicitly given (supervised learning), or constructed from domain knowledge (unsupervised/self-supervised learning). An example of the supervised learning approach is a widely successful practice in deep learning, wherein a complex model is first trained on a source task with plentiful data, then applied to a target task by extracting representations of data by freezing the last layer of the complex model, and training a simple (linear) classifier on top of it [Bengio et al., 2013, Donahue et al., 2014]. Similarly, the unsupervised approaches of training a (large) neural network on a general objective has seen tremendous success, especially in natural language processing [Brown et al., 2020, Devlin et al., 2019] and computer vision. As before, a simple linear classifier trained on the output of the penultimate layer works well on a target task.

Representation learning, where the goal is to learn representations of data that are useful across multiple domains and tasks, is a common theme in the examples above. We study a popular formalization of this type of machine learning called Multi-Task Representation Learning (MTRL). As the name suggests, we are given t source tasks each with a small number of data points, say n. We seek to effectively pool the nt data points to extract a common good representation useful for the target task. Therefore, if our tasks are sufficiently diverse , we hope to effectively circumvent our data scarcity by sharing knowledge between tasks. As a by-product, since the shared representation

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

can also be used for the source tasks themselves, this procedure is also effective for the related problem of multi-task learning (MTL).

To study the transfer of knowledge via representations within the MTRL framework, we consider a composite-learning model. This model consists of F and H, a class of hypotheses and representations respectively, where the final end-to-end predictors are given via the composition f h, for all f F, h H. Since we must decide between compositions, we use a two-stage empirical risk minimization procedure. First, the multi-task representation learning stage, the procedure selects one common representation and t predictors, one for each task, by minimizing the average empirical loss on the source tasks. Second, the transfer learning stage, the procedure selects a predictor for the target task that minimizes the empirical loss, w.r.t. m samples, on top of the fixed representation from stage one.

We desire to learn a good predictor i.e., one that has small excess risk from F H for the target task. In this work, we consider the class H to be complex (e.g., deep neural networks), whereas the class F consists of simple functions (e.g., linear predictors). Therefore, the class F H is complex, so directly learning it would require a large number of samples. Yet, if we can effectively pool all our data from each task, the total number of examples could be sufficiently plentiful to successfully learn a complex representation. Further, a good representation can significantly reduce the statistical burden of learning a good predictor for the target task. This phenomenon is formalized by the prior work of Tripuraneni et al. [2020], which is the work most related to ours. Specifically, they show that if the loss function is Lipschitz and the tasks are sufficiently diverse (see Definition 2), the excess (transfer) risk for the target task is bounded as,

Excess transfer risk = O

C(H) + t C(F)

where C(H) and C(F) denote the complexity measures associated with learning H and F respectively (precise definition to follow in a later section).

While the above rate has the desirable properties described above, it is insufficient on many fronts, which our work seeks to address. First, it is akin to the standard rate of n 1/2, albeit optimal in agnostic settings, for single-task PAC learning. However, it is well-known that a fast rate is achievable under realizability in the standard supervised learning. But, the above guarantee does not capture this possibility. Importantly, the work of Tripuraneni et al. [2020] operates under the assumption that the loss function is Lipschitz in such settings, it is generally not possible to get fast rates, despite realizability, even for the special case of learning half-spaces [Srebro et al., 2010]. In the single task setting, the literature contains a rich theory of guarantees that interpolate between the slow and fast rate depending on the level of realizability, such guarantees are called optimistic rates . We derive such optimistic rates for MTRL. Consequently, as an important application of our general theory, we show that the (transfer) risk of the MTRL approach on the target task is bounded as,

L source (C(H) + t C(F))

L target C(F)

m + C(H) + t C(F)

when using non-negative smooth losses (such as squared loss in linear regression or smoothed ramp loss for classification [Cortes et al., 2021, Srebro et al., 2010]). The values L source and L target represent the minimum average risk over the source tasks and target task respectively, hence (1) yields a fast rate under mild realizability, i.e. when they are small.

In the course of proving the above, we extend a number of foundational results in the single-task setting to the multi-task setting. We list them in our contributions below.

1. We give a general local Rademacher complexity result for transfer learning via multitask representation learning see Theorem 1. 2. We also provide a local Rademacher complexity result for multitask learning see Theorem 3. Our result has two benefits compared to prior work [Yousefi et al., 2018]. First, we perform MTL via MTRL that leads to improved bounds which show the benefit of pooling data as opposed to learning each task separately. Second, we operate under the special, yet fairly natural setting, of hypothesis classes with product-space structure this simplifies the proof significantly as well as recovers the result for single-task setting with the exact same constants (unlike rates in Yousefi et al. [2018]).

3. For non-negative smooth losses, we derive optimistic rates (such as Eqn. (1)) for transfer learning and MTL via MTRL. Notably, when we restrict our rates to the single task setting, our proof is significantly simpler and our obtained bound on local Rademacher complexity is smaller compared to the prior work of Srebro et al. [2010]. 4. Finally, we present a chain rule for local Rademacher complexity for composite settings, which allows us to decouple the (local) complexities of representation and predictor classes.

1.1 Our techniques

In this section, we expand on our contributions and provide an overview of techniques and challenges overcome in the context of prior art. Local Rademacher complexity for transfer learning via MTRL. We build on the work of Tripuraneni et al. [2020], which introduced the notion of task diversity. They showed that task diversity leads to improved rates on excess transfer risk via MTRL, compared to training only on the target task. Their work however yields the standard O(m 1/2) rate on excess risk, akin to what is obtained in agnostic settings. We provide a local Rademacher complexity result for transfer learning via MTRL, yielding optimistic rates depending on distributional properties. Our methods are based on tools and techniques from the seminal local Rademacher complexity paper of Bartlett et al. [2005]. A crucial component of our results is relative deviation Bennett concentration inequalities established via log-Sobolev inequalities together with the entropy method [Bousquet, 2002]. We make the crucial observation that the above concentration inequalities, though established for the single-task setting, are general enough to be used to extend the local Rademacher complexity framework to the multi-task and transfer learning setting. Local Rademacher complexity for multi-task learning via MTRL. Intermediate to our above contribution, we provide a local Rademacher complexity result for MTL via MTRL, yielding optimistic rates. We note that the prior work of Yousefi et al. [2018] already provided a local Rademacher complexity result for MTL, though not via MTRL; importantly, our work establishes the benefits of pooling data compared to learning each task separately. However, in contrast to Yousefi et al. [2018]1, our setting is limited to hypothesis classes with a product space structure rather than a general vector-valued hypothesis. This simply means that the tasks are permutation-invariant in our setting, which we argue is a fairly reasonable assumption in the multi-task context. This assumption makes the analysis much simpler as we can reuse concentration inequalities for single task setting whereas Yousefi et al. [2018] had to extend log-Sobolev concentration inequalities to the multi-task setting. Additionally, this yields bounds with better constants than Yousefi et al. [2018] in the special case of a single task, this recovers the known result with the exact same constants. Optimistic rates for non-negative smooth losses. An important application of our theorems is providing optimistic rates when using non-negative smooth losses, which is known to enable distribution-free optimistic rates in a single task setting [Cortes et al., 2021, Srebro et al., 2010]. This result shows the provable benefits of representation learning. Further, it can be combined with the chain rule for Gaussian complexity of Tripuraneni et al. [2020]. This allows us to separate the composite class F H into the complexities of representation class H and predictor class F and also reuse existing complexity bounds in the literature. While we borrow some of the tools for optimistic rates theory in the single task setting from Srebro et al. [2010], the prior proof technique does not directly extend to the multi-task setting. Consequently, we modify the proof, simplifying it significantly as well as getting an improved bound, even in the single task setting. In particular, Srebro et al. [2010] bounds the local Rademacher complexity of the loss class by an ℓ2-covering number via Dudley s entropy integral formula, which is then bound by ℓ -covering number of hypothesis class using smoothness, which in turn is bound by the fat-shattering dimension, which eventually is bound by the (global) Rademacher complexity of the hypothesis class. Some of the steps seemingly do not generalize in the multi-task setting because there are no (well-studied) multi-task analogues, e.g. the fat-shattering dimension. Our proof starts with the above bound on the local Rademacher complexity of the loss class by an ℓ2-covering number via Dudley s entropy integral formula but then applies the well-known Sudakov minoration inequality that yields a bound in terms of Gaussian width; this is in turn bound by Rademacher complexity by standard known relations between the two quantities. Chain rule for local Rademacher complexities. In the application to non-negative smooth losses, the local Rademacher complexity of the loss class is bounded by a non-trivial function of the

1Yousefi et al. [2018] additionally considers Bernstein classes, which we do not. However, we think that our results can be extended to that setting.

(global) Gaussian complexity of the hypothesis class which is further bounded using the chain rule of Tripuraneni et al. [2020]. For general loss functions, there may not be any such non-trivial function. To this end, for Lipschitz losses, we develop a chain rule of local Rademacher complexity, Theorem 5, that similarly allows separating the local complexities of the representation and predictor class.

1.2 Related work

As mentioned before, our work is most related to, and builds on, Tripuraneni et al. [2020]. Other related work includes an early work of Baxter [2000] that gives guarantees on excess risk of transfer learning and multi-task learning via MTRL under a certain generative model for tasks. This was subsequently improved by Maurer et al. [2016], Pontil and Maurer [2013]. These early works give rates of the form O t 1/2 + m 1/2 and do not capture the advantage of having a large number of source samples per task. Prior work has extensively studied the special case of linear representations and/or linear predictors, partly in the context of meta learning, for instance see Du et al. [2021], Tripuraneni et al. [2021], Xu and Tewari [2021]. Further, multitask learning has been explored under various notions of task relatedness [Ben-David and Borbely, 2008, Cavallanti et al., 2010].

Regarding optimistic rates, the seminal work of Vapnik and Cervonenkis [1971] provided the first optimistic rates, via a normalized uniform convergence analysis, but was limited to the PAC learning setting. Much later, a line of work of Bousquet et al. [2002], Koltchinskii and Panchenko [2000] derived optimistic rates in more general settings using the technique of localization. In this area, an important result is the local Rademacher complexity theorem of Bartlett et al. [2005], which many works have leveraged to yield improved rates for various problems [Blanchard et al., 2007, Cortes et al., 2013, Ullah et al., 2018]. Importantly, this theorem applies to learning with non-negative smooth losses [Srebro et al., 2010] where the local Rademacher complexity-based technique can be used to give distribution-free optimistic rates. Further, optimistic rates have seen renewed interest owing to the interpolation/benign overfitting phenomenon in deep learning. In certain problems, predictors may interpolate the data (zero training error), yet have small risk. A line of work [Zhou et al., 2020, 2021] shows that in certain linear regression problems, optimistic rates can be used to derive risk bounds that explain the phenomenon, where the usual tool of uniform convergence provably fails.

There has been some work on optimistic rates in the multi-task setting. The work of Yousefi et al. [2018] established a local Rademacher complexity theorem for multi-task learning (see Section Our techniques for detailed comparison). Besides that, the work of Reeve and Kaban [2020] proved optimistic rates for vector-valued hypothesis classes for self-bounding loss functions (which includes non-negative smooth losses).

Notation. Denote 2 be the Euclidean norm and be the infinity norm. We use the convention that [m] = [1, . . . , m] is the list of contiguous integers starting at 1 and ending at m. For f = (f1, . . . , ft) let f 2 = f 2 1 , . . . , f 2 t and denote

t Pt j=1 Pjfj = 1

t Pt j=1 E (fj (Xj)) , ˆP nf := 1

t Pt j=1 ˆP n j fj = 1 nt Pt j=1 Pn i=1 f Xi j .

2 Problem setup and preliminaries

Let X Rd and Y R denote the input feature space and the output label space respectively. The source tasks and the target task are represented using probability distributions {Pj}t j=1 and P0, respectively. Let F and F0 be a class of predictor functions from Rk to Y for the source tasks and target task respectively, and H a class of representation functions from Rd to Rk.

Following Tripuraneni et al. [2020], we assume that marginal distribution of Pj over X is the same for all the j from 0 to t, and that there exists a common representation h H and task-specific predictors f j F for j [t] and f0 F0 such that Pj can be decomposed as Pj(x, y) = Px(x)Py|x(y|f j h (x)). Note that this decomposition does not assume that f j h is the optimal in-class predictor; however, we will assume it is to make our optimistic rates more meaningful (although this is not strictly necessary). Also the decomposition implicitly assumes that y depends on x only via f j h (x), and thus any additional noise in y is independent of x. Note that like Tripuraneni et al. [2020], we allow the predictor class for the target task F0 to be different than on the source tasks F. For example, when performing logistic regression with a linear representation

B Rk d and linear predictor w Rk, we have Py|x(y = 1|f j h (x)) = σ(w T BT x) , where σ is the sigmoid function.

Given a loss function ℓ: R Y R, the goal is to find a good predictor ˆf ˆh from the composite-class F0 H with small transfer risk,

Rtarget( ˆf0, ˆh) = E(x,y) P0[ℓ( ˆf0 ˆh(x), y)].

Yet, as standard, we do not have access to the joint distribution for all tasks, but instead n independent and identically distributed (i.i.d.) samples from each source task and m i.i.d. samples for the target task. Let (xi j, yi j) be the ith sample for the jth task. To accomplish the above, we use the source data to find a representation, and the target data to find a predictor for the target task. Specifically, in this MTRL procedure, first, we train on all source task data, then freeze the resulting representation. Second, we train over this frozen representation to find a good predictor on the target samples.

We formalize the above as the following two-stage Empirical Risk Minimization (ERM) procedure,

(ˆf, ˆh) arg min f F t,h H

i=1 ℓ(fj h(xi j), yi j), (Multi-task (representation) learning) (2)

ˆf0 arg min f F0

i=1 ℓ(f ˆh(xi 0), yi 0). (Transfer learning) (3)

Besides bounding the transfer risk, we seek to understand if the above procedure also yields improved bounds for the source tasks (the problem of MTL).

Rsource(ˆf, ˆh) = 1

j=1 E(x,y) Pj[ℓ( ˆfj ˆh(x), y)]

We make the following regularity assumptions. Assumption 1.

A: The loss function ℓis nonnegative and b-bounded, i.e., 0 ℓ(y , y) b < for all y , y Y.

B: All functions in F are L-Lipschitz for some 0 < L < w.r.t. 2, i.e., f(z1) f(z2) 2 L z1 z2 2 for all z1, z2 Dom(f) and f F.

C: Any f h F H is D-bounded over X w.r.t. , i.e., supx X |f h(x)| D < for all f F and h H.

Unlike Tripuraneni et al. [2020], we assume for some of our theorems that our function is smooth instead of Lipschitz, as leveraging smoothness is key to achieving fast rates [Srebro et al., 2010].

Definition 1 (H-smoothness). The loss function ℓ : R Y R is H-smooth if ℓ y(y1, y) ℓ y(y2, y) H |y1 y2| for all y1, y2 R, y Y.

Task diversity [Tripuraneni et al., 2020]. We benefit from learning a representation for a new target task, so long as there must be information encoded for that task by the source tasks. A typical way to quantify this encoded information is to control the difference between the learned representation and the underlying representation. To this end, we use the framework introduced by Tripuraneni et al. [2020] that quantifies a measure of task diversity. As mentioned in their work, this framework recovers task diversity assumptions in Du et al. [2021] exactly. In our setting, the definitions given in Tripuraneni et al. [2020] simplify considerably.

Definition 2. The tasks {Pj}t i=1 are (ν, ϵ)-diverse over P0, if for the corresponding f F t, f 0 F0 and representation h H, we have that for all h H

inf f F0 Rtarget(f , h ) Rtarget(f 0 , h ) 1

inf f F t Rsource(f , h ) Rsource(f , h ) + ε.

Parameters ν and ε quantify the similarity between learning the source tasks and the target task. For a detailed analysis of these parameters and the framework introduced by Tripuraneni et al. [2020], see Appendix B.

Local Rademacher Complexity. We now define the measures of complexity that are used in our work. Let u, p, n N, an input space Z, a class of vector-valued functions Q : Z Ru, and a dataset Z = (zi j)j [p],i [n] where zi j Z. Define the data-dependent Rademacher width, RZ( ), as

RZ(Q p)= Eσi,j,k

sup q Q p 1 np

i,j,k=1 σi,j,k qj(zi j)

where σi,j,k are i.i.d. Rademacher random variables. Analogously, define the data-dependent Gaussian width, GZ( ), as

GZ(Q p)= Egi,j,k

sup q Q p 1 np

i,j,k=1 gi,j,k qj(zi j)

where gi,j,k are i.i.d. N(0, 1) random variables. We define the worst-case Rademacher width as Rn(Q p) = sup Z Zpn RZ(Q p) and, analogously, the worst-case Gaussian width as Gn(Q p) = sup Z Zpn GZ(Q p). The above definitions generalize the standard Rademacher and Gaussian width for real-valued functions and can be derived as a special case of the more general set-based definitions [Wainwright, 2019] (see Appendix A).

We now define local Rademacher width as RZ(Q p, r) = RZ ({q Q p : V (q) r}), where V : Qp R. That is, the local Rademacher width is simply the Rademacher width restricted by a functional. In our applications, we consider any V satisfying V (q) b 1

i,j,k=1(qj(zi j))k, where b is the uniform bound on the range of q in ℓ2-norm. Note this recovers the classical local Rademacher width2 for real-valued functions and non-product spaces (p = u = 1). We are mostly interested in the local Rademacher width of the loss applied to the pairwise composition between our vector-valued hypothesis class and our representation class. Specifically, if we define such a class as Lℓ(F t(H)) = {(ℓ f 1 h, . . . , ℓ f t h) | h H, f F t}, then we seek to bound Rn(Lℓ(F t(H)), r). Crucially, our bound will be a sub-root function in r, where we define this type of function below.

Definition 3 (Sub-root Function). A function ψ : [0, ) [0, ) is sub-root if it is nonnegative, nondecreasing, and if r 7 ψ(r)/ r is nonincreasing for r > 0.

Sub-root functions have the desirable properties of always being continuous and having a unique fixed point, i.e., r = ψ(r ) is only satisfied by some r ; see Lemma 8 in Appendix A.

3 Main Results

In this section, we detail our local Rademacher complexity results for transfer learning (TL) and MTL via MTRL, a bound on the local Rademacher complexity of smooth bounded non-negative losses, and the application of this bound to both TL and MTRL.

Our first result is a local Rademacher complexity result for TL via MTRL.

Theorem 1. Let ˆh and ˆf0 be the learned representation and target predictor, as described in Eqns. (2) and (3). Under Assumption 1.A, if f is (ν, ϵ)-diverse over F0 w.r.t. h and let ψ1 and ψ2 be sub-root functions such that ψ1(r) Rn(ℓ F t H, r) and ψ2(r) Rm(ℓ F0, r), then with probability at least 1 4e δ, the transfer learning risk is upper-bounded by

Rtarget( ˆf0, ˆh) Rtarget(f 0 , h ) + c1

Rtarget(f 0 , h )

Rsource(f , h )

where r 1 and r 2 are the fixed points of ψ1(r) and ψ2(r), respectively, and c1 and c2 are absolute constants.3

2Bartlett et al. [2005] uses complexity , but we use width , as common in stochastic processes. 3For exact constants for each term, please refer to Appendix D.

Inequality (4) bounds the transfer risk in terms of task diversity parameters ν and ϵ; local Rademacher complexity parameters, fixed points r 1 and r 1; the minimum risk for the target task; and the minimum average risk for the source tasks.

As discussed in Xu and Tewari [2021], there exist settings where the task diversity parameters ν and ϵ are favorable, i.e., ϵ = 0 and ν = Θ(1), so we will disregard them in the subsequent discussion. The bound is an optimistic rate because it interpolates between p

r 2 and r 1 + r 2, depending on both Rtarget(f 0 , h ) and Rsource(f , h ).

The fixed point r 1 is a function of F t(H) and encodes its complexity measured with respect to the number of samples for the source tasks (among other class-specific parameters). The same reasoning holds for r 2 w.r.t. F0.

We consider H to be complex and both F and F0 to be simple. So, if we instead learned the target task directly, then we would pay the complexity of F0(H) against m samples. In contrast, in our bound, we only pay the complexity of F0 for m samples and the complexity of F t(H) for nt samples. Since in many settings nt is much larger than m, we are successfully leveraging our representation to learn the target task.

The dependence on minimum risk is useful in various settings. An instructive regime is when nt m with Rtarget(f 0 , h ) = 0 and Rsource(f , h ) > 0; in this case, we achieve a bound of p

r 1 + r 2. This demonstrates that data-abundant environments can be leveraged to learn representations that work well in small sample regimes.

The next result uses a decoupling of function classes, the Gaussian chain rule of Tripuraneni et al. [2020], and smoothness to bound r 1 and r 2. We see that under this additional assumption that ℓis H-smooth, r 1 and r 2 decay quickly. Theorem 2. Under the setting of Theorem 1 along with ℓbeing H-smooth and Assumptions 1.B and 1.C, with probability at least 1 6e δ, the fixed points in Inequality (4),

r 1 c3b H G2 m(F0 ˆh) log2(m) + (1 + δ) b

r 2 c4b L2 G2 nt(H) + G2 n(F) H log(nt)4 + D2H log2(nt) + b(1 + δ)

where c3 and c4 are absolute constants. Remark 1. Note that in Theorem 2 it is not necessary to apply the chain rule. If the chain rule is not used, Assumptions 1.B and 1.C is not needed. The fixed point r 2 would be bounded as a function of F t(H). In the following, we use the above decomposed form for interpretability.

Note that for constant L, D, H, b, and δ, the terms on the right are always fast, so the rate of decay depends crucially on the behavior of the Gaussian widths for each class. For many function classes

used in machine learning, it is common that Gnt(H) q

nt and Gn(F) q

n , where C( ) is some notion of complexity of the function class, which could be, for instance, the VC dimension, pseudo-dimension, fat-shattering dimension, etc. Therefore, it is common that

L2 G2 nt(H) + G2 n(F) L2 C(H)

n and G2 n(F0 ˆh) C(F0)

Recall, from our discussion following Theorem 1, that these fixed points control the order of the rate. Therefore, this theorem interpolates between a rate of 1/ m + 1/

nt and 1/m + 1/nt, where the fast rate is achieved in a realizable setting. Linear classes. As an example let us consider a linear projection onto a lower dimensional space along with linear regressors and squared loss. Let X = Rd, Y = R, and F = f | f(z) = α z, α Rk, α c ,

H = h | h(x) = B x, B Rd k, B is a matrix with orthonormal columns .

If we assume that Px is sub-Gaussian, then standard arguments (see Tripuraneni et al. [2020]) show that G2 m(F0 ˆh) O( k

m), G2 n(F) O( k

n), and G2 nt(H) O( k2d

nt ) . To simplify the bounds let L = 1, H = 1, D = 1, b = 1, and δ = 0.05. Thus, under the assumptions of Theorem 2, the fixed points are bounded by

, and r 2 k2d

which gives

Rtarget( ˆf0, ˆh) Rtarget(f 0 , h )

Rtarget(f 0 , h )

Rsource(f , h )

The following result is such a bound on the local Rademacher complexity for a non-negative Hsmooth loss and any function class H. Proposition 1 (Smooth non-negative local Rademacher complexity bound). Under the setting of Theorem 1 along with ℓbeing H-smooth and Assumptions 1.B and 1.C, there exists an absolute constants c3 and c4 such that

Rnt(Lℓ(F t(H)), r) c3 r G F t(H)

H log(n)2 + D

H log(n) n +

Rnt(Lℓ(F t(H)), r) c4 r Π(F t(H)) log e

b Π(F t(H))

where G(F t(H)) = L Gnt(H) + Gn(F) and Π(F t(H)) =

HG (F t(H)) log e D G(F t(H)) .

When we specialize the above bound to the standard single-task non-compositional setting and compare with Srebro et al. [2010], then

Rn(Lℓ(F, r)) c Gn(F)

log (n) log

log (n) log eb H log (n) n B2

The second inequality uses the relation Gn(F) 2 p

log(n) Rn(F) [Wainwright, 2019, p. 155] and that x 7 x log e C/x is increasing in x until x = C. The third inequality uses Khintchine s inequality, Rn(F)

2B n , where B is the ℓ2-bound on the range of F. Thus, the upper bound above is always better than the bound in Srebro et al. [2010]4 for large enough n. Furthermore, under the additional though well-studied assumption that ℓ(0) HB2 [Arora et al., 2022, Shamir, 2015], ours is better for n = Ω(1). Finally, the Gaussian and Rademacher widths are of the same order for the class of linear predictors, bounded in norm, by a strongly convex regularizer, in general non-Euclidean settings. Hence, in this setting, our bound is smaller by a factor of log n in Kakade et al. [2008] see Appendix E for a detailed comparison.

3.1 Multi-Task Learning Next, we detail some theorems foundational to the results above. For those results, we were able to link MTL with MTRL by using the task diversity assumption. Therefore, we naturally have the following local Rademacher result for MTL.

Theorem 3. Let (ˆf, ˆh) be an empirical risk minimizer of ˆRsource( , ) as given in Eqn. (2). Under Assumption 1.A, let ψ be a sub-root function such that ψ(r) Rn(F t(H), r) with r the fixed point of ψ(r), then with probability 1 2e δ,

Rsource(ˆf, ˆh) Rsource(f , h ) + c q

Rsource(f , h ) r

where c is an absolute constant.

4We note that Srebro et al. [2010] incur an additional p

rb/n term when converting from complexity to width

As above, we can use our bound on the local Rademacher complexity of a H-smooth loss class to get the following result. This result is similar to Srebro et al. [2010], Theorem 1, yet ours is in an MTL via MTRL setting. Theorem 4. Under the setting of Theorem 3 along with ℓbeing H-smooth and Assumptions 1.B and 1.C, with probability at least 1 3e δ, r of ψ(r) is bounded by

cb L2 G2 nt(H) + G2 n(F) H log(nt)4 + D2H log(nt)2

nt + b(1 + δ)

where c is an absolute constant.

3.2 Local Rademacher complexity chain rule Using the chain rule of Tripuraneni et al. [2021], as we did above, to decouple the complexities of the representation and prediction classes is desirable. Yet, our general local Rademacher complexity Theorems 1 and 3, as stated, do not seem to have this property. Consequently, we develop a local chain rule, which aims to separate the local complexities of learning the representation and predictor. The main result is the following. Theorem 5. Suppose the loss function ℓis Lℓ-Lipschitz. Define the restricted representation and predictor classes as follows,

ℓ FX(r) := ℓ f ℓ F t : h H : V (ℓ f h) r

HX(r) := h H : f F t : V (ℓ f h) r ,

where V is the functional in the local Rademacher complexity description. Under Assumptions 1.B and 1.C and that the worst-case Gaussian width of the above is bounded by the sub-root functions ψF and ψH, respectively, there exists an absolute constant c such that

Gn(Lℓ(F t(H), r)) c (LLℓψF(r) + ψH(r)) log (nt) + D

4 Proof Techniques

In this section, we give details on some of the tools we use by giving proof sketches for the results in Section 3. In the next section, we cover some theorems at the root of all our results.

Proof sketch for Theorem 1. Let f0 = arg minf F Rtarget(f, ˆh). By a risk decomposition, we can bound the excess transfer risk by Rtarget( ˆf0, ˆh) Rtarget( f0, ˆh) + supf0 F0 inff F{Rtarget(f , ˆh) Rtarget(f0, h )}. We can now use Theorem 3, with t = 1, to bound Rtarget( ˆf0, ˆh). Note that Rtarget(f 0 , ˆh) Rtarget( f0, ˆh) 0. Now inff F Rtarget(f , ˆh) Rtarget(f 0 , h ) is less than 1 ν (inff F t Rsource(f , ˆh) Rsource(f , h )+ε, by the task-diversity assumption. As shown in Tripuraneni et al. [2020], we can bound inff F t{Rsource(f , ˆh) Rsource(f , h )} with Rsource(ˆf, ˆh). We can then apply Theorem 3 to the remaining Rsource(ˆf, ˆh) term. By leveraging task diversity again we can bound the remaining Rtarget(f 0 , ˆh) as a function of Rtarget(f 0 , h ).

Proof sketch for Proposition 1. We achieve the following bound on the local Rademacher complexity of the loss class by using a truncated version of Dudley s integral [Srebro et al., 2010, see. Lemma A.3].

Rnt(Lℓ(F t(H)), r) inf 0 α

n 4α + 10(nt) 1/2 Z

log N2(Lℓ(F t(H)), r), ε, nt) o

We observe that we can use smoothness to move the dependence on the loss to an appropriate scaling, which gives that N2(Lℓ(F t(H)), r), ε, nt) N2(F t(H), ε/

12Hrnt, nt) (Lemma 18 in the appendix). Next, we apply Sudakov minoration to bound the log covering number by the Gaussian

width of our function class: q

log N2(F t(H), ε/

12Hrnt, nt)

ϵ Gn(F t(H)). With these simplifications, the parameter α from Dudley s integral can either be solved for exactly (as in Inequality 6), so long as α

br, or set in such a way that it always holds (like in Inequality 5). Finally, we can optionally apply the Gaussian chain rule from Tripuraneni et al. [2020] to bound Gn(F t(H)) with L Gnt(H) + Gn(F).

Proof sketch for Theorem 2. The right hand side of Inequality 5 is a sub-root function in r, because it is an affine transformation of r. Therefore, we just solve for the fixed point of this sub-root function, i.e., solve for r w.r.t. the equation below.

H log(n)2 + D

H log(n) n +

All that remains of the proof is basic algebraic simplifications. The proof for Inequality 6 is similar.

5 Discussion

The proofs of the above results hinge on the following local Rademacher complexity result. Theorem 6. Let F be a class of non-negative b bounded functions. Assume that X1 j , . . . , Xn j are n draws from Pj for j [t] and all nt samples are independent. Suppose that Var f V (f) BPf, where V is a functional from F t to R and B > 0.

Let ψ be a sub-root function with fixed point r such that BR(F t, r) ψ(r) for all r r , then for K > 1 and δ > 1, with probability at least 1 e δ,

f F t Pf K K 1 ˆP nf + 200(1 + α)2 Kr

Yousefi et al. [2018], Theorem 9, showed a result of this form in a more general setting of Bernstein classes and a function class that is not necessarily a product space. In contrast, for simplicity, we do not consider Bernstein classes, though we believe such a generalization is possible, and our function classes are product spaces. By doing so, we achieve better constants than Yousefi et al. [2018] by using the next theorem. Theorem 7 (Concentration with t functions). Let F be a class of functions from X to R and assume that all functions f in F are Pj-measurable for all j [t], square-integrable, and satisfy Ef = 0. Let supf F ess sup f 1 and Z = supf F t Pt j=1 Pn i=1 fj(Xi j). Let σ > 0 such that σ2 supf F t 1

t Pt j=1 Var fj(X1 j ) almost surely. Then, for all δ 0, we have that

2δ(tnσ2 + 2EZ) + δ

This result also holds for supf F t Pt j=1 Pn i=1 fj(Xi j) under the condition that sup f F f 1.

Yousefi et al. [2018] also proves something similar to the above from scratch using Logarithmic Sobolev inequalities. We make the observation that within our setting a proof similar to the original proof given by Bousquet [2002] still holds. To clarify this further, Yousefi et al. [2018] state:

Note that the difference between the constants in (2) [Bousquet, 2002, Theorem 2.3] and (3) [Yousefi et al., 2018, Theorem 1] is due to the fact that we were unable to directly apply Bousquet s version of Talagrand s inequality (like it was done in Bartlett et al. [2005] for scalar-valued functions) to the class of vector-valued functions.

We show that within the product space structure, which we consider to be the most natural for MTL, it is possible to achieve the same constants. Concretely, when t = 1, we realize the single function version of Bousquet [2002] exactly.

The proof of Theorem 6 is similar to the one given in Bartlett et al. [2005], Sec. 3, and generalized to MTL by Yousefi et al. [2018], Appendix B. Indeed, we follow the same steps within these proofs but use our inequality above.

6 Conclusion

We provide the first optimistic rates for transfer learning and multi-task learning via multi-task representation learning. This is achieved by establishing a local Rademacher complexity result for multi-task representation learning. We provide distribution-free optimistic rates for smooth nonnegative losses by bounding the local Rademacher complexity for multi-task loss classes. Besides our contributions to multi-task representation learning and multi-task learning, our work provides several foundational results and improved rates in the standard single-task setting. Directions for future work include: exploring adversarial robustness, active learning, and fine-tuning of the representation within multi-task representation learning.

Acknowledgements This research was supported, in part, by DARPA GARD award HR00112020004, NSF CAREER award IIS-1943251, funding from the Institute for Assured Autonomy (IAA) at JHU, and the Spring 22 workshop on Learning and Games at the Simons Institute for the Theory of Computing.

Raman Arora, Raef Bassily, Cristóbal Guzmán, Michael Menart, and Enayat Ullah. Differentially private generalized linear models revisited. In Advances in Neural Information Processing Systems, volume 35, pages 22505 22517, 2022.

Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, 33:1497 1537, 2005.

Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12: 149 198, 2000.

Witold Bednorz and Rafał Latała. On the boundedness of bernoulli processes. Annals of Mathematics, 180(3):1167 1203, 2014.

Shai Ben-David and Reba Schuller Borbely. A notion of task relatedness yielding provable multipletask learning guarantees. Machine learning, 73:273 287, 2008.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

Gilles Blanchard, Olivier Bousquet, and Laurent Zwald. Statistical properties of kernel principal component analysis. Machine Learning, 66:259 294, 2007.

Olivier Bousquet. A bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 334(6):495 500, 2002.

Olivier Bousquet. Concentration inequalities for sub-additive functions using the entropy method. In Stochastic inequalities and applications, pages 213 247. Springer, 2003.

Olivier Bousquet, Vladimir Koltchinskii, and Dmitriy Panchenko. Some local measures of complexity of convex hulls and generalization bounds. In Computational Learning Theory: 15th Annual Conference on Computational Learning Theory, COLT 2002, pages 59 73. Springer, 2002.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. The Journal of Machine Learning Research, 11:2901 2934, 2010.

Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local rademacher complexity. Advances in neural information processing systems, 26, 2013.

Corinna Cortes, Mehryar Mohri, and Ananda Theertha Suresh. Relative deviation margin bounds. In International Conference on Machine Learning, pages 2122 2131. PMLR, 2021.

Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, pages 13089 13099, 2019.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pages 4171 4186, 2019.

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647 655. PMLR, 2014.

Simon Shaolei Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-shot learning via learning the representation, provably. In 9th International Conference on Learning Representations, ICLR 2021, 2021.

Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. Advances in neural information processing systems, 21, 2008.

Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, pages 5915 5926, 2019.

Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443 457. Springer, 2000.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1 32, 2016.

Massimiliano Pontil and Andreas Maurer. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, pages 55 76. PMLR, 2013.

Henry Reeve and Ata Kaban. Optimistic bounds for multi-output learning. In International Conference on Machine Learning, pages 8030 8040. PMLR, 2020.

Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. J. Mach. Learn. Res., 16:3475 3486, 2015.

Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Optimistic rates for learning with a smooth loss. ar Xiv: Learning, 2010.

Nilesh Tripuraneni, Michael I. Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, 2020.

Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434 10443. PMLR, 2021.

Enayat Ullah, Poorya Mianjy, Teodor Vanislavov Marinov, and Raman Arora. Streaming kernel pca with o( n) random features. Advances in Neural Information Processing Systems, 31, 2018.

VN Vapnik and A Ya Cervonenkis. On uniform convergence of the frequencies of events to their probabilities. Teor. Veroyatnost. i Primenen, 16(2):264 279, 1971.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

Ziping Xu and Ambuj Tewari. Representation learning beyond linear prediction functions. Advances in Neural Information Processing Systems, 34:4792 4804, 2021.

Niloofar Yousefi, Yunwen Lei, Marius Kloft, Mansooreh Mollaghasemi, and Georgios C. Anagnostopoulos. Local rademacher complexity-based learning guarantees for multi-task learning. Journal of Machine Learning Research, 19(38):1 47, 2018.

Lijia Zhou, Danica J Sutherland, and Nati Srebro. On uniform convergence and low-norm interpolation learning. Advances in Neural Information Processing Systems, 33:6867 6877, 2020.

Lijia Zhou, Frederic Koehler, Danica J. Sutherland, and Nathan Srebro. Optimistic rates: A unifying theory for interpolation learning and regularization in linear regression. Co RR, abs/2112.04470, 2021.

1 Introduction 1

1.1 Our techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Problem setup and preliminaries 4

3 Main Results 6

3.1 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Local Rademacher complexity chain rule . . . . . . . . . . . . . . . . . . . . . . 9

4 Proof Techniques 9

5 Discussion 10

6 Conclusion 10

Appendices 14

A Additional preliminaries on Gaussian processes and complexities 16

A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Sub-root functions and local Rademacher Complexity . . . . . . . . . . . . . . . . 19

B Task diversity digression 19

C Technical Miscellanea 21

C.1 Martingale Concentration Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 21

C.2 Empirical Process Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.3 Analytic Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

D Proofs of main results 25

D.1 Concentration Inequalities and Risk Bounds . . . . . . . . . . . . . . . . . . . . . 26

D.2 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

D.3 Multi-Task Learning via Representation Learning . . . . . . . . . . . . . . . . . . 34

D.4 Empirical Local Rademacher Complexity Bound . . . . . . . . . . . . . . . . . . 36

D.5 Theorems for transition from empirical to population . . . . . . . . . . . . . . . . 39

D.6 Smooth Learning Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

D.7 Smooth MTL and MTL via MTRL . . . . . . . . . . . . . . . . . . . . . . . . . . 42

D.8 Local Rademacher complexity chain rule . . . . . . . . . . . . . . . . . . . . . . 42

E Detailed comparison with prior works 43

E.1 Comparison with Srebro et al. [2010] . . . . . . . . . . . . . . . . . . . . . . . . 43

E.2 Comparison with Denevi et al. [2019], Khodak et al. [2019] . . . . . . . . . . . . . 45

A Additional preliminaries on Gaussian processes and complexities

A.1 Preliminaries

We recall basic preliminaries on (sub) Gaussian processes [Vershynin, 2018, Wainwright, 2019]. Given a set Θ Rk, a random process is a collection of random variables {Zθ}θ Θ defined on the same probability space, indexed by elements of set Θ. The random process induces the following canonical psuedometric on the (abstract) set Θ, defined as d(θ, θ ) = Zθ Zθ L2 = q

E (Zθ Zθ )2, (pseudo)metrizing the set T. This allows us to define the covering and packing number of the set Θ with respect to the metric d, denoted as N(Θ, d, ϵ) and M(Θ, d, ϵ) respectively.

Definition 4 (Covering number). An ϵ-cover of a set Θ with respect to a metric d, is a set C Θ such that for any θ Θ, there exists a c C, such that d(θ, c) ϵ. The ϵ-covering number, N(Θ, d, ϵ) is the cardinality of the smallest ϵ-cover.

Definition 5 (Packing number). An ϵ-packing of a set Θ with respect to a metric d, is a set P Θ such that for all p, q P, p = q, d(p, q) > ϵ. The ϵ-packing number, M(Θ, d, ϵ) is the cardinality of the largest ϵ-packing.

These are related as follows.

Lemma 1 (Wainwright [2019]). For any totally-bounded set Θ and ϵ > 0,

M(Θ, d, 2ϵ) N(Θ, d, ϵ) M(Θ, d, ϵ)

A well-studied problem is to control the process uniformly, E[supθ Θ Zθ] in terms of the geometric properties of the set Θ.

A random process is a Gaussian process, if for all finite sets Θ0 Θ, the distribution of the random vector {Zθ}θ Θ0 is Gaussian. A well-known result regarding controlling E[supθ Θ Zθ] is the Sudakov minoration inequality.

Lemma 2 (Sudakov minoration [Wainwright, 2019]). Let {Zθ, θ Θ} be a zero-mean Gaussian process. Then

E sup θ Θ Zθ

log M(Θ, d, α),

An important example is the so-called canonical Gaussian process, Zθ = θ, g , where g N(0, Ik) is the standard Normal vector. In this case, the canonical metric simply becomes the ℓ2-norm, d(θ, θ ) = θ θ 2. The uniform control of the random process becomes a standard object, known as Gaussian width of the set Θ.

G(Θ) = E sup θ Θ Zθ

sup z Θ g, θ .

The celebrated Fernique-Talagrand theory of majorizing measures establishes a characterization of the Gaussian width of a set in terms of its metric geometry see Vershynin [2018] for more details.

A related concept is that of Gaussian complexity, where the goal is to control the process absolutely and uniformly, defined as,

G(Θ) = E sup z Θ |Zθ| = Eg

sup z T | g, θ | .

These two are related as follows.

Lemma 3 (Vershynin [2018]). The following are true about the relationship between Gaussian width and complexity.

1. G(Θ) G(Θ).

2. For any θ Θ, we have 1

3 G(Θ) + θ G(Θ) 2 G(Θ) + θ .

An important category, generalizing Gaussian process, is that of sub-Gaussian processes, in which for all θ, θ Θ, the random variable Zθ Zθ is d(θ, θ )-sub-Gaussian,

E exp λ(Zθ Zθ )2 exp λ2d(θ, θ )2

The supremum of sub-Gaussian process is upper bound by Dudley s entropy integral formula, as follows. Lemma 4 (Refined Dudley s entropy integral formula Wainwright [2019]). Let {Zθ}θ Θ be a zero-mean sub-Gaussian process indexed on set Θ with diameter B, then

E[sup θ Θ Zθ] E[ sup θ,θ Θ Zθ Zθ ] inf 0 α B

4α + 10 Z B

log N (Θ, d, α))dϵ

An important example of sub-Gaussian process is the so-called Bernoulli/Rademacher process, where Zθ = σ, θ , where σ { 1, 1}k, whose co-ordinates are i.i.d. Rademacher random variables. This leads to the notions of Rademacher width R(Θ) and Rademacher complexity, R(Θ), defined as follows.

sup θ Θ σ, θ R(Θ) = Eσ

sup θ Θ | σ, θ | .

The Rademacher width and complexity are related in the same way as Gaussian width and complexity (akin to Lemma 3). Lemma 5 (Vershynin [2018]). The following is true about the relationship between Rademacher width and complexity.

1. R(Θ) R(Θ)

2. For any θ Θ, we have 1

3 R(Θ) + θ R(Θ) 2 R(Θ) + θ .

Further, the Gaussian and Rademacher widths are related as follows. Lemma 6 (Wainwright [2019]). For a set Θ Rk, r

2 π R(Θ) G(Θ) 2 R(Θ) p

Akin to the Fernique-Talagrand majorizing measures theory for Gaussian processes, a complete characterization of the expected suprema for Bernoulli/Rademacher processes in terms of metric and co-ordinate geometry is obtained via (the now proved) Talagrand s Bernoulli conjecture Bednorz and Latała [2014], Ledoux and Talagrand [1991].

Complexity of real-valued function classes. Relatedly, another well-studied concept is Gaussian/Rademacher width/complexity of a set of real-valued functions on a fixed set of inputs. We first define it and then argue how this is simply a special case of the above "set"-view of these. Consider a class of functions Q which map an abstract set Z to R. Given input Z consisting of n points in the domain of q Q, the Gaussian/Rademacher complexity and width of Q, on Z are defined as follows.

i=1 σiq(zi)

, RZ(Q) = E

i=1 σiq(zi)

i=1 giq(zi)

, GZ(Q) = E

i=1 giq(zi)

where σi and gi are i.i.d. Rademacher and standard Normal random variables, respectively. The metric induced on the set Q is the the L2 distance with respect to the empirical measure on input Z,

d Z(q, q ) = q q L2(Z) =

i=1 (q(zi) q(zi))2.

The corresponding covering and packing numbers of Q are denoted as N(Q, ϵ, Z) and M(Q, ϵ, Z).

To reduce the above from the set-definition, consider the set

n(q(z1), q(z2), . . . , q(zn)) : q Q .

In that case, RZ(Q) = R(Q(Z)). Further, d Z(q, q ) = n d(θq, θq ) where θq = 1 n (q(z1), q(z2), . . . , q(zn)) and θq = 1

n (q (z1), q (z2), . . . , q (zn)) and thus lie in Q(Z).

Complexity of function classes in the multi-task learning setting. We now generalize the above to functions encountered in multi-task representation learning. For input space Z, p, q, n N, consider a class of vector-valued functions Q : Z Rq, and a dataset Z = (zi j)j [p],i [n], where zi j Z. For Q p, the p-fold Cartesian product of Q, we define its data-dependent Rademacher width, RZ( ), data-dependent Gaussian width, GZ, ( ), data-dependent Rademacher complexity, RZ( ), data-dependent Gaussian complexity, GZ, with respect to input Z as,

RZ(Q p)= Eσi,j,k

sup q Q p 1 np

i,jk=1 σijk qj(zi j)

, (Rademacher width)

GZ(Q p)= Egi,j,k

sup q Q p 1 np

i,j,k=1 gi,j,k qj(zi j)

, (Gaussian width)

RZ(Q p)= Eσi,j,k

i,jk=1 σijk qj(zi j)

, (Rademacher complexity)

GZ(Q p)= Egi,j,k

i,j,k=1 gi,j,k qj(zi j)

, (Gaussian complexity)

where σi,j,k and gi,j,k are i.i.d. Rademacher and standard Normal random variables, respectively.

To reduce the above from the set-based definition, consider the set

np(((qj(zi j))k)j)i : q Q p .

In the above, the notation 1 np(((qj(zi j))k)j)i denotes a vector in Rnpq obtained via enumerating i, j and k.

The corresponding metric analogously is,

d Z(q, q ) = q q L2(Z) = s

(qj(zi j))k (qj(zi j))k 2 = npd Z(θq, θq ),

where θq = 1 np(((qj(zi j))k)j)i and similarly θq , which lie in Q p(Z).

The above can be used to define covering and packing numbers of Q p, denoted as N(Q p, ϵ, Z) and M(Q p, ϵ, Z) respectively.

The concepts of Dudley s entropy integral formula and Sudakov minoration thus generalize accordingly. A result of the following form, for real-valued functions, appears in Srebro et al. [2010]. Lemma 7 (Refined Dudley s entropy integral). Consider a class of vector-valued functions Q and input set Z = zi j

j [p],i [n]. Define B := supq Q p q L2(Z) = np supq Q p 1

np(((qj(zi j))k)j)i . Then,

RZ(F) inf 0 α B/ np

4α + 10 Z B/ np

log N (Q p, ϵ, Z)dϵ

= inf 0 α B

4α + 10 Z B

log N (Q p, ϵ, Z)

The equality above follows from the change of variables.

Worst-case complexities of functions. Often, we take the sup over datasets of a specified size, in the above definitions of Rademacher/ Gaussian width/complexity, yielding their worst-case counterparts, denoted as,

Gnp(Q p) = sup Z Znp GZ(Q p), Gnp(Q p) = sup Z Znp GZ(Q p),

Rnp(Q p) = sup Z Znp RZ(Q p), Rnp(Q p) = sup Z Znp RZ(Q p).

The corresponding metric similarly is

dnp(q, q ) = sup Z Znp d Z(q, q ),

which can then be used to define covering and packing numbers of Q p, denoted as N(Q , ϵ, np) and M(Q , ϵ, np), respectively.

As before, Dudley s entropy integral formula and Sudakov minoration generalize analogously.

A.2 Sub-root functions and local Rademacher Complexity

We present a key property about sub-root functions from [Bartlett et al., 2005] below. Lemma 8 (Lemma 3.2. [Bartlett et al., 2005]). If ψ : [0, ) [0, ) is a nontrivial sub-root function, then it is continuous on [0, ) and the equation ψ(r) = r has a unique positive solution. Moreover, if we denote the solution by r , then for all r > 0, r ψ(r) if and only if r r.

B Task diversity digression

In this section we make a series of observations regarding the task-diversity definition introduced by Tripuraneni et al. [2020]. Within this work, they introduce two definitions that are used to relate average source tasks performance to target task performance: task-averaged representation difference and worst-case representation difference. Definition 6 (The task-averaged representation difference w.r.t h H).

d F,f (h ; h) = 1

j=1 inf f F Exj,yj Pj {ℓ(f h (xj) , yj) ℓ(fj h (xj) , yj)}

Importantly, like the work which introduces these concepts, it is possible to statistically control d F,f ˆh; h , where ˆh results from the first step of the two tage ERM procedure, see the variational definition Eqn. (2).

The other quantity they introduce is the worst-case representation difference. Definition 7 (The worst-case representation difference w.r.t h, h H).

d F,F0(h ; h) = sup f0 F inf f F Ex,y P0

ℓ(f h (x), y) ℓ(f0 h(x), y) ,

Finally, they introduce the following assumption on the two quantities defined above. It has a multiplicative parameter ν and additive parameter ε. Definition 8 ((ν, ε)-task diversity w.r.t h H [Tripuraneni et al., 2020]). For a function class F, we say f F t is (ν, ϵ)-diverse over F0 for a representation h, if uniformly for all h H,

d F,F0 (h ; h) d F,f (h ; h) /ν + ϵ

We will make a series of observations about these quantities which provide substantial simplification in our setting.

First, observe in the setting of product space, d F,f(h ; h) = inf f F t Rsource(f , h ) Rsource(f, h)

d F,F0(h ; h) = sup f0 F0 inf f F {Rtarget(f , h ) Rtarget(f0, h)}

Note, here we are again leveraging product space structure to simplify prior work. Recall in Section 5 we observed that this structure allows us to recover existing single function theorems exactly, unlike Yousefi et al. [2018]. Specifically in the proof of Theorem 9 we commute suprema and summations.

Here we are commuting infima and summations to simplify the framework specified in Tripuraneni et al. [2020] within our setting. Note they work with product spaces but do not make the simplifying observation above. Nonetheless, the definitions they provide are suitable for the more general function classes considered in Yousefi et al. [2018], which are not necessarily product spaces.

Now consider the setting detailed in the preliminaries applied to our approximation ˆh, h H and f F t with the following ratio

d F,F0(ˆh; h )

d F,f (ˆh; h ) = supf0 F0 inff F n Rtarget(f , ˆh) Rtarget(f0, h ) o

inff F t n Rsource(f , ˆh) Rsource(f , h ) o

= inff F Rtarget(f , ˆh) inff0 F0 Rtarget(f0, h )

inff F t n Rsource(f , ˆh) Rsource(f , h ) o (9)

= inff F Rtarget(f , ˆh) Rtarget(f 0 , h )

inff F t n Rsource(f , ˆh) Rsource(f , h ) o.

Note every sup and inf will realize a function that minimizes their respective risk.

This ratio is a measure of how similar the performance we can expect when we use our representation learned from the source tasks to and apply it the target task. That is if we seek to control this quantity we do not care per se if the risks are small nor do we care that we are getting good performance in comparison to the ground truth, but how comparable is our performance between source and target.

We seek to control this ratio with some ν

inff F Rtarget(f , ˆh) Rtarget(f 0 , h )

inff F t n Rsource(f , ˆh) Rsource(f , h ) o 1

If the risk on our source tasks and target task is similar w.r.t. to the ground truth for all tasks then ν 1. That is low (high) average risk on the source tasks leads to low (high) risk on the target task.

This is even more apparent in the realizable setting.

inff F Rtarget(f , ˆh)

inff F t Rsource(f , ˆh) 1

First, imagine a setting in which there are several tasks and on all but a few source tasks, we perform well.

inff F Rtarget(f , ˆh)

inff F t Rsource(f , ˆh) relatively small

relatively big .

ν will be small, which is okay if we do not care about performance across all t tasks, as long as the target task performance is good.

Alternatively, imagine that the task does relatively poorly on the target task but is still small. This could lead to a very large 1

ν which is not desirable.

inff F Rtarget(f , ˆh)

inff F t Rsource(f , ˆh) relatively big relatively small.

Therefore we allow for an additive ε to capture this quantity when we get risk on the source task is very small, but we are okay with relatively poor performance on the target task.

That is we care about the following inequality

inf f F Rtarget(f , ˆh) Rtarget(f 0 , h ) 1

ν inf f F t

n Rsource(f , ˆh) Rsource(f , h ) o + ε, (10)

where ε encodes how much tolerance we have between the performance on the source tasks and the target task. Hence, it is Inequality (10) which we care to control over all functions in H. Therefore, we will assume uniformly for all h H that:

inf f F Rtarget(f , ˆh) Rtarget(f 0 , h ) 1

ν inf f F t

n Rsource(f , ˆh) Rsource(f , h ) o + ε.

Although we observe this simplification, to be succinct we will still use that

d F,F0(ˆh; h ) = inf f F Rtarget(f , ˆh) Rtarget(f 0 , h ) (11)

and d F,f (ˆh; h ) = inf f F t

n Rsource(f , ˆh) Rsource(f , h ) o (12)

and use this notation in the proofs of Appendix D.3.

C Technical Miscellanea

In this section, we present technical miscellanea that we shall later refer to in obtaining our proofs of the main results in Appendix D.

C.1 Martingale Concentration Inequality

For the sake of notational succinctness, we forgo the compositional model by overloading the notation to just consider a generic function class F : X R. Lemma 9 (Theorem 2.1 in Bousquet [2002]). Let W1, . . . , Wn be independent random variables in a Polish space W.

Let A = σ(W1, . . . , Wn) k [n] Ak = σ(W1, , . . . , Wk 1, Wk+1, . . . , Wn) be σ-algebras generated by these random variables.

We denote by Ek[ ] the expectation taken conditionally on Ak. Let φ(x) = (1 + x) log(1 + x) x and ψ(x) = e x 1 + x.

Let (Z, Z 1, . . . , Z n) be a sequence of A-measurable random variables and let (Zk)k [n] be a sequence of random variables that are Ak measurable, respectively. Assume that there exists u > 0 such that for all k [n] the following inequalities are satisfied Z k Z Zk 1 a.s., Ek [Z k] 0 and Z k u a.s.

Let σ R be a real number satisfying σ2 1 n Pn k=1 Ek h (Z k)2i almost surely and let v =

(1 + u)E[Z] + nσ2. If the following condition holds

k=1 Z Zk Z a.s.,

we obtain, for all λ 0, log E h eλ(Z E[Z])i ψ( λ)v,

which gives the following bounds for all δ > 0,

P{Z E [Z] + δ} exp vφ δ

and P{Z E [Z] +

The following corollary is immediate as independence is sufficient Bousquet [2003] and we can express the summation over nt random variables as double summation where we first sum over q [n] and then over j [t].

Corollary 1 (Lemma 9 in block structure). Assume that W 1 j , . . . , W n j are n independent draws from Pj for j [t] with random variables in some polish space, where all nt samples are independent.

Let A be the sigma algebra generated by (W i j)j [t],i [n] and Aq k be the sigma algebras generated by (W i j)j [t],i [n] (j,i) =(k,q) . We denote by Eq k[ ] the expectation taken conditionally on Aq k.

Let Z be a A-measurable r.v., ( Zq k)k [t],q [n] be a sequence of A-measurable random variables, and (Zq k)k [t],q [n] be a sequence of random variables that are Aq k measurable, respectively. Assume that there exists u > 0 such that for all k [t] and q [n] the following inequalities are satisfied

Zq k Z Zq k 1 a.s., (13)

Eq k h Zq k i 0, (14)

and Zq k u a.s. (15)

Let σ R be a real number satisfying

Zq k 2 (16)

almost surely and let v = (1 + u)E[Z] + nσ2. If the following condition holds

q=1 Z Zq k Z a.s., (17)

we obtain, for all λ 0,

log E h eλ(Z E[Z])i ψ( λ)v,

which gives the following bounds for all δ > 0,

P{Z E [Z] + δ} exp vφ δ

and P{Z E [Z] +

C.2 Empirical Process Lemmas

The following result is standard symmetrization.

Lemma 10. Let F be a class of functions from X to R and let X denote the set of inputs Xi j n,t i,j=1. Assume that all functions f in F are Pj-measurable for all j [t]. Then,

sup f F t 1 nt

i=1 fj(Xi j) E 1

i=1 fj(Xi j)

where {σi j} is a sequence of nt independent Rademacher variables.

The result follows by standard symmetrization proof extended to the multi-function setting.

Proof. Let X1 j , . . . , Xn j be independent copies of X1 j , . . . , Xn j on probability space Pj for all j [t]. Let E X be the expectation with respect to X1 j , . . . , Xn j for all j [t] and EX be the expectation with respect to X1 j , . . . , Xn j for all j [t]. Let σi j be i.i.d. Rademacher random variables for all j [t]

and i [n]. Fixing X1 j , . . . , Xn j for all j [t], and using the copied variables we have that

E sup f F t

i=1 fj(Xi j) E 1

i=1 fj(Xi j)

= EX sup f F t

i=1 fj(Xi j) E X 1 nt

i=1 fj( Xi j)

By Jensen s inequality, we have

E sup f F t

i=1 fj(Xi j) E 1

i=1 fj(Xi j)

EX, X sup f F t

i=1 fj(Xi j) 1

i=1 fj( Xi j)

nt EX, X sup f F t

fj( Xi j) fj(Xi j)

If Xi j and Xi j have the same distribution, we could exchange fj( Xi j) fj(Xi j) for fj(Xi j) fj( Xi j) and the equation would be equivalent. We can do this with Rademacher random variables, by randomly flipping the sign with probability 1/2. This yields the following

E sup f F t

i=1 fj(Xi j) E 1

i=1 fj(Xi j)

nt Eσ,X, X sup f F t

σi jfj( Xi j) σi jfj(Xi j)

nt Eσ, X sup f F t

σi jfj( Xi j)

nt Eσ,X sup f F t

σi jfj(Xi j)

= 2E RX(F t).

where the last inequality follows from triangle inequality.

The next result, a Gaussian chain rule given in Tripuraneni et al. [2020], is crucial to decoupling the complexity of the representation function class and the hypothesis class.

Theorem 8 (Theorem 7 from Tripuraneni et al. [2020]). Let the function class F consist of functions that satisfy Assumption 1. Then the (empirical) Gaussian width of the function class F t(H) satisfies,

GZ F t(H) inf D α>0

4α + 64G F t(H) log D

where G(F t(H)) = L Gnt(H) + Gn(F). Further, if G (F t(H)) D then by computing the exact infima of the expression,

GX F t(H) 64G F t(H) log e D G (F t(H))

Lemma 11 (Lemma A.4 in Bartlett et al. [2005]). Let F be a class of real-valued functions with range in [a, b]. Given a set Z of nt i.i.d. points, with probability at least 1 e δ,

E RZ(F) inf α (0,1)

1 α + (b a)δ 4nα(1 α)

C.3 Analytic Inequalities

Lemma 12. If x x A B = 0 for A, B > 0 then x A2 + 2B.

Proof. As the discriminant of the quadratic in x is positive we are guaranteed two real roots x1, x2.

Taking the larger of the two solutions we have x2 = A

2 . So x2 = A2

2 . By AM-GM, x2 A2

4 = A2 + 2B

Lemma 13 (Lemma B.1. from Srebro et al. [2010]). For any H-smooth non-negative function ℓ: R 7 R and any t, r R we have that

(ℓ(t) ℓ(r))2 6H(ℓ(t) + ℓ(r))(t r)2.

Lemma 14. For k [t], q [n],

j,i=1 (j,i) =(k,q)

Proof. First note that for any k [n],

This follows from applying reverse triangle inequality

Finally, we square both sides and rearrange the single summation into a double summation, which completes the proof.

Lemma 15. Consider the function f(x) = x + a ln (b/x) for a, b 0.

1. The minimizer of the function over 0 < x b is x = min (a, b).

2. The corresponding minimum value is a ln eb min(a,b)

3. For any c such that a c b, a ln (eb/a) c ln (eb/c).

Proof. The first and second claims follow from the fact that the function f decreases from x = 0 to a, and then increases. The third claim follows since the function x 7 x ln (eb/x) increases till x = b.

Lemma 16. Let Z be a random variable taking values in R and let A R be such that P [Z A] > 0, then sup |Z| ess sup |Z| = Z A.

Proof. The first inequality is standard. The second simply follows from the definition of ess sup:

ess sup |Z| = inf {a R : P [Z a] = 0} .

D Proofs of main results

Theorem 2.1 in Bousquet [2002], Lemma 9

Theorem 9 Multi-Function Bennett inequality

Theorem 10 Multi-Function Analog of Theorem 2.1 in Bartlett et al. [2005]

Theorem 11 LRC Bound

Lemma 19 Rademacher width to empirical Rademacher width

Loss class empirical LRC bound

Theorem 14 Target task transfer risk bound

Theorem 13 task-averaged representation difference bound

Theorem 17 Smooth MTL

Lemma 20 Loss class LRC bound

Theorem 15 MTL via MTRL

Theorem 18 Smooth Loss MTL via MTRL

Figure 1: A graph demonstrating the dependency structure of our results. The top blue node is the concentration inequality given in Bousquet [2002]. We abbreviate local Rademacher complexity with LRC .

In this section, we restate the main theorems in our main paper (with precise constants) and give a detailed proof of our results. A graph in Appendix D shows the dependency between our theorems.

Note the relationship between the theorems in the main paper and the theorems in the appendix are as follows:

Theorem 1 becomes Theorem 15 Theorem 2 becomes Theorem 18 Proposition 1 becomes Theorem 16 Theorem 3 becomes Theorem 12 Theorem 4 becomes Theorem 17 Theorem 5 becomes Theorem 19 Theorem 6 becomes Theorem 11 Theorem 7 becomes Theorem 9

In Appendix D.1, we present Theorems 9 to 11 which are generalizations of prior work and also recover the single function setting. In particular, we give a concentration inequality and two risk bounds. In Section D.2, we present Theorem 12 our result for multitask learning with bounded non-negative loss classes. Next, Appendix D.3 contains Theorems 13 to 15 the MTL via MTRL learning setting.

The remaining theorems add the additional assumption of smoothness. We start, in section Appendix D.4, by giving Theorem 16, our bound on the empirical local Rademacher complexity of smooth bounded non-negative loss classes. However, in order to use the above result with our prior theorems we need to bound the local Rademacher complexity constrained on the risk to one

constrained by the empirical risk. So, in Appendix D.5, we introduce Lemmas 19 and 20 to make this conversion. In Appendix D.6 we conclude by invoking our bound on the local Rademacher complexity of smooth bounded non-negative loss classes within our MTL and MTRL theorems with Theorems 17 and 18.

D.1 Concentration Inequalities and Risk Bounds

In this section we forgo the compositional model and will consider a generic function class F.

Theorem 9 (Concentration with t functions). Let F be a class of functions from X to R and assume that all functions f in F are Pj-measurable for all j [t], square-integrable, and satisfy Ef = 0. Let sup f F f 1 and

Z = sup f F t

i=1 fj(Xi j)

or let sup f F ess sup f 1 and

Z = sup f F t

i=1 fj(Xi j). (20)

Let σ > 0 such that σ2 sup f F t

j=1 Var fj(X1 j ) almost surely, then for all δ 0, we have

2δ(tnσ2 + 2EZ) + δ

Proof of Theorem 9. We prove the result for Z as defined in Equation (19); the proof for the definition of X in Equation (20) is similar. Let

Z = sup f F t

j,i=1 fj(Xi j)

and let f (0) 1 , . . . , f (0) t be the functions that achieve the supremum w.r.t. Z.

j,i=1 f (0) j (Xi j)

Next, for each k [t] and q [n] let

Zq k = sup f F t

j,i=1 (j,i) =(k,q)

= sup f F t

j,i=1 fj(Xi j)

The random variable Zq k is essentially removing a fk(Xq k) term from Z. We define f (k,q) 1 , . . . , f (k,q) t as the functions which achieve the supremum w.r.t Zq k. That is,

j,i=1 (j,i) =(k,q)

f (k,q) j (Xi j)

Finally define,

j,i=1 f (k,q) j (Xi j)

We now prove that the conditions in Corollary 1 are satisfied.

j,i=1 f (k,q) j (Xi j)

j,i=1 f (0) j (Xi j)

j,i=1 (j,i) =(k,q)

f (k,q) j (Xi j)

j,i=1 f (0) j (Xi j)

j,i=1 (j,i) =(k,q)

f (0) j (Xi j)

j,i=1 (j,i) =(k,q)

f (0) j (Xi j)

+ f (0) k (Xq k)

j,i=1 (j,i) =(k,q)

f (0) j (Xi j)

(by triangle inequality)

= f (0) k (Xq k)

1 a.s (by assumption)

where Inequality 21 follows from f (0) 1 , . . . , f (0) t being the functions which achieve the supremum w.r.t Z and Inequality 22 follows from f (k,q) 1 , . . . , f (k,q) t achieving the supremum w.r.t Zq k.

In addition, we show that Eq k h Zq k i is non-negative. Recall that Eq k is conditioned on Aq k.

Eq k h Zq k i = Eq k

j,i=1 f (k,q) j (Xi j)

j,i=1 f (k,q) j (Xi j)

Zq k (since Zq k is a constant w.r.t Eq k)

j,i=1 f (k,q) j (Xi j)

Zq k (using Jensen s inequality).

Finally, each f (k,q) j is a constant w.r.t Eq k besides when j = k and i = q so the above reduces to:

Eq k h Zq k i

j,i=1 (j,i) =(k,q)

f (k,q) j (Xi j)

Also, we have

i=1 f (0) j (Xi j)

j,i=1 (j,i) =(k,q)

f (0) j (Xi j)

j,i=1 (j,i) =(k,q)

f (0) j (Xi j)

j,i=1 (j,i) =(k,q)

f (k,q) j (Xi j)

k,q=1 Zq k.

Therefore, Pt j=1 Pn i=1(Z Zq k) Z.

j,i=1 f (k,q) j (Xi j)

j,i=1 (j,i) =(k,q)

f (k,q) j (Xi j)

f (k,q) k (Xq k) 2 (by Corollary 14)

As the samples for the same task are being drawn from the same distribution,

f (k,1) k (X1 k) 2

f (k,1) k (X1 k) 2 .

Note that f (k,1) k is a fixed function w.r.t. X1 k, therefore f (k,1) k is a fixed function as we integrate

over X1 k. So , E1 k

f (k,1) k (X1 k) 2 supfj F E1 k h fj(X1 k) 2i . Applying this reasoning to the t

functions gives the desired bound.

f (k,1) k (X1 k) 2 n

k=1 sup fk F E1 k h fk(X1 k) 2i

= n sup f F t

k=1 E1 k h fk(X1 k) 2i .

Note, crucially, that the product space structure of our function class F t allows us to move the suprema outside the summation. Dividing by nt gives the desired bound.

Zq k 2 sup f F t 1

k=1 Eq k h fk(X1 k) 2i = sup f F t 1

k=1 Var fk(X1 k) .

The right-hand side is bounded by σ2 by hypothesis.

Finally, now we apply Corollary 1 as all the conditions are satisfied.

Theorem 10 (Analog of Theorem 2.1 in Bartlett et al. [2005]). Let F be a class of functions that map X into [a, b]. Given an input X = Xi j t,n i,j=1, assume that there is some r > 0 such that for all

f1, . . . , ft that 1

t Pt j=1 Var fj(X1 j ) r. Then, for every δ > 0, with probability 1 e δ,

sup f F t(Pf ˆP nf) inf α

2(1 + α)ERX(F t) +

tn + (b a) 1

Proof of Theorem 10. Let V + = supf F t(Pf ˆP nf) where ˆP nf = 1 nt Pt j=1 Pn i=1 fj(Xi j)

and Pf = E ˆP nf. Let F = n f Ef (b a) | f F o be a set of scaled functions. For each f F we

have that E f = 0 and f 1. In addition, for all f F t we have that 1

t Pt j=1 Var fj(X1 j ) =

t Pt j=1 Var fj(X1 j ) r (b a)2 . Therefore, the conditions for Theorem 9 are satisfied.

Letting Z = sup f F t Pt j=1 Pn i=1 fj(Xi j) = nt (b a)V + as in Equation (20), we thus have with probability 1 e δ:

tn + 4δEV +(b a)

nt + (b a)δ

4δEV +(b a)

nt + (b a)δ

tn + αEV + + δ(b a)

αnt + (b a)δ

2(1 + α)E RX(F t) +

tn + (b a) 1

where Inequality 23 is by subadditivity of the square root, Inequality 24 is by application of the AM-GM inequality, and Inequality 25 is by Lemma 10.

Theorem 11. Let F be a class non-negative b bounded functions. Assume that X1 j , . . . , Xn j are n draws from Pj for j [t] and all nt samples are independent. Suppose that

Var f V (f) BPf

where V is a functional from F t to R and B > 0. Given an input X = Xi j t,n i,j=1, let ψ be a

sub-root function with fixed point r such that BE RX(F t, r) ψ(r) for all r r . Then, the following holds: For K > 1, α > 0, and x > 1, with probability at least 1 e x.

f F t Pf K K 1 ˆP nf + 200(1 + α)2 Kr

Remark 2 (Comparison to Yousefi et al. [2018]). A Theorem of this form was shown by Yousefi et al. [2018], which generalized results from Bartlett et al. [2005] to vector valued functions. The proof of Theorem 11 follows the same steps as the proof of Theorem Yousefi et al. [2018]. Yet, we will use our Theorem 10 which has better constants than Theorem 1 within Yousefi et al. [2018] for when the class that the suprema is indexing over is has product space structure.

The following lemma, and its proof, is a modified version of work in [Yousefi et al., 2018, Lemma B.2 ], which is a generalized version of work within [Bartlett et al., 2005, Lemma 3.8.]. We remove the Bernstein class structure, make B > 0, and only consider non-negative bounded functions 5.

5 The reason we restrict to non-negative functions instead of functions that map to [a, b] as done in prior work is because we failed to reproduce a step within Bartlett et al. [2005] and Yousefi et al. [2018]. Consider the last sentence within the first paragraph of the proof of Lemma 3.8. in Bartlett et al. [2005]. It is shown that Pf ˆP nf + r/(λBK). Now ˆP nf may be negative for f which maps to [a, b] with a < 0. Therefore, we cannot multiply this quantity by a value greater than one. In this case K/(K 1) for K > 1.

The theorem can be generalized to Bernstien classes like Theorem B.3 from Yousefi et al. [2018] with worse dependence on B.

Lemma 17. Let K > 1, r > 0 and B > 0. Suppose that for all f F t we have V (f) BPf. Define the re-scaled version of F t as

F t r = f = f 1, . . . , f t | f j = rfj max(r, V (f)), f = (f 1, . . . , f t) F t . (26)

If V + r = supf F t r

h Pf ˆP nf i r BK , then

f F t, Pf K K 1 ˆP nf + r BK .

Proof of Lemma 17. Let f be any element in F t.

Case 1. If V (f) r, then f = f and the inequality V + r r BK leads to

P(f) ˆP n(f) + r BK .

If all component functions of f are positive then ˆP n(f) is positive and as K/(K 1) > 1 for all K > 1 we have

P(f) K K 1 ˆP n(f) + r BK .

Case 2. If V (f) r, then f = rf/V (f) and the inequality V + r r BK yields

P(f) ˆP n(f) + V (f)

BK ˆP n(f) + P(f)

Solving for P(f) gives

P(f) K K 1 ˆP n(f)+ K K 1 ˆP n(f) + r BK .

which completes the proof.

Proof of Theorem 11. We modify the proof within Yousefi et al. [2018], which generalises the proof within Bartlett et al. [2005] by extending it to multi-function and Bernstein classes. At a high level we apply the same steps as Yousefi et al. [2018], yet we use our concentration inequality and do not need to consider the more complicated inequality [Yousefi et al., 2018, Lemma B.1] as we do not have the additional Bernstein structure. The proof closely follows from Yousefi et al. [2018] with minor modifications as described and presented for completeness.

Let r r be a fixed real number.

Verifying Conditions of Theorem 10. We seek to use Theorem 10 on F t r , see Equation (26), therefore we will validate the conditions of the theorem.

Fix f F t r . Thus, f = rf/ max(r, V (f)) for some f F. If max(r, V (f)) = r, then f = rf/ max(r, V (f)) = f. Thus, Var f = Var f V (f) r. If max(r, V (f)) = V (f), then Var f = r2 V 2(f)Var f r2 V (f) r. Therefore, 1

t supf F t r Pt j=1 E f j (Xj) 2 r. Finally, recall, functions in F are positive and b-bounded.

Thus, by Theorem 10, with probability at least 1 e δ and δ > 0

sup f F t(Pf ˆP nf) inf α

2(1 + α)E RX(F t r ) +

Bounding RX(F t r ). Now we must control the Rademacher complexity within the above inequality. This part of the proof is the same for Bartlett et al. [2005], Yousefi et al. [2018].

Let F t(u, v) := {f F t : u V (f) v} for all 0 u v, and let

i=1 σi jf j Xi j ,

RX F t r = sup f F t r

Observe that RX (F t r ) = ERX (F t r ). By assumption we have that V (f) B(Pf) Bb, f F t. Fix some λ > 1 and define k as the smallest integer such that rλk+1 Bb. By leveraging the principle that combining function classes on a uniform set of inputs is equivalent to the Minkowski sum of the created image, and utilizing the trait of Rademacher width breakdown under Minkowski sum as detailed by Vershynin [2018], we have, RX (G1 G2) RX (G1) + RX (G2) .

We thus the following inequalities

RX F t r = E

sup f F t r RX(f )

sup f F t 1 nt

r max(r, V (f))σi jfj Xi j

sup f F t(0,r)

i=1 σi jfj Xi j

sup f F t(r,Bb)

r V (f)σi jfj Xi j

sup f F t(0,r)

i=1 σi jfj Xi j

sup f F t(rλj,rλj+1) RX(f)

RX(F t, r) +

j=0 λ j RX F t, rλj+1

As r r , we can bound the local Rademacher width with the sub-root function.

j=0 λ jψ rλj+1 .

Now the sub-root property of ψ gives us that that ψ(ξr) ξ 1 2 ψ(r) for any ξ 1 and, hence,

RX F t r ψ(r)

Pick λ = 4 in the above inequality, which gives RX (F t r ) 5ψ(r)/B. Also, r is the fixed point of ψ, we have for all r r that ψ(r) p

r/r ψ (r ) =

rr . Taken together, we have

Applying Lemma 17. The remainder of the proof is simple manipulations and applying Lemma 17. Here our proof deviates from [Yousefi et al., 2018]. Using our above bound on the Rademacher complexity gives for any r r and δ > 0, we have with probability at least 1 e δ,

sup f F t(Pf ˆP nf) inf α

Now, letting A = 10(1 + α)

2δ/nt and C = b 1

sup f F t r

h Pf ˆP nf i A r + C.

Setting A r + C = r BK gives a quadratic, which has both roots bounded by (ABK)2 + 2BKC by Lemma 12 now applying Lemma 17 we have that

Pf K K 1 ˆP nf + BKA2 + 2C.

Plugging in our values for A and C gives

Pf K K 1 ˆP nf +100 (1 + α)2 Kr

2 (1 + α) K

Applying AM-GM inequality,

Pf K K 1 ˆP nf + 101 (1 + α)2 Kr

which completes the proof.

D.2 Multi-Task Learning

The following theorem is a vector-valued generalization of Theorem 5.2 within Bartlett et al. [2005] with several differences: we do not require δ

n r , b = 1, and we solve for constants.

Theorem 12. Let (ˆf, ˆh) be an empirical risk minimizer as given in Eqn. (2). Let ψ(r) b E RX(F t(H), r) with r the fixed point of ψ(r). Then, under Assumption 1.A with probability 1 2e δ,

Rsource(ˆf, ˆh) Rsource(f , h ) + q

Rsource(f , h )

bδ nt + 146

Proof of Theorem 12. To simplify the notation within the proof let L = Rsource(f , h ).

We follow steps similar to the proof of Theorem 5.2 within Bartlett et al. [2005]. Assume f j = arg minf F Pjℓf exists. (As mentioned in Bartlett et al. [2005] if it does not exist we can consider a sequence converging to the infimum.) Note that {f j (Xi j)}i [n],j [t] are nt independent r.v.. Note that

j=1 Var f j (Xi j) 1

j=1 E f j (Xi j) 2 b

j=1 Ef j (Xi j) = b L .

By Bernstein inequality, with probability 1 e δ

ˆP n ˆ f ˆP nf L +

By Theorem 11, setting α = 1 11 and B = b, and union bound, with probability 1 2e δ,

Pℓˆ f K K 1

nt + L + 2bδ

882nt + 68bδ

3nt + 144Kr

By change of variables we consider K > 0 by adding 1 to K:

1789bδ (K + 1)

882nt + 68bδ

3nt + 144r (K + 1)

b + (K + 1)

nt + L + 2bδ

882nt + 144Kr

nt + L + 22369bδ

882nt + 144r

K + 2bδ 3Knt.

Applying AM-GM to

L + 1789Kbδ

882nt + 144Kr

nt + 22369bδ

882nt + 144r

K + 5bδ 3Knt.

These are the terms without K:

nt + 22369bδ

882nt + 144r

Considering only those terms a function of K:

882nt + 144Kr

K + 5bδ 3Knt.

We set K = max

bδ nt . Terms with K in the numerator are upper-bounded by

882nt max q

Terms with K in the denominator are upper-bounded by

Expanding both expressions gives eight terms where we will choose an appropriate value to maximize the quantity:

5b 3 2 δ 3 2

3n 3 2 t 3 2 max

1789b 3 2 δ 3 2

882n 3 2 t 3 2 max q

882nt max q

Adding all of the right-hand side expressions together gives:

882nt + 437

Adding the terms without K, Equation (28) gives:

b + L + 12814bδ

441nt + 437

Applying AM-GM again

bδ nt + 3553

bδ nt + 146

+ L + 89867bδ

882nt + 1301r

Rounding up to the nearest integers:

bδ nt + 146

+ L + 102bδ

D.3 Multi-Task Learning via Representation Learning

Theorem 13. Let ˆh be an empirical risk minimizer as in (2). Let ψ(r) b E RZ(ℓ F t H, r) with r the fixed point of ψ(r). Then, if Assumption 1.A holds. Then, with probability at least 1 e δ,

d F,f (ˆh, h ) q

Rsource(f , h )

bδ nt + 146

where recall that d F,f is defined in Equation (12).

Proof of Theorem 13.

d F,f (ˆh, h ) = inf f F t

n Rsource(f , ˆh) Rsource(f , h ) o

Rsource(ˆf, ˆh) Rsource(f , h )

Rsource(f , h )

bδ nt + 146

where the last inequality is due to Theorem 12.

By using our hypothesis which rates the task-averaged representation distance to the worst case representation difference we can bound the excess transfer risk.

Theorem 14. Let ˆf0 be an empirical risk minimizer of ˆRtarget( , ˆh) in Eqn. (3) for any representation ˆh. Let ψ(r) b E RZ(ℓ F0, r) with r the fixed point of ψ(r). Then if Assumption 1.A holds, with probability at least 1 δ :

Rtarget( ˆf0, ˆh) Rtarget(f 0 , h ) + q

Rtarget(f 0 , h )

bδ m + 219 rr

2b + d F,F0(ˆh; h ).

Proof of Theorem 14. Let f0 = arg minf F Rtarget(f, ˆh). By adding and subtracting d F,F0(ˆh; h ) then applying Theorem 12 to Rtarget( ˆf0, ˆh), we have convenient cancellation of Rtarget( f0, ˆh). For notational succinctness let

b and Bt = 102bδ

Therefore we have:

Rtarget( ˆf0, ˆh) = Rtarget( ˆf0, ˆh) d F,F0(ˆh; h ) + d F,F0(ˆh; h )

= Rtarget( ˆf0, ˆh) Rtarget( f0, ˆh) Rtarget(f 0 , h ) + d F,F0(ˆh; h )

Rtarget( f0, ˆh) + q

Rtarget( f0, ˆh)At + Bt

Rtarget( f0, ˆh) Rtarget(f 0 , h ) + d F,F0(ˆh; h )

Rtarget( f0, ˆh)At + Bt + Rtarget(f 0 , h ) + d F,F0(ˆh; h ).

Now note that Rtarget( f0, ˆh) = Rtarget(f 0 , h ) + d F0,f 0 (ˆh, h ). After applying subadditivity of the square root and AM-GM we can bound d F0,f 0 (ˆh, h ) with Theorem 13. This is concretely

demonstrated with the following chain of inequalities:

Rtarget( ˆf0, ˆh) q

Rtarget(f 0 , h ) + d F0,f 0 (ˆh, h )At + Bt + Rtarget(f 0 , h ) + d F,F0(ˆh; h )

Rtarget(f 0 , h )At + q

d F0,f 0 (ˆh, h )At + Bt + Rtarget(f 0 , h ) + d F,F0(ˆh; h )

Rtarget(f 0 , h )At + d F0,f 0 (ˆh, h )

2 + A2 t 2 + Bt + Rtarget(f 0 , h ) + d F,F0(ˆh; h )

Rtarget(f 0 , h )At +

Rtarget(f 0 , h )At + Bt

+ A2 t 2 + Bt + Rtarget(f 0 , h ) + d F,F0(ˆh; h )

Rtarget(f 0 , h )3

2At + A2 t 2 + 3

2Bt + Rtarget(f 0 , h ) + d F,F0(ˆh; h ).

We can apply AM-GM to A2 t = 6 q

bδ m + 146 q

to get A2 72bδ

Finally, by collecting like terms, we have that Rtarget( ˆf0, ˆh) is bounded by

Rtarget(f 0 , h )

bδ m + 219 rr

2b + Rtarget(f 0 , h ) + d F,F0(ˆh; h ).

Theorem 15. Let ˆh and ˆf0 be the learned representation and target predictor, as described in Eqns. (2) and (3). Let ψ1(r) b E RZ(ℓ F t H, r) and ψ2(r) b E RZ(ℓ F0, r) with r 1 and r 2 the fixed points of ψ1(r) and ψ2(r), respectively. Then, under Assumption 1.A and that f is (ν, ϵ)-diverse over F0 w.r.t. h , with probability at least 1 4e δ, the transfer learning risk is upper-bounded by,

Rtarget( ˆf0, ˆh) Rtarget(f 0 , h ) + q

Rtarget(f 0 , h )

m + 21967r 1 2b

Rsource(f , h )

bδ nt + 146

nt + 217r 2 b .

Proof of Theorem 15. The proof is by first applying Theorem 14, then (ν, ε)-diversity assumption, and finally Theorem 13. This is demonstrated with the following series of inequalities.

Rtarget( ˆf0, ˆh) q

Rtarget(f 0 , h )

m + 21967r 1 2b + Rtarget(f 0 , h )

+ d F,F0(ˆh; h )

Rtarget(f 0 , h )

m + 21967r 1 2b + Rtarget(f 0 , h )

Rsource(f , h )

bδ nt + 146

nt + 217r 2 b .

D.4 Empirical Local Rademacher Complexity Bound

The following lemma allows us to discard the composition of a loss so long as we appropriately scale the cover.

Lemma 18. For a non-negative H-smooth loss and any function class F, we have:

N Lℓ(F t, r), ϵ, nt N F t, ϵ

12Hrnt , nt .

We follow the same first few steps as in Srebro et al. [2010] but then bound the maximum by summation to recover an empirical L2 cover.

Proof of Lemma 18. By Lemma 13 we see that for a non-negative H-smooth function f, we have that (f(t) f(r))2 6H(f(t)+f(r))(t r)2. Using this inequality, for any sample X = {(xi j, yi j)}j [t] i [n] ,

of nt points, for an fϵ, we have,

ℓ f xi j , yi j ℓ fε xi j , yi j 2

ℓ f xi j , yi j + ℓ fε xi j , yi j f xi j fε xi j 2

ℓ f xi j , yi j + ℓ(fε (xi) , yi) r

max i [n] (f (xi) fε (xi))2

max i [n],j [t]

f xi j fε xi j 2

f xi j fε xi j 2

f xi j fε xi j 2.

That is, a cover of {f F t : Pf r} at radius ϵ/

12Hrnt is also a cover of Lℓ(F , r) at radius ϵ, and we can conclude that,

N Lℓ(F t, r), ϵ, nt N F t, ϵ

12Htrn , nt ,

which completes the proof.

Theorem 16 (Smooth non-negative local Rademacher complexity bound). Under the setting of Theorem 15 along with ℓbeing H-smooth and Assumptions 1.B and 1.C, we have:

Rnt(Lℓ(F t(H)), r) 640

Hr log (nt)2 + 80

Hr log (nt) + 4

and if 16G (F t(H)) D and

3Π(F t(H)), which holds for large enough samples,

Rnt(Lℓ(F t(H)), r) 2560

3 rΠ(F t(H)) log

b 1920Π(F t(H))

where G(F t(H)) = L Gnt(H) + Gn(F) and Π(F t(H)) =

HG (F t(H)) log e D 16G(F t(H)) .

Proof of Theorem 16. Let Z denote the dataset of nt points. We first compute supg F t(H) sup Z f L2(Z). Note that,

sup g ℓ(F t(H)) sup Z g L2(Z) sup f F t,h H sup Z ℓ f h L2(Z)

= sup f F t,h H

ℓ(fj(h(xi j))) 2

sup f F t,h H b 1

i=n ℓ(fj(h(xi j))) br.

Therefore, by refined Dudley s entropy integral formula, we have

Rnt(Lℓ(F t(H)), r) inf

log N(Lℓ(r), ε, nt)

(by Lemma 7)

log N(F t(H), ε

12Hrnt, nt)

(by Lemma 18)

12Hrnt Gnt(F t(H))

(by Lemma 2)

3Hr Gnt(F t(H)) log

Now we apply the Gaussian chain rule Theorem 8 to Gnt(F t(H)), giving us:

Rnt(Lℓ(F t(H)), r) inf

3Hr inf D δ 0

4δ + 64G F t(H) log D

Setting δ = min(16G (F t(H)) , D) and simplifying by bounding the minimum with 16G (F t(H)) we have:

Rnt(Lℓ(F t(H)), r) 2560

Hr log e D min (16G (F t(H)) , D)

Doing the same for δ with min

3G (F t(H))

Hr log e D min(16G(F t(H)),D) we have Rnt(Lℓ(F t(H)), r) 2560

log e D min (16G (F t(H)) , D)

3G (F t(H))

H log e D min(16G(F t(H)),D)

The above expression is optimal with respect to the parameters α and δ and clearly gives four different possibilities based on the two minima. Since, under a bounded setting, G(F t(H) would typically go down with nt, for large enough nt we have that min(16G F t(H) , D) = 16G F t(H)

Hr log e D min (16G (F t(H)) , D)

Hr log e D 12G (F t(H))

Under these conditions, we have

Rnt( Lℓ(F t(H)), r) 2560

3 rΠ(F t(H)) log

b 1920Π(F t(H))

where Π(F t(H)) = G (F t(H))

H log e D 16G(F t(H)) .

Under the other three possibilities for the minima some or all the logarithmic factors are identically one and lead to a simplified bound.

Alternatively, for a more simple bound which always holds set α = q

br nt and δ = D

Rnt( Lℓ(F t(H)), r) 640

Hr log (nt)2 + 80

Hr log (nt) + 4

Remark 3 (Standard single function setting.). Everything in the proof of Theorem 16 up to Equation (31) holds for a non-compositional model and where t = 1 which corresponds to the standard setting. In this context, there is no need to apply the Gaussian chain rule, Theorem 8. Let Q : X R be a hypothesis class and note that we have n i.i.d. samples.

If we set α = min

Hr then we have

Rn(Lℓ(Q), r) 40

and if we set α = q

Rn(Lℓ(Q), r) 20

Hr log (n) + 3

Using Lemma 5, we also get the following bounds for local Rademacher complexity (as opposed to width),

Rn(Lℓ(Q), r) 40

Rn(Lℓ(Q), r) 20

Hr log (n) + 5

D.5 Theorems for transition from empirical to population

Lemma 19. Let F be a class of functions that map X into [0, b] with b > 0. Fix α > 0. For every δ > 0 and r that satisfy

r 4 (α + 1) ERZ{f F t | Pf r} + bδ

we have with probability at least 1 e δ that f F t | Pf r n f F t | ˆP nf 2r o .

The proof is similar to the proof of Corollary 2.2 within Bartlett et al. [2005].

Proof of Lemma 19. For f {f F t | Pf r} we have Var f Pf 2 b Pf br. By Theorem 10, with probability 1 e δ, every f {f F t | Pf r} satisfies, with α =

ˆP nf Pf + 2(1 + α)ERZF t r +

r + 2(1 + α)ERZF t r +

r + 2(1 + α)ERZF t r + bδ

D.6 Smooth Learning Bounds

Lemma 20. Under the setting of Theorem 16 for any δ > 0, the fixed point r is bounded by,

c1G F t(H) 2 Hb log (nt)4 + 2332800D2Hb log (nt)2

nt + 112b2δ

3nt + 1944b2

where c1 < 1.5 108 and G(F t(H)) = L Gnt(H) + Gn(F).

Further, if 16G F t(H) D and

3Π(F t(H)),

where Π(F t(H)) =

HG (F t(H)) log e D 16G(F t(H)) , then for any δ > 0, the fixed the fixed point r is bounded by,

c2G F t(H) 2 Hb log De 16G (F t(H))

1920G (F t(H))

H log e D 16G(F t(H))

with c2 < 8 108.

Proof of Lemma 20. Let

γ(r) = 4(α + 1)ERZ(bf F t | Pf r

Now we have

b ERZ n f F t | Pf r

o 4(1 + α)ERZ n bf F t | Pf r

= γ(r). (32)

Note that from Lemma 19, we have that, for any δ > 0, with probability at least 1 e δ,

RZ n bf F t | Pf r

bf F t | ˆP nf 2r

Further, we can apply the concentration of Rademacher complexity, Lemma 11, to bound the expected value of the left-hand side as follows. With probability at least 1 e δ,

ERZ n bf F t | Pf r

o 2RZ n bf F t | Pf r

Hence, with probability 1 2e δ, we get,

E h RZ n bf F t | Pf r

bf F t | ˆP nf 2r

We now apply Lemma 16 and upper bound the right-hand side by taking supremum over Z. This gives us,

E h RZ n bf F t | Pf r

bf F t | ˆP nf 2r

Plugging the above in Eqn. (32), we get,

b ERZ n f F t | Pf r

o 8(1 + α)ERnt

bf F t | ˆP nf 2r

3 + 4(1 + α) + 2

Suppose that the local Rademacher complexity is bounded by a multiple of r, say r E. Now bounding the local Rademacher complexity with Theorem 16, we have that,

b ERZ n f F t | Pf r

3 + 4(1 + α) + 2

A = 8(1 + α)

3 + 4(1 + α) + 2

Note that A r + D is sub-root and larger than γ(r). Thus if we solve for a fixed point r of this expression we have A

r + D = r . By Lemma 12 we have r A2 + 2D.

Therefore, with α = 1/8,

b Rn n f F t | Pf r

is bounded by a sub-root function of r with the following fixed point:

b72 (E)2 b + 139

We have seen from Theorem 16 that if 16G (F t(H)) D and

3Π(F t(H)) then we can set

3Π(F t(H)) log

b 1920Π(F t(H))

In this setting, using Equation (33), the fixed point is bounded by:

796262400G F t(H) 2 Hb log De 16G (F t(H))

1920G (F t(H))

H log e D 16G(F t(H))

Alternatively, we can always set E as follows:

H log (nt)2 + 80

H log (nt) + 5

therefore we also have that the fixed point is bounded as:

149299200G F t(H) 2 Hb log (nt)4 + 2332800D2Hb log (nt)2

nt + 112b2δ

3nt + 1944b2

D.7 Smooth MTL and MTL via MTRL

When we add the additional assumption that the loss function is smooth we have the following extensions to Theorems 12 and 15 Theorem 17. Let (ˆf, ˆh) be an empirical risk minimizer as given in Eqn. (2). Let ψ(r) b Rn(F t(H), r) with r the fixed point of ψ(r). Then, under Assumption 1 along with ℓbeing H-smooth, with probability 1 2e δ,

Rsource(ˆf, ˆh) Rsource(f , h ) + q

Rsource(f , h )

bδ nt + 146

(34) where r

G F t(H) 2 H log (nt)4 + D2Hb log (nt)2

with c < 1.5 108 and G(F t(H)) = L Gnt(H) + Gn(F).

Proof of Theorem 17. Use the fixed point bound from Lemma 20 that always holds and substitute it into Theorem 12.

Theorem 18. Let ˆh and ˆf0 be the learned representation and target predictor, as described in Eqns. (2) and (3). Let ψ1(r) b Rn(ℓ F t H, r) and ψ2(r) b Rn(ℓ F0, r) with r 1 and r 2 the fixed points of ψ1(r) and ψ2(r), respectively. Then, under Assumption 1 along with ℓbeing H-smooth. and that f is (ν, ϵ)-diverse over F0 w.r.t. h , with probability at least 1 2e δ, the transfer learning risk is upper-bounded by,

Rtarget( ˆf0, ˆh) Rtarget(f 0 , h ) + q

Rtarget(f 0 , h )

m + 21967r 1 2b (35)

Rsource(f , h )

bδ nt + 146

nt + 217r 2 b .

where r 2 b c

G F t(H) 2 H log (nt)4 + D2Hb log (nt)2

where c < 1.5 108 and r 1 b 97200 G2 m(F0 ˆh)H log (m)2 + 112bδ

Proof of Theorem 18. Use the the fixed point bound from Lemma 20 that always holds and substitute it into Theorem 15.

D.8 Local Rademacher complexity chain rule

Theorem 19. Suppose the loss function ℓis Lℓ-Lipschitz. Define the restricted representation and predictor classes as follows,

ℓ FX(r) := ℓ f ℓ F t : h H : V (ℓ f h) r

HX(r) := h H : f F t : V (ℓ f h) r ,

where V is the functional in the local Rademacher complexity description. Under Assumptions 1.B and 1.C and that the worst-case Gaussian width of the above is bounded by the sub-root functions ψF and ψH, respectively, there exists an absolute constant c such that

Gn(Lℓ(F t(H), r)) c (LLℓψF(r) + ψH(r)) log (nt) + D

Proof of Theorem 19. Define

ℓ F t X (H)(r) = {(ℓ f, h) : V (ℓ f h) r} .

Observe that for any (ℓ f, h) F t X (H), we have f FX(r) and h HX(r). Hence,

ℓ F t X (H)(r) FX(r) HX(r)

Further, by assumption, the composed function ℓ f is LℓL-Lipschitz. We can now apply the (standard) chain rule of Tripuraneni et al. [2021] which gives us,

Gn(Lℓ(F t X (H)(r))) Gnt(ℓ FX(r) HX(r))

(LLℓψF(r) + ψH(r)) log (nt) + D

which completes the proof.

E Detailed comparison with prior works

In this section, we provide a more detailed comparison with prior works.

E.1 Comparison with Srebro et al. [2010]

Firstly, we identify some erroneous or missing though fixable steps in the proof of Srebro et al. [2010].

1. Dudley s integral formula. The work of Srebro et al. [2010] seems to interchange the concepts of Rademacher width and complexity with the same notation Rn(F). The original Dudley s formula and its truncated version proved by Srebro et al. [2010] is for Rademacher (or Gaussian) width , yet in their work, it is used to bound Rademacher complexity. It is possible to translate from width to complexity, but this will lead to an additional term of q

n as detailed below. This additional term only changes their result by constant factors.

Lemma 21. Consider a class of functions F with range in [0, b]. Then given input Z of n

points, RZ n f F : ˆP nf r o 2 RZ n f F : ˆP nf r o + q

Proof. The proof follows by an application of Lemma 5. In particular, given input Z of n points, for any f n f F : ˆP nf r o , we have that

1 n f L2(Z) =

Plugging this in Lemma 5 gives the claimed bound.

2. Centering. Within Srebro et al. [2010] they consider loss with the bounded difference property: ˆy, ˆy, y we have that |ℓ(ˆy, y) ℓ(ˆy , y)| b. To bound the local Rademacher complexity of their loss class which is empirically constrained with ˆP n(ℓ f) r, they use

Dudley s integral. Note the upper limit of integration in Dudley s integral is q

and it is claimed that q

br because ˆP n(ℓ f)2 b P(ℓ f) br. Yet, this reasoning only follows for b-bounded losses not under this weaker condition of bounded difference. Yet, it is possible to center the process and perform a comparable analysis. To resolve this issue let f = arg inff F ˆP n(ℓ f) and consider (ℓ f) (ℓ f). This process is now b-bounded. Centering the process in this way does not affect the downstream results because Gaussian/Rademacher width is shift agnostic due to symmetry.

3. Missing condition on n. This omission is related to the subsequent discussion about comparison with Srebro et al. [2010]. In the proof of Lemma 2.2, after Eqn. 23, the limits of integration are required to satisfy

br. Using the fact that in the setup, in general, Rn(F) = Θ B n , the above is equivalent to n = Ω HB2

b , This condition on n is missing from their main statement. Further, as we detail in Appendix E.1.1, this term, in general, can be unbounded. 4. Fat-Shattering inequality. Using the notation from Srebro et al. [2010]. Note that fatx is a decreasing function in x. Also x log( n

x) is a decreasing function in x for x 1. Therefore fatx log( n fatx ) is an increasing function in x. So therefore, for ε [γ, θ] we have fatγ log( n fatγ ) fatε log( n

fatε ) fatθ log( n

fatθ ). This in inequality is in the wrong direction on page 27 of Srebro et al. [2010].

Disregarding the above issues, the bound obtained on Srebro et al. [2010] on the local Rademacher complexity is, Rn(Lℓ(F, r)) = O Rn(F)

Hr (log (n))3/2 . (37)

In contrast ours is,

Rn(Lℓ(F, r)) c Gn(F)

Worst-case improvement for large enough samples. We first perform a general comparison. Accordingly, we bound the above as,

Rn(Lℓ(F, r)) c Gn(F)

log (n) log

log (n) log eb H log (n) n B2

The above is of the same form as the bound of Srebro et al. [2010] in Eqn. (37), and it is easy to see that ours is better whenever n = eΩ( b HB2 ).

Worst-case improvement for constant samples. Unfortunately, the term b HB2 can be unbounded, from above, in general, as detailed in Appendix E.1.1. However, a reasonable and standard assumption removes this issue. If we assume that loss at zero is bounded as follows, ℓ(0) HB2, then we show in Lemma 22, that b HB2 = Ω(1), thereby, making the requirement to be a constant number of samples. A uniform bound on ℓ(0) features in the characterization of sample complexity of learning with non-negative smooth losses as well as the regime described by the above bound on ℓ(0) has been considered in prior works [Arora et al., 2022, Shamir, 2015].

Improvement for certain hypothesis class. In the above chain of inequalities, we use the bound that Gn(F) Rn(F) log n which is tight in the worst case. However, there are natural situations where the two are of the same order. A prominent example is linear predictors with data bounded in (any) norm and hypothesis class bounded in the via a (regularization) function R which is strongly convex with respect to the dual norm [Kakade et al., 2008].

E.1.1 Understanding the HB2

No non-trivial lower bound. Firstly, we see that the term is not non-trivially lower bounded, in general. Consider the function ℓ(f(x)) = 1

2 (q f(x))2. This function is 1-smooth and considers B = 1 (which bounds the size of f(x)), however, b here is controlled by the size of q and thus could be unbounded. This means that the is a non-trivial lower bound on HB2

b in general. The above example also works if relax the definition of b to be "bounded difference", supf(x),f (x ) F (ℓ(f(x)) ℓ(f (x ))) as opposed to a uniform absolute bound.

A lower bound under ℓ(0) bound. We give a lower bound under the assumption of bound on ℓ(0). Lemma 22. Let F be a class of real-valued functions and let the loss function ℓ: R R be a non-negative H-smooth function with ℓ(0) L0, and supz Range(F) ℓ(z) b. Then,

b 1 1 + 2L0

Proof. Note that

b = sup z ℓ(z) = ℓ(z) ℓ(0) + ℓ(0)

ℓ (0), z z + H

2 |z|2 + L0

2H + H |z|2

2 + HB2 + L0

= 2L0 + HB2

where the first inequality uses smoothness, the second AM-GM inequality, and the third uses the self-bounding property of non-negative smooth losses (Lemma 2.1 in Srebro et al. [2010]).

No non-trivial upper bound. Consider the function ℓ(z) = H(1 + sin(z)); this is non-negative, H-smooth, and b = 2H. However, we can define the size of the domain B as arbitrary, without affecting the smoothness and range boundedness. Thus, HB2

b is unbounded from above, in general. Note that even assuming an upper bound on ℓ(0) doesn t help here.

E.2 Comparison with Denevi et al. [2019], Khodak et al. [2019]

MTRL can be seen as a more specific case of meta-learning in Denevi et al. [2019], Khodak et al. [2019], which has the type of representation learning we study frequently called as feature learning. It is possible to give guarantees for excess transfer risk in this more general setting. For example, the work of Denevi et al. [2019], under a generative model assumption, is restricted to linear predictors and convex losses. In their Thm. 5, they get a rate of O 1 n + 1

, on excess transfer risk. However, for the feature learning setting, as they point out, this rate contains terms with hidden dependence on n and t. Their guarantee for feature learning, with linear representations (which is more restrictive than ours) Corollary 7 , gets a rate of O 1 n + 1 t1/4 .

Another work of Khodak et al. [2019] mainly considers two settings of convex and strongly convex losses, respectively. Interestingly, fast rates in terms of the number of tasks can be obtained. Thm. 5.1 in the work, obtains the following rates, O 1 n + 1

for convex Lipschitz losses and

O 1 n + 1 nt for strongly-convex Lipschitz losses.