# representation_learning_beyond_linear_prediction_functions__a2e065eb.pdf Representation Learning Beyond Linear Prediction Functions Ziping Xu Department of Statistics University of Michigan zipingxu@umich.edu Ambuj Tewari Department of Statistics University of Michigan tewaria@umich.edu Recent papers on the theory of representation learning has shown the importance of a quantity called diversity when generalizing from a set of source tasks to a target task. Most of these papers assume that the function mapping shared representations to predictions is linear, for both source and target tasks. In practice, researchers in deep learning use different numbers of extra layers following the pretrained model based on the difficulty of the new task. This motivates us to ask whether diversity can be achieved when source tasks and the target task use different prediction function spaces beyond linear functions. We show that diversity holds even if the target task uses a neural network with multiple layers, as long as source tasks use linear functions. If source tasks use nonlinear prediction functions, we provide a negative result by showing that depth-1 neural networks with Re Lu activation function need exponentially many source tasks to achieve diversity. For a general function class, we find that eluder dimension gives a lower bound on the number of tasks required for diversity. Our theoretical results imply that simpler tasks generalize better. Though our theoretical results are shown for the global minimizer of empirical risks, their qualitative predictions still hold true for gradient-based optimization algorithms as verified by our simulations on deep neural networks. 1 Introduction It has become a common practice (Tan et al., 2018) to use a pre-trained network as the representation for a new task with small sample size in various areas including computer vision (Marmanis et al., 2015), speech recognition (Dahl et al., 2011; Jaitly et al., 2012; Howard and Ruder, 2018) and machine translation (Weng et al., 2020). Most representation learning is based on the assumption that the source tasks and the target task share the same low-dimensional representation. In this paper, following the work of Tripuraneni et al. (2020), we assume that each task, indexed by t, generates observations noisily from the mean function f t h , where h is the true representation shared by all the tasks and f t is called the prediction function. We assume h H, the representation function space and ft Ft, the prediction function space. Tripuraneni et al. (2020) proposed a diversity condition which guarantees that the learned representation will generalize to target tasks with any prediction function. Assume we have T source tasks labeled by 1, . . . , T and F1 = = FT = Fso. We denote the target task by ta and let Et(f, h) R be the excess error of f, h Ft H for task t. The diversity condition can be stated as follows: there exists some ν > 0, such that for any f ta Fta and any h H, inf fta Fta Eta(fta, h) Excess error given h for target task ν inf ft Fso,t=1,...,T 1 T t=1 Et(ft, h) Excess error given h for source tasks 35th Conference on Neural Information Processing Systems (Neur IPS 2021). The diversity condition relates the excess error for source tasks and the target task with respect to any fixed representation h. A smaller ν indicates a better transfer from source tasks to the target task. Generally, we say T source tasks achieve diversity over Fta when ν is finite and relatively small. The number of source tasks plays an important role here. To see this, assume Fso = Fta = F is a discrete function space and each function can be arbitrarily different. In this case, we will need the target task to be the same as at least one source task, which in turn requires T |F| and ν |F|. So far, it has only been understood that when both source tasks and target task use linear prediction functions, it takes at least d source tasks to be diverse, where d is the number of dimension of the linear mappings. This paper answers the following two open questions with a focus on the deep neural network (DNN) models: How does representation learning work when Fso = Fta? How many source tasks do we need to achieve diversity with nonlinear prediction functions? There are strong practical motivations to answer the above two questions. In practice, researchers use representation learning despite the difference in difficulty levels of source and target tasks. We use a more complex function class for substantially harder target task, which means Fso = Fta. This can be reflected as extra layers when a deep neural network model is used as prediction functions. On the other hand, the source task and target task may have different objectives. Representation pretrained on a classification problem, say Image Net, may be applied to object detection or instance segmentation problems. For instance, Oquab et al. (2014) trained a DNN on Image Net and kept all the layers as the representation except for the last linear mapping, while two fully connected layers are used for the target task on object detection. Another motivation for our work is the mismatch between recently developed theories in representation learning and the common practice in empirical studies. Recent papers on the theory of representation learning all require multiple sources tasks to achieve diversity so as to generalize to any target task in F (Maurer et al., 2016; Du et al., 2020; Tripuraneni et al., 2020). However, most pretrained networks are only trained on a single task, for example, the Image Net pretrained network. To this end, we will show that a single multi-class classification problem can be diverse. Lastly, while it is common to simply use linear mapping as the source prediction function, there is no clear theoretical analysis showing whether or not diversity can be achieved with nonlinear prediction function spaces. Main contributions. We summarize the main contributions made by this paper. 1. We show that diversity over Fso implies diversity over Fta, when both Fso and Fta are DNNs and Fta has more layers. More generally, the same statement holds when Fta is more complicated than Fso, in the way that Fta = F ta (F m so ) for some positive integer m1 and function class F ta. 2. Turning our attention to the analysis of diversity for non-linear prediction function spaces, we show that for a depth-1 NN, it requires Ω(2d) many source tasks to establish diversity with d being the representation dimension. For general Fso, we provide a lower bound on the number of source tasks required to achieve diversity using the eluder dimension (Russo and Van Roy, 2013) and provide a upper bound using the generalized rank (Li et al., 2021). 3. We show that, from the perspective of achieving diversity, a single source task with multiple outputs can be equivalent to multiple source tasks. While our theories are built on empirical risk minimization, our simulations on DNNs for a multi-variate regression problem show that the qualitative predictions our theory makes still hold when stochastic gradient descent is used for optimization. 2 Preliminaries We first introduce the mathematical setup of the problem studied in this paper along with the two-phase learning method that we will focus on. 1F T is the T times Cartesian product of F. Problem setup. Let X denote the input space. We assume the same input distribution PX for all tasks, as covariate shift is not the focus of this work. In our representation learning setting, there exists a generic feature representation function h H : X 7 Z that is shared across different tasks, where Z is the feature space and H is the representation function space. Since we only consider the different prediction functions, each task, indexed by t, is defined by its prediction function f t Ft : Z 7 Yt [0, 1], where Ft is the prediction function space of task t and Yt is the corresponding output space. The observations Yt = f t h (X) + ϵ are generated noisily with mean function f t h , where X PX and ϵ is zero-mean noise that is independent of X. Our representation is learned on source tasks f so := (f 1 , . . . , f T ) F1 FT for some positive integer T. We assume that all the prediction function spaces Ft = Fso are the same over the source tasks. We denote the target task by f ta Fta : Z 7 Yt [0, 1], where Fta is the target prediction function space. Unlike the previous papers (Du et al., 2020; Tripuraneni et al., 2020; Maurer et al., 2016), which all assume that the same prediction function space is used for all tasks, we generally allow for the possibility that Fso = Fta. Learning algorithm. We consider the same two-phase learning method as in Tripuraneni et al. (2020). In the first phase (the training phase), n = (n1, . . . , n T ) samples from each task are available to learn a good representation. In the second phase (the test phase), we are presented nta samples from the target task to learn its prediction function using the pretrained representation learned in the training phase. We denote a dataset of size n from task ft by Sn t = {(xti, yti)}n i=1. We use empirical risk minimization (ERM) for both phase. In the training phase, we minimize average risks over {Snt t }T t=1: ˆR(f, h | f so) := 1 P i=1 lso(ft h(xti), yti), where lso : Y Y 7 R is the loss function for the source tasks and f = (f1, . . . , f T ) F T so . The estimates are given by ( ˆfso, ˆh) arg minf,h ˆR(f, h | f so). In the second phase, we obtain the dataset {xta,i, yta,i}nta i from the target task and our predictor ˆfta is given by arg min f Fta ˆR(f, ˆh | f ta) := 1 nta i=1 lta(f ˆh(xta,i), yta,i), for some loss function lta on the target task. We also use R( , | ) for the expectation of the above empirical risks. We denote the generalization error of certain estimates f, h by E(f, h | ) := R(f, h | ) minf F,h H R(f , h | ), where F can be either F T so or Fta depending on the tasks. Our goal is to bound the generalization error of the target task. For simplicity, our results are presented under square loss functions. However, we show that our results can generalize to different loss functions in Appendix G. We also assume n1 = = n T = nso for some positive integer nso. 2.1 Model complexity As this paper considers general function classes, our results will be presented in terms of the complexity measures of classes of functions. We follow the previous literature (Maurer et al., 2016; Tripuraneni et al., 2020) which uses Gaussian complexity. Note that we do not use the more common Rademacher complexity as the proofs require a decomposition theorem that only holds for Gaussian complexity. For a generic vector-valued function class Q containing functions q : Rd 7 Rr and N data points, XN = (x1, . . . , x N)T , the empirical Gaussian complexity is defined2 as ˆGN(Q) = Eg i=1 g T i q(xi) , gi N(0, Ir) i.i.d. 2Note that the standard definition has a 1/N factor instead of 1/ N. We use the variant so that ˆGN does not scale with N in most of the cases we consider and only reflects the complexity of the class. The corresponding population Gaussian complexity is defined as GN(Q) = EXN h ˆGN(Q) i , where the expectation is taken over the distribution of XN. 2.2 Diversity Source tasks have to be diverse enough, to guarantee that the representation learned from source tasks can be generalized to any target task in Fta. To measure when transfer can happen, we introduce the following two definitions, namely that of transferability and diversity. Definition 1. For some ν, µ > 0, we say the source tasks f 1 , . . . , f T Fso are (ν, µ)-transferable to task f ta if inff Fta E(f, h | f ta) inff F T so E(f, h | f so) + µ/ν ν. Furthermore, we say they are (ν, µ)-diverse over Fta if above ratio is bounded for any true target prediction functions f ta Fta, i.e. sup f ta Fta sup h H inff Fta E(f, h | f ta) inff F T so E(f, h | f so) + µ/ν ν. When it is clear from the context, we denote E(f, h | f ta) and E(f, h | f so) by Eta(f, h) and Eso(f, h), respectively. We will call ν the transfer component and µ, the bias introduced by transfer. The definition of transferable links the generalization error between source tasks and the target task as shown in Theorem 1. The proof can be found in Appendix A. Theorem 1. If source tasks f 1 , . . . , f T are (ν, µ)-diverse over Fta, then for any f ta Fta, we have Eta( ˆfta, ˆh) νEso( ˆfso, ˆh) + µ + 2π ˆGnta Fta ˆh The first term in Theorem 1 can be upper bounded using the standard excess error bound of Gaussian complexity. The benefit of representation learning is due to the decrease in the third term from ˆGnta(Fta H)/ nta without representation learning to ˆGnta(Fta ˆh)/ nta in our case. For the problem with complicated representations, the former term can be extremely larger than the later one. In the rest of the paper, we discuss when we can bound (ν, µ), for nonlinear and nonidentical Fso and Fta. 3 Negative transfer when source tasks are more complex Before introducing the cases that allow transfer, we first look at a case where transfer is impossible. Let the source task use a linear mapping following the shared representation and the target task directly learns the representation. In other words, f ta is identical mapping and known to the learner. We further consider Fso = {z 7 w T z : w Rp} and H = {x 7 Hx : H Rp d}. The interesting case is when p d. Let the optimal representation be H and the true prediction function for each source task be w 1, . . . , w T . In the best scenario, we assume that there is no noise in the source tasks and each source task collects as many samples as possible, such that we will have an accurate estimation on each w T t H Rp d which we denote by W t . However, the hypothesis class given the information from source tasks are {H Rp d : w Rp, w T H = Wt for all t = 1, . . . , T}. As H is in the above class, any QH for some rotation matrix Q Rp p is also in the class. In other words, a non-reducible error of learnt representation is max Q H QH 2 2 in the worst case. More generally, we give a condition for negative transfer. Consider a single source task f so. Let H so := arg minh H minf Fso Eso(f, h) and H ta := arg minh H minf Fta Eta(f, h). In fact, if one hopes to get a representation learning benefit with zero bias (µ = 0), we need all h H so to satisfy inff Fta Eta(f, h) = 0, i.e. any representation that is optimal in the source task is optimal in the target task as well. Equivalently, we will need H so H ta. As shown in Proposition 1, the definition of transferable captures the case well. Proposition 1. If there exists a h H so such that h / H ta, then minfta Eta(fta, h ) > 0. Furthermore, there is no ν < , such that f so is (ν, 0)-transferable to f ta. Proof. The first statement is by definition. Plugging g into the Definition 1, we will have ν = The negative transfer happens when the optimal representation for source tasks may not be optimal for the target task. As the case in our linear example, this is a result of more complex Fso, which allows more flexibility to reduce errors. This inspires us to consider the opposite case where Fta is more complex than Fso. 4 Source task as a representation Before discussing more general settings, we first consider a single source task, which we refer to as so. Assume that the source task itself is a representation of the target task. Equivalently, the source task has a known prediction function f so(x) = x. This is a commonly-used framework when we decompose a complex task into several simple tasks and use the output of simple tasks as the input of a higher-level task. A widely-used transfer learning method, called offset learning, which assumes f ta(x) = f so(x) + w ta(x) for some offset function w ta falls within the framework considered here. The offset method enjoys its benefits when wta has low complexity and can be learnt with few samples. It is worth mentioning that our setting covers a more general setting in Du et al. (2017), which assumes f ta(x) = G(f so(x), w ta(x)) for some known transformation function G and unknown w ta. We show that a simple Lipschitz condition on Fta gives us a bounded transfer-component. Assumption 1 (Lipschitz assumption). Any fta Fta is L-Lipschitz with respect to L2 distance. Theorem 2. If Assumption 1 holds, task f so is (L, 0)-transferable to task f ta and we have with a high probability, Eta( ˆfta, ˆh) = O LˆGnso(H) nso + ˆGnta(Fta ˆh) nta Theorem 2 bounds the generalization error of two terms. The first term that scales with 1/ nso only depends on the complexity of H. Though the second term scales with 1/ nta, it is easy to see that ˆGnta(Fta ˆh) = ˆGnta(Fta) ˆGnta(Fta H) with the dataset {ˆh(xta,i)}nta i=1. 4.1 General case Previously, we assume that a single source task is a representation of the target task. Now we consider a more general case: there exist functions in the source prediction space that can be used as representations of the target task. Formally, we consider Fta = F ta (F m so ) for some m > 0 and some target-specific function space F ta : Y m so 7 Yta. Note that Fta is strictly larger than Fso when m = 1 and the identical mapping x 7 x F ta. In practice, Res Net (Tai et al., 2017) satisfies the above property. Assumption 2. Assume any f ta F ta is L -Lipschitz with respect to L2 distance. Theorem 3. If Assumption 2 holds and the source tasks are (ν, µ)-diverse over its own space Fso, then we have Eta ˆfta, ˆh = O ν ˆGT nso(F T so H) Tnso + µ + ˆGnta Fta ˆh It is shown in Tripuraneni et al. (2020) that ˆGP s ns(F S so H) can be bounded by O(ˆGT nso(H) + T ˆGnso(Fso)). Thus, the first term scales with ˆGT nso(H)/ Tnso + ˆGnso(Fso)/ nso. Again the common part shared by all the tasks decreases with Tnso, while the task-specific part scales with nso or nta. 4.2 Applications to deep neural networks Theorem 3 has a broad range of applications. In the rest of the section, we discuss its application to deep neural networks, where the source tasks are transferred to a target task with a deeper network. We first introduce the setting for deep neural network prediction function. We consider the regression problem with Yso = Yta = R and the representation space Z Rp. A depth-K vector-valued neural network is denoted by f(x) = σ (WK (σ (. . . σ (W1x)))) , where each Wk is a parametric matrix of layer k and σ is the activation function. For simplicity, we let Wk Rp p for k = 1, . . . K. The class of all depth-K neural network is denoted by MK. We denote the linear class by L = {x 7 αT x + β : α Rp, α 2 M(α)} for some M(α) > 0. We also assume max{ Wk , Wk 2 } M(k). where and 2 are the infinity norm and spectral norm. We assume any z Z, z DZ. Deeper network for the target task. We now consider the source task with prediction function of a depth-Kso neural network followed by a linear mapping and target task with depth-Kta neural network. We let Kta > Kso. Then we have Fta = L MKta Kso MKso and Fso = L MKso. Using the fact that M1 = σ(L p), we can write Fta as L MKta Kso 1 σ (F p so ). Thus, we can apply Theorem 3 and the standard Gaussian complexity bound for DNN models, which gives us Corollary 1. Corollary 1. Let Fso be depth-Kso neural network and Fta be depth-Kta neural network. If source tasks are (ν, 0)-diverse over Fso, we have Eta ˆfta, ˆh = pνM(α)ΠKta k=1M(k) ˆGT nso(H) Tnso + DZ Kso nso + DZ Kta M(α)ΠKta k=1M(k) nta Note that the terms that scales with 1/ nso and 1/ nta have similar coefficients M(α)ΠKta k=1M(k), which do not depends on the complexity of H. The term that depends on ˆGT nso(H) scales with 1/ Tnso as we expected. 5 Diversity of non-linear function classes While Corollary 1 considers the nonlinear DNN prediction function space under a diversity condition, it is not clearly understood how diversity can be achieved for nonlinear spaces. In this section, we first discuss a specific non-linear prediction function space and show a fundamental barrier to achieving diversity. Then we extend our result to general function classes by connecting eluder dimension and diversity. We end the section with positive results for achieving diversity under a generalized rank condition. We consider a subset of depth-1 neural networks with Re Lu activation function: F = {x 7 [ x, w (1 ϵ/2)]+ : x 2 1, w 1} for some ϵ > 0. Our lower bound construction is inspired by the similar construction in Theorem 5.1 of Dong et al. (2021). Theorem 4. Let T = {f1, . . . , f T } be any set of depth-1 neural networks with Re Lu activation in F. For any ϵ > 0, if T 2d log(1/ϵ) 1, there exists some representation h , h H, some distribution PX and a target function f ta F, such that inf f F T Eso(f, h ) = 0, while inf f F Eta(f, h ) = ϵ2/32. Theorem 4 implies that we need at least Ω(2d log(1/ϵ)) source tasks to achieve diversity. Otherwise, we can always find a set of source tasks and a target task such that the generalization error in source tasks are minimized to 0 while that in target task is ϵ2/32. Though Re Lu gives us a more intuitive result, we do show that similar lower bounds can be shown for other popular activation functions, for example, sigmoid function (see Appendix E). 5.1 Lower bound using eluder dimension We extend our result by considering a general function space F and build an interesting connection between diversity and eluder dimension (Russo and Van Roy, 2013). We believe we are the first to notice a connection between eluder dimension and transfer learning. Eluder dimension has been used to measure the sample complexity in the Reinforcement Learning problem. It considers the minimum number of inputs, such that any two functions evaluated similarly at these inputs will also be similar at any other input. However, diversity considers the minimum number of functions such that any two representations with similar outputs for these functions will also have similar output for any other functions. Thus, there is a kind of duality between eluder dimension and diversity. We first formally define eluder dimension. Let F be a function space with support X and let ϵ > 0. Definition 2 ((F, ϵ)-dependence and eluder dimension (Osband and Roy (2014))). We say that x X is (F, ϵ)-dependent on {x1, . . . , xn} X iff f (xi) f (xi) 2 2 ϵ2 = f(x) f(x) 2 ϵ. We say x X is (F, ϵ)-independent of {x1, . . . , xn} iff it does not satisfy the definition of dependence. The eluder dimension dim E(F, ϵ) is the length of the longest possible sequence of elements in X such that for some ϵ ϵ every element is (F, ϵ )-independent of its predecessors. We slightly change the definition and let dims E(F, ϵ) be the shortest sequence such that any x X is (F, ϵ)-dependent on the sequence. Although dims E(F, ϵ) dim E(F, ϵ), it can be shown that in many regular cases, say linear case and generalized linear case, dim E(F, ϵ) is only larger then dims E(F, ϵ) up to a constant. Definition 3 (Dual class). For any F : X 7 Y, we call F : F 7 Y its dual class iff. F = {gx : gx(f) = f(x), x X}. Theorem 5. For any function class F : X 7 R, and some ϵ > 0, let F be the dual class of F. Let d E = dims E(F , ϵ). Then for any sequence of tasks f1, . . . , ft, t d E 1, there exists a task ft+1 F such that for some data distribution PX and two representations h, h , inff t+1 F EX f t+1(h(X)) ft+1(h (X)) 2 2 1 t inff 1,...,f t Pt i=1 EX f i(h(X)) fi(h (X)) 2 2 t/2. Theorem 5 formally describes the connections between eluder dimension and diversity. To interpret the theorem, we first discuss what are good source tasks. Any T source tasks that are diverse could transfer well to a target task if the parameter ν could be bounded by some fixed value that is not increasing with T. For instance, in the linear case, ν = O(d) no matter how large T is. While for a finite function class, in the worst case, ν will increase with T before T reaches |F|. Theorem 5 states that if the eluder dimension of the dual space is at least d E, then ν scales with T until T reaches d E, as if the function space F is discrete with d E elements. Note that this result is consistent with what is shown in Theorem 9 as eluder dimension of the class discussed above is lower bounded by Ω(2d) as well (Li et al., 2021). 5.2 Upper bound using approximate generalized rank Though we showed that diversity is hard to achieve in some nonlinear function class, we point out that diversity can be easy to achieve if we restrict the target prediction function such that they can be realized by linear combinations of some known basis. Tripuraneni et al. (2020) has shown that any source tasks f1, . . . , f T F are (1/T, µ)-diverse over the space {f F : f conv(f1, . . . , f T ) such that supz f(z) f(z) µ}, where conv(f1, . . . , f T ) is the convex hull of {f1, . . . , f T }. This can be characterized by a complexity measure, called generalized rank. Generalized rank is the smallest dimension required to embed the input space such that all hypotheses in a function class can be realizable as halfspaces. Generalized rank has close connections to eluder dimension. As shown in Li et al. (2021), eluder dimension can be upper bounded by generalized rank. Definition 4 (Approximate generalized rank with identity activation). Let Bd(R) := {x Rd | x 2 R}. The µ-approximate id-rk(F, R) of a function class F : X 7 R at scale R is the smallest dimension d for which there exists mappings φ : X 7 Bd(1) and w : F 7 Bd(R) such that for all (x, f) X F : |f(x) w(f), φ(x) | µ. Proposition 2. For any F with µ-approximate id-rk(F, R) di for some R > 0, there exists no more than di functions, f1, . . . , fdi, such that w(f1), . . . w(fdi) span Rddi. Then f1, . . . , fdi are (di, µ)-diverse over F. Proposition 2 is a direct application of Lemma 7 in Tripuraneni et al. (2020). Upper bounding id-rank is hard for general function class. However, this notation can be useful for those function spaces with a known set of basis functions. For example, any function space F that is square-integrable has a Fourier basis. Though the basis is an infinite set, we can choose a truncation level d such that the truncation errors of all functions are less than µ. Then the hypothesis space F has µ-approximate id-rank less than d. 6 Experiments In this section, we use simulated environments to evaluate the actual performance of representation learning on DNNs trained with gradient-based optimization methods. We also test the impact of various hyperparameters. Our results indicate that even though our theory is shown to hold for ERM estimators, their qualitative theoretical predictions still hold when Adam is applied and the global minima might not be found. Before we introduce our experimental setups, we discuss a difference between our theories and experiments: our experiments use a single regression task with multiple outputs as the source task, while our analyses are built on multiple source tasks. 6.1 Diversity of problems with multiple outputs Though we have been discussing achieving diversity from multiple source tasks, we may only have access to a single source task in many real applications, for example, Image Net, which is also the case in our simulations. In fact, we can show that diversity can be achieved by a single multiclass classification or multivariate regression source task. The diversity for the single multivariate regression problem with L2 loss is trivial as the loss function is decomposable. For multiclass classification problems or regression problems with other loss functions, we will need some assumptions on boundedness and continuous. We refer the readers to Appendix H for details. 6.2 Experiments setup Our four experiments are designed for the following goals. The first two experiments (Figure 1, a and b) target on the actual dependence on nso, nta, Kso and Kta in Equation (1) that upper bounds the errors of DNN prediction functions. Though the importance of diversity has been emphasized by various theories, no one has empirically shown its benefits over random tasks selection, which we explore in our third experiment (Figure 1, c). Our fourth experiment (Figure 1, d) verifies the theoretical negative results of nonlinearity of source prediction functions showed in Theorem 4 and 5. Though the hyper-parameters vary in different experiments, our main setting can be summarized below. We consider DNN models of different layers for both source and target tasks. The first K layers are the shared representation. The source task is a multi-variate regression problem with output dimension p and Kso layers following the representation. The target task is a single-output regression problem with Kta layers following the representation. We used the same number of units for all the layers, which we denote by nu. A representation is first trained on the source task using nso random samples and is fixed for the target task, trained on nta random samples. In contrast, the baseline method trains the target task directly on the same nta samples without the pretrained network. We use Adam with default parameters for all the training. We use MSE (Mean Square Error) to evaluate the performance under different settings. 6.3 Results Figure 1 summaries our four experiments, with all the Y axes representing the average MSE s of 100 independent runs with an error bar showing the standard deviation. The X axis varies depending on the goal of each experiment. In subfigure (a), we test the effects of the numbers of observations for both source and target tasks, while setting other hyperparameters by default values. The X axis represents nso and the colors represents nta. In subfigure (b), we test the effects of the number of shared representation layers K. To have comparable MSE s, we keep the sum K + Kta = 6 and run K = 1, . . . , 5 reflected in the X axis, while keeping Kso = 1. In subfigure (c), we test the effects of diversity. The larger p we have, the more diverse the source task is. We keep the actual number of observations nso p = 4000 for a fair comparison. Lastly, in subfigure (d), we test whether diversity is hard to achieve when the source prediction function is nonlinear. The X axis is the number of layers in source prediction function Kso. The nonlinearity increases with Kso. We run Kso = 1, 2, 3 and add an activation function right before the output such that the function is nonlinear even if Kso = 1. 100 1000 10000 Baseline Source Mean Square Error (a) Effects of sample sizes 1 2 3 4 5 baseline Number of Layers in the Representation Mean Square Error (b) Effects of K 1 2 3 4 5 6 7 8 Baseline Output Dimensions Mean Square Error (c) Effects of diversity 1 2 3 Baseline Number of Layers in the Source Prediction Functions Mean Square Error (d) Effects of Kso Figure 1: (a) Effects of the numbers of observations for both source (nso) and target tasks (nta). (b) Effects of the number of shared representation layers K. (c) Effects of diversity determined by the output dimensions p. We keep the actual number of observations nso p = 4000. (d) Effects of nonlinearity of the source prediction function. Higher Kso indicates higher nonlinearity. In Figure 1 (a), MSE s decrease with larger numbers of observations in both source and target tasks, while there is no significant difference between nso = 1000 and 10000. The baseline method without representation learning performs worst and it performs almost the same when nta reaches 1000. In (b), there are positive benefits for all different numbers of the shared layers and the MSE is the lowest at 5 shared layers. As shown in Figure 1 (c), the MSE s and their variances are decreasing when the numbers of outputs increase, i.e., higher diversity. Figure 1 (d) shows that there is no significant difference between baseline and Kso = 1, 2, 3. When Kso = 2, 3 there is a negative effects. 7 Discussion In this paper, we studied representation learning beyond linear prediction functions. We showed that the learned representation can generalize to tasks with multi-layer neural networks as prediction functions as long as the source tasks use linear prediction functions. We show the hardness of being diverse when the source tasks are using nonlinear prediction functions by giving a lower bound on the number of source tasks in terms of eluder dimension. We further give an upper bound depending on the generalized rank. Focusing on future work, we need better tools to understand the recently proposed complexity measure generalized rank. Our analyses rely on the ERM, while in practice as well as in our simulations, gradient-based optimization algorithms are used. Further analyses on the benefits of representation learning that align with practice on the choice of optimization method should be studied. Our analyses assume arbitrary source tasks selection, while, in real applications, we may have limited data from groups that are under-represented, which may lead to potential unfairness. Algorithm that calibrates this potential unfairness should be studied. Another direction of future work is to propose reasonable assumptions for a guarantee of generalization even when the source prediction functions are nonlinear. Possibly the interesting functions are within a subset of the target prediction function class, which makes generalization easier. In this work, we consider provided and fixed source tasks. In practice, many empirical methods (Graves et al., 2017) are proposed to adaptively select unknown source tasks. An interesting direction is to ask whether adaptively task selection could achieve the optimal level of diversity and to design algorithms for that purpose. 8 Funding Disclosure & Acknowledgements This work was directly supported by NSF grant IIS-2007055. Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482. Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2011). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1):30 42. Dong, K., Yang, J., and Ma, T. (2021). Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. ar Xiv preprint ar Xiv:2102.04168. Dragomir, S. S. and Gluscevic, V. (2000). Some inequalities for the Kullback-Leibler and x2distances in information theory and applications. RGMIA research report collection, 3(2):199 210. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. (2020). Few-shot learning via learning the representation, provably. In International Conference on Learning Representations. Du, S. S., Koushik, J., Singh, A., and Póczos, B. (2017). Hypothesis transfer learning via transformation functions. In Advances in Neural Information Processing Systems, pages 574 584. Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. (2019). Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 32:11427 11438. Golowich, N., Rakhlin, A., and Shamir, O. (2018). Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297 299. PMLR. Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks. In International Conference on Machine Learning, pages 1311 1320. PMLR. Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328 339. Jaitly, N., Nguyen, P., Senior, A., and Vanhoucke, V. (2012). Application of pretrained deep neural networks to large vocabulary speech recognition. In Proceedings of Interspeech. Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan, M. I. (2021). On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 68(2):1 29. Li, G., Kamath, P., Foster, D. J., and Srebro, N. (2021). Eluder dimension and generalized rank. ar Xiv preprint ar Xiv:2104.06970. Marmanis, D., Datcu, M., Esch, T., and Stilla, U. (2015). Deep learning earth observation classification using Imagenet pretrained networks. IEEE Geoscience and Remote Sensing Letters, 13(1):105 109. Maurer, A., Pontil, M., and Romera-Paredes, B. (2016). The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884. Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1717 1724. Osband, I. and Roy, B. V. (2014). Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pages 1466 1474. Russo, D. and Van Roy, B. (2013). Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256 2264. Scaman, K. and Virmaux, A. (2018). Lipschitz regularity of deep neural networks: Analysis and efficient estimation. ar Xiv preprint ar Xiv:1805.10965. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. Tai, Y., Yang, J., and Liu, X. (2017). Image super-resolution via deep recursive residual network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3147 3155. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. (2018). A survey on deep transfer learning. In International conference on artificial neural networks, pages 270 279. Tripuraneni, N., Jordan, M., and Jin, C. (2020). On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems, 33. Weng, R., Yu, H., Huang, S., Cheng, S., and Luo, W. (2020). Acquiring knowledge from pretrained model to neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9266 9273. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] The limitations are discussed in Section 7. (c) Did you discuss any potential negative societal impacts of your work? [Yes] The potential societal impacts are discussed in Section 7. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] The proofs are all included in the Supplementary materials. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The code is in the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] The training details are in the supplemental material. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Reported in Figure 1. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] In the supplementary materials. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] We did not use any assets. (b) Did you mention the license of the assets? (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] We did not conduct any research with human subjects. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]