# the_benefit_of_multitask_representation_learning__2c6bfef1.pdf Journal of Machine Learning Research 17 (2016) 1-32 Submitted 5/15; Revised 10/15; Published 4/16 The Benefit of Multitask Representation Learning Andreas Maurer am@andreas-maurer.eu Adalbertstrasse 55, D-80799 M unchen, Germany Massimiliano Pontil massimiliano.pontil@iit.it Istituto Italiano di Tecnologia, 16163, Genoa, Italy Department of Computer Science, University College London, WC1E 6BT, UK Bernardino Romera-Paredes bernard@robots.ox.ac.uk Department of Engineering Science, University of Oxford, OX1 3PJ, UK Editor: Urun Dogan, Marius Kloft, Francesco Orabona, and Tatiana Tommasi We discuss a general method to learn data representations from multiple tasks. We provide a justification for this method in both settings of multitask learning and learning-to-learn. The method is illustrated in detail in the special case of linear feature learning. Conditions on the theoretical advantage offered by multitask representation learning over independent task learning are established. In particular, focusing on the important example of half-space learning, we derive the regime in which multitask representation learning is beneficial over independent task learning, as a function of the sample size, the number of tasks and the intrinsic data dimensionality. Other potential applications of our results include multitask feature learning in reproducing kernel Hilbert spaces and multilayer, deep networks. Keywords: learning-to-learn, multitask learning, representation learning, statistical learning theory, transfer learning 1. Introduction Multitask learning (MTL) can be characterized as the problem of learning multiple tasks jointly, as opposed to learning each task in isolation. This problem is becoming increasingly important due to its relevance in many applications, ranging from modelling users preferences for products, to multiple object classification in computer vision, to patient healthcare data analysis in health informatics, to mention but a few. Multitask learning algorithms which exploit structure and similarities across different learning problems have been studied by the machine learning community since the mid 90 s, initially in connection to neural network models (see Baxter, 2000; Caruana, 1998; Thrun and Pratt, 1998, and reference therein). More recent approaches have been based on kernel methods (Evgeniou et al., 2005), structured sparsity and convex optimization (Argyriou et al., 2008), among others. Closely related to multitask learning but more challenging is the problem of learning-tolearn (LTL), namely learning to perform a new task by exploiting knowledge acquired when solving previous tasks. Arguably, a solution to this problem would have major impact in c 2016 Andreas Maurer, Massimiliano Pontil and Bernardino Romera-Paredes. Maurer, Pontil and Romera-Paredes Artificial Intelligence as we could build machines which learn from experience to perform new tasks, similar to what we observe in human behavior. An influential line of research on multitask and transfer learning is based on the idea that the tasks are related by means of a common low dimensional representation, which is learned jointly with the tasks parameters. This approach was first advocated in (Baxter, 2000; Caruana, 1998; Thrun and Pratt, 1998) and more recently reconsidered in (Argyriou et al., 2008) from the perspective of convex optimization and sparsity regularization. Representation learning is also a key problem in AI, and in the past years there has been much renewed interest in learning nonlinear hierarchical representations from multiple tasks using multilayer, deep networks. Researchers have shown improved results in a number of empirical domains; the case of computer vision is perhaps most remarkable, (see e.g. Girshick et al., 2014, and references therein). This success has increased interest in multitask representation learning (MTRL) as it is a core component of deep networks. Still, the understanding of why this methodology works remains largely unexplored. In this paper we analyze a general method for MTRL and discuss its potential advantage in both the MTL setting, where the learned representation is applied to the same tasks used during training, and in the domain of LTL, where the representation is applied to new tasks. We derive upper bounds on the error of these methods and quantify their advantage over independent task learning. When the original data representation is high dimensional and the number of examples provided to solve a regression or classification problem is limited, any learning algorithm which does not use any sort of prior knowledge will perform poorly because there is not enough data to reliably estimate the model parameters. We make this statement precise by considering the example of half space learning. 1.1 Previous Work Many papers have proposed multitask learning methods and studied their applications to specific problems (see Ando and Zhang, 2005; Argyriou et al., 2008; Baxter, 2000; Ben David and Schuller, 2003; Caruana, 1998; Cavallanti et al., 2010; Kuzborskij and Orabona, 2013; Maurer et al., 2013; Pentina and Lampert, 2014; Widmer et al., 2013, and references therein). There is a vast literature on these subjects and the list of papers provided here is necessarily incomplete. Despite the considerable success of multitask learning and in particular multitask representation learning there are only few theoretical investigations (Ando and Zhang, 2005; Baxter, 2000; Ben-David and Schuller, 2003). Other statistical learning bounds are restricted to linear multitask learning such as (Cavallanti et al., 2010; Lounici et al., 2011; Maurer, 2006a,a). Learning-to-learn (also called inductive bias learning or transfer learning) has been proposed by Thrun and Pratt (1998) and theoretically studied by Baxter (2000) where an error analysis is provided, showing that a common representation which performs well on the training tasks will also generalize to new tasks obtained from the same environment . More recent papers which present dimension independent bounds appear in Maurer (2006a,b); Maurer and Pontil (2013); Maurer et al. (2013); Pentina and Lampert (2014). The Benefit of Multitask Representation Learning 1.2 Our Contributions There are two main contributions of this work. First we present bounds to both the MTL and LTL settings, which apply to a very general MTRL method. Our analysis goes well beyond linear representation learning considered in most previous works. It improves over the analysis by Baxter (2000) based on covering numbers. We use more recent techniques of empirical process theory to achieve bounds which are independent of the input dimension (hence also valid in reproducing kernel Hilbert spaces) and to avoid logarithmic factors. Furthermore our analysis can be made fully data dependent. When specialized to subspace learning (i.e. linear feature learning) we get best bounds valid for infinite dimensional input spaces. As the second main contribution of this paper, we explain the advantage of MTRL in terms of specificity of feature maps and expose conditions when MTRL is beneficial or when it is not worth the effort. We further specialize our upper bounds to half-space learning (noiseless binary classification) and compare them to a general lower bound for learning isolated tasks. We observe that if the number of tasks grows then the performance of the method (both in the MTL and LTL setting) matches the performance of square norm regularization with best a priori known representation. This analysis highlights the advantage of multitask learning over learning the tasks independently. We also present numerical experiments for half-space learning, which indicate the good agreement between theory and experiments. 1.3 Organization The paper is organized as follows. In Section 2, we introduce the problem and present our main results. In Section 3, we specialize these results to subspace learning and illustrate the role played by the data covariance matrices in our bounds. In Section 3.1 we further illustrate our results in the case of half-space learning, rigorously comparing our upper bounds to a general lower bound for orthogonal equivariant algorithms. In Section 4, we present the proof of our main results, developing in particular uniform bounds on the estimation error. Finally, in Section 5 we summarize our findings and suggest directions for future research. 2. Multitask Representation Learning The set of possible observations is denoted by Z = (X, R), where the members of X are interpreted as inputs and the members of R are interpreted as outputs, or labels. A learning task is modelled by a probability measure µ on Z where µ (x, y) is the probability to encounter the input-output pair (x, y) Z in the context of task µ. We want to learn how to predict outputs. If we predict y while the true output is y , we suffer a loss ℓ(y, y ), where the loss function ℓ: R R [0, 1] is assumed to be 1-Lipschitz in the first argument for every value of the second argument. Different Lipschitz constants can be absorbed in the scaling of the predictors and different ranges than [0, 1] can be handled by a simple scaling of our results. Maurer, Pontil and Romera-Paredes If g is a real function defined on X, then the values g (x) can be interpreted as predictors and the expectation E(X,Y ) µ [ℓ(g (X) , Y )] is the risk associated with hypothesis g on the task µ. Multitask learning simultaneously considers many tasks µ1, . . . , µT and hopes to exploit some suspected common property of these tasks. For the purpose of this paper this property is the existence of a representation or common feature-map, which simultaneously simplifies the learning problem for most, or all of the tasks at hand. We consider predictors g which factorize g = f h, where stands for functional composition, that is, (f h)(x) = f (h (x)), for every x X. The function h : X RK is called the representation, or feature-map, and it is used across different tasks, while f is a function defined on RK, a predictor specialized to the task at hand. In the sequel K will always be the dimension of the representation space. As usual in learning theory the functions h : X RK and f : RK R are chosen from respective hypothesis classes H and F, which we refer to as the class of representations and the class of specialized predictors, respectively. These classes can be quite general, but we require that the functions in F have Lipschitz constant at most L, for some positive real number L. The choice of representation and specialized predictors is based on the data observed for all the tasks. This data takes the form of a multi-sample Z = (Z1, . . . , ZT ), with Zt = (Zt1, . . . , Ztn) µn t . Here and in the sequel an exponent on a measure indicates a product measure, so that µn t is a measure on Zn and Zt is an iid sample of n random variables distributed as µt. We also write Zti = (Xti, Yti), Zt = (Xt, Yt) and Z = X, Y . Multitask representation learning (MTRL) solves the optimization problem i=1 ℓ(ft (h (Xti)) , Yti) : h H, (f1, . . . , f T ) FT ) In this paper, we are not concerned with the algorithmics of this problem, but rather with the statistical properties of its solutions ˆh and ˆf1, . . . , ˆf T . Note that these are functional random variables in their dependence on Z. We consider two possible applications of these solutions. One application, which we will refer to as multitask learning (MTL), retains both the representation ˆh and the specializations ˆf1, . . . , ˆf T to be applied to the tasks at hand. The other, perhaps more important, application assumes that the tasks µt are related by a probabilistic law, called an environment, and keeps only the representation ˆh to be used when specializing to new tasks obeying the same law. In this way the parametrization of a learning algorithm is learned, hence the name learning-to-learn (LTL). We will give general statistical guarantees in both cases. Our bounds consist of three terms. The first term can be interpreted as the cost of estimating the representation h and decreases with the number T of tasks available for training. The second term corresponds to the cost of estimating task-specific predictors and decreases with the number n of training examples available for each task. The last term contains the confidence parameter and typically makes only a very small contribution. The Benefit of Multitask Representation Learning It is not surprising that the complexity of the representation class H (first term in the bounds) plays a central role. We measure this complexity on the observed input data X X Tn. Define a random set H X RKTn by H X = {(hk (Xti)) : h H} . The complexity measure relevant to estimation of the representation is the Gaussian average kti γktihk (Xti) |Xti where the γkti are independent standard normal variables. The Gaussian average is of order n T in T and n for many classes of interest. These include kernel machines with Lipschitz kernels (e.g. Gaussian RBF) and arbitrarily deep compositions thereof, see Maurer (2014) for a discussion. As we shall see, this increase of O( n T) is compensated in our bounds and the cost of learning the representation vanishes in the multi-task limit T . The second term in the bounds is governed by the quantity h X = 1 n sup h H kti hk (Xti)2 (3) or an equivalent distribution-dependent expression. If the feature-maps in H are very specific, in the sense that their components are appreciably different from zero only for very special data, the quantity in (3) can become much smaller than 1/ n, a phenomenon which can give a considerable competitive edge to MTRL, in particular if the per-task sample size n is small. We will demonstrate this in Section 3, where we apply Theorems 1 and 2 to subspace-learning and show that the above quantity is related to the operator norm of the data covariance. 2.1 Bounding the Excess Task-averaged Risk (MTL) If we make no further assumptions on the generation of the task-measures µ1, . . . , µT , a conceptually simple performance measure for a representation h and specialized predictors f1, . . . , f T is the task-averaged risk Eavg (h, f1, . . . , f T ) = 1 t=1 E(X,Y ) µtℓ(ft (h (X)) , Y ) . We want to compare this to the very best we can do using the classes H and F, given complete knowledge of the distributions µ1, . . . , µT . The minimal risk is clearly E avg = min h H,(f1,...,f T ) FT Eavg (h, f1, . . . , f T ) . It is a fundamental hope underlying our approach that the classes H and F are large enough for this quantity to be sufficiently small for practical purposes. We use the words hope and belief because an assumption would imply a statement to be used in analytical Maurer, Pontil and Romera-Paredes reasoning. Instead our approach is agnostic, and our results are valid independent of the size of the minimal risk above. Our first result bounds the excess average risk, which measures the difference between the task-averaged true risk of the solutions to (1) and the theoretical optimum above. Theorem 1 Let µ1, . . . , µT , H and F be as above, and assume 0 H and f (0) = 0 for all f F. Then for δ > 0 with probability at least 1 δ in the draw of Z QT t=1 µn t we have that Eavg(ˆh, ˆf1, . . . , ˆf T ) E avg n T + c2Q suph H h X where c1 and c2 are universal constants, G(H( X)) is the Gaussian average in Equation (2), and Q is the quantity Q Q(F) sup y =y RKn 1 y y E sup f F i=1 γi f (yi) f y i . (4) 1. The assumptions 0 H and f (0) = 0 for all f F are made to give the result a simpler appearance. They are not essential, as the reader can verify from the proof. 2. If G (H ( x)) is of order n T then the first term on the right hand side above is of order 1/ Tn and vanishes in the multi-task limit T even for small values of n. 3. For reasonable classes F one can find a bound on Q, which is independent of n, because the y y in the denominator balances the Gaussian average depending on the class F. 4. The quantity suph h X is of order n T whenever H is uniformly bounded, a crude bound being n T suph H maxti h (xti) . The second term is thus typically of order 1/ n. As explained in the discussion of Equation (3) above it can be very small if the representation components in H are very data-specific. 2.2 Bounding the Excess Risk for Learning-to-learn (LTL) Now we consider the case where we only retain the representation ˆh obtained from (1) and specialize it to future, hitherto unknown tasks. This is of course only possible, if there is some common law underlying the generation of tasks. Following Baxter (2000) we suppose that the tasks originate in a common environment η, which is by definition a probability measure on the set of probability measures on Z. The draw of µ η models the encounter of a learning task µ in the environment η. The environment η induces a measure µη on Z by µη (A) = Eµ η [µ (A)] for A Z. The Benefit of Multitask Representation Learning This simple mixture plays an important role in the interpretation of our results. The measure η also induces a measure ρη on Zn which corresponds to the draw of an n-sample from a random task in the environment. To draw a sample Z Zn from ρη we first draw a task µ from η and then generate the sample Z = (Z1, . . . , ZT ) from n independent draws from µ. Formally ρη (A) = Eµ η [µn (A)] for A Zn. We assume that the tasks µ1, . . . , µT are drawn independently from η and, consequently, that the multisample Z = (Z1, . . . , TT ) is obtained in T independent draws from ρη, that is, Z ρT η . The way we plan to use a representation h H on a new task µ η is as follows: we draw a training sample Z = (Z1, . . . , Zn) from µn and solve the optimization problem min f F 1 n i=1 ℓ(f (h (Xi)) , Yi) . Let ˆfh,Z denote the minimizer and mh,Z the corresponding minimum. We will then use the hypothesis a (h)Z = ˆfh,Z h = ˆfh,Z(h( )) for the new task. In this way any representation h H parametrizes a learning algorithm, which is a function a(h) : Zn F h, defined, for every Z Zn, as a (h)Z = ˆfh,Z h. In this sense the problem of optimizing such a representation can properly be called learningto-learn . It can also be interpreted as learning a hypothesis space as in (Baxter, 2000), namely selecting a hypothesis space F h from the collection of hypothesis spaces {F h : h H}. We can test the algorithm a (h) on the environment η in the following way: we draw a task µ η, we draw a sample Z Zn from µn, we run the algorithm to obtain a (h)Z = ˆfh,Z h, finally, we measure the loss of a (h)Z on a random data-point Z = (X, Y ) µ. To define the risk Eη (h) associated with the algorithm a (h) parametrized by h we just replace all random draws with corresponding expectations, so Eη (h) = Eµ ηEZ µn E(X,Y ) µ [ℓ(a (h)Z (X) , Y )] . The best value for any representation h in a (h) , given complete knowledge of the environment, is then min h H Eη (h) . But, given complete knowledge of the environment, this is still not the best we can do using the classes F and H, because for given µ and h we still use the expected performance Maurer, Pontil and Romera-Paredes EZ µn EZ µ ℓ(a (h)Z (X) , Y ) of the empirical risk minimization algorithm a (h), instead of using knowledge of µ to replace it by minf F EZ µℓ(f (h (X)) , Y ). The very best we can do is thus E η = min h H Eµ η min f F EZ µℓ(f (h (X)) , Y ) . The excess risk associated with any representation h is thus Eη (h) E η. We give the following bound for the excess risk associated with the representation ˆh found as solution to the optimization problem (1). Theorem 2 Let η be an environment on Z and H and F as above. Then: (i) with probability at least 1 δ in the draw of Z ρT η 2πQ sup h H v u u t E(X,Y ) µη and (ii) with the same probability 2πQ (1/T) P t suph H h (Xt) n + 5 where ˆh is solution to the problem (1), G H X is the Gaussian average introduced in (2), and Q is the quantity Q Q (F) = sup y RKn\{0} 1 y E sup f F i=1 γif (yi) . (5) We make some remarks and comparison to the previous result. 1. The constants are now explicit and small. For Theorem 1, uniform estimation had to be controlled simultaneously in H and F, while for LTL the problem can be more easily decoupled. 2. The first term is equivalent to the first term in Theorem 1 except for n replacing n in the denominator. It is therefore typically of order 1/ T instead of 1/ n T. The different order is due to the estimation of a hitherto unknown task, for which the sample sizes are irrelevant. To understand this point assume the η has the property that every µ η is deterministic, that is supported on a single point zµ Z. Then clearly the sample size n is irrelevant, and the problem becomes equivalent to learning a single task with a sample of size T. The Benefit of Multitask Representation Learning 3. The quantity Q is very much like the quantity Q in Equation (4), and it is uniformly bounded in n for the classes we consider. For linear classes Q = Q . 4. The bound in part (i) is not fully data-dependent, but more convenient for our applications below. The quantity E(X,Y ) µη h (X) 2 = sup h H k E(X,Y ) µη plays a similar role to (3), which is its empirical counterpart. Again, if the features hk are very specific, as the dictionary atoms of the next section or the atoms in a radial basis function network, then the above quantity can become very small. 2.3 Comparison to Previous Bounds The first and most important theoretical study of MTL and LTL was carried out by Baxter (2000), where sample complexity bounds are given for both settings. Instead of a feature map a hypothesis space is selected from a class of hypothesis spaces. Clearly every feature map with values in RK defines a hypothesis space while the reverse is not true in general, so Baxter s setting is certainly more general than ours. On the other hand the practical applications discussed in (Baxter, 2000) can be cast in the language of feature learning. To prove his sample complexity bounds Baxter uses covering numbers. This classical method requires to cover a (meta-)hypothesis space (or its evaluation on a sample) with a set of balls in an appropriately chosen metric. The uniform bound is then obtained as a union bound over the cover and bounds valid on the individual balls. The latter bounds follow from Lipschitz properties L of the loss function relative to the chosen metric. For a bound of order ϵ the radius of the balls has to be of order ϵ/L. This leads to covering numbers of order ϵ d, where d is some exponent (see the last inequalities in the proof of in (Baxter, 2000), and has the consequence that the dominant term in the bound has an additional factor of ln (1/ϵ). This is manifest in Theorem 8, Theorem 12 and Corollary 13 in (Baxter, 2000) and constitutes an essential weakness of the method of covering numbers. For bounds on the excess risk it implies that the orders of p 1/n obtained from Rademacher or Gaussian complexities have to be replaced by p ln (T) /T and p ln (n) /n. Rademacher and Gaussian complexities make it easy to handle infinite dimensional input spaces (see our Theorems 4 and 5 below). They also lead to data dependent bounds, which allows us to explain the benefits of multi-task learning in terms of the spectrum of the data covariance operator and the effective input dimension. Bounding Gaussian complexities for linear classes is comparatively simple, see the proof of our Lemma 3. There is a wealth of recent literature on the Rademacher complexity of matrices with spectral regularizers (see e.g. Kakade et al., 2012; Maurer and Pontil, 2013, and references therein), while it is unclear to us how Baxter s method could be applied if the feature map is constrained by a bound on, say, the trace norm of the associated matrix. In the case of LTL, our approach also leads to explicit and small constant factors. On the other hand it must be admitted, that it is relatively easy to obtain bounds (also provided by Baxter) of order ln (n) /n or ln (T) /T with covering numbers in the realizable case. Such bounds would be more difficult to obtain with our techniques. Maurer, Pontil and Romera-Paredes The work of Ando and Zhang (2005) proposes the use of MTL as a method of semisupervised learning through the creation of artificial tasks from unlabelled data, for example predicting concealed components of vectors. They analyze a specific algorithm where the class of feature maps can be seen as a linear mixture of a fixed feature map with subspace projections as discussed in our paper. The bounds given apply to the task-averaged risk and not to LTL. The analysis is based on Rademacher averages and is independent of the input dimension. The bound itself is expressed as an entropy integral as given by Dudley (see e.g. Van Der Vaart and Wellner, 1996) but it is not very explicit. In particular the role of the spectrum of the data covariance is not apparent. 3. Multi-task Subspace Learning We illustrate the general results of the previous section with an important special case. We assume that the input space X is a bounded subset of a Hilbert space H, which could for example be a reproducing kernel Hilbert space. We denote by , the inner product in H and by the induced norm. We hope that sufficiently good results can be obtained by predictors of the form g, where g : H R is linear with bounded norm. We also suspect that only few linear features in H suffice for most tasks, so that the vectors defining the hypotheses g can all be chosen from one and the same, albeit unknown, K-dimensional subspace M of H. Consequently we will factorize predictors as f h, where h is a partial isometry h : H RK and f is a linear functional on RK chosen from some ball of bounded radius. Specifically, we introduce the classes H = H x 7 ( d1, x , . . . , d K, x ) RK : D = (d1, . . . , d K) HK orthonormal k wkyk R : X k w2 k B2 ) The D s appearing in the definition of H are also called dictionaries and the individual dk are called atoms (see Maurer et al., 2013). It does no harm to our analysis if we immediately generalize the class H so as to include certain two-layer neural networks by allowing a nonlinear activation function φ with Lipschitz constant Lφ and satisfying φ (0) = 0, to be applied with each atom. We can also drop the condition of orthonormality and allow the atoms to trade some of their norms when needed. The enlarged class of representations is x H 7 (φ ( d1, x ) , . . . , φ ( d K, x )) RK : d1, . . . , d K H, X The results can then be re-specialized to subspace learning by setting φ to the identity and Lφ to one. When applied to subspace learning, our bounds are expressed in terms of covariances. If ν is a probability measure on H the corresponding covariance operator Cν is defined by Cνv, w = EX ν v, X X, w for v, w H. The Benefit of Multitask Representation Learning For an environment η we denote the covariance operator corresponding to the data-marginal of the mixture measure µη simply by C. If x = (x1, . . . , xm) Hm we define the empirical covariance operator ˆC (x) by D ˆC (x) v, w E = 1 i v, xi xi, w for v, w H, in particular D ˆC X v, w E = 1 ti v, Xti Xti, w . The following lemma establishes the necessary ingredients for the application of Theorems 1 and 2 to the case of subspace learning. Recall that if A is a selfadjoint positive linear operator on H, we denote by A and A 1 its spectral and trace norms, respectively. They are defined as A = sup z 1 Az and A 1 = P i N ei, Aei , where {ei}i N is an orthonormal basis in H. Recall also the definition of Q (F) and Q (F) given in Equations (4) and (5), respectively. Lemma 3 Let x = (xti) be a T n matrix with values in a Hilbert space and let φ, H and F be defined as above. Then (i) G (H ( x)) LφK r n T ˆC ( x) 1. (ii) For every h H, h ( x) Lφ Kn T ˆC ( x) . (iii) For an environment η and every h H E(X,Y ) µη h (X) 2 L2 φK C . (iv) L (F) B. (v) Q (F) B and Q (F) B. Proof (i) Using the contraction lemma, Corollary 11, in the first inequality and Cauchy Schwarz and Jensen s inequality in the second we get G (H ( x)) LφE sup d H kti γkti dk, xti = LφE sup d H ti xti 2 !1/2 = LφK r n T ˆC (x) 1. Maurer, Pontil and Romera-Paredes (ii) For any D H X kti φ ( dk, xti )2 L2 φ X kti dk, xti 2 L2 φK sup v: v 1 ti v, xti 2 = L2 φKn T ˆC ( x) , where we used φ (0) = 0 in the first step. (iii) Similarly, we have that E(X,Y ) µη X k φ ( dk, X )2 L2 φ X k dk 2 E(X,Y ) µη L2 φK sup v 1 E(X,Y ) µη v, X 2 = L2 φK C . (iv) Let y, y RK. Then !1/2 y y B y y , so L B. (v) Similarly, we have that = E sup w F i γi (yki yki) i γi (yki yki) ki (yki yki)2 = B y y , so Q B. The same proof works for Q . Substitution in Theorem 1 immediately gives Theorem 4 (subspace MTL) With probability at least 1 δ in X the excess risk is bounded by Eavg(ˆh, ˆf1, . . . , ˆf T ) E avg c1LφBK ˆC X 1 n T + c2LφB v u u t K ˆC X n + The Benefit of Multitask Representation Learning We remark that in the linear case the best competing bound for MTL, obtained by Maurer and Pontil (2013) from noncommutative Bernstein inequalities, is v u u t K ˆC X 1 ln (Tn) v u u t8K ˆC X n + If we disregard the constants this is worse than the bound (6) whenever K < ln (Tn). Its approach to the multitask limit is slower ( p ln (T) /T as opposed to p 1/T), but of course it has the advantage of smaller constants. The methods used to obtain (7), however, break down for nonlinear dictionaries. For the LTL setting, we use the distribution dependent bound, Theorem 2 (i), and obtain Theorem 5 (subspace LTL) With probability at least 1 δ in X, the excess risk is bounded by The two most important common features of Theorems 4 and 5 are the decay to zero of the first term, as T , and the occurrence of the operator norm of the empirical or true covariances in the second term. The first implies that for very large numbers of tasks the bounds are dominated by the second term. To understand the second term we must first realize that the ratio of trace and operator norms of the true covariances can be interpreted as an effective dimension of the distribution. This is easily seen if the mixture of task-marginals is concentrated and uniform on a ddimensional unit-sphere. In this case C 1 = 1 and by isotropy all eigenvalues are equal, so C = 1/d, whence C 1 / C = d. In such a case the second term in Theorem 5 above becomes The appropriate standard bound for learning the tasks independently would be B p 1/n (see Bartlett and Mendelson, 2002). The ratio p K/d of the two bounds in the multitask limit is the quotient of utilized information (the dimension of the representation space) to available information (the dimension of the data). This highlights the potential advantages of MTRL: if the data is already low-dimensional in the order of K then multi-task learning isn t worth the extra computational labour. If the data is high dimensional however, then multi-task learning may be superior. The expression (8) above might suggest that there really is a benefit of high dimensions for learning-to-learn. This is of course not the case, because the regularizer B has to be chosen large, in fact proportional to d to allow a small empirical error. The correct interpretation of (8) is that the burden of high dimensions vanishes in the limit T . In the next section we will explain this point in more detail. Maurer, Pontil and Romera-Paredes 3.1 Learning to Learn Half-spaces In this section, we illustrate the benefit of MTRL over independent task learning (ITL) in the case of noiseless linear binary classification (or half-space learning). We compare our upper bounds for LTL to a general lower bound on the performance of ITL algorithms and quantify the parameter regimes where LTL is superior to ITL. We assume that all the input marginals are given by the uniform distribution σ on the unit sphere Sd in Rd, and the objective is for each task µ to classify membership in the half-space {x : x, uµ > 0} defined by a task-specific (unknown) unit vector uµ. In the given environment all the vectors uµ are assumed to lie in some (unknown) K-dimensional subspace M of Rd. We are interested in the regime that and T grows. This is the safe regime in which our upper bounds for MTL or LTL (cf. Theorems 4 and 5) are smaller than a uniform lower bound for independent task learning, which we discuss below. We need n d for the lower bound to be large and K n for the middle term in our upper bounds to be small. If T is large enough, the second term in our upper bounds dominates the first (task dependent) term. A safe choice is T K2d, see Equation (9) below. The 0-1-loss is unsuited for our bounds because it is not Lipschitz. Instead we will use the truncated hinge loss with unit margin given by ℓ(y , y) = ξ (y y), where ξ is the real function 1 if t 0, 1 t if 0 < t 1, 0 if 1 < t. This loss is an upper bound of the 0-1-loss, so upper bounds for this loss function are also upper bounds for the classification error. Let H and F be as given at the beginning of Section 3 in its linear variant, where H is defined by orthonormal dictionaries without activation functions. Thus, H can be viewed as the set of partial isometries D : H RK. Recall the definition of the minimal risk for LTL E η = min h H Eµ η min f F EZ µℓ f h (X) , Y = min D H Eµ η min w B EZ µξ w, DX sgn ( uµ, X ) . Let DM be the partial isometry mapping M onto RK. Then DM H and for every unit vector u H we have DM (Bu) F. Thus E η Eµ η EZ µξ DM (Buµ) , DMX sgn ( uµ, X ) = Eµ η [EX σξ (B | uµ, X |)] sup u 1 EX σξ (B | u, X |) . For any unit vector u H the density of the distribution of | u, X | under σ has maximum Ad 1/Ad, where Ad is the volume of Sd in the metric inherited from Rd. This density can The Benefit of Multitask Representation Learning therefore be bounded by ξ (B |s|) ds = if we set B = d/ (2ϵ). This choice is made to ensure that the Lipschitz loss upper bounds the 0-1-loss. Now let Z be a multi-sample generated from the environment η and assume that we have solved the optimization problem (1) to obtain the representation (or feature-map) ˆD H. Using the excess risk bound, Theorem 5, and the fact that C = 1/d and C 1 = 1, we get with probability at least 1 δ in the draw of Z, that Eη( ˆD) ϵ + if we optimize ϵ. This guarantees the expected performance of future uses of the representation ˆD. The high dimension still is a hindrance to the estimation of the representation, but, as announced, its effect vanishes in the limit T . The individual samples must only well outnumber the dimension K, roughly the number of shared features. We compare this upper bound to a lower bound for a large class of algorithms which learn the tasks independently. Definition 6 An algorithm f : Sd { 1, 1}n Sd is called orthogonally equivariant if f (V x, y) = V f (x, y) , for every orthogonal matrix V Rd d. (10) For data transformed by an orthogonal transformation an orthogonally equivariant algorithm produces a correspondingly transformed hypothesis. Any algorithm which does not depend on a specific coordinate system is orthogonally equivariant. This class of algorithms includes all kernel methods, but it excludes the Lasso (L1-norm regularization). If the known properties of the problem posses a rotation symmetry only equivariant algorithms make sense. Below we denote by err(u, v) the misclassification error between the half-spaces associated with unit vectors u and v, that is err(u, v) = Prx σ{ u, x v, x < 0}. The following lower error bound is given in (Maurer and Pontil, 2008). Theorem 7 Let n < d and suppose that f : Sn d { 1, 1}n Sd is an orthogonally equivariant algorithm. Then for δ > 0 with probability at least 1 δ in the draw of X σn we have for every u Sd that err u, f(X, u(X)) 1 where u (X) = (sgn u, X1 , . . . , sgn u, Xn ). Maurer, Pontil and Romera-Paredes If we use a union bound to subtract the upper bound (9) from this lower bound we obtain high probability guarantees for the advantage of representation learning over other algorithms. In the following section we plot the phase diagram derived here, namely the difference between the uniform lower bound and our upper bound, and compare it with empirical results (see Figure 4). 3.2 Numerical Experiments The purpose of the experiments is to compare MTL and LTL to independent task learning (ITL) in the simple setting of linear feature learning (or subspace learning)1. We wish to study the regime in which MTL/LTL learning is beneficial over ITL as a function of the number of tasks T and the sample size per task n. We consider noiseless linear binary classification tasks, namely halfspace learning. We generated the data in the following way. The ground truth weight vectors u1, . . . , u T are obtained by the equation ut = Dct, where ct RK is sampled from the uniform distribution on the unit sphere in RK, and the dictionary D Rd K is created by first sampling a d-dimension orthonormal matrix from the Haar measure, and then selecting the first K columns (atoms). We create all input marginals by sampling from the uniform distribution on the d radius sphere in Rd. For each task we sample n instances to build the training set, and 1000 instances for the test set. We train the methods with the hinge loss function h(z) := max{0, 1 z/c}, where c is the margin. We choose c = 2/ϵ, so that the true error relative to the best hypothesis is of order ϵ. We fixed the value of ϵ to be (K/n)1/2. For ITL we optimize that loss function constraining the ℓ2-norm of the weights, for MTL and LTL we constrain D to have a Frobenius norm less or equal than 1, and each ct is constrained to have an ℓ2 norm less or equal than 1. During testing we use the 0-1 loss. For example the task-average error is evaluated as i=1 1{sign( ut, xi ) = sign( ˆut, xi )} (11) where ˆut are the weight vectors learned by the assessed method. 3.3 MTL Experiment We first discuss the MTL experiment. We let d = 50, and vary T {5, 10, . . . , 150}, n {5, 10, . . . , 150} considering the cases K = 2 and K = 5. In Figure 1 we report the difference between the classification error of the two methods. These results are obtained by repeating the experiment 10 times, reporting the average difference. In each trial a different set of input points and underlying weight vectors are generated for each task. In the MTL case the training error was always below 0.1 and on average it was smaller than 0.04. This suggests that despite the problem being non-convex, the gradient optimization algorithm finds a good suboptimal solution. 1. The code used for the experiments presented in this section is available at http://romera-paredes. com/multitask-representation. The Benefit of Multitask Representation Learning Figure 1: Difference of test classification error, computed according to eq. (11), between ITL and MTL. The vertical axis represents the number of training tasks, and the horizontal axis the number of training instances per task. In the left column K = 2, and in the right column K = 5. We have made further experiments to assess the influence of other data settings on the difference between ITL and MTL. In the first of those experiments we have explored the cases in which the dictionary size is overestimated and underestimated. The results are shown in Figure 2. In the left plot the dictionary size is overestimated, in particular the ground truth number of atoms is 2, and the number of atoms used in the MTL method is 5. We can appreciate a similar pattern as the one we saw in Figure 1, although differences between ITL and MTL are not as high. The performance is slightly hampered, as expected due to an overestimation of the number of atoms. On the other hand in Figure 2 (right) we show the results when the number of atoms in the ground truth dictionary is 5, whereas the number of atoms used in the MTL approach is 2. In this case we see that the performance is severely affected by the underestimation of the size of the dictionary, yet we observe that MTL performs better than ITL in the same regime as in the previous experiments. In the second of these experiments we study how the results are affected when the data are noisy. To do so, we have generated the data so that the ground truth label for instance xi for task t is given by sign( ut, xi +εti), where εti N(0, 1). The dictionary size, for both the ground truth and the MTL approach, is K = 2. The results are shown in Figure 3, and we can see a similar behaviour as the one in Figure 1, with somewhat smaller differences between ITL and MTL. 3.4 LTL Experiment In this experiment we test how the dictionary learned at the training stage helps learning new tasks, and we assess how similar the resultant figure is in comparison to the phase diagram derived in the previous section. The data is generated according to the settings given in the MTL experiment. Furthermore, 50 new tasks are sampled following the same scheme previously described for the purpose of computing the LTL test error. We present the results in Figure 4 (Top). Similar Maurer, Pontil and Romera-Paredes Figure 2: Difference of test classification error, computed according to eq. (11), between ITL and MTL, when the number of atoms of the ground truth dictionary does not match the number of atoms of the MTL model. The plot in the left shows the experiment in which the ground truth number of atoms is 2, whereas the number of atoms used in the MTL approach is 5. The plot in the right shows the opposite scenario: 5 atoms as ground truth, and 2 atoms in the MTL model. The vertical axis represents the number of training tasks, and the horizontal axis the number of training instances per task. Figure 3: Difference of test classification error, computed according to eq. (11), between ITL and MTL, when adding Gaussian noise to the ground truth labels. The vertical axis represents the number of training tasks, and the horizontal axis the number of training instances per task. to the previous experiment, we report the average difference between the test error of ITL and LTL after 10 trials. In Figure 4 (Bottom) we present the theoretical phase diagram, which was generated using 1 T 1011, 1 n 105, d = 105, δ = 0.0001. We also plot as a dark line the points in which there is no difference in the performances between ITL and LTL. The Benefit of Multitask Representation Learning Figure 4: The vertical axis represents the number of training tasks, and the horizontal axis the number of training instances per task. Plots in the top row show the difference of test classification error, computed on 50 new tasks, between ITL and LTL. Plots in the bottom row show the region where the upper bound for LTL is smaller than the lower bound for any equivariant algorithm for ITL (see the discussion in Section 3.1, in particular Equation 9) using 1 T 1011, 1 n 105 , d = 105, and δ = 0.0001. In the left column K = 2, and in the right column K = 5. The reader may object about the much larger parameter values used to generate the plots of theoretical differences, in comparison to the experimental settings. These large parameters are partly a consequence of an accumulation of somewhat loose estimates in the derivation of both the upper and lower bounds. Another reason is that in applying it to a noiseless, finite-dimensional problem (for clarity) we have sacrificed two strong points of our results: independence of input dimension and its agnostic nature. Apart from the large parameter values the theoretical prediction shown in Figure 4 (Bottom) is in very good agreement with the experimental results in Figure 4 (Top). We have also performed experiments in order to evaluate the influence of noise and under/overestimation of the dictionary size on the difference between ITL and LTL. We obtained similar results as the ones reported for MTL in Figures 2 and 3. Maurer, Pontil and Romera-Paredes 20 40 60 80 100 120 140 Figure 5: Similarity between the learned dictionary ˆD and the ground truth dictionary D, according the similarity measure s( ˆD, D) in Equation (12). The vertical axis represents the number of training tasks, and the horizontal axis the number of training instances per task. Left plot: K = 2. Right plot: K = 5. Finally, we have compared the learned dictionary, ˆD, with the ground truth, D, in the same regime of parameters used for the previous experiments. Note that a dictionary could be correct up to permutations and changes of sign of its atoms. To overcome this issue we use the similarity measure s( ˆD, D) = 1 D ˆD tr , (12) where tr is the sum of singular values of a matrix. Note that s( ˆD, D) = 1 if ˆD and D are the same matrix up to permutation of columns and changes of sign, as requested. The results are found in Figure 5. Figure 5 indicate that the learned dictionary is close to the true dictionary even for small sample sizes, provide T is large. This supports the results in Figure 1 and the top plots in Figure 4, where MTL or LTL are found to be superior to ITL in this regime, respectively. 4. Proofs of the Main Theorems In this section we prove our principal results, Theorem 1 and Theorem 2. In preparation for the proofs we will first present some important auxiliary results. We denote by γ a generic vector of independent standard normal variables, whose dimension will be clear from context. A central role in this paper is played by the Gaussian average G(Y ) of a set Y Rn, which is defined as G (Y ) = E sup y Y γ, y = E sup y Y The Benefit of Multitask Representation Learning The reader who is concerned about the measurability of the random variable on the right hand side should replace Y by a countable dense subset of Y , with similar adjustments wherever the Gaussian averages occur. Rademacher averages, where the γi are replaced by uniform { 1, 1}-distributed variables, are somewhat more popular in the literature. We use Gaussian averages instead, because in most cases they are just as easy to bound and possess special properties (Theorem 10 and Theorem 12 below) which we need in our analysis. The first result is a standard tool to prove uniform bounds on the estimation error in terms of Gaussian averages (Bartlett and Mendelson, 2002). Theorem 8 Let F be a real-valued function class on a space X and let X = (X1, ..., Xn) be a vector of independent random variables and X iid to X. Then (i) EX f X i f (Xi) 2πEXG (F (X)) (ii) if the members of F have values in [0, 1] then with probability greater than 1 δ in X for all f F EX f X i f (Xi) 2πG (F (X)) The following theorem is a vector-valued version of the above, is useful for bounds on the task-averaged estimation error (Ando and Zhang (2005), Maurer (2006b)). Theorem 9 Let F be a class of functions f : X [0, 1]T , and let µ1, ..., µT be probability measures on X with X = (X1, ..., XT ) QT t=1 (µt)n where Xt = (Xt1, ..., Xtn). Then with probability greater than 1 δ in X for all f F EX µt [ft (X)] 1 ti ft (Xti) where Y RTn is the random set defined by Y = {(ft (Xti)) : f F} . The previous two theorems replace the problem of proving uniform bounds by the problem of bounding Gaussian averages. One key result in the latter direction is known as Slepian s Lemma (Slepian (1962), Ledoux and Talagrand (1991)). Theorem 10 Let Ωand Ξ be mean zero, separable Gaussian processes indexed by a common set S, such that E (Ωs1 Ωs2)2 E (Ξs1 Ξs2)2 for all s1, s2 S. Then E sup s S Ωs E sup s S Ξs. Maurer, Pontil and Romera-Paredes The following corollary is the key to our bound for LTL. Corollary 11 Let Y Rn and let φ : Y Rm be (Euclidean) Lipschitz with Lipschitz constant L. Then G (φ (Y )) LG (Y ) . Proof Define two Gaussian processes indexed by Y as k=1 γkφ (y)k and Ξy = L with independent γk and γ i. Then for any y, y Y E Ωy Ωy 2 = φ (y) φ y 2 L2 y y 2 = E (Ξs1 Ξs2)2 , so that, by Slepian s Lemma, G (φ (Y )) = E sup y Y Ωy E sup y Y Ξy = LG (Y ) . In many applications this is applied when n = m and φ is defined by φ (y1, ..., yn) = (φ1 (y1) , ..., φn (yn)) where the real functions φ1, ..., φn have Lipschitz constant L. At one point we will need a generalization of the above corollary, which allows to select φ from an entire class of Lipschitz functions. We will use the following result, which is taken from Maurer (2014). It will play an important role in the proof of Theorem 13 below. Theorem 12 Let Y Rn have (Euclidean) diameter D (Y ) and let F be a class of functions f : Y Rm, all of which have Lipschitz constant at most L (F). Then for any y0 Y G (F (Y )) c1L (F) G (Y ) + c2D (Y ) Q (F) + G (F (y0)) , where c1 and c2 are universal constants and Q (F) = sup y,y Y, y =y E sup f F γ, f (y) f (y ) Note that the result allows us to minimize the right hand side in y0. Analogs of Theorem 10 and Theorem 12 are not available for Rademacher averages. This is the reason why we use the slightly more exotic Gaussian averages. 4.2 Proof of the Excess Risk Bound for the Average Risk We first establish the following uniform bound. It is of some interest in its own right, in particular since the problem (1) is often non-convex, so that the excess risk bound may not be meaningful in practice. Recall the definition of Q given in Equation (4). The Benefit of Multitask Representation Learning Theorem 13 Let µ1, . . . , µT be probability measures on Z and let Zt1, . . . , Ztn be i.i.d. from µt, for t = 1, . . . , T. Let δ (0, 1). With probability at least 1 δ in the draw of a multisample Z, it holds for every h H and every f1, . . . , f T F that Eavg (h, f1, ..., f T ) 1 ti ℓ(ft (h (Xti)) , Yti) c1 LG(H( X)) n T + c2 Q suph H h X where c1 and c2 are universal constants. Proof By Theorem 9, with probability at least 1 δ in Z, for all h H and all f1, ..., f T F, we have that Eavg (h, f1, ..., f T ) 1 ti ℓ(ft (h (Xti)) , Yti) 2π n T G (S) + 2n T , (13) where S = (ℓ(ft (h (Xti)) , Yti)) : f FT and h H RTn. By the Lipschitz property of the loss function ℓand the contraction lemma Corollary 11 (recall the remark which follows its proof) we have G (S) G (S ), where S = (ft (h (Xti))) : f FT and h H RTn. Recall that H X RKTn is defined by H X = {(hk (Xti)) : h H} , and define a class of functions F : RKTn RTn by F = y RKTn 7 (ft (yti)) : (f1, ..., f T ) FT . Then S = F H X , and by Theorem 12 for universal constants c 1 and c 2 G S c 1L F G H X + c 2D H X Q F + min y Y G (F (y)) . (14) We now proceed by bounding the individual terms in the right hand side above. Let y, y RKTn, where y = (yti) with yti RK and y = (y ti) with y ti RK. Then for f = (f1, ..., f T ) FT f (y) f y 2 = X ft (yti) ft y ti 2 yti y ti 2 = L2 y y 2 , Maurer, Pontil and Romera-Paredes so that L (F ) L. Also E sup g F γ, g (y) g y = E sup (f1,...,f T ) FT ti γti ft (yti) ft y ti t E sup f F i γi f (yti) f y ti i γi f (yti) f y ti !2 yti y ti 2 !1/2 whence Q (F ) = TQ. Finally we take y0 = 0 and the last term in (14) vanishes since f (0) = 0 for all f F. Substitution in (14) and using G (S) G (S ) we arrive at G (S) c 1LG H X + c 2 Bounding D H X 2 suph h X and substitution in (13) gives the result. Proof of Theorem 1 Let h and f 1 , ..., f T be the minimizers in the definition of E avg. Then Eavg(ˆh, ˆf1, ..., ˆf T ) E avg Eavg(ˆh, ˆf1, ..., ˆf T ) 1 ti ℓ( ˆft(ˆh(Xti)), Yti) ti ℓ( ˆft(ˆh(Xti)), Yti) 1 ti ℓ(f t (h (Xti)) , Yti) ti ℓ(f t (h (Xti)) , Yti) 1 t E(X,Y ) µtℓ(f t (h (X)) , Y ) The last term involves only the n T random variables ℓ(f t (h (Xti)) , Yti) with values in [0, 1]. It can be bounded with probability 1 δ/2 by p ln (2/δ) / (2Tn) using Hoeffding s inequality. The middle term is non-positive by definition of ˆh, ˆf1, ..., ˆf T being the corresponding minimizers. There remains the first term which we bound by sup h H,f1,...,f T F Eavg (h, f1, ..., f T ) 1 ti ℓ(ft (h (Xti)) , Yti) . and appeal to Theorem 13 to bound the supremum. A union bound then completes the proof. The Benefit of Multitask Representation Learning 4.3 Proof of the Excess Risk Bound for Learning-to-learn Recall the definition of the algorithm parametrized by h H a (h)Z = arg min f F 1 n i ℓ(f (h (Xi)) , Yi) for Z Zn and the associated minimum m (h)Z. Also recall that Eη (h) = Eµ ηEZ µn E(X,Y ) µℓ(a (h)Z (X) , Y ) and the two measures µη and ρη induced by the environment η and defined by µη (A) = Eµ ηµ (A) for A Z and ρη (A) = Eµ ηµn (A) for A Zn. Also recall the definition of Q given in Equation (5). Again we begin with a uniform bound. Theorem 14 Let δ (0, 1). (i) With probability at least 1 δ in Z ρT η it holds for every h H that 2πLG (H ( x)) 2πQ sup h H v u u t E(X,Y ) µη (ii) With probability at least 1 δ in Z ρT η it holds for every h H that 2πLG (H ( x)) 2πQ P t suph H h (Xt) 16 ln (4/δ) Proof The key to the proof is the decomposition bound sup h H Eη (h) 1 t m (h)Zt sup h H Eµ ηEZ µn E(X,Y ) µℓ(a (h)Z (X) , Y ) m (h)Z EZ ρη [m(h)Z] 1 In turn we will bound both terms on the right hand side above. A bound on the second term means that we can predict the empirical risk on the data of a future task uniformly in h. A bound on the first term means that we can predict the true risk from the empirical risk on the future task. We first bound the second term in the right hand side of (15), and use Theorem 8-(ii) on the class of functions {z Zn 7 m (h)z : h H} Maurer, Pontil and Romera-Paredes to get with probability at least 1 δ in Z ρT η that EZ ρη [m (h)Z] 1 t=1 m (h)Zt 2π T G (S) + where S is the subset of RT defined by S = n m (h)Z1 , ..., m (h)ZT We will bound the Gaussian average of S using Slepian s inequality (Theorem 10). Define two Gaussian processes indexed by H as t γtm (h)zt and Ξh = L n kti γktihk (xti) . Now for any z Zn and representations h, h H min f F 1 n i ℓ(f (h (xi)) , yi) min f F 1 n i ℓ f h (xi) , yi !2 i ℓ(f (h (xi)) , yi) ℓ f h (xi) , yi !2 1 n sup f F ℓ(f (h (xi)) , yi) ℓ f h (xi) , yi 2 hk (xi) h k (xi) 2 , where in the last step we used the Lipschitz properties of the loss function ℓand of the members in the class F. It follows that E (Ωh Ωh )2 = X m (h)zt m h hk (xti) h k (xti) 2 = E (Ξh Ξh )2 , so by Theorem 10 G (S) = E sup k Ωh E sup k Ξh = L n G (H ( x)) . The second term in the right hand side of (15) is thus bounded with probability 1 δ by 2πLG (H ( x)) The Benefit of Multitask Representation Learning We now bound the first term on the right hand side of (15) by sup h H Eµ ηEZ µn E(X,Y ) µℓ(a (h)X (X) , Y ) m (h)Z sup h H Eµ ηEZ µn sup f F E(X,Y ) µℓ(f (h (X)) , Y ) 1 i ℓ(f (h (Xi)) , Yi) For Z = (X, Y) Zn and h H denote with ℓ(F h (X) , Y) the subset of Rn defined by ℓ(F (h (X)) , Y) = {(ℓ(f (h (Xi)) , Yi)) : f F} . Using Theorem 8-(i) and the contraction lemma, Corollary 11, we can bound the last expression above by sup h H Eµ ηEZ µn sup f F E(X,Y ) µℓ(f (h (X)) , Y ) 1 i ℓ(f (h (Xi)) , Yi) 2π sup h H EZ ρη G (ℓ(F (h (X)) , Y)) 2π sup h H EZ ρη G (F (h (X))) 2π n sup h H EZ ρη G (F (h (X))) h (X) h (X) 2π n Q sup h H EZ ρη h (X) , using Hoelder s inequality and the definition of Q in the last step. But, using Jensen s inequality, EZ ρη h (X) s i h (Xi) 2 = q n E(X,Y ) µη h (X) 2, since Z ρη is iid. Inserting this in the previous chain of inequalities and combining with (16) gives the first part of the theorem. To obtain the data dependent bound we use the fact that, with probability at least 1 δ/4, sup h H EZ ρη G (ℓ(F (h (X)) , Y)) n EX ρη sup h H G (ℓ(F (h (X)) , Y)) G (ℓ(F (h (Xt)) , Yt)) The last inequality follows from Hoeffding s inequality since for any h H and any sample Z Zn 0 G (ℓ(F (h (X)) , Y)) i γiℓ(f (h (Xi)) , Yi) i ℓ(f (h (Xi)) , Yi)2 !1/2 1, Maurer, Pontil and Romera-Paredes where we have also used the fact that the loss function ℓhas range in [0, 1]. Bounding G (ℓ(F h (Xt) , Yt)) Q h (Xt) as above and combining (18) and (16) in (15) with a union bound gives the second inequality of the theorem. Remark: In the proof of the fully data-dependent part above the bound on sup h H EZ ρη G (ℓ(F (h (X)) , Y)) is very crude. Instead we could have again invoked Theorem 8 to get a better bound with a more complicated expression involving nested Gaussian averages. We have chosen the simpler path for greater clarity. Proof of Theorem 2 Recall that E η = min h H Eµ η min f F E(X,Y ) µℓ(f (h (X)) , Y ) . We denote with h the minimizer in H occurring in the definition of E η. We have the following decomposition Eη(ˆh) E η = t m(ˆh)Zt 1 t m(h )Zt EZ ρη [m (h )Z] EZ µn [m (h )Z] min f F E(X,Y ) µℓ(f (h (X)) , Y ) . (22) For a fixed distribution µ let f µ be the minimizer in minf F E(X,Y ) µℓ(f (h (X)) , Y ). By definition of m (h )Z we have for every µ η that EZ µn [m (h )Z] = EZ µn min f F 1 m i=1 ℓ(f (h (Xi)) , Yi) i=1 ℓ f µ (h (Xi)) , Yi = E(X,Y ) µℓ f µ (h (X)) , Y , since Z is iid. The term in (22) is therefore non-positive. The Benefit of Multitask Representation Learning The term in (21) involves the deviation of the empirical and true averages of the T iid [0, 1]-valued random variables m (h )Zt. With Hoeffding s inequality this can be bounded with probability at least 1 δ/8 by p ln (8/δ) / (2T). The term (20) is non-positive by the definition of ˆh. There remains the term (19), which we bound by Theorem 14. The result now follows by combining this bound with the bound on (21) in a union bound and some numerical simplifications. 5. Conclusion Several works have advocated that sharing features among tasks as a means to learning representations which capture invariant properties to tasks can be highly beneficial. In this paper, we studied the statistical properties of a general MTRL method, presenting bounds on its learning performance in both settings of MTL and LTL. Our work provides a rigorous justification of the benefit offered by MTRL over learning the tasks independently. To give the paper a clear focus we have illustrated this advantage in the case of linear feature learning. Our results however apply to fairly general classes of representations H and specifications F, and similar conclusions may be derived for other nonlinear MTRL methods. We conclude by sketching specific cases which deserve a separated study: Deep networks. As we noted our bounds directly apply to multilayer, deep architectures obtained by iteratively composing linear transformations with nonlinear activation functions, such as the rectifier linear unit or the sigmoid functions. The representations learned by such methods tend to be specific in that only a subset of components are active on each given input, which makes our bounds particularly attractive for further analysis. Sparse coding. Another interesting case of our framework is obtained when the specialized class F consists of sparse linear predictors. This case has been considered in Maurer et al. (2013); Ruvolo and Eaton (2014) when the representation class consists of linear functions. Different choices of sparse classes F could lead to interesting learning methods. Representations in RKHS. As we already noted the feature maps forming the class H could be vector-valued functions in a reproducing kernel Hilbert space. Although kernel methods are more difficult to apply to large datasets required for MTRL and need additional approximation steps, the representations learned using for example Gaussian kernels would be very specific and suitable for our bounds. Acknowledgments We thank the reviewers for their helpful comments. Maurer, Pontil and Romera-Paredes R. K. Ando, T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817-1853, 2005. M. Anthony, P. Bartlett. Learning in Neural Networks: Theoretical Foundations. Cambridge University Press, 1999. A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Machine Learning, 73(3):243-272, 2008 P.L. Bartlett and S. Mendelson. Rademacher and Gaussian Complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463-482, 2002. J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149-198, 2000. S. Ben-David, N. Eiron, H. U. Simon. Limitations of Learning Via Embeddings in Euclidean Half Spaces, Journal of Machine Learning Research, (3):441-461, 2002. S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning. Proc. 16th Annual Conference on Computational Learning Theory (COLT), pages 567 580, 2003. R. Bhatia. Matrix Analysis. Springer, 1997. R. Caruana. Multi-task learning. Machine Learning, 28(1):41 75, 1997. G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11:2597 2630, 2010. S. Dasgupta, A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22: 60-65, 2003. A. Ehrenfeucht, D. Haussler, M. Kearns, L. G. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3): 247-251, 1989. T. Evgeniou, C. Micchelli and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615 637, 2005. R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), 2014. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13 30, 1963. S. M. Kakade, S. Shalev-Shwartz, A. Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research 13:1865 1890, 2012. The Benefit of Multitask Representation Learning V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30(1):1 50, 2002. I. Kuzborskij, and F. Orabona. Stability and Hypothesis Transfer Learning. Proc. International Conference on Machine Learning (ICML), 2013. M. Ledoux, M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin, 1991. P. M. Long. On the sample complexity of PAC learning halfspaces against the uniform distribution. IEEE Transactions on Neural Networks, 6(6):1556-1559, 1995. K. Lounici, M. Pontil, A.B. Tsybakov and S. van de Geer. Oracle inequalities and optimal inference under group sparsity. Annals of Statistics, 39(4):2164 2204, 2011. A. Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, 7:117 139, 2006. A. Maurer. The Rademacher complexity of linear transformation classes. Proc. 19th Annual Conference on Learning Theory (COLT), pages 65 78, 2006. A. Maurer. A chain rule for the expected suprema of Gaussian processes. Proc. 25th International Conference on Algorithmic Learning Theory, 2014. A. Maurer and M. Pontil. A uniform lower error bound for half-space learning. Proc. 19th International Conference on Algorithmic Learning Theory, pages 70 78, 2008. A. Maurer and M. Pontil. Excess risk bounds for multitask learning with trace norm regularization. Proc. 26th Annual Conference on Computational Learning Theory (COLT), 2013. A. Maurer, M. Pontil, B. Romera-Paredes. Sparse coding for multitask and transfer learning. Proc. 30th International Conference on Machine Learning, JMLR W&CP 28 (2):343-351, 2013 A. Maurer, M. Pontil, B. Romera-Paredes. An inequality with applications to structured sparsity and multitask dictionary learning. Proc. 27th Annual Conference on Computational Learning Theory (COLT), 2014. S. Mendelson and A. Pajor. On singular values of matrices with independent rows. Bernoulli 12(5):761 773, 2006. A. Pentina and C.H. Lampert. A PAC-Bayesian bound for lifelong learning. Proc. International Conference on Machine Learning (ICML), 2014. P. Ruvolo and E. Eaton. Online multi-task learning via sparse dictionary optimization. In Proc. of the 28th AAAI Conference on Artificial Intelligence, 2014. D. Slepian. The one-sided barrier problem for gaussian noise. Bell System Tech. J., 41:463501, 1962. Maurer, Pontil and Romera-Paredes S. Thrun and L. Pratt. Learning to Learn. Springer, 1998. C. Widmer, M. Kloft, X. Lou X and G.R atsch. Regularization-based multitask learning: With applications to genome biology and biomedical imaging. German Journal on Artificial Intelligence, 2013. A.W. Van Der Vaart and J.A. Wellner. Weak Convergence of Empirical Processes. Springer, 1996