# a_pacbayesian_bound_for_lifelong_learning__90727bf7.pdf A PAC-Bayesian Bound for Lifelong Learning Anastasia Pentina APENTINA@IST.AC.AT IST Austria (Institute of Science and Technology Austria), 3400 Am Campus 1, Klosterneuburg, Austria Christoph H. Lampert CHL@IST.AC.AT IST Austria (Institute of Science and Technology Austria), 3400 Am Campus 1, Klosterneuburg, Austria Transfer learning has received a lot of attention in the machine learning community over the last years, and several effective algorithms have been developed. However, relatively little is known about their theoretical properties, especially in the setting of lifelong learning, where the goal is to transfer information to tasks for which no data have been observed so far. In this work we study lifelong learning from a theoretical perspective. Our main result is a PAC-Bayesian generalization bound that offers a unified view on existing paradigms for transfer learning, such as the transfer of parameters or the transfer of low-dimensional representations. We also use the bound to derive two principled lifelong learning algorithms, and we show that these yield results comparable with existing methods. 1. Introduction Today, many problems can be solved equally well or better by machine learning algorithms as by humans. However, these algorithms typically require large amount of training data to achieve acceptable results, whereas humans are able to learn new concepts from just a few examples. Presumably this difference comes from the fact that most machine learning systems are trained from scratch for each task at hand, whereas humans exploit context and knowledge they acquired previously while solving other tasks. This observation motivates research on transfer learning: how can information from previously learned tasks be used for solving new tasks? Several scenarios of how this question can be formalized have been identified. Here, we discuss some that are most relevant in the context of super- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). vised learning. As general setup, one assumes that one or more learning tasks have been observed, typically in form of labeled training sets. The methods then differ in how this information is meant to be used. In the multitask setting (Caruana, 1997), the goal is simply to perform well on all of the tasks. In domain adaptation (Bridle & Cox, 1990), the goal is to perform well on a new task for which only unlabeled or very few labeled data samples are observed. Finally, in lifelong learning or learning to learn (Thrun & Mitchell, 1995), the goal of the learner is to perform well on future tasks, for which so far no data has been observed. In this work we focus on the third setting. Lifelong Learning. For lifelong learning to make sense, one must assume a relation between the observed tasks and the future tasks. To formalize this, Baxter (2000) introduced the notion of a task environment as a set of possible tasks that might need to be solved at some time. The observed tasks are sampled randomly from the environment according to an unknown task distribution. In this setting, Baxter also provided the first theoretical guarantees by proving generalization bounds in the framework of VC theory (Vapnik, 1998). After this work, however, progress on the theoretical understanding of lifelong machine learning slowed down. Many algorithms for transfer learning were developed and found empirically to work well in many cases. However, except for a few cases, such as (Maurer, 2009; Maurer et al., 2013), their theoretical justifications are so far not well understood. In this work, we aim at making progress on the theoretical justifications of lifelong learning. In Section 2 we prove a general PAC-Bayesian generalization bound for lifelong learning that allows quantifying the relation between the expected loss on a future learning task to the average loss on the observed tasks. In contrast to Baxter s results, our bound has the advantage that its value depends on the representation of the data and on the learning algorithm used to solve the tasks. This makes it possible to interpret the bound as a quality measure of the transferred information. Therefore, by optimizing the measure we obtain principled A PAC-Bayesian Bound for Lifelong Learning algorithms for lifelong learning. In Sections 2.1 and 2.2 we demonstrate this process in two cases: assuming that the solutions to all tasks can be represented by a single parameter vector plus small task-specific perturbation (Evgeniou & Pontil, 2004), we obtain an algorithm that resembles previously proposed methods for regularizing the weight vectors of future tasks using linear combinations of weight vectors of previous tasks, such as (Yang et al., 2007; Aytar & Zisserman, 2011). An alternative assumption is that the solution vectors to tasks can differ significantly, but that they all lie in a common feature subspace of low dimension. In this setting, our bound provides an algorithm in which the observed tasks are used to identify the most promising subspace of features, such that learning for future tasks needs to take place only within the reduced feature space. This procedure is related to existing methods for representation and dictionary learning, e.g. (Argyriou et al., 2008; Kumar & Daum e III, 2012), which have also successfully been applied in the lifelong learning setting (Ruvolo & Eaton, 2013). In a case of linear regression, Maurer (2009) used this assumption to prove a generalization bound in the PAC framework, by using the concept of environment of tasks from Baxter (2000) and Rademacher complexity. Similar results were obtained in (Maurer et al., 2013) in the case of sparsity constraints. The PAC-Bayesian framework. For the convenience of readers who are not familiar with the PAC-Bayesian framework, we introduce the most relevant concepts from the literature here. For more details, see (Langford, 2005; Seeger, 2003; Catoni, 2007). PAC-Bayesian theory studies the properties of randomized predictors, called Gibbs predictors. Formally, let X be an input set, Y an output set and H {h : X Y} a set of prediction functions (hypotheses). For any probability distribution P over H, the Gibbs predictor associated with P is the stochastic predictor that for any x X randomly samples a hypothesis h P and then returns h(x). Assume that we are given a set S = {(x1, y1), . . . , (xm, ym)} of i.i.d. samples from an unknown probability distribution D over X Y. For any loss function, ℓ : Y Y [0, 1], let er(Q) denote the expected loss of the Gibbs classifier associated with Q, i.e. er(Q) = Eh Q E(x,y) D ℓ(h(x), y), and let ber(Q) denote the expected empirical loss, i.e. ber(Q) = Eh Q 1 m Pm i=1 ℓ(h(xi), yi). It is then possible to prove generalization bounds such as the following: with probability at least 1 δ (over the sampling of S) we have for all distributions Q over H (Mc Allester, 1999) er(Q) ber(Q)+ KL(Q||P) + log 1 δ + log m + 2 2m 1 , (1) where P is a reference or prior distribution over H that must be chosen before observing the samples S, and KL(Q||P) is the Kullback-Leibler divergence, i.e. a measure how different Q is from P. As such, the bound resembled the typical trade-off in regularized risk minimization between the training loss and a regularizer (Vapnik, 1998). Inequality (1) is uniform with respect to Q, so it holds regardless of which Q we choose. In particular, we can choose it after seeing S, and for this reason Q is typically referred to as posterior distribution in this context. Choosing Q such that it minimizes the right hand side of the bound we obtain a Gibbs predictor that can be expected to be a good choice for the learning task at hand, since its expected loss is controlled by a hopefully small quantity. While the inequality (1) holds regardless the agreement between the data distribution and the prior distribution P, the value of the right hand side of the bound strongly depends on the choice of P. Therefore, one would prefer a prior that allows learning a posterior that is at the same time close to the prior (KL(Q||P) is small) and shows good performance on the training set (ber(Q) is small). 2. PAC-Bayesian Lifelong Learning To develop a PAC-Bayesian theory of lifelong learning we adopt the concept of a task environment from Baxter (2000). We assume an unknown set of possible tasks T, all of which share the same input space X, output space Y, hypothesis set H and loss function ℓ: Y Y [0, 1]. The lifelong learning system (which we will call an agent) observes n tasks t1, . . . , tn that are sampled i.i.d. from T according to some unknown distribution over tasks. For each task ti the agent observes a training set Si = {(xi1, yi1), . . . , (ximi, yimi)} that is sampled i.i.d. according to the task s unknown data distribution Di. To solve individual tasks, the agent makes use of an arbitrary but fixed learning algorithm, i.e. a deterministic procedure that, given a training set S and a form of prior knowledge P, outputs a posterior distribution Q = Q(S, P) over H. The agent makes predictions using Gibbs predictor associated with Q. Staying within a PAC-Bayesian setting we assume that the prior knowledge, P, is encoded in a probability distribution over H. For concrete examples of the above setting see Sections 2.1 and 2.2. The goal of the agent is to use the information contained in the observed tasks to identify prior knowledge that will cause as good as possible performance on new (so far unobserved) tasks from the same environment. This setting is strictly harder than multi-task learning or domain adaptation, since no data for the future tasks to be solved is available at the time the agent makes its decision. In particular, previously developed techniques for learning priors are A PAC-Bayesian Bound for Lifelong Learning not directly applicable: first, note that we cannot use ordinary generalization bounds, such as (1), to identify optimal priors, since they only hold uniformly in Q if the prior is chosen independently from the training set. Catoni (2007) derived an expression for the overall best prior, i.e. the distribution resulting in the smallest possible bound value. However, it is generally of a non-parametric form and uncomputable without full information about the data distribution. Parrado-Hern andez et al. (2012) showed that priors can be learned by splitting the available training data into two parts, one for learning a prior, one for learning the predictor. This, however, requires training data for the task at hand, which is not available in the lifelong setting. Our first contribution in this work is the insight that one should treat the prior P itself as a random variable. Let P be an initial distribution over all possible priors, which we call hyperprior in concordance with the Bayesian nomenclature. For learning a prior the agent uses the observed tasks to adjust its original hyperprior into a hyperposterior distribution Q over the set of priors. This randomized setting allows us to follow a PAC-Bayesian path analogous to classical results. We will obtain a bound that requires a fixed hyperprior P but that holds uniformly with respect to the hyperposterior Q. The hyperposterior that minimizes the bound will provide us with the most promising distribution from which to obtain priors for future tasks. Formally, the goal of the agent is to find Q that minimizes the expected loss er(Qt) of a randomly sampled new task t with training set St and prior P sampled from Q. We write er(Q) = E(t,St)EP Q er(Qt(St, P)), (2) where Qt(St, P) is the posterior obtained by training the learning algorithm with prior P and training sample St. We call this quantity the transfer risk. We cannot compute er(Q) because the distributions over the tasks and the tasks data are both unknown. However, we can approximate it by its empirical counterpart, based on n observed tasks i=1 EP Q ber(Qi(Si, P)), (3) which we call empirical multi-task risk. Our main result is a theorem that bounds the difference between the two quantities defined above. Theorem 1 For any δ > 0 the following inequality holds with probability at least 1 δ (over the training samples {S1, . . . , Sn}) for all hyperposterior distributions Q er(Q) ber(Q) + 1 n KL(Q P) + 1 + 1 n m KL((Q, Qn) (P, P n)) + 1 m where (Q, Qn) = Q Qn i=1 Qi denotes the distribution in which we first sample P according to Q and then use it and the data Si to produce a posterior Qi for each task ti. (P, P n) = P Qn i=1 P denotes the distribution in which we sample P according to P and use it as a posterior for all tasks. m = 1 n Pn i=1 1 mi 1 is the harmonic mean of the sample sizes. Proof To prove Theorem 1 we introduce an intermediate quantity that can be seen as an expected multi-task risk eer(Q) = E P Q 1 n i=1 E h Qi E (x,y) Diℓ(h(x), y). (5) First we will bound the uncertainty on the task environment level by bounding the difference between transfer error, er(Q), and expected multi-task error, eer(Q). Then we will bound the uncertainty within observed tasks by bounding the difference between expected multi-task error, eer(Q), and its empirical approximation, ber(Q). Our main tool in both cases will be the following lemma. Lemma 2 Let f be a random variable taking values in A and let X1, . . . , Xl be l independent random variables with each Xk distributed according to µk over the set Ak. For functions gk : A Ak [ak, bk], k = 1 . . . l, let ξk(f) = EXk µk gk(f, Xk) for any fixed value of f. Then for any fixed distribution π on A and any λ, δ > 0 the following inequality holds with probability at least 1 δ (over sampling X1, . . . , Xl) for all distributions ρ over A k=1 ξk(f) E f ρ k=1 gk(h, Xk) KL(ρ||π) + λ2 k=1(bk ak)2 log δ . (6) For the proof of this lemma, see the Appendix A. In order to bound the difference between er(Q) and eer(Q) we treat each task t with the corresponding training sample St as a random variable and apply Lemma 2. Formally, we set ρ = Q, π = P, Xk = (tk, Sk), l = n, f = P and gk(f, Xk) = 1 n E h Qk E (x,y) Dkl(h(x), y) and apply Lemma 2 with λ = n. Since ak = 0 and bk = 1 n we obtain with probability at least 1 δ/2 that for all Q er(Q) eer(Q) + 1 n KL(Q||P) + 1 To bound the difference between eer(Q) and ber(Q) we apply Lemma 2 to the union of all training samples S = Sn i=1 Si. We set ρ = (Q, Qn), π = (P, P n), Xk = (xij, yij), l = P mi, f = (P, h1, . . . , hn) and gk(f, Xk) = 1 nmi ℓ(hi(xij), yij). In this setting ak = 0 A PAC-Bayesian Bound for Lifelong Learning and bk = 1/(nmi), Lemma 2 with λ = n m yields that with probability at least 1 δ/2 for all Q eer(Q) ber(Q) + 1 n m KL((Q, Qn)||(P, P n)) + 1 8 m 1 n m log δ Now (4) follows by a union bound from (7) and (8). To get a better understanding of Theorem 1, we rewrite (4) in the following way: er(Q) ber(Q) + 1 n + 1 n m KL(Q P) (9) i=1 E P Q KL(Qi(Si, P) P) + const(n, m, δ). We see that the bound contains two types of complexity terms that correspond to two levels of our model: KL(Q P) belongs to the level of task environment in general, while each KL(Qi(Si, P) P) corresponds specifically to the i-th task. To better understand their roles, we look at the following limit cases: when the agent has access to sufficiently many tasks (n ) but tasks come with a finite amount of data ( m is finite), the first complexity term converges to 0 as 1/ n. The second complexity term converges to an average KL-divergence over tasks and may therefore remain non-zero. This means that observing many task gives the agent full knowledge about the task environment, but it cannot overcome the uncertainty within each task. In the opposite case, if the agent observes unlimited data for each tasks, but only for a finite number of tasks ( m , n is finite), the second complexity term converges to 0 as 1/ m, while the first one does not, so there is still uncertainty on the task environment level. Only when both comes together, sufficiently many tasks and sufficient amounts of data per task, it is guaranteed that the empirical multi-task risk ber(Q) converges to the transfer risk er(Q). A second important aspect of Theorem 1 is that the bound (4) consists only of observable quantities. Therefore, we can treat it as a quality measure for hyperposteriors Q. By minimizing it, we obtain a hyperposterior distribution over priors that is adjusted to the particular environment of learning tasks. Since the bound holds uniformly with respect to Q, the guarantees of Theorem 1 also hold for the resulting learned hyperposterior, so we can expect priors sampled according to the learned hyperposterior to work well even for future tasks. In the following sections, we discuss two instantiations of this procedure and show how they relate to previous work on transfer learning. 2.1. Parameter Transfer Let X Rd and H be a set of linear predictors: h(x) = w, x if Y = R or h(x) = sign w, x if Y = { 1, 1}, where w Rd is a weight vector. One of the common assumptions in multitask or lifelong learning is that the weight vectors for different tasks are only minor variations of an unknown prototypical vector (Evgeniou & Pontil, 2004). It can be captured by regularizing the distance to this vector (Aytar & Zisserman, 2011; Yang et al., 2007): ˆw = arg min w j=1 (yj w, xj )2 , (10) where wpr is some function of weight vectors of previously observed tasks, e.g. just their average, wpr = 1 n Pn i=1 wi. Theorem 1 allows us to learn an optimal wpr from the data, instead of fixing the rule for computing it. For this, we choose P = N(w P , Id) and Q = N(w Q, Id), i.e. unit variance normal distributions with means w P and w Q, respectively. The mean w P is a random variable distributed first according to the hyperprior distribution, P, which we set as N(0, σId) and later according to the hyperposterior, Q, which we model as Q = N(w Q, Id). The task of the learning consists of identifying the best w Q. As underlying learning algorithm we use Equation (10) with regularizer centered at a prior vector. For any w P and training set S = {(xi, yi)i=1,...,m} the posterior, Q(S, P) = N(w Q, Id), is given by w Q = argmin w w P 2+ C j=1 (yj w, xj )2 . (11) This has the closed form solution C Id+XX 1 m C w P +XY = Aw P +b, (12) where X is the matrix with columns x1, . . . , xm, Y is a column of labels (y1, . . . , ym) , A = Id + C Computing the complexity terms from (9) we obtain KL(Q P) = w Q 2 E P Q KL(Qi(Si, P) P) = E w P Q (Ai Id)w P + bi 2 (Ai Id)w Q + bi 2 + tr(Ai Id)2 . (13) We insert Equations (13) into the inequality (9) and obtain w Q er(w Q) ber(w Q) + n m + 1 2σn m w Q 2 i=1 (Ai Id)w Q + bi 2 + const. (14) A PAC-Bayesian Bound for Lifelong Learning The last thing we have to specify is the loss function ℓ. We consider two options: first, the binary classification setting with 0/1 loss: ℓ(y1, y2) = Jy1 = y2K. In this case the expected empirical error of the Gibbs classifier is given by the following expression (Germain et al., 2009; Langford & Shawe-Taylor, 2002) ber(w Q)= 1 yijx ij(Aiw Q + bi) q x ij(Id+Ai A i )xij where Φ(z) = 1 2) and erf(z) = 2 π R z 0 e t2dt is the Gauss error function. For a practical algorithm, one typically would prefer a bound on the loss of a deterministic classifier rather than of the stochastic Gibbs classifier. For 0/1-loss Theorem 1 provides this, since the Gibbs error is at most twice smaller than the expected error of the classifier defined by Aiw Q + bi (Mc Allester, 2003; Laviolette & Marchand, 2007). Inserting (15) in (14) and multiplying the left hand side by 1 2 we obtain the following inequality: w Q 1 2 E (t,St) E (x,y) Dt[y = sign Atw Q + bt, x ] (16) 2σn m w Q 2 + 1 2n m i=1 (Ai Id)w Q + bi 2 yijx ij(Aiw Q + bi) q x ij(Id + Ai A i )xij For regression tasks, we consider the case of truncated squared loss, ℓ(y1, y2) = min{(y1 y2)2, 1} (the truncation is necessary to fulfill the condition of a bounded loss function). Since ℓ(y1, y2) (y1 y2)2, we can substitute ber(w Q) in (14) by the empirical error of the Gibbs predictor with squared loss without violating the inequality. This error differs from the error of the predictor that is defined by Aiw Q + bi only by a constant that does not depend on w Q. An elementary calculation shows that for truncated squared loss ℓ, as in the case of 0/1 loss, the error of Gibbs predictor is at least one half of the expected error of the predictor defined by Aiw Q + bi. Therefore in this case we obtain a result similar to the inequality (16) w Q 1 2 E (t,St) E (x,y) Dt min{(y Atw Q + bt, x )2, 1} 2σn m w Q 2 + 1 2n m i=1 (Ai Id)w Q + bi 2 j=1 (yij Aiw Q+bi, xij )2 + const. (17) Minimizing the right hand side of (16) or (17) with respect to w Q, we obtain a data-dependent hyperposterior that induces prior distributions that are adjusted optimally (in the sense of the bound) to the task environment. 2.2. Representation Transfer A second assumption commonly made in multitask or lifelong learning is that the weight vectors for all tasks lie in low-dimensional subspace. Theorem 1 also allows us to learn such a subspace in a principled way. We again assume that X Rd and H is a set of linear predictors. We represent k-dimensional subspaces of Rd by d k matrices with orthogonal columns, i.e. elements of the Stiefel manifold Vd,k. As hyperprior, we want all subspaces to be equally likely, so we set P to the uniform distribution over Vd,k (Downs, 1972) C0 for any B Vd,k, (18) where C0 = 0F1( 1 2d, 0). As hyperposterior, Q, we want a distribution that concentrates its probability mass around a specific subspace, M. We choose a special case of Langevin distribution, D(Ik, M), C1 exp(tr(M B)) for any B Vd,k, (19) where C1 = 0F1( 1 4M M). The only free parameter is M Vd,k, i.e. a d k matrix with M M = Ik that represents the most promising subspace . Equation (19) can be interpreted as an analog of the Gaussian distribution on Vd,k, with mode M and unit variance. In the special case of k = 1, it reduces to the better known Von Mises distribution on the unit circle (Downs, 1972). As in the previous section we use Gaussian distributions for prior and posterior, but defined only within the subspaces sampled from P or Q. For the prior, P, we choose a Gaussian with zero mean and variance σIk. The posterior, Q, is a shifted Gaussian with variance σIk and mean w Q in the same subspace. As in the previous section we use ridge regression as learning algorithm, but again only within the subspace determined by the prior, w Q = argmin w i=1 (yi w, B xi )2 ! where B is the matrix representing the subspace, such that B x is the projected representation of the training data in this subspace. To obtain an objective function for learning M, we first compute the complexity terms in the bound (9). KL(Q P) is a constant independent of M: P is uniform, so A PAC-Bayesian Bound for Lifelong Learning KL(Q P) depends only on the differential entropy of Q. This itself is a constant independent of the parameter matrix1. Furthermore, we have KL(Qi(Si, P) P) = 1 2σ wi(B) 2, where B is the representation of the selected subspace. In combination, we get the following bound er(M) ber(M) + 1 2σn m i=1 E B D(Ik,M) wi(B) 2 ber(wi(B)) + 1 2σ m wi(B) 2 , + const. (21) where wi(B) = C mi mi B Xi X i B 1 B Xi Yi. We see that a representation, M, can be considered promising for future tasks, if itself as well as the subspaces close to it allow classification with small loss and small weight vector norm (i.e. large margin) for all observed tasks. 3. Experiments In this section, we demonstrate how learning priors distributions by minimizing the bounds (16), (17) and (21) can improve prediction performance in real prediction tasks. To position our results with respect to previous work on parameter and representation transfer, we compare to adaptive ridge regression (ARR), i.e. Equation (10) the prior wpr set to the average of the weight vectors from the observed tasks, and with the ELLA algorithm (Ruvolo & Eaton, 2013) that learns a subspace representation using structured sparsity constraints, also with squared loss. We also report results for ordinary ridge regression without any knowledge transfer. We perform experiments on three public datasets: Land Mine Detection (Xue et al., 2007). This dataset consists of 14820 data points. For each data point there are 9 features extracted from radar images and a binary label 0 or 1 corresponding to landmine or clutter. We also add a bias term, resulting in d = 10 features. Data points are collected from 29 geographical regions and we treat each region as a binary classification task. London School Data. This is a regression dataset, containing exam scores of 15362 students from 139 schools. Each student is described by 4 school-specific, 3 student-specific features and a year of examination. We use the same procedure as in (Argyriou et al., 2008; Kumar & Daum e III, 2012; Ruvolo & Eaton, 2013) to encode them in a set of binary features. We also add a bias term, so the final data 1For any M Vd,k there exits an orthogonal matrix L Rd d such that LM = J = {δij} Rd k. Therefore if B D(Ik, M), than LB D(Ik, LM) = D(Ik, J). So, the entropy of D(Ik, M) is equal to the entropy of D(Ik, J) for any M. dimensionality is d = 28. Each school constitutes a task. Animals with Attributes Dataset (Lampert et al., 2013). This dataset contains 30475 images from 50 classes. Each image comes with a 2000-dimensional feature vector, that we reduced to 100 dimensions using PCA. We l2-normalize the resulting feature vectors and add a bias term. We select the largest class, collie, and form 49 binary classification tasks, each of them is a classification of collie versus one of the remaining classes. For each task we use 2% of the data (approximately 20 images) available for collie class and the same amount of images from the another task, such that data between different tasks does not overlap. 3.1. Parameter Transfer We first perform experiments on prior learning in the setup of parameter transfer, as described in Section 2.1, calling the resulting algorithm Prior Learning with Gaussian hyperprior (PL-G). For the classification tasks (Landmine and Animals), we optimize the bound (16). To do so we replace Φ by its convex relaxation, Φcvx(z) = 1 2π, if z 0 and Φcvx(z) = Φ(z) otherwise, and use the conjugate gradient method for finding the minimum. For the regression tasks (Schools) we first divide labels (examination scores) by their maximum value. This allows us to assume that the squared loss will not exceed 1. We optimize (17), and due to the squared loss, the problem has a closed form solution: w Q = D + n m + 1 σn m Id (22) i=1 A i A i 1 c + 1 n m i=1 A i bi , where A i =Ai Id, D = 2 1 mi A i Xi X i Ai, (23) mi Y i X i A i Xi X i Ai Y i X i Ai . To make results comparable with the baseline algorithms, we report the squared error multiplied by the squared value of the maximum examination score. 3.2. Representation Transfer In a second set of experiments, we implement the idea of representation transfer from Section 2.2, calling the algorithm Prior Learning with Langevin hyperprior (PL-L). As in the case of parameter transfer, we use 0/1 loss to measure the quality in classification tasks. For the regression task we apply the same scaling procedure as discussed in Section 3.1 and use truncated squared loss. Both of these loss functions can be upper-bounded by the standard A PAC-Bayesian Bound for Lifelong Learning (a) Landmines Parameter Transfer (b) Schools Parameter Transfer (c) Animals Parameter Transfer (d) Landmines Representation Transfer (e) Schools Representation Transfer Figure 1. Results of the experiments on three datasets. For details, see Sections 3.1 and 3.2 squared loss, which we do to obtain tractable expressions for the right hand side of the Inequality (21). To be able to optimize the expression (21) numerically, we approximate it by replacing all expectations over Q by their values at its mode, M. Furthermore, we replace the error of any Gibbs predictor by the error of the deterministic predictor defined by the mode of the posterior distribution, wi(M). The result is a quadratic optimization problem over the Stiefel manifold, which we solve using gradient descent with curvilinear search (Wen & Yin, 2013). 3.3. Evaluation procedure To get reliable estimates of the transfer risk, we repeat the following experimental procedure 100 times for each dataset and calculate the mean prediction errors and standard errors of the mean. In each experiment, we set aside a subset of tasks as unobserved (9 in Landmines, 39 in Schools, 9 in Animals). These are not used during any part of training, but only to evaluate the methods on future tasks. Of the remaining tasks we use different fractions to measure the effect of a different number of observed tasks. The algorithms described in Section 3.1 and 3.2 and ARR have one free parameter, the regularization strength C {10 3, . . . , 103}. We choose this using 3-fold cross-validation in the following way. We split the data of each task into three parts: we use the first third of all tasks jointly to learn a prior. To evaluate this prior, we then train individual predictors using the second part of the data, and test their quality on the third part. For the ELLA algorithm, we use the same procedure to set the regularization strength µ, the remaining parameters we leave at their default values. For the baseline, we set the regularization using ordinary 3-fold cross-validation. 3.4. Results The results of the experiments on all three datasets are shown in Figure 1. Since classes in Landmine dataset are unbalanced, for this problem we report the value of area under the ROC curve (AUC, bigger value means better prediction). Tasks in the Animals dataset are balanced, so for them we report the standard mean 0/1 error. Since the dataset was too large for the subspace methods, we only report results for the parameter transfer techniques. For the experiment on Schools dataset we report the mean squared error (MSE, smaller values mean better prediction). As a first observation, Figure 1 confirms the findings of previous work that better prediction can be achieved by transferring information from related tasks. Overall, it shows that PL-G and PL-L are comparable to the existing, manually designed, techniques. Given sufficiently many tasks, they are able to improve the prediction accuracy over the baseline. As an illustration of the hyperprior concept, we A PAC-Bayesian Bound for Lifelong Learning show results for PL-G with two different values for the Gaussian hyperprior variance (Figures 1(a), 1(b), 1(c)). For σ = 1, the adaption pursues in a very conservative way and many tasks are needed to find a reliable hyperposterior. With σ = 10, convergence is faster, and PL-G achieves results comparable with ARR or even slightly better. For practical tasks, the hyperprior should possibly be chosen by model selection. The results for representation transfer (Figures 1(d) and 1(e)) show that the improvements achieved by PL-L are comparable to the ELLA algorithm. As in the case of parameter transfer, we show results for two different values of the Gaussian prior variance: σ = 1 and σ = 10. While for the Landmine dataset (Figure 1(d)), there is no significant difference in the performance for different values of parameters k and σ, for the Schools dataset (Figure 1(e)) the choice of these parameters plays a bigger role. We see that the improvements of PL-L with σ = 10 are almost the same as the one achieved by ELLA, while for σ = 1 they are smaller. This might be the effect of too strict hyperparameters that cause the method to be more conservative than necessary. Another possible reason for the difference in accuracy is that ELLA makes additional sparsity assumption, which PL-L does not. 4. Conclusion In this work we studied lifelong learning from a theoretical perspective. Our main result is a generalization bound in a PAC-Bayesian framework (Theorem 1). On the one hand, the bound is very general, allowing us to recover two existing principles for transfer learning as special cases: the transfer of classifier parameters, and the transfer of subspaces/representations. On the other hand, the bound consists only of observable quantities, such that it can be used to derive principled algorithms for lifelong learning that achieve results comparable with existing manually designed methods. A further use of the bound we see is in using it to study the implicit assumptions of possible learning methods. For example, a method obtained by means of a unimodal hyperposterior will require all tasks to be related to each other. In future work, we plan to explore the potential of integrating more realistic assumptions, such as hierarchical or multimodal hyperposteriors. A second interesting direction will be to relax the condition that tasks are sampled i.i.d. from an environment, e.g. into the direction of learning tasks of continuously improving difficulty (Bengio et al., 2009). Acknowledgements. We thank Shai Ben-David, Olivier Catoni and Emilie Morvant for helpful discussions. This work was in parts funded by the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no 308036. A. Proof of Lemma 2 In the proof we will make use of Hoeffding s Lemma: Lemma 3 (Hoeffding, 1963) Let X be a real-valued random variable such that Pr(X [a, b]) = 1 and let ξ = E{X}. Then E h eλ(ξ X)i e We will also require the following property of the Kullback-Leibler divergence that holds for any λ > 0 and can be proved by convex duality (Seeger, 2003): E f Qg(f) 1 KL(Q P) + log E f Peλg(f) . (25) We now prove Lemma 2. First, we apply (25) to g(f) = Pl k=1 ξk(f) Pl k=1 gk(f, Xk), obtaining k=1 gk(f, Xk) KL(ρ π) + log E f πeλg(f) . (26) k=1 exp(λ(ξk(f) gk(f, Xk))), (27) since for any fixed f the factors are independent. This allows us to apply Hoeffding s Lemma 3 to each factor: E X1 µ1 E Xl µleλg(f) exp λ2 k=1(bk ak)2 . (28) By taking the expectation over f π we obtain E f π E X1 µ1 E Xl µleλg(f) exp λ2 k=1(bk ak)2 . (29) Since π is fixed and does not depend on X1, . . . , Xl, we can exchange the order of expectations. By applying Markov s inequality with respect to expectations over X1, . . . , Xl we obtain that with probability at least 1 δ: log E f πeλg(f) λ2 k=1(bk ak)2 log δ. (30) We obtain (2) by combining (30) and (26). A PAC-Bayesian Bound for Lifelong Learning Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Convex multi-task feature learning. Machine Learning, 73(3):243 272, 2008. Aytar, Yusuf and Zisserman, Andrew. Tabula rasa: Model transfer for object category detection. In International Conference on Computer Vision (ICCV), 2011. Baxter, Jonathan. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149 198, 2000. Bengio, Yoshua, Louradour, J erˆome, Collobert, Ronan, and Weston, Jason. Curriculum learning. In International Conference on Machine Learing (ICML), 2009. Bridle, John S. and Cox, Stephen J. Rec Norm: Simultaneous normalisation and classification applied to speech recognition. In Conference on Neural Information Processing Systems (NIPS), pp. 234 240, 1990. Caruana, Rich. Multitask learning. Machine Learning, 28 (1):41 75, 1997. Catoni, Olivier. PAC-Bayesian Supervised Classification (The Thermodynamics of Statistical Learning), volume 56 of Monograph Series of the Institute of Mathematical Statistics. IMS, 2007. Downs, Thomas D. Orientation statistics. Biometrika, 59 (3):665 676, 1972. Evgeniou, Theodoros and Pontil, Massimiliano. Regularized multi task learning. In International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2004. Germain, Pascal, Lacasse, Alexandre, Laviolette, Franc ois, and Marchand, Mario. PAC-Bayesian learning of linear classifiers. In International Conference on Machine Learing (ICML), 2009. Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, March 1963. Kumar, Abhishek and Daum e III, Hal. Learning task grouping and overlap in multi-task learning. In International Conference on Machine Learing (ICML), 2012. Lampert, Christoph H., Nickisch, Hannes, and Harmeling, Stefan. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013. Langford, John. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research (JMLR), 6:273 306, 2005. Langford, John and Shawe-Taylor, John. PAC-Bayes and margins. In Conference on Neural Information Processing Systems (NIPS), 2002. Laviolette, Franc ois and Marchand, Mario. PAC-Bayes risk bounds for stochastic averages and majority votes of sample-compressed classifiers. Journal of Machine Learning Research (JMLR), 8:1461 1487, 2007. Maurer, Andreas. Transfer bounds for linear feature learning. Machine Learning, 75(3):327 350, 2009. Maurer, Andreas, Pontil, Massimiliano, and Romera Paredes, Bernardino. Sparse coding for multitask and transfer learning. In International Conference on Machine Learning (ICML), 2013. Mc Allester, David. PAC-Bayesian model averaging. In Workshop on Computational Learning Theory (COLT), pp. 164 170. ACM, 1999. Mc Allester, David. Simplified PAC-Bayesian margin bounds. In Learning Theory and Kernel Machines, pp. 203 215. Springer, 2003. Parrado-Hern andez, Emilio, Ambroladze, Amiran, Shawe Taylor, John, and Sun, Shiliang. PAC-Bayes bounds with data dependent priors. Journal of Machine Learning Research (JMLR), 13:3507 3531, 2012. Ruvolo, Paul and Eaton, Eric. ELLA: An efficient lifelong learning algorithm. In International Conference on Machine Learing (ICML), 2013. Seeger, Matthias W. Bayesian Gaussian Process Models: PAC-Bayesian Generalization Error Bounds and Sparse Approximations. Ph D thesis, University of Edinburgh, 2003. Thrun, Sebastian and Mitchell, Tom M. Lifelong robot learning. Robotics and autonomous systems, 15(1):25 46, 1995. Vapnik, Vladimir N. Statistical learning theory. Wiley, 1998. Wen, Zaiwen and Yin, Wotao. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397 434, 2013. Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishnapuram, Balaji. Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research (JMLR), 8:35 63, 2007. Yang, Jun, Yan, Rong, and Hauptmann, Alexander G. Cross-domain video concept detection using adaptive SVMs. In International Conference on Multimedia, pp. 188 197, 2007.