# lifelong_learning_with_weighted_majority_votes__92a46ee6.pdf Lifelong Learning with Weighted Majority Votes Anastasia Pentina IST Austria apentina@ist.ac.at Ruth Urner Max Planck Institute for Intelligent Systems rurner@tuebingen.mpg.de Better understanding of the potential benefits of information transfer and representation learning is an important step towards the goal of building intelligent systems that are able to persist in the world and learn over time. In this work, we consider a setting where the learner encounters a stream of tasks but is able to retain only limited information from each encountered task, such as a learned predictor. In contrast to most previous works analyzing this scenario, we do not make any distributional assumptions on the task generating process. Instead, we formulate a complexity measure that captures the diversity of the observed tasks. We provide a lifelong learning algorithm with error guarantees for every observed task (rather than on average). We show sample complexity reductions in comparison to solving every task in isolation in terms of our task complexity measure. Further, our algorithmic framework can naturally be viewed as learning a representation from encountered tasks with a neural network. 1 Introduction Machine learning has made significant progress in understanding both theoretical and practical aspects of solving a single prediction problem from a set of annotated examples. However, if we aim at building autonomous agents, capable to persist in the world, we need to establish methods for continuously learning various tasks over time [25, 26]. There is no hope to initially provide, for example, an autonomous robot with sufficiently rich prior knowledge to solve any problem that it may encounter during the course of its life. Therefore, an important goal of machine learning research is to replicate humans ability to learn from experience and to reuse knowledge from previously encountered tasks for solving new ones more efficiently. This is aimed at in lifelong learning or learning to learn, where a learning algorithm is assumed to encounter a stream of tasks and is aiming to exploit commonalities between them by transferring information from earlier tasks to later ones. The first theoretical formulation of this framework was proposed by Baxter [4]. In that model, tasks are generated by a probability distribution and the goal, given a sample of tasks from this distribution, is to perform well in expectation over tasks. Under certain assumptions, such as a shared good hypothesis set, this model allows for sample complexity savings [4]. However, good performance in expectation is often too weak a requirement. To stay with the robot example, failure on a single task may cause severe malfunction and the end of the robot s life. Moreover, the theoretical analysis of this model relies on the assumption that the learner maintains access to training data for all previously observed tasks, which allows to formulate a joint optimization problem. However, it is unlikely that an autonomous robot is able to keep all this data. Thus, we instead focus on a streaming setting for lifelong learning, where the learner can only retain learned models from previously encountered tasks. These models have a much more compact description than the joint training data. Specifically, we are interested in analysis and performance guarantees in the scenario that 1) tasks arrive one at a time without distributional or i.i.d. assumptions, 2) the learner can only keep the learned hypotheses from previously observed tasks, 3) error bounds are required for every single task, rather than on average. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. The first analysis of this challenging setting was recently provided by Balcan et al. [3]. That work demonstrates sample complexity improvements for learning linear halfspaces (and some boolean function classes) in the lifelong learning setting in comparison to solving each task in isolation under the assumption that the tasks share a common low dimensional representation. However, the analysis relies on the marginal distributions of all tasks being isotropic log-concave. It was stated as an open challenge in that work whether similar guarantees (error bounds for every task, while only keeping limited information from earlier tasks) were possible under less restrictive distributional assumptions. In this work, we (partially) answer this question in the positive. We do so by proposing to learn with weighted majority votes rather than linear combinations over linear predictors. We show that the shift from linear combinations to majority votes introduces stability to the learned ensemble that allows exploiting it for later tasks. Additionally, we show that this stability is achieved for any ground hypothesis class. We formulate a relatedness assumption on the sequence of tasks (similar to one used in [3]) that captures how suitable to lifelong learning a sequence of tasks is. With this, we prove that sample complexity savings through lifelong learning are obtained for arbitrary marginal distributions (provided that these marginal distributions are related in terms of their discrepancy [5, 17]). This is a significant generalization towards more practically relevant scenarios. Summary of our work We employ a natural algorithmic paradigm, similar to the one in [3]. The algorithm maintains a set of base hypotheses from some fixed ground hypothesis class H. These base hypotheses are predictors learned on previous tasks. For each new task, the algorithm first attempts to achieve good prediction performance with a weighted majority vote over the current base hypotheses, and uses this predictor for the task if successful. Otherwise (if no majority vote classifier achieves high accuracy), the algorithm resorts to learning a classifier from the ground class for this task. This classifier is then added to the set of base hypotheses, to be used for subsequent tasks. We describe this algorithm Section 4.1. If the ground class is the class of linear predictors, this algorithm is actually learning a neural network. Each base classifier becomes a node in a hidden middle layer, which represents a learned feature representation of the neural net. A new task is then either solved by employing the representation learned from previous tasks (the current middle layer), and just learning task specific weights for the last layer, or, in case this is not possible, it extends the current representation. See also Section 4.2. This paradigm yields sample complexity savings, if the tasks encountered are related in the sense that for many tasks, good classification accuracy can be achieved with a weighted majority vote over previously learned models. We formally capture this property as an effective dimension of the sequence of tasks. We prove in Section 4.3 that if this effective dimension is bounded by k, then the total sample complexity for learning n tasks with this paradigm is upper bounded by O (nk + VC(H)k2)/ϵ , a reduction from O (n VC(H)/ϵ), the sample complexity of learning n tasks individually without retaining information from previous tasks. The main technical difficulty is to control the propagation of errors. Since every task is represented by a finite training set, the learner has access only to approximations of the true labeling functions, which may degrade the quality of this collection of functions as a basis for new tasks. Balcan et al. [3] control this error propagation using elegant geometric arguments for linear predictors under isotropic log-concave distributions. We show that moving from linear combinations to majority votes yields the required stability for the quality of the representation under arbitrary distributions. Finally, while we first present our algorithm and results for known upper bounds of k base tasks and n tasks in total, we also provide a variation of the algorithm that does not need to know the number of tasks and the complexity parameter of the task sequence. We show that similar sample complexity improvements are achievable in this setting in Section 5. 2 Related Work Lifelong learning. While there are many ways in which prediction tasks may be related [20], most of the existing approaches to transfer or lifelong learning are exploiting possible similarities between the optimal predictors for the considered tasks. In particular, one widely used relatedness assumption is that these predictors can be described as linear or sparse combinations of some common metafeatures and the corresponding methods aim at learning this representations [10, 2, 15, 3]. Though this idea was originally used in the multi-task setting, it was later extended to lifelong learning by Eaton et al. [9], who proposed a method for sequentially updating the underlying representation as new tasks arrive. These settings were theoretically analyzed in a series of works [18, 19, 23, 21] that have demonstrated that information transfer can lead to provable sample complexity reductions compared to solving each task independently. However, all these results rely on Baxter s model of lifelong learning and therefore assume access to the training data for all (observed) tasks and provide guarantees only on average performance over all tasks. An exception is [6], where the authors provide error guarantees for every task in the multi-task scenario. However, these guarantees are due to the relatedness assumption used which implies that all tasks have the same expected error. The task relatedness assumption that we employ is related to the one used in [1] for multi-task learning with expert advice. There the authors consider a setting where there exists a small subset of experts that perform well on all tasks. Similarly, we assume that there is a small subset of base tasks, such that the remaining ones can be solved well using majority votes over the corresponding base hypotheses. Majority votes. Weighted majority votes are a theoretically well understood and widely used in practice type of ensemble predictor. In particular, they are employed in boosting [11]. They are also often considered in works that utilize PAC-Bayesian techniques [22, 12]. Majority votes are also used in the concept drift setting [14]. The corresponding method, conceptually similar to the one proposed here, dynamically updates a set of experts and uses their weighted majority votes for making predictions. 3 Formal Setting 3.1 General notation and background We let X Rd denote a domain set and let Y denote a label set. A hypothesis is a function h : X Y, and a hypothesis class H is a set of hypotheses. We model learning tasks as pairs D, h of a distribution D over X and a labeling function h : X Y. The quality of a hypothesis is measured by a loss function ℓ: Y Y R+. We deal with binary classification tasks, that is, Y = { 1, 1}, under the 0/1-loss function, that is, ℓ(y, y ) = Jy = y K (we let J K denote the indicator function). The risk of a hypothesis h with respect to task D, h is defined as its expected loss: LD,h (h) := Ex D[ℓ(h(x), h (x))]. Given a sample S = {(x1, y1), (x2, y2), . . . , (xn, yn)}, the empirical risk of h with respect to S is i=1 ℓ(h(xi), yi). For binary classification, the sample complexity of learning a hypothesis class is characterized (that is, upper and lower bounded), by the VC-dimension of the class [27]. We will employ the following generalization bounds for classes of finite VC-dimension: Theorem 1 (Corollaries 5.2 and 5.3 in [7]). Let H be a class of binary functions with a finite VC-dimension. There exists a constant C, such that for any δ (0, 1), for any task D, h , with probability at least 1 δ over a training set S of size n, sampled i.i.d. from D, h : LD,h (ˆh) LS(ˆh) + q LS(ˆh) + , (1) LS(ˆh) LD,h (ˆh) + q LD,h (ˆh) + , (2) LD,h (ˆh) inf h H LD(h) + q inf h H LD(h) + , (3) where ˆh arg minh H LS(h) is an empirical risk minimizer and = C VC(H) log(n) + log(1/δ) In the realizable case (h H), the above bounds imply that the sample complexity is upper bounded by O VC(H)+log(1/δ) Weighted majority votes Given a hypothesis class H, we define the class of k-majority votes as g : X Y | h1, . . . , hk H, w1, . . . , wk R : g(x) = sign i=1 wihi(x) We will omit k in the above notation if clear from the context. The VC-dimension of MV(H, k) is upper bounded as VC(MV(H, k)) k(VC(H) + 1)(3 log(VC(H) + 1) + 2) (5) (see Theorem 10.3 in [24]). This implies in particular, that the VC-dimension of majority votes over a fixed set of k functions is upper bounded by O(k log(k)). 3.2 Lifelong learning In the lifelong learning setting the learner encounters a stream of prediction problems D1, h 1 , . . . , Dn, h n , one at a time. In this work we focus on the realizable case, i.e. h i H for every i and some fixed H. In contrast to most works on lifelong learning, we assume that the only information the learner is able to store about the already observed tasks is the obtained predictors, i.e. it does not have access to the training data of already solved tasks. The need to keep little information from previous tasks has also been argued for in the context of domain adaptation [16]. Possible benefits of information transfer depend on how related or, in other words, how diverse the observed tasks are. Moreover, since we do not make any assumptions on the task generation process (in contrast to Baxter s i.i.d. model [4]), we will formulate our relatedness assumption in terms of a sequence of tasks. Intuitively, one would expect that the information transfer is beneficial if only a few times throughout the course of learning information obtained from the already solved tasks will not be sufficient to solve the current one. In order to formalize this intuition, we use the following (pseudo-)metric over the hypothesis class with respect to a marginal distribution D: d D(h, h ) = E x DJh(x) = h (x)K. (6) Further, we can define a distance of a hypothesis to a hypothesis space as d D(h, H ) = min h H d D(h, h ) (7) and the distance between two sets of hypotheses as d D(H, H ) = max h H d D(h, H ) = max h H min h H d D(h, h ). (8) Note that the latter is not necessarily a metric over subsets of the hypothesis space. However, it does satisfy the triangle inequality (see Section 1 in the supplementary material). Now we can formulate the diversity measure for a sequence of learning tasks that we will employ. Note that the concepts below are closely related to the ones used in [3] for the case of linear predictors and linear combinations over these. Definition 1. A sequence of learning tasks D1, h 1 , . . . , Dn, h n is γ-separated, if for every i d Di(h i , MV(h 1, . . . , h i 1)) > γ. Definition 2. A sequence of learning tasks D1, h 1 , . . . , Dn, h n has γ-effective dimension k, if the largest γ-separated subsequence of these tasks has length k. Formally, we will assume that the γ-effective dimension k of the observed sequence of tasks is relatively small for a sufficiently small γ. Note that this assumption can also be seen as a relaxation of the one used in [8]. There the authors assumed that there exists a set of k hypothesis such that every task can be well solved by one of them. This would correspond to substituting the sets of weighted majority votes MV(h 1, . . . , h i 1) by just the collections {h 1, . . . , h i 1} in the above definitions. Moreover, we will assume that the marginal distributions have small discrepancy with respect to the hypothesis set H: disc H(Di, Dj) = max h,h H |d Di(h, h ) d Dj(h, h )|. (9) This is a measure of task relatedness that has been introduced in [13] and shown to be beneficial in the context of domain adaptation [5, 17]. Note, however, that we do not make any assumptions on the marginal distributions D1, . . . , Dn themselves. 4 Algorithm and complexity guarantees 4.1 The algorithm We employ a natural algorithmic paradigm, which is similar to the one in [3]. Algorithm 1 below provides pseudocode for our procedure. The algorithm takes as parameters a class H, which we call the ground class, accuracy and confidence parameters ϵ and δ, as well as a task horizon n (the number of tasks to be solved) and a parameter k (a guessed upper bound on the number of tasks that will not be solvable as majority votes over earlier tasks). In Section 5 we present a version that does not need to know n and k in advance. Algorithm 1 Lifelong learning of majority votes 1: Input parameters H, n, k, ϵ, δ 2: set δ = δ/(2n), ϵ = ϵ/(8k) 3: draw a training set S1 from D1, h 1 , such that 1 := (VC(H), δ , |S1|) ϵ 4: g1 = arg minh H LS1(h) 5: set k = 1, i1 = 1, h1 = g1 6: for i = 2 to n do 7: draw a training set Si from Di, h i , such that i := (VC(MV( h1, . . . , h k)), δ , |Si|) ϵ 40 8: gi = arg minh MV( h1,..., h k) LSi(h) 9: if LSi(gi) + p LSi(gi) i + i > ϵ then 10: draw a training set Si from Di, h i , such that i := (VC(H), δ , |Si|) ϵ 11: gi = arg minh H LSi(h) 12: set k = k + 1, h k = gi, i k = i 13: end if 14: end for 15: return g1, . . . , gn During the course of its life , the algorithm maintains a set of base hypotheses ( h1, . . . , h k) from the ground class, which are predictors learned on previous tasks. In order to solve the first task, it uses the hypothesis set H and a large enough training set S1 to ensure the error guarantee ϵ ϵ/8k with probability at least 1 δ , where δ = δ/2n. The learned hypothesis h1 H is the first member of the set of base hypotheses. For each new task i, the algorithm first attempts to achieve good prediction performance (up to error ϵ) with a weighted majority vote over the base hypotheses, i.e. it attempts to learn this task using the class MV( h1, . . . , h k), and uses the obtained predictor for the task if successful. Otherwise (if no majority vote classifier achieves high accuracy), the algorithm resorts to learning a classifier from the base class for this task, which is then added to the set of base hypotheses, to be used for subsequent tasks. The error guarantees are ensured with Theorem 1 by choosing the training sets Si large enough so that i := (VC(Hi), δ , |Si|) := C VC(Hi) log(|Si|) + log(1/δ ) where Hi is either the ground class H or the set of weighted majority votes over the current set of base hypotheses MV( h1, . . . , h k), and constant c is set according to case, see pseudocode. While this approach is very natural, the challenge is to analyze it and to specify the parameters. In particular, we need to ensure that the algorithm will not have to search over (potentially large) hypothesis set H too often and, consequently, will lead to provable sample complexity reductions over solving each task independently. The following theorem summarizes the performance guarantees for Algorithm 1 (the proof is in Section 4.3). Theorem 2. Consider running Algorithm 1 on a sequence of tasks with γ-effective dimension at most k and disc H(Di, Dj) ξ for all i, j. Then, if γ ϵ/4 and kξ < ϵ/8, with probability at least 1 δ: The error of every task is bounded: LDi,h i (gi) ϵ for every i = 1, . . . , n. The total number of labeled examples used is O nk+VC(H)k2 Discussion Note that if we assume that all tasks are realizable by H, independently learning them up to error ϵ would have sample complexity O VC(H)n ϵ . The sample complexity of learning n tasks in the lifelong learning regime with our paradigm in contrast is O nk+VC(H)k2 ϵ . This is a significant reduction if the effective dimension of the task sequence k is small in comparison to the total number n of tasks, as well as the complexity measure VC(H) of the ground class. That is, if most tasks are learnable as combination of previously stored base predictors, much less data is required overall. Note that for all those tasks that are solved as majority votes, our algorithm and analysis actually require realizability only by the class of k-majority votes over H and not by the ground class H. Learning the n tasks independently under this assumption, has sample complexity O VC(H)k ϵ + (n k)VC(H)k ϵ . In contrast, the lifelong learning method gradually identifies the relevant set of base predictors and thereby reduces the number of required examples. 4.2 Neural networks If the ground class is the class of linear predictors, our algorithm is actually learning a neural network (with sign() as the activation function). Each base classifier becomes a new node in a hidden middle layer. Thus, the maintained set of base classifiers can be viewed as feature representation in the neural net, which was learned based on the encountered tasks. A new task is then either solved by employing the representation learned from previous tasks (the current middle layer), and just learning task specific weights for the last layer; or, in case this is not possible, a fresh linear classifier is learned, and added as a node to the middle layer. Thus, in this case, the feature representation is extended. 4.3 Analysis We start with presenting the following two lemmas that show how to control the error propagation of the learned representations (sets of base classifiers). We then proceed to the proof of Theorem 2. Lemma 1. Let V =MV(h1, . . . , hk, g) and V =MV(h1, . . . , hk, g). Then, for any distribution D: d D(V, V ) d D(g, g). (10) Proof. By the definition of d D(V, V ) there exists u V such that: d D(V, V ) = d D(u, V ). (11) We can represent u as u = sign(Pk i=1 αihi + αg) and let u1 = Pk i=1 αihi. Note that while all hi-s, g and g are assumed to take values in { 1, 1}, u1 can take values in R. Then: d D(u, V ) = min h V d D(u, h) min h MV(u1, g) d D(u, h) max h MV(u1,g) min h MV(u1, g) d D(h, h) = d D(MV(u1, g), MV(u1, g)). Now we show that for any α1u1 + α2g MV(u1, g) there exists a close hypothesis in MV(u1, g). In particular, this hypothesis is α1u1 + α2 g: d D(α1u1 + α2g, α1u1 + α2 g) = E x DJsign(α1u1(x) + α2g(x)) = sign(α1u1(x) + α2 g(x))K = E x DJα2 1u2 1(x) + α1α2u1(x)g(x) + α1α2u1(x) g(x) + α2 2g(x) g(x) < 0K. Note that for every x on which g and g agree, i.e. g(x) g(x) = 1, we obtain: α2 1u2 1(x) + α1α2u1(x)g(x) + α1α2u1(x) g(x) + α2 2g(x) g(x) = (α1u1(x) + α2g(x))2 0. Therefore: d D(α1u1 + α2g, α1u1 + α2 g) E x DJg(x) = g(x)K = d D(g, g). (12) Lemma 2. Let Vk = MV(h1, . . . , hk) and Vk = MV( h1, . . . , hk). For any distribution D, if d D(hi, hi) ϵi for every i = 1, . . . , k, then d D(Vk, Vk) Pk i=1 ϵi. For the proof see Section 2 in the supplementary material. Proof of Theorem 2. 1. First, note that for every task Algorithm 1 solves at most 2 estimation problems with a probability of failure δ for each of them. Therefore, with a union bound argument, the probability of any of these estimations being wrong is at most 2 n δ = δ. Thus, from now we assume that all the estimations were correct, that is, the high probability events of Theorem 1 hold. 2. To see that the error of every encountered task is bounded by ϵ, note that there are two cases. For tasks i that are solved by a majority vote over previous tasks, we have LSi(gi) + p LSi(gi) i + i ϵ. In this case, Equation (1) in Theorem 1 implies LDi,h i (gi) ϵ. For tasks i that are not solved as a majority vote over previous tasks, we have i = (VC(MV( h1, . . . , h k)), δ , m) ϵ/8k. Since task i is realizable by the base class H, we have infh H LDi,h i (h) = 0, and thus Equation (3) of Theorem 1 implies LDi,h i (gi) ϵ/8k < ϵ. 3. To upper bound the sample complexity we first prove that the number k of tasks, which are not learned as majority votes over previous tasks, is at most k. For that we use induction showing that for every ˆk k, when we create a new hˆk from the iˆk-th task, we have that d Diˆk (h iˆk, MV(h i1, . . . , h iˆk 1)) > γ. (13) This implies k k by invoking that the γ-effective dimension of the sequence of encountered tasks is at most k. To proceed to the induction, note that for ˆk = 1, the claim follows immediately. Consider ˆk > 1. If we create a new hˆk, it means that the condition in line 9 is true, which is: LSiˆk (giˆk) + q LSiˆk (giˆk) i + i > ϵ. (14) Therefore LSiˆk (giˆk) > 0.83ϵ. Consequently, due to (2), LDiˆk ,h iˆk (giˆk) > 0.67ϵ. Finally, by (3), infg LDiˆk ,h iˆk (g) > 0.5ϵ. Therefore there is no majority vote predictor based on h1, . . . , hˆk 1 that leads to error less than ϵ/2 on the problem iˆk. In other words: d Diˆk (h iˆk, MV( h1, . . . , hˆk 1)) > ϵ/2. (15) Now, by way of contradiction, suppose that d Diˆk (h iˆk, MV(h i1, . . . , h iˆk 1)) γ. By construction for every j = 1, . . . , ˆk 1 d Dij (h ij, hj) ϵ ϵ/8k. By the definition of discrepancy and the assumption on the marginal distributions it follows that for all j: d Diˆk (h ij, hj) d Dij (h ij, hj) + disc H(Dij, Diˆk) ϵ + ξ. (16) Therefore by Lemma 2: d Diˆk (MV(h i1, . . . , h iˆk 1), MV( h1, . . . , hˆk)) k(ϵ + ξ). (17) Consequently, by using the triangle inequality: d Diˆk (h iˆk, MV( h1, . . . , hˆk 1)) γ + k(ϵ + ξ) ϵ/4 + ϵ/8 + ϵ/8 = ϵ/2, (18) which is in contradiction with (15). 4. The total sample complexity of Algorithm 1 consists of two parts. First, for every task Algorithm 1 checks, whether it can be solved by a majority vote over the base, at most k predictors. For that it employs Theorem 1 and therefore needs the following number of samples: n k log k log( k log k) log(2n/δ) Second, there are at most k tasks that satisfy the condition in line 9 and are learned using the hypothesis set H with estimation error ϵ = ϵ/(8k). Therefore the corresponding sample complexity is: O k VC(H) log(2n/δ) ϵ/(8k) = O VC(H)k2 5 Lifelong learning with unknown horizon In this section we present a modification of Algorithm 1 for the case when the total number of tasks n and the complexity of the task sequence k are not known in advance. The main difference between Algorithm 2 and Algorithm 1 is that with unknown n and k the learner has to adopt the parameters δ and ϵ on the fly. We show that this can be done by the doubling trick that is often used in online learning. Theorem 3 summarizes the resulting guarantees (the proof can be found in the supplementary material, Section 3). Algorithm 2 Lifelong learning of majority votes with unkown horizon 1: Input parameters H, ϵ, δ 2: set δ1 = δ/2, ϵ 1 = ϵ/16 3: draw a training set S1 from D1, h 1 of size m, such that (VC(H), δ1, m) ϵ 1 (see (4)) 4: g1 = arg minh H LS1(h) 5: set k = 1, i1 = 1, h1 = g1 6: for i = 2 to n do 7: set l = log i , m = log k + 1 8: set δi = δ 22l+2 , ϵ i = ϵ 22m+4 9: draw a training set Si from Di, h i of size m, such that (VC(MV( h1, . . . , h k)), δi, m) ϵ/40 (see (4)) 10: gi = arg minh MV( h1,..., h k) LSi(h) 11: if LSi(gi) + p LSi(gi) + > ϵ then 12: draw a training set Si from Di, h i of size m, such that (VC(H), δi, m) ϵ i (see (4)) 13: gi = arg minh H LSi(h) 14: set k = k + 1, h k = gi, i k = i 15: end if 16: end for 17: return g1, . . . , gn Theorem 3. Consider running Algorithm 2 on a sequence of tasks with γ-effective dimension at most k and disc H(Di, Dj) ξ for all i, j. Then, if γ ϵ/4 and kξ < ϵ/8, with probability at least 1 δ: The error of every task is bounded: LDi,h i (gi) ϵ for every i = 1, . . . , n. The total number of labeled examples used is O nk+VC(H)k3 6 Conclusion In this work, we have shown sample complexity improvements with lifelong learning in the challenging, yet as argued important setting, where tasks arrive in a stream (without assumptions on the tasks generating process), where the learner is only allowed to maintain limited amounts of information from previously encountered tasks, and where high performance is required for every single task, rather than on average. While such improvements have been established in very specific settings [3], our work shows they are possible in much more general and realistic scenarios. We hope that this will open the door for more work in this area of machine lifelong learning and lead to better understanding of how and when learning machines can benefit from past experience. An intriguing direction is to investigate whether there exists a more general characterization of ensemble methods and/or data distributions that would lead to benefits with lifelong learning. Another one is to better understand lifelong learning with neural networks, analyzing cases of more complex network structures and activation functions, an area where current machine learning practice yields exciting successes, but little is understood. Acknowledgments This work was in parts funded by the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no 308036. [1] J. Abernethy, P. Bartlett, and A. Rakhlin. Multitask learning with expert advice. In Workshop on Computational Learning Theory (COLT), 2007. [2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning (ML), 2008. [3] M.-F. Balcan, A. Blum, and S. Vempala. Efficient representations for lifelong learning and autoencoding. In Workshop on Computational Learning Theory (COLT), 2015. [4] J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research (JAIR), 12, 2000. [5] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning (ML), 2010. [6] S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning. 2003. [7] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323 375, 11 2005. [8] K. Crammer and Y. Mansour. Learning multiple tasks using shared hypotheses. In Conference on Neural Information Processing Systems (NIPS), 2012. [9] E. Eaton and P. L. Ruvolo. ELLA: An efficient lifelong learning algorithm. In International Conference on Machine Learning (ICML), 2013. [10] A. Evgeniou and M. Pontil. Multi-task feature learning. In Conference on Neural Information Processing Systems (NIPS), 2007. [11] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning (ICML), 1996. [12] P. Germain, A. Habrard, F. Laviolette, and E. Morvant. A New PAC-Bayesian Perspective on Domain Adaptation. In International Conference on Machine Learning (ICML), 2016. [13] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In International Conference on Very Large Data Bases (VLDB), 2004. [14] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research (JMLR), 8:2755 2790, Dec. 2007. [15] A. Kumar and H. Daumé III. Learning task grouping and overlap in multi-task learning. In International Conference on Machine Learning (ICML), 2012. [16] I. Kuzborskij and F. Orabona. Stability and hypothesis transfer learning. In International Conference on Machine Learning (ICML), 2013. [17] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Workshop on Computational Learning Theory (COLT), 2009. [18] A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327 350, 2009. [19] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning. In International Conference on Machine Learning (ICML), 2013. [20] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, Oct. 2010. [21] A. Pentina and S. Ben-David. Multi-task and lifelong learning of kernels. In Algorithmic Learning Theory (ALT), 2015. [22] A. Pentina and C. H. Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning (ICML), 2014. [23] M. Pontil and A. Maurer. Excess risk bounds for multitask learning with trace norm regularization. In Workshop on Computational Learning Theory (COLT), 2013. [24] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014. [25] S. Thrun and T. M. Mitchell. Lifelong robot learning. Robotics and Autonomous Systems, 15(1 2):25 46, 1995. The Biology and Technology of Intelligent Autonomous Agents. [26] S. Thrun and L. Pratt. Learning to learn. Kluwer Academic Publishers, 1998. [27] V. N. Vapnik and A. J. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264 280, 1971.