# a_combinatorial_perspective_on_transfer_learning__03e64a5a.pdf A Combinatorial Perspective on Transfer Learning Jianan Wang Eren Sezener David Budden Marcus Hutter Joel Veness Deep Mind aixi@google.com Human intelligence is characterized not only by the capacity to learn complex skills, but the ability to rapidly adapt and acquire new skills within an ever-changing environment. In this work we study how the learning of modular solutions can allow for effective generalization to both unseen and potentially differently distributed data. Our main postulate is that the combination of task segmentation, modular learning and memory-based ensembling can give rise to generalization on an exponentially growing number of unseen tasks. We provide a concrete instantiation of this idea using a combination of: (1) the Forget-Me-Not Process, for task segmentation and memory based ensembling; and (2) Gated Linear Networks, which in contrast to contemporary deep learning techniques use a modular and local learning mechanism. We demonstrate that this system exhibits a number of desirable continual learning properties: robustness to catastrophic forgetting, no negative transfer and increasing levels of positive transfer as more tasks are seen. We show competitive performance against both offline and online methods on standard continual learning benchmarks. 1 Introduction Humans learn new tasks from a single temporal stream (online learning) by efficiently transferring experience of previously encountered tasks (continual learning). Contemporary machine learning algorithms struggle in both of these settings, and few attempts have been made to solve challenges at their intersection. Despite obvious computational inefficiencies, the dominant machine learning paradigm involves i.i.d. sampling of data at massive scale to reduce gradient variance and stabilize training via back-propagation. In the case of continual learning, the batch i.i.d. paradigm is often further extended to sample from a memory of experiences from all previous tasks. This is a popular method of overcoming catastrophic forgetting" [CG88, MC89, Rob95], whereby a neural network trained on a new target task rapidly loses in its ability to solve previous source tasks. Instead of considering online" and continual" learning as inconvenient constraints to avoid, in this paper we describe a framework that leverages them as desirable properties to enable effective, data-efficient transfer of previously acquired skills. Core to this framework is the ability to ensemble task-specific neural networks at the level of individual nodes. This leads naturally to a desirable property we call combinatorial transfer, where a network of m nodes trained on h tasks can generalize to hm pseudo-tasks". Although the distribution of tasks and pseudo-tasks may differ, we show that this method works very well in practice across a range of online continual learning benchmarks. The ability to meaningfully ensemble individual neurons is not a property of contemporary deep neural networks, owing to a lack of explicit modularity in the distributed and tightly coupled feature representations learnt via backpropagation [BDR+19, CFB+19, PKRCS17]. To concretely instantiate our learning framework we instead borrow two complementary algorithms from recent literature: the Gated Linear Network (GLN) and the Forget-Me-Not Process (FMN). We demonstrate that properties of both models are complementary and give rise to a system suitable for online continual learning. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 2 Background Transfer learning can be defined as the improvement of learning a new (target) task by the transfer of knowledge obtained from learning about one or more related (source) tasks [TS09]. We focus on a generalization of transfer learning which has been studied under the names continual learning [PKP+19] or lifelong-learning [CL18]. Rather than considering just a single source and target task, continual learning considers a sequence of tasks drawn from a potentially non-stationary distribution. The aim is to learn future tasks while both (1) not forgetting how to solve past tasks, and ideally (2) leveraging similarities in previous tasks to improve learning efficiency. Despite the similarity with online learning, this sequence of tasks is rarely exposed as a temporal stream of data in the literal sense. We refer to the intersection of these challenges as online continual learning. A long-standing challenge for neural networks and continual learning is that the performance of a neural network on a previously encountered source task typically rapidly degrades when it is trained on a new target task. This phenomenon is widely known as catastrophic forgetting [CG88, MC89, Rob95], but we will sometimes refer to it as negative backward transfer to distinguish it from positive forward transfer (source task training improving target task performance) and positive backward transfer (target task training improving source task performance). Previous approaches for overcoming catastrophic forgetting typically fall into two categories. The first category involves replaying source task data during target task training [Rob95, RKSL17, LPR17, CRE+19, ABT+19]. The second category involves maintaining task-specific network weights that can be leveraged in various ways [DJV+13, GDDM14], typically involving regularization of the training loss [KPR+17, ZPG17, SLC+18]. These methods share some combination of the following limitations: (1) requiring knowledge or accurate estimation of task boundaries; (2) additional compute and memory overhead for storing task-specific experience and/or weights; and (3) reliance on batch i.i.d. training that leads to unsuitability for online training. A desirable algorithm for human-like" continual learning is one that addresses these limitations while simultaneously encouraging positive forward and backward transfer. 3 Combinatorial Transfer Learning We start with a short thought experiment. Consider two neural networks trained on two different tasks, as shown in Figure 1. If the networks are modular in the sense that we could locally (per node/neuron) ensemble the two networks, one could reasonably expect a global ensemble to work beyond the range of tasks it was trained on by leveraging its previously learnt building blocks across the two tasks. Now consider a network with m nodes, independently trained on h tasks, resulting in h sets of parameters. If we locally ensemble these h networks, such that for each position in the ensemble network we can pick a node of same position from any of the networks, we would have hm possible instantiations. We refer to the task solved by each instantiation as a pseudo-task and claim that such a system exhibits combinatorial transfer. There are two issues with this thought experiment. First, contemporary neural networks do not exhibit the modularity necessary for local ensembling. Many recent studies aim at identifying reusable modular mechanisms to support causal and transfer learning [BDR+19, CFB+19, PKRCS17], but this is made fundamentally difficult by the nature of the distributed and tightly coupled representations learnt via backpropagation. Second, it is not obvious whether the distribution of pseudo-tasks (ensembles of partial solutions to source tasks) meaningfully captures the complexities of the unseen target tasks in a useful fashion. To address these issues, we seek a solution that satisfies the following desiderata: (1) associated to every node is a learning algorithm with its own local loss function; and (2) the nodes coordinate, in the sense that they work together to achieve a common goal. In Section 4 we propose an algorithm that fulfills these requirements. In Section 5 we evaluate its performance on a number of diagnostic and benchmark tasks, demonstrating that the combinatorial transfer property is present and leads to improved continual learning performance compared with previous state-of-the-art online and offline methods. Segmentation Postulations Bayesian averaging Trained on task B Trained on task A x1 x2 x3 x4 x5 x6 x7 x8 x1x2 x3x4 x5x6 x7x8 x1x2x3x4 x5x6x7x8 Local Ensemble Figure 1: Neural Combinatorial Transfer Learning (NCTL). (Left) Consider two 6-node networks trained separately on tasks A and B. These networks can be locally ensembled (keeping node positions fixed) to form 26 new networks. One particular instantiation is shown with small nodes placed within each (large) node of the local ensemble. Rather than ensembling by selecting models/nodes, we perform Bayesian Model Averaging, which is indicated using the colour gradients. (Center) Each NCTL node uses the FMN Process as its ensembling technique. Best solutions identified by the GGM base model are stored in a model pool and combined via Bayesian Model Averaging. Further (hierachical) Bayesian Model Averaging is performed over postulationed task segmentations, each of which is a temporal partition of the input stream into task segments. (Right) A candidate task segmentation can be any pruning of the tree structure shown, with the segmentation described by the set of leaves. 4 Algorithm In this Section we describe the Neural Combinatorial Transfer Learning (NCTL) algorithm, a concrete instantiation (see Figure 1) of the above ideas using (1) Gated Geometric Mixer (GGM) as the node-level modular learning algorithm; and (2) Forget-Me-Not Process (FMN) for automatic task segmentation and local ensembling of the learnt GGM solutions. Gated Geometric Mixer. Geometric Mixing is a well studied ensemble technique for combining probabilistic forecasts [Mat12, Mat13]. Given p := (p1, . . . , p K) input probabilities predicting the occurrence of a single binary event, geometric mixing predicts probability p := σ(w σ 1(p)), where σ(x) := 1/(1 + e x) denotes the sigmoid function, σ 1 defines the logit function, and w RK is the weight vector which controls the relative importance of the input forecasts. A Gated Geometric Mixer (GGM) [VLB+19] is the combination of a gating procedure and geometric mixing. Gating has the intuitive meaning of mapping particular input examples to a particular choice of weight vector for use with geometric mixing. The key change compared with normal geometric mixing is that now our neuron will also take in an additional type of input, side information z Z, which will be used by the contextual gating procedure to determine an active subset of the neuron weights to use for a given example. Associated with each GGM is a context function c : Z C, where C = {1, . . . , C} for some C N is the context space. Each GGM is parameterized by a matrix W = [w 1 , ..., w k ] with row vector wi RK. The context function c is responsible for mapping a given piece of side information z Z to a particular row wc(z) of W, which we then use with standard geometric mixing. Formally, GGMc,W(p, z) := σ(wc(z) σ 1(p)). One can efficiently adapt the weights W online using Online Gradient Descent [Zin03] by exploiting convexity under the logarithmic loss as described in [VLB+19]. Forget-Me-Not Process. Generalizing stationary algorithms to non-stationary environments is a key challenge in continual learning. In this work we choose to apply the Forget-Me-Not (FMN) process [MVK+16] as our choice of node-level ensembling technique. The FMN process is a probabilistic meta-algorithm tailored towards non-i.i.d., piecewise stationary, repeating sources. This meta-algorithm takes as input a single base measure ρ on target strings and extends the Partition Tree Weighting algorithm [VWBG13] to incorporate a memory of up to k previous model states in a data structure known as a model pool. It efficiently applies Bayesian model averaging over a set of postulated segmentations of time (task boundaries) and a growing set M of stored base model states ρ( |s) for some subsequence s of x1:n, while providing a mechanism to either learn a new local solution or adapt/recall previous learnt solutions. Here our base measure ρ will be (implicitly) defined from a sequential application of GGM, and formally defined later. The FMN algorithm is derived and described in [MVK+16]. It computes the probability p = FMNd(x1:n) [0; 1] of a string of binary targets x1:n {0, 1}n of length n, e.g. xt could be a binary class label (of feature zt introduced later). For this, it hierarchically Bayes-mixes an exponentially large self-generated class of models from a base measure ρ in O(kn log n) time and O(k log n) space, roughly as follows: For n = 2d and for j = 0, .., d, it breaks up string x1:n into 2j strings, each of length 2d j, which conceptually can be thought of in terms of a complete binary tree of depth d (see Figure 1, right). For each substring xa:b, associated to each node of the tree will be a probability ξ(xa:b) obtained from a Bayesian mixture of all models in the model pool Ma at time a. Taking any (variable depth) subtree (which induces a particular segmentation of time), concatenating the strings at its leaves gives back x1:n, therefore the product of their associated mixture probabilities gives a probability for x1:n. Doing and averaging this (see [MVK+16, Eq.7]) for all possible O(2n) subtrees, which can be done incrementally in time O(k log n) per string element, gives FMNd(x1:n|ρ). The models in the model pool are generated from an arbitrary adaptive base measure ρ by conditioning it on past substrings xa:b. For example, ρ could be a Beta-Bernoulli model whose weights are updated using Bayesian inference, or something more sophisticated such as a GGM. At time t, Mt contains at most k versions of ρ, with M1 := {ρ}. For t = 2, ..., n, the path leading to the tth leaf of the tree in Figure 1 (right) is traversed; whenever a string xa:b with b = t is encountered, then the model ρ Ma assigning the highest probability to the node s string xa:b is either added to the model pool, i.e. Mt+1 = Mt {ρ ( |xa:b)}, or ignored based on a Bayesian hypothesis test criterion given in [MVK+16]. Neural Combinatorial Transfer Learning (NCTL). We now instantiate the Combinatorial Transfer Learning framework. This will involve defining a feed-forward network structure exactly the same as for a GLN, but replacing the basic notion of neuron, a GGM, with the more intricate FMN process applied to a GGM. Informally, the output of GGM is first piped through FMN before being used in the NCTL neuron, so that each node is now well-suited to non-stationary continual learning. We use random half-space gating [VLB+17, SHB+20] to generate the gates for each neuron. A GLN is a network of GGMs with two types of input to each node: the first is side information z Z, which can be thought of as the input features in a standard supervised learning setup; the second is the input p [0; 1]K to the node, which are the predictions output by the previous layer, or in the case of layer 0, p0 := f(z) for some function f of the side information z. Upon receiving an input p and side information z, each node attempts to directly predict the target probability P[xt = 1|zt]. Formally, neuron j in layer i outputs qt ij := GGMcij( ),Wij(pt i 1, zt). A GLN would now feed pt i := qt i := (qt i1, ..., qt i Ki), where Ki is the number of neurons in layer i, as input to the next layer. NCTL works in the same way but instantiates pt i differently. Formally, qij determines base measures ρij on the target string x1:n by ρij(xt = 1|x