# gated_linear_networks__a5e5f94a.pdf Gated Linear Networks Joel Veness, Tor Lattimore, David Budden, Avishkar Bhoopchand, Christopher Mattern, Agnieszka Grabska-Barwinska, Eren Sezener, Jianan Wang, Peter Toth, Simon Schmitt and Marcus Hutter Deep Mind aixi@google.com, lattimore@google.com, budden@google.com, avishkar@google.com This paper presents a new family of backpropagation-free neural architectures, Gated Linear Networks (GLNs). What distinguishes GLNs from contemporary neural networks is the distributed and local nature of their credit assignment mechanism; each neuron directly predicts the target, forgoing the ability to learn feature representations in favor of rapid online learning. Individual neurons are able to model nonlinear functions via the use of data-dependent gating in conjunction with online convex optimization. We show that this architecture gives rise to universal learning capabilities in the limit, with effective model capacity increasing as a function of network size in a manner comparable with deep Re LU networks. Furthermore, we demonstrate that the GLN learning mechanism possesses extraordinary resilience to catastrophic forgetting, performing almost on par to an MLP with dropout and Elastic Weight Consolidation on standard benchmarks. Introduction Backpropagation has long been the de-facto credit assignment technique underlying the successful training of popular neural network architectures such as convolutional neural networks and multilayer perceptions (MLPs). It is well known that backpropagation enables these networks to learn highly-relevant task-specific features. However, this method is not without its limitations. Contemporary neural networks trained via backpropagation require many epochs of training over massive datasets, limiting their effectiveness for data-efficient online learning. This has motivated many recent studies on alternate credit assignment mechanisms such as layer-wise training and/or single pass learning (Zhang, Liang, and Wainwright 2017; Belilovsky, Eickenberg, and Oyallon 2019, 2020; Jaderberg et al. 2017; L owe, O Connor, and Veeling 2019; Nøkland and Eidnes 2019). Interpretibility limitations can also prevent their application in domains where a human understandable solution is a mandatory requirement. Their effectiveness is further limited in the continual learning setting by their tendency to catastrophically forget previously learnt tasks. Although various meta-learning (Ortega et al. 2019) algorithms such as Elastic Weight Consolidation (Kirkpatrick et al. 2017, Equal contribution. Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. EWC) have been effective in compensating for these limitations, it is interesting to explore whether alternative methods of credit assignment can give rise to complementary neural models with different strengths and weaknesses. This paper introduces one such family of neural models, Gated Linear Networks (GLNs), and studies their contrasting empirical properties. The distinguishing feature of a GLN is its distributed and local credit assignment mechanism. This technique is a generalization of the PAQ family (Mahoney 2000, 2005, 2013) of online neural network models, which are well-known in the data compression community for their excellent sample efficiency (Mahoney 2013). By interpreting these systems within an online convex programming (Zinkevich 2003) framework as a sequence of data dependent linear networks coupled with a choice of gating function, we are able to provide a new algorithm and gating mechanism that opens up their usage to the wider machine learning community. GLNs have a number of desirable properties. Their local credit assignment mechanism is derived by associating a separate convex loss function to each neuron, which greatly simplifies parameter initialization and optimization, and provides significant sample efficiency benefits when learning online. Importantly, we show that these benefits do not come at the expense of capacity in practice, which adds further weight to previously obtained asymptotic universality results (Veness et al. 2017). GLNs possess excellent online learning capabilities, which we demonstrate by showing performance competitive with batch-trained MLPs on a variety of standard classification, regression and density modeling tasks, using only a single online pass through the data. In terms of interpretibility, we show how the data-dependent linearity of the predictions can be exploited to trivialise the process of constructing meaningful saliency maps, which can be of great reassurance to practitioners that the model is predicting well for the right reasons. Perhaps most interestingly, we demonstrate that our credit assignment mechanism is extraordinarily resilient to catastrophic forgetting, achieving performance competitive with EWC on a standard continual learning benchmark with no knowledge of the task boundaries. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) p 00 p 01 p 02 p 11 p 12 p 1(K 1 - 1) p 21 p 22 p 2(K 2 -1) p 0(K 0 - 1) Bias Base model Base model Base model Side information Figure 1: A graphical depiction of a Gated Linear Network. Each neuron receives inputs from the previous layer as well as the broadcasted side information z. The side information is passed through all the gating functions, whose outputs sij = cij(z) determine the active (blue) weight vectors. In this section we review some necessary background on geometric mixing, a parametrised way of combining probabilitstic forecasts, and show how to adapt its parameters using online convex programming. Later we will combine this method with a gating mechanism to define a single neuron within a GLN. Geometric Mixing. Geometric Mixing is a simple and well studied ensemble technique for combining probabilistic forecasts. It has seen extensive application in statistical data compression (Mattern 2012, 2013). Given p1, p2, . . . , pd input probabilities predicting the occurrence of a single binary event, geometric mixing predicts σ(w σ 1(p)), where σ(x) := 1/(1 + e x) denotes the sigmoid function, σ 1 defines the logit function, p := (p1, . . . , pd) and w Rd is the weight vector which controls the relative importance of the input forecasts. One can easily show the following identity: σ w σ 1(p) = Qd i=1 pwi i Qd i=1 pwi i + Qd i=1(1 pi)wi , which makes it clear that that geometric mixing implements a type of product of experts (Hinton 2002) operation. This leads to the following interesting properties: setting wi = 1/d is equivalent to taking the geometric mean of the d input probabilities; if the jth component of wj is 0 then the contribution of pj is ignored, and if w = 0 then the geometric mixture predicts 1/2; and finally, due to the product formulation, every forecaster has the right of veto , in the sense that a single pi close to 0 coupled with a wi > 0 drives the geometric mixture prediction close to zero. Online Convex Programming Formulation. We now describe how to adapt the geometric mixing parameters using online convex programming (Zinkevich 2003; Hazan 2016). Let B := {0, 1}. As we are interested in probabilistic prediction, we assume a standard online learning framework for the logarithmic loss, where at each round t N a predictor outputs a binary distribution qt : B [0, 1], with the environment responding with an observation xt B, causing the predictor to suffer a loss ℓt(qt, xt) = log qt(xt) before moving onto round t + 1. In the case of geometric mixing, we first define our parameter space to be a non-empty convex set W Rd. As the prediction depends on both the d dimensional input predictions pt and the parameter vector w W, we abbreviate the loss at time t, given target xt, using parameters w by ℓGEO t (w) := log (GEOw(xt ; pt)) , (1) with GEOw(1; pt) := σ(w σ 1(pt)) and GEOw(0 ; pt) := 1 GEOw(1 ; pt). One can show that ℓGEO t (w) is a convex function of w (Mattern 2013) and that the gradient of the loss with respect to w is given by ℓGEO t (w) = (GEOw(1; pt) xt) logit(pt). (2) Furthermore we can bound the 2-norm of the gradient of the loss with ℓGEO t (w) 2 ε provided that pt [ε, 1 ε]d for some ε (0, 1/2) for every time t. These properties of the sequence of loss functions make it possible to apply one of the many different online convex programming techniques to adapt w at the end of each round. In this paper we restrict our attention to Online Gradient Descent (Zinkevich 2003), with W equal to some choice of hypercube, for reasons of computational efficiency. This gives a O( T) regret bound with respect to the best w W chosen in hindsight provided an appropriate schedule of decaying learning rates is used. Gated Geometric Mixing We define the GLN neuron as a gated geometric mixer, which we obtain by adding a contextual gating procedure to geometric mixing. Here, contextual gating has the intuitive meaning of mapping particular input examples to particular sets of weights. The key change compared with normal geometric mixing is that now our neuron will also take in an additional type of input, side information, which will be used by the contextual gating procedure to determine an active subset of the neurons weights to use for a given example. In typical applications the side information will simply be the input features associated with a given example. More formally, associated with each neuron is a context function c : Z C, where Z is the set of possible side information and C = {0, . . . , k 1} for some k N is the context space. Given a convex set W Rd, each neuron is parametrized by a matrix W = [w0 . . . wk 1] with each row vector wi W for 0 i < k. The context function c is responsible for mapping a given piece of side information zt Z to a particular row wc(zt) of W, which we then use with standard geometric mixing. In other words, a Gated Geometric Mixer can be defined in terms of geometric mixing as GEOc W (xt ; pt, zt) := GEOwc(zt)(xt ; pt), (3) with the associated loss function log (GEOc W (xt ; pt, zt)) inheriting all the properties needed to apply Online Convex Programming directly from Equation 1. The key intuition behind gating is that it allows each neuron to be able to specialize its weighting of input predictions based on some particular property of the side information. Universal context functions. We now introduce a halfspace gating mechanism that is tailored towards machine learning applications whose input features lie in Rd. Although not the focus of this work, its worth noting that this choice gives rise to universal approximation capabilities for sufficiently large GLNs (Veness et al. 2017). Once we are in a position to describe the learning dynamics of multiple interacting neurons, the rationale for this class of context functions will become more clear. Exploring alternative gating mechanisms is an exciting area for future work. Halfspace gating. Given a normal v Rd and offset b R, consider the associated affine hyperplane {x Rd : x v = b}. This divides Rd in two, giving rise to two halfspaces, one of which we denote Hv,b = {x Rd : x v b}. The associated half-space context function is then given by 1Hv,b(z), where 1S(s) := 1 if s S and 0 otherwise. Context composition. Richer notions of context can be created by composition. In particular, any finite set of m context functions {ci : Z Ci}m i=1 with associated context spaces C1, . . . , Cm can be composed into a single higher order context function c : Z C, where C = C1 ... Cd = {0, ..., |C| 1} by defining c(z) = (c1(z), ..., cd(z)). For example, we could combine m = 4 different halfspace context functions into a single context function with a context space containing |C| = 16 elements. From here onwards, whenever this technique is used, we will refer to the choice of m as the context dimension. Gated Linear Networks We now introduce Gated Linear Networks, which are feedforward networks composed of many layers of gated geometric mixing neurons as shown in Figure 1. Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron. In a supervised learning setting, a GLN is trained on (side information, base predictions, label) triplets (zt, pt, xt)t=1,2,3,... derived from input-label pairs (zt, xt). There are two types of input to neurons in the network: the first is the side information zt, which can be thought of as the input features; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0, some (optionally) provided base predictions pt that typically will be a function of zt. Each neuron will also take in a constant bias prediction, which helps empirically and is essential for universality guarantees (Veness et al. 2017). GLN architecture. A GLN is a network of gated geometric mixers organized in L + 1 layers indexed by i {0, . . . , L}, with Ki models in each layer. Neurons are indexed by their position in the network when laid out on a grid; for example, neuron (i, k) will refer to the kth neuron of the i layer and pik will refer to the output of neuron (i, k). The output of layer i will be denoted by pi. The zeroth layer of the network is called the base layer, whose output p0 will typically be instantiated via scaling or squashing each component of the current side information z to lie within [ε, 1 ε]. The nonzero layers are composed of gated geometric mixing neurons. Associated to each of these will be a fixed context function cik : Z C that determines the behavior of the gating at neuron (i, k). In addition to the context function, for each context c C and each neuron (i, k) there is an associated weight vector wikc RKi 1 which is used to geometrically mix the inputs whenever active. The bias outputs pi0 for 0 i L can be set to be any constant β [ε, 1 ε] \ {0.5}. Given a z Z, a weight vector for each neuron is determined by evaluating its associated context function. For layers i 1, the kth node in the ith layer receives as input the vector pi 1 of dimension Ki 1 of predictions of the preceding layer. GLNs are data dependent linear networks. Without loss of generality, here we assume that the network is estimating the probability of the target being positive. The output of a single neuron is the geometric mixture of the inputs with respect to a set of weights that depend on its context, namely pik(z) = σ wikcik(z) σ 1 (pi 1 (z)) . The output of layer i can be written in matrix form as pi(z) = σ(Wi(z) σ 1(pi 1(z))) , (4) where Wi(z) RKi Ki 1 is the matrix with kth row equal to wik(z) = wikcik(z). Iterating Equation 4 gives pi(z) = σ Wi(z)Wi 1(z) . . . W1(z) σ 1(p0(z)) , (5) which shows the network behaves like a linear network (Baldi and Hornik 1989; Saxe, Mc Clelland, and Ganguli 2013), but with weight matrices that are data-dependent. Without the data dependent gating, the product of matrices would collapse to a single linear mapping and provide no additional modeling power over a single neuron (Minsky and Papert 1969). Local learning in GLNs. We now describe how the weights are learnt in a Gated Linear Network using Online Gradient Descent (OGD) (Zinkevich 2003) locally at each neuron. They key observation is that as each neuron (i, k) in layers i > 0 is itself a gated geometric mixture, all of these Algorithm 1 GLN(Θ, z, p, x, η, update). Perform a forward pass and optionally update weights. Each layer performs clipped geometric mixing over the outputs of the previous layer, where the mixing weights are side-infodependent via the gating function (Line 10). Lines 12-13 apply (optionally) the weight update, which is obtained from Equation 2. 1: Input: GLN weights Θ {wijc} 2: Input: side info z, base predictions p [ε; 1 ε]K0 1 3: Input: binary target x, learning rate η (0, 1) 4: Input: boolean update (controls if we learn or not) 5: Output: estimate of P[x = 1 | z, p] 6: p0 (β, p1, p2, . . . , p K0 1) 7: for i {1, . . . , L} do 8: pi0 β 9: for j {1, . . . , Ki} do 10: pij CLIP1 ε ε σ wijcij(z) σ 1(pi 1) 11: if update then 12: ij η (pij x) σ 1(pi 1) 13: wijcij(z) CLIPb b[wijcij(z) + ij] 14: end if 15: end for 16: end for return p L1 neurons can be thought of as individually predicting the target. Thus given side information z and from Equations 1 and 3, each neuron (i, k) suffers a loss convex in its active weights u := wikcik(z) of ℓt(u) := log (GEOu (xt ; pi 1)) . Algorithmically, a single step of OGD consists of two parts: a gradient step, and then a projection back into some convex weight space W. The gradient step can be trivially obtained from Equation 2. It is well known that the projection step can be implemented via clipping if the convex set W is a scaled hypercube. In our case this can be achieved if we force every component of each weight vector, for each neuron, to lie within [ b, b] for some constant b > 1. Weight initialisation. One benefit of a convex loss is that weight initialization is less important in determining overall model performance, and one can safely recommend deterministic initialization schemes that favor reproducibility of results. While other choices are possible, we found the initialization wikc = 1/Ki 1 for all i, k, c to be a good choice empirically, which causes geometric mixing to initially compute a geometric average of its input. Algorithm. A single prediction step, as well as a single step of learning using Online Gradient Descent can be implemented via a single forward pass of the network as shown in Algorithm 1. Here we make use of a subroutine CLIP1 ε ε [x] := min {max(x, ε), 1 ε}. Generating a prediction requires computing the active contexts from the given side information for each neuron, and then performing L matrix-vector products. Under the assumption that multiplying a m n by n 1 pair of matrices takes O(mn) work, the total time complexity to generate a single prediction is O(PL i=1 Ki Ki 1) for the matrix-vector products, which in typical cases will dominate the overall runtime. Note too that updating the weights does not affect this complexity. Random halfspace sampling. Here we describe how we generate a diverse set of halfspace context functions in practice. As we are interested in higher dimensional applications, it is necessary to sample hyperplanes in a manner that addresses the curse of dimensionality. Consider a halfspace context function: c(z; v, b) = 1 if z v b; or 0 otherwise. To sample v, we first generate an i.i.d. random vector x = (x1, ..., xd) of dimension d, with each component of x distributed according to the unit normal N(0, 1), and then divide by its 2-norm, giving us a vector v = x/||x||2. This scheme uniformly samples points from the surface of a unit sphere. The scalar b is sampled directly from a standard normal distribution. The motivation for this approach is two-fold. First, With large d, it is well known that the hyperplanes defining each half-space are orthogonal with high probability; in other words, this choice should help to chop the data up in complementary ways given a limited number of gates. Second, suppose we have a set of m different gating functions ci(z; vi, bi) for 1 to m. Now consider the binary vector: g = (c1(z; v1, b1), ..., cm(z; vm, bm)). This signature vector g of input z has the property (Charikar 2002) that different z s which are close in terms of cosine similarity will map to similar signatures. For a GLN, this gives rise to the desirable property that inputs close in cosine distance will map to similar products of data dependent matrices, i.e. they will predict similarly. On convergence properties and rates for GLNs. Various asymptotic convergence results for GLNs on i.i.d. data have been proven in (Veness et al. 2017). Theorem 1 roughly says, that on-average within each context cell, the prediction converges to the true expected output/probability. Theorem 14 roughly says, that a sufficiently large GLN can represent the true target probabilities arbitrarily well. Of course while asymptotic convergence is a useful sanity check for any model, it tells us little about practical finite-time performance. Below we will outline how (good) convergence rates may be obtained for fixed finite sized GLNs to the best locally learnable approximation. Precisely obtaining such bounds is outside the scope of this paper. For a single gated neuron, one can show (Veness et al. 2017, Section 3.2) that Online Gradient Descent (OGD) (Zinkevich 2003) with a learning rate proportional to 1/ t has total regret of O( T) with respect to the best w W chosen in hindsight. The loss function ℓt is exp-concave, so Online Newton Step (Hazan, Agarwal, and Kale 2007) can improve the regret to O(log T), but is computationally more expensive. If the expected loss ℓ(w) := E[ℓt(w)] were strongly convex, then Stochastic Gradient Descent (SGD) with i.i.d. sampling and a learning rate proportional to 1/t Figure 2: Empirical capacity of GLNs (context dimension showed in parentheses) versus an MLP baseline. Top row represents the MNIST dataset with shuffled labels. Bottom row represents a dataset of uniform noise of the same size and shape. would also achieve a regret of O(log T), Unfortunately ℓt is flat in all directions orthogonal to logit(pt), hence not strongly convex. But since ℓt is exp-concave (strongly convex in gradient direction), this makes ℓ(w) strongly convex in the linear subspace of W Rd spanned by S := Span(logit(p1), ..., logit(pn)). For sample size n larger than d it is plausible that S = Rd. Even if not, all ℓt are exactly constant in directions orthogonal to S, hence the gradient lies in S. Since ℓis strongly convex in S, SGD will still achieve a regret of O(log T). The above implies that the time-averaged weights w T have an instantaneous regret of O(log T /T), and even O(1/T) can be achieved (Bubeck et al. 2015, Thm.6.2). In general, OGD algorithms can be converted to achieve these rates even for the current weight wt (Cutkosky 2019), and, indeed, SGD achieves the former even unmodified (Shamir and Zhang 2013). By strong convexity on S, this implies that the (time-averaged) weights (or at least the outputs) converge with a rate of O(t 1/2). Hence after time O(1/ε2) the output of the first layer has converged within O(ε), after which the input to the next layer becomes approximately i.i.d. A similar analysis should then be possible for the second layer, and so on. With an appropriately delayed learning rate decay, this should lead to an overall time bound of O(L/ε2) to achieve ε-approximation. For these reasons we use gradient descent with a learning rate proportional to 1/t in our later experiments. Empirical Capacity of GLNs Contemporary neural networks have the desirable capability to approximate arbitrary continuous functions given almost any reasonable activation function (Hornik 1991, and others). GLNs share this property so long as the context ca- Figure 3: The effect of a single noisy XOR update (circled) on the decision boundaries of a halfspace gated GLN. Sampled hyperplanes for each gate are shown in white. pacity is sufficiently expressive. Moreover, (Veness et al. 2017) prove that this capacity is effective in the sense that gradient descent will eventually find the best feasible approximation. This property is not shared by neural networks trained by backpropagation; it is possible to demonstrate the existence of such weights, but not to guarantee that gradient descent (or any other practical algorithm) will find them. Here we demonstrate the capacity of GLNs in practice by measuring their ability to fit random labelled data. We ran two sets of experiments: first, using the standard MNIST dataset with shuffled labels; and second, replacing the MNIST images with uniform noise of the same shape and dataset length. These results are presented in Figure 2 compared to an MLP baseline in an equivalent one-vs-all configuration. For GLNs, we select a fixed layer width of 128 and vary both the context dimension and number of layers. For the MLP, we select the number of neurons such that the total number of weights in the network is equivalent to a GLN with context dimension 4 (the largest considered). The GLN was trained with learning rate 10 4 and the MLP using the Adam optimizer (Kingma and Ba 2014) with learning rate 10 5, both selected by conducting a sweep over learning rates from 10 1 to 10 6. It is evident from Figure 2 that GLNs have comparable capacity to an equivalently sized MLP in practice, with their ability to memorize training data scaling in both the number of neurons and context dimension of each neuron. Interpretability of GLNs A visual example of the change in decision boundaries resulting from a single halfspace gated GLN update is shown in Figure 3 for the noisy XOR problem. The magnitude of the change is largest within the convex polytope containing the training point, and decays with respect to the remaining convex polytopes according to how many halfspaces they share with the containing convex polytope. This makes intuitive sense, as since the weight update is local, each row of Wi(z) is pushed in the direction to better explain the data independently of each other. Thus one should think of a halfspace gated GLN as a kind of smoothing technique input points which cause similar gating activation patterns must have similar outputs. Aside from reasoning about the inductive bias, the above Figure 4: Saliency maps for constituent GLN binary classifiers of one-vs-all MNIST classifier after a single training epoch. formulation provides a convenient mechanism for interpreting the learnt weights of a trained GLN. Contemporary neural networks have been criticized by some as black boxes that are notoriously difficult to interpret (Yosinski et al. 2015; Zhang and Zhu 2018). Despite their high discrimination power, this can prove problematic for learning and debugging efficiently at the semantic level as well as for deployment in safety-critical real-world applications. This has led to the development of gradient-based methods for post-hoc network analysis (Simonyan, Vedaldi, and Zisserman 2013). Such methods are not necessary for GLNs; for a given input, the collapsed multilinear polynomial of degree L is a weight vector of the same dimension (since WL(z) has 1 row and W1(z) has K0 columns) as the inputs and provides a natural formulation for intuitive saliency maps without any further computational expense. An example of the obtained saliency maps are provided in Figure 4 for a one-versus-all GLN trained as an MNIST classifier. One can clearly see that the characteristic shape of each hand-written character is preserved. Resilience to Catastrophic Forgetting Humans are able to acquire new skills throughout life seemingly without compromising their ability to solve previously learnt tasks. Contemporary neural networks do not share this ability; if a network is trained on a task A and these weights used to initialize training for a new task B, the ability to solve A rapidly degrades as training progresses on B. This phenomenon of catastrophic forgetting has been well studied for decades (Carpenter and Grossberg 1988; Mc Closkey and Cohen 1989; Robins 1995) but continues to limit the applicability of neural networks in continual or lifelong learning scenarios. Similar to the problem of model interpretability, many algorithms have been developed that augment standard training by backpropagation to address catastrophic forgetting. These methods typically fall into two main categories. The first approach involves replaying previously seen tasks during training using one of many heuristics (Robins 1995; Caruana 1997; Rebuffiet al. 2017). The other common category involves explicitly maintaining additional sets of model parameters related to previously learnt tasks. Examples include freezing a subset of weights (Donahue et al. 2013; Razavian et al. 2014), dynamically adjusting learning rates (Girshick et al. 2014) or augmenting the loss with regularization terms with respect to past parameters (Kirkpatrick et al. 2017; Zenke, Poole, and Ganguli 2017; Schwarz et al. 2018). A limitation of these approaches (aside from additional algorithmic and computational complexity) is that they require task boundaries to be provided or accurately inferred. Unlike contemporary neural networks, we demonstrate that the halfspace-gated GLN architecture and learning rule is naturally robust to catastrophic forgetting without any modifications or knowledge of task boundaries. We focus on the pixel-permuted MNIST continual learning benchmark of (Goodfellow et al. 2013; Kirkpatrick et al. 2017), which involves training on a sequence of different tasks where each task is obtained from a different random permutation of the input pixels. We compare the learning and retention characteristics of a GLN against an MLP baseline (of equal number of neurons, using dropout as per the original paper) with and without elastic weight consolidation (EWC) (Kirkpatrick et al. 2017), which is a highly-effective method explicitly designed to prevent catastrophic forgetting by storing parameters of previously seen tasks. Our results are presented in Figure 5. As we train our models on a growing number of sequential tasks (rows), the performance on all previously learnt tasks (columns) is evaluated. Note that the plotted task indices are not contiguous. It is evident that the GLN outperforms EWC in terms of both initial single-task learning (diagonal) and retention when both are trained for a single pass. Only when EWC is trained for multiple (ten) passes over the data does it exhibit superior performance to a vanilla GLN. In all tests, GLNs substantially outperform the standard MLP without EWC. To gain some intuition as to why GLNs are resilient to catastrophic interference, recall that the random halfspace gating causes inputs close in terms of cosine similarity to give rise to similar data dependent weight matrices. Since each task-specific cluster of examples is far from each other in signature space, the amount of interference between tasks is significantly reduced, with the gating essentially acting as an implicit weight hashing mechanism. If this were not the case, one would expect catastrophic forgetting to occur. Online Benchmarking Our final series of results validate the impressive sample efficiency of GLNs across a variety of settings. MNIST Classification. First we explore the use of GLNs for online (single-pass) classification of the deskewed MNIST dataset (Lecun et al. 1998). We use 10 GLNs to construct a one-vs-all classifier, each consisting of 128 neurons per layer with context dimension 4. The learning rate at each step t was set to min{100/t, 0.01}. We find that the GLN is capable of impressive online performance, achieving 98% accuracy in a single pass of the training data. UCI Dataset Classification. We next compare GLNs to a variety of general purpose batch learning techniques (SVMs, Gradient Boosting for Classification, MLPs) in small data Figure 5: Retention results for permuted MNIST. Models are trained sequentially on 8 tasks (rows) and evaluated on all previously encountered tasks (columns). For example, the top-right plot indicates performance on Task 1 after being trained sequentially on Tasks 1 to 8 inclusive (not all tasks shown). Each model only trains for one epoch per task, with the exception of EWC 10 pass and MLP 10 pass (shrunken 10-fold on x axis). Error bars denote 95% confidence levels over 10 trials. glass balance-scale test accuracy SVM GBC MLP GLN model Figure 6: Online GLN classification accuracy on a selection UCI datasets, compared to three contemporary batch methods (Support Vector Machine, Gradient Boosting for Classification, Multi-Layer Perceptron) trained for 100 epochs. regimes on a selection of standard UCI datasets. A 1000-500 neuron GLN with context-dimension 8 was trained with a single pass over 80% of instances and evaluated with frozen weights on the remainder. The comparison MLP used Re LU activations and the same number of weights, and was trained for 100 epochs using the Adam optimizer (Kingma and Ba 2014) with learning rate 0.001 and batch size 32. The SVM classifier used a radial basis function kernel K(x, x ) = exp{ γ x x 2} with γ = 1/d, where d is the input dimension. The GBC classifier was an ensemble of 100 trees of maximum depth 3 with a learning rate of 0.1. The mean and stderr over 100 random train/test splits are shown in the leftmost graph of Figure 6. Here we see that the single-pass GLN is competitive with the best of the batch learning results on each domain. MNIST Density Modelling. Our final result is to use GLNs and image specific gating to construct an online image density model for the binarized MNIST dataset (Larochelle and Murray 2011), a standard benchmark for image density modeling. By exploiting the chain rule P(X1:d) = Qd i=1 P(Xi | X