# globally_gated_deep_linear_networks__239371a7.pdf

Globally Gated Deep Linear Networks

Qianyi Li1 Haim Sompolinsky2,3

1Biophysics Graduate Program, Harvard University

2Center for Brain Science, Harvard University 3Edmond and Lily Safra Center for Brain Sciences, Hebrew University qianyi_li@g.harvard.edu, hsompolinsky@mcb.harvard.edu, haim@fiz.huji.ac.il

Recently proposed Gated Linear Networks (GLNs) present a tractable nonlinear network architecture, and exhibit interesting capabilities such as learning with local error signals and reduced forgetting in sequential learning. In this work, we introduce a novel gating architecture, named Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer, thereby decoupling the architectures of the nonlinear but unlearned gating and the learned linear processing motifs. We derive exact equations for the generalization properties of Bayesian Learning in these networks in the ﬁnite-width thermodynamic limit, deﬁned by N, P ! 1 while P/N = O(1) where N and P are the hidden layers width and size of training data sets respectfully. We ﬁnd that the statistics of the network predictor can be expressed in terms of kernels that undergo shape renormalization through a data-dependent order parameter matrix compared to the inﬁnite-width Gaussian Process (GP) kernels. Our theory accurately captures the behavior of ﬁnite width GGDLNs trained with gradient descent (GD) dynamics. We show that kernel shape renormalization gives rise to rich generalization properties w.r.t. network width, depth and L2 regularization amplitude. Interestingly, networks with a large number of gating units behave similarly to standard Re LU architectures. Although gating units in the model do not participate in supervised learning, we show the utility of unsupervised learning of the gating parameters. Additionally, our theory allows the evaluation of the network s ability for learning multiple tasks by incorporating task-relevant information into the gating units. In summary, our work is the ﬁrst exact theoretical solution of learning in a family of nonlinear networks with ﬁnite width. The rich and diverse behavior of the GGDLNs suggests that they are helpful analytically tractable models of learning single and multiple tasks, in ﬁnite-width nonlinear deep networks.

1 Introduction

Despite the recent advances in machine learning, theoretical understanding of how machine learning algorithms work is very limited. Many current theoretical approaches study inﬁnitely wide networks [1, 2, 3], where the input-output relation is equivalent to a Gaussian Process (GP) in function space with a covariance matrix deﬁned by a GP kernel. However, this GP limit holds when the network width approaches inﬁnity while the size of the training data remains ﬁnite, severely limiting its applicability to realistic conditions. Another line of work focuses on ﬁnite-width deep linear neural networks (DLNNs)[4, 5, 6], while applicable in a wider regime, the generalization behavior of linear networks are very limited, and the bias contribution always remains constant with network parameters [4], which fails to capture the behavior of generalization performance in general nonlinear networks. Therefore, a tractable nonlinear network architecture is in need for theoretically probing into the diverse generalization behavior of general nonlinear networks.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Recently proposed Gated Linear Networks (GLNs) present a tractable nonlinear network architecture [7, 8, 9], with capabilities such as learning with local error signals and mitigating catastrophic forgetting in sequential learning. Inspired by these recent advances in GLNs, we propose Globally Gated Deep Linear Networks (GGDLNs) as a simpliﬁed GLN structure that preserves the nonlinear property of general GLNs, the decoupling of ﬁxed nonlinear gating from learned linear processing units, and the ability to separate the processing of multiple tasks using the gating units. Our GGDLN structure is different from previous GLNs in several ways. First, the gating units are shared across hidden layer units and different layers while in previous work each unit has its own set of gatings [10, 8, 9]. Second, we deﬁne global learning objective instead of local errors [8, 9]. These simpliﬁcations allow us to obtain direct analytical expressions of memory capacity and exact generalization error of these networks for arbitrary training and testing data, providing quantitative insight into the effect of learning in nonlinear networks, as opposed to studies of generalization bounds [10], expressivity estimates [9, 11], and indirect quantities relevant to generalization such as the implicit bias of the network [12]. Furthermore, the kernel expression of the predictor statistics we propose in this work also allow us to make qualitative explanations of the generalization and how it s related to data structure and network representation for single and multiple tasks.

First, we introduce the architecture of our GGDLNs and analyze its memory capacity. We then derive our theory for generalization properties of GGDLNs, and make qualitative connections between the generalization behavior and the relation between the renormalization matrix and task structure. Second, we apply our theory to GGDLNs performing multiple tasks, focusing on two scenarios where tasks are either deﬁned by different input statistics or different output labels on the same inputs. While the effect of kernel renormalization is different in the two cases, we ﬁnd that for ﬁxed gating functions, de-correlation between tasks always improves generalization.

2 Globally gated deep linear networks

100 101 102 M

training error

L=1 L=2 L=3

100 101 102 M

L=1 L=2 L=3

L=1 L=2 L=3

L=1 L=2 L=3

100 101 102 M

100 101 102 M

Figure 1: Globally gated deep linear networks. (a) Structure of GGDLNs, each neuron in the hidden layer has M dendrites, each with a different input-dependent gating gm(x) which is ﬁxed during training, the M gatings are shared across neurons in the hidden layer. The m-th dendritic branch of the i-th neuron in layer l connects to neuron j in the previous layer with weight W m

l,ij(shown in orange). (b) Training error of networks with 1-3 hidden layers in the GP limit as a function of M evaluated on a noisy Re LU teacher task, training error goes to zero at network capacity (black dashed lines). (c-e) Bias, variance and generalization error of the same network and task as (b). Bias and generalization error diverges, variance generalization becomes nonzero at network capacity (black dashed line). See Appendix C.1 for detailed parameters.

In GGDLNs, the network input-output relation is deﬁned as follows,

am,ix L,igm(x), xl,i =

l,ijgm(x)xl 1,j l > 1

j=1 Wl,ijxl 1,j l = 1

where x0 = x is the input, N is the hidden layer width, M is the number of gating units in each layer, and N0 is the input dimension. Each neuron in every layer has M dendrites, each with an input-dependent global gating gm(x) shared across all neurons. The m-th dendritic branch of neuron i in the L-th hidden layer connects to neurons in the previous layer with a dendrite-speciﬁc weight vector Wm

L,i (or with readout weight vector am for the output neuron), as shown in Fig.1 (a). Note that although the gatings are ﬁxed during learning, changes in the weights affect how these gatings act on the hidden layer activations, and it is interesting to understand how the learned task interacts with these gating operations. Since adding gatings at the input layer is equivalent to expanding the

input dimension and replacing xj by xjgm(x), and learning does not affect how the gatings interact with the input, we do not add gatings at the input layer for simplicity.

Memory Capacity: Memory capacity refers to the maximum number of random (or generic) inputoutput examples for which there exists a set of parameters such that the network achieves zero training error (here we consider the mean squared error, MSE). By deﬁnition, it is irrespective of the learning algorithm. The capacity bounds of deep nonlinear networks has been extensively studied in many recent works [10, 13, 14]. To calculate the capacity of GGDLNs, note that the input-output relation given by Eq.1 can be alternatively expressed as f(x) = P

m1, ,m L,j W e

m1, ,m L,jxe

m1, ,m L,j, which is a linear combination of the effective input xe

m1, ,m L,j = gm1(x)gm2(x) gm L(x)xj (ml = 1, , M;j = 1, , N0), with some effective weights We which is a complicated function of a and W. Here ml is the index of the gatings in the l-th layer. As the gating units are shared across layers, the effective input xe has N0

independent dimensions. This combinatorial term represents the number of possible combinations of L gatings selected from M total number of gatings. Assuming N M L such that the effective weight W e

m1, ,m L,j can take any desired value in the N0M L dimensional whole space, the problem of ﬁnding We with zero training error is equivalent to a linear regression problem with input xe and the target outputs. Therefore, the capacity is equivalent to the number of independent input dimensions, given by P N0

. The above capacity is veriﬁed by Fig.1(b), where the training error becomes nonzero above the memory capacity. The generalization behavior also changes drastically at network capacity (Fig.1(ce)), where generalization error and its bias contribution diverge, and the variance contribution shrinks to 0 (see detailed calculation in the next paragraph and Appendix A.3). This double descent property of the generalization error is similar to previously studied in linear and nonlinear networks. Furthermore, although the output of the network is a linear function of the effective input xe , due to the multiplicative nature of the network weights and the gatings, learning in GGDLNs is highly nonlinear and the space of solution for W and a is highly nontrivial, and the network exhibit properties unique to nonlinear networks, as we will show in the following sections.

Posterior distribution of network weights: We consider a Bayesian network setup, where the network weights are random variables whose statistics are determined by the training data and network parameters, instead of deterministic variables. This probabilistic approach enables us to study the properties of the entire solution space instead of a single solution which may be heavily initialization dependent. We consider the posterior distribution of the network weights induced by learning with a Gaussian prior [15, 16, 17, 18], given by

P( ) = Z 1 exp( 1

(f(xµ, ) Y µ)2 1 2σ2 > ) (2)

where Z is the partition function Z =

d P( ). The ﬁrst term in the exponent is the MSE of the network outputs on a set of P training data points xµ from their target outputs Y µ, and the second term is a Gaussian prior on the network parameters = {W, a} with amplitude σ 2. In this work we focus on the T ! 0 limit where the ﬁrst term dominates. Below the network capacity, the distribution of concentrates onto the solution space that yields zero training error, the Gaussian prior then biases the solution space towards weights with smaller L2 norms. The fundamental properties of the system can be derived from the partition function. As the distribution is quadratic in the readout weights am,i, it is straightforward to integrate them out, which yields

2σ2 Tr(W>W) + 1

2Y>KL(W) 1Y + 1

2 log det(KL(W))] (3)

where W denotes all the remaining weights in the network, and KL(W) is the W dependent P P kernel on the training data, deﬁned as Kµ

L (W) = ( σ2

M g(xµ)>g(x ))( 1

Generalization in inﬁnitely wide GGDLNs: It is well known that in inﬁnitely wide networks where N ! 1 while P remains ﬁnite (also referred to as the GP limit), KL(W) is self-averaging and does not depend on the speciﬁc realization of W. It can therefore be replaced by the GP kernel deﬁned as h KL(W)i W where W N(0, σ2) [2]. For GGDLNs, the GP kernel for a pair of arbitrary data x and y is given by KGP (x, y) = ( σ2

M g(x)>g(y))LK0(x, y), where K0(x, y) = σ2

N0 x>y. We denote the P P kernel data matrix as KGP where Kµ

GP = KGP (xµ, x ), and the input kernel matrix on training data as K0 where Kµ

0 = K0(xµ, x ).

Generalization error is measured by MSE including the bias and the variance contributions, g = (hf(x)i y(x))2 | {z } bias

+ hδf(x)2i | {z } variance

, which depends on the ﬁrst and second order statistics of the

predictor. In the GP limit, we have

hf(x)i = k GP (x)>K 1

GP Y, hδf(x)2i = KGP (x, x) k GP (x)>K 1

GP k GP (x) (4)

GP (x) = KGP (x, xµ). Note that the rank of KGP is the same as the capacity of the network, and the kernel matrix becomes singular as P approaches its capacity (the interpolation threshold), which results in nonzero training error, diverging bias and vanishing variance contribution to the generalization error (Fig.1 (b-e)). The singularity of the kernel at the interpolation threshold holds also for ﬁnite width networks, and similar diverging bias and vanishing variance are seen in our ﬁnite width theory below (Section 3 ) and are conﬁrmed by simulation of networks trained with GD (see Appendix B.1,[19]).

3 Kernel shape renormalization theory in ﬁnite-width GLNs

We now address the ﬁnite width thermodynamic limit, where P, N ! 1 but P/N O(1), M, L O(1). In this limit, calculating the statistics of the network predictor requires integration over W in Eq.3 . To do so, we apply the previous method of Back-propagating Kernel Renormalization

[4] (see Appendix A) to GGDLNs. The partition function for a single hidden layer network is given by Z = exp( H1), where the Hamiltonian H1 is given by

2 log det( K1) N

2 log det U1 + 1 2σ2 NTr(U1)

M g(xµ)>U1g(x ))Kµ

Comparing the matrix K1 to KGP , we note that the GP kernel is renormalized by an an M M matrix order parameter U1. This order paramter satisﬁes the self-consistent equation

U1 = I 1 NM U1/2

1 K0]g U1/2

1 + 1 NM U1/2

1 K0]g U1/2

where denotes element-wise multiplication. In the linear case (which corresponds to M = 1), the GP kernel is renormalized by a scalar factor. In the M > 1 case, the effect of renormalization is more drastic as it changes that not only the amplitude but also the shape of the kernel. The renormalization matrix has an interesting physical interpretation that relates it to the readout weights a of GGDLNs,

am,ian,ii (7)

The calculation can be extended to multiple layers with a new order parameter introduced for each layer (see Appendix A). The predictor statistics for a input x can be expressed in terms of the renormalized kernels, for a network with L = 1

hf(x)i = k1(x)> K 1

1 Y, hδf(x)2i = K1(x, x) k1(x)> K 1

1 k1(x) (8)

where K1(x, y) = ( 1

M g(x)>U1g(y))K0(x, y) denotes the renormalized kernel function for two arbitrary inputs x and y, K1 denotes the P P renormalized kernel matrix on the training data, and k1(x) is a P-dimensional vector, kµ

1 (x) = K(x, xµ). The kernel renormalization in GGDLNs changes the shape of the kernel through the data dependent U1, reﬂecting the nonlinear property of the network, and resulting in more complex behavior of predictor statistics relative to the linear networks, as shown in Section 4. Our theory describes the properties of the posterior distribution of the network weights induced at equilibrium by Langevin dynamics with the MSE cost function and the Gaussian prior [4, 20, 21, 22]. Simulating this dynamics agrees remarkably well with the simulation (see Appendix B.2). Although our theoretical results do not directly describe the solutions obtained by running gradient descent (GD) dynamics on the training error, it is interesting to ask to what extent the predicted behaviors of our theory are also exhibited by GD dynamics of the same network architectures, as GD-based learning is more widely used. We will compare our theoretical results with numerics of GD dynamics throughout the paper. We consider the case where

the network is initialized with Gaussian i.i.d. weights with variance σ2, and the mean and variance of the predictors are evaluated across multiple initializations (see Appendix C.5 for details). As we will show, our theory makes accurate qualitative predictions for GD dynamics in all examples in this paper, in the sense that while the exact values may not match, the general trend of how generalization or representation varies with different parameters in different regimes are very similar.

4 Generalization

For linear networks the generalization error depends on N, σ2 and L through the variance only, while the mean predictor always assumes the same value as in the GP limit [4]. This is because the scalar kernel renormalization of k1(x) is cancelled out in the mean predictor by the renormalization of the inverse kernel K 1

1 . In contrast, for GGDLNs the mean predictor and hence the error bias also change with these network parameters due to the matrix nature of the kernel renormalization (Eq.8) . Below we investigate in detail how matrix renormalization of the kernel affects the generalization behavior (especially the bias term) of the network.

4.1 Networks with single hidden layer

local gatings

student network

0 1000 2000 3000 N

0 1000 2000 3000 N

0 1000 2000 3000 N

g variance bias

10 20 30 40 50

10 20 30 40 50

10 20 30 40 50

0 1000 2000 3000 N

0 1000 2000 3000 N

0 1000 2000 3000 N

teacher network

Global (b) Local (c-e)

0 1000 2000 3000 N

Figure 2: Dependence of generalization error on network width for a Re LU teacher task. (a)Top: The Re LU teacher network, the input x is divided into 5 subsets of input dimensions, the input layer weights either assume same order of magnitude across different input dimensions (left, (b)), or assume larger amplitudes for one subset of input dimensions-the preferred inputs (right, bold connections to a subset of input neurons,(c-e)). Bottom: The student network is a GGDLN with one hidden layer and gatings with localized receptive ﬁelds: each gating is connected to only a subset of input dimensions. (b) Bias, variance and generalization error decreases as a function of N for a regular Re LU teacher, theory agrees qualitatively well with GD dynamics. (c) Bias and generalization error increases as a function of N for Re LU teacher with preferred inputs. (d) The renormalization matrix U1 for different network widths for the teacher with preferred inputs. The ﬁrst 10 10 block corresponds to the gatings with the same receptive ﬁeld as the teacher s preferred inputs, and is ampliﬁed for small N. (e) The ratio of the average amplitude of the ﬁrst 10 10 block relative to the average amplitude of the other four 10 10 diagonal blocks decreases as a function of N.

Feature selection in ﬁnite-width networks: Unlike in DLNs, the bias term in GGDLNs depends on N, exhibiting different dependence in different parameter regimes. This dependence also varies with

the choice of the gating functions. In Fig.2 we consider a student-teacher learning task, commonly used for evaluating and understanding neural network performance[23, 24, 25, 26]. We present results of learning a Re LU teacher task in GGDLNs with gatings that have localized receptive ﬁelds (i.e., the activation of each gating unit depends on only a subset of input dimensions, the receptive ﬁeld of all gating units tile the N0 input dimensions, as shown in Fig.2(a) bottom), where the student GGDLN is required to learn the input-output relation of a given Re LU teacher. For a Re LU teacher with a single fully connected hidden layer (Fig.2(a) top left), gatings with different receptive ﬁelds are of equal importance, hence the renormalization does not play a beneﬁcial functional role, and the inﬁnitely wide network performs better than ﬁnite N. As shown in Fig.2(b), bias, variance and generalization error all decrease with N. For a local Re LU teacher with larger input weights for one subset of input components (the preferred inputs, Fig.2(a) top right), renormalization improves task performance by the selective increase of the elements in U1 that correspond to gating units whose receptive ﬁelds overlap the teacher s preferred inputs (Fig.2(d&e)). Hence, narrower networks (with a stronger renormalization) generalize better, and both the bias and the generalization error increase with N (Fig.2(c)). More generally, the input can represent a set of ﬁxed features of the data, and the local teacher generate labels depending on a subset of the features. Therefore, networks with ﬁnite

width are able to select the relevant set of features by adjusting the amplitude in the renormalization matrix U1 to assign the gating units with different importance for the task, while in the GP limit the network always assigns equal importance to all the gating units.

To summarize, our theory not only captures the more complex behavior of generalization (especially bias) as a function of network width, but also provides qualitative explanation of how generalization is affected by the structure of the renormalization matrix in different tasks.

Effect of regularization strengths on generalization performance: Similar to the dependence on N, generalization also exhibits different behavior as a function of the regularization parameter σ in different parameter regimes, with contributions from both the bias and the variance. The dependence of error bias on σ also arises due to the matrix nature of the renormalization. In Fig.3 , we show parameter regimes where the bias can increase (Fig.3 (a-c)) or decrease (Fig.3 (d-f)) with σ on MNIST dataset [19] (Appendix C.3 ). Although the dependence on σ is complicated and diverse, and there lacks a general rule for when the qualitative behavior changes, we found that our theory accurately captures the qualitative behavior of results obtained from GD (Appendix B.3 Fig.3). In both regimes the variance increases with σ as the solution space expands for a weaker regularization. Speciﬁcally in Fig.3 (d-f), due to the increasing variance (e) and decreasing bias (d), there is a minimum error rate ((f), Appendix A.3 Eq.50 for how error rate is calculated from the mean and variance of the predictor) at intermediate σ, indicating an optimal level of regularization strength as opposed to linear networks [4], where strong regularization (σ = 0) always results in optimal generalization

0.1 0.2 0.3 0.4 0.5

0.1 0.2 0.3 0.4 0

variance 10-3

0.1 0.2 0.3 0.4

0.2 error rate

0 0.2 0.4 0.6 0.4

0 0.2 0.4 0.6 0

0 0.2 0.4 0.6 0.12

Figure 3: Generalization as a function of σ for GGDLNs trained on MNIST dataset predicted by our theory. (a-c) Bias (a), variance (b) and error rate (c) increase as a function of σ . (b-f) Bias decreases as a function of σ while variance increases, leading to an optimal σ with minimum error rate.

GGDLNs with different choices of gatings achieve comparable performance to Re LU networks: The nonlinear operation of the gatings enables the network to learn nonlinear tasks. In Fig.4, we show that although the gatings are ﬁxed during training, the network achieves comparable performance as a fully trained nonlinear (Re LU) network with the same hidden layer width for classifying even and odd digits in MNIST data when M is sufﬁciently large (over-parameterization does not lead to over-ﬁtting

here, as shown also in other nonlinear networks [27, 28], possibly due to the explicit L2 prior). Furthermore, although the gatings are ﬁxed during the supervised training of the GGDLN, they can be cleverly chosen to improve generalization performance. To demonstrate this strategy, we compared two different choices of gatings. Random gatings take the form gm(x) = ( 1 p N0 V>

mx b), where Vm is a N0-dimensional random vector with standard Gaussian i.i.d. elements, b is a scalar threshold, and (x) is the heaviside step function. The pretrained gatings are trained on the unlabelled training dataset with unsupervised soft k-means clustering, such that the m-th gating gm(x) outputs the probability of assigning data x to the m-th cluster (Appendix C.3). As shown in Fig.4, for pretrained gatings, generalization performance improves with M much faster compared to random gatings, and approaches the performance of Re LU network at a smaller M. Our theory (Fig.4) and numerical results of GD dynamics (Appendix B.3 Fig.4) agree qualitatively well. The result shows that GGDLNs can still achieve competitive performance on nonlinear tasks while remaining theoretically amenable.

50 0 100 50 0 100 50 0

bias variance error rate a b c

Figure 4: Dependence of generalization on M for GGDLNs trained on MNIST dataset predicted by our theory. Bias (a), variance (b) and error rate (c) as a function of M for random (red lines) and pretrained gatings (blue lines), and Re LU network with the same width (black dashed lines).

4.2 Kernel shape renormalization in deeper networks

We now consider the effect of the matrix renormalization on GGDLNs with more layers. We begin by analyzing the renormalization effect on the shape of the kernel in deep architectures. It is well known that the GP kernel of many nonlinear networks ﬂattens (the kernel function goes to a constant) as network depth increases [2], ultimately losing information about the input and degrading generalization performance. Here we show that kernel shape renormalization slows down ﬂattening of kernels by incorporating data relevant information into the learned weights.

To study the shape of the kernel independent of kernel magnitude, we deﬁne the normalized kernel KL(x, y) =

KL(x,y) KL(x,x)1/2 KL(y,y)1/2 , where KL(x, y) denotes the renormalized kernel for GGDLN with L hidden layers. This normalized kernel measures the cosine of the vectors x and y with generalized inner product deﬁned by the kernel KL(x, y), and therefore KL(x, y) 2 [ 1, 1]. For the GP kernel of GGDLNs, we have KL(x, y) = cos(g(x), g(y))L cos(x, y). While KL depends on the speciﬁc choice of gatings in general, in the special case of random gatings with zero threshold gm(x) = ( 1 p N0 V>

mx) and the number of gatings M ! 1, we can write KL analytically as a

function of the angle between input vectors x and y, given by KL( ) = (

)L cos( ), 2 [ , ]. Thus, as L ! 1, KL( ) shrinks to zero except for = 0. This ﬂattening effect reﬂects the loss of information in deep networks, as pairs of inputs with different similarities now all have hidden representations that are orthogonal. The effect also empirically holds true for networks with ﬁnite M (see Appendix B.4).

In Fig.5 ,we study the effect of kernel renormalization on the ﬂattening effect of deep GGDLNs. As shown in Fig.5 (a)-(c), the elements of the renormalized kernel shrink to zero at a much slower rate compared to the GP kernel. (Note that unlike the variance, the bias is affected only by shape changes, but not by changes in the amplitude of the kernel, in Fig.5(d) we plot only the bias contribution to the generalization.) While mitigating the ﬂattening of the GP kernel is a general feature of our renormalized kernel for different parameters, its effect on the generalization performance (especially the bias) may be different for different network parameters. In the speciﬁc example in Fig.5, ﬁnite width networks with a less ﬂattened renormalized kernel achieve better performance than the GP limit. Both the GP limit and the ﬁnite width networks have optimal performance at L = 2 in this example.

0 0.5 1 K(x,y)

probability

L=0 L=1 L=2 L=3 L=4

0 0.5 1 K(x,y)

probability

L=0 L=1 L=2 L=3 L=4

ratio of elements<=0.05

1 2 3 4 5 L

Figure 5: Shape renormalization slows down ﬂattening of kernels in deep networks. (a-b) Distribution of kernel elements KL(x, y) for the renormalized kernel (a) and GP kernel (b) for different network depth L. (c) Ratio of kernel elements smaller or equal to 0.05 increases faster for GP kernel (blue line) compared to the renormalized kernel (black line), the renormalization slows down the rate at which elements in the GP kernel shrink to zero as a function of L. (d) The bias contribution to the generalization ﬁrst decreases then increases as a function of L due to the ﬂattening of the kernel (blue line). Finite width network with renormalized kernel performs better for L > 1 in this parameter regime (black line). See Appendix C.3 for detailed parameters.

5 GGDLNs for multiple tasks

In this section, we apply our theory to investigate the ability of GGDLNs to perform multiple tasks. We consider two different scenarios below. First, different tasks require the network to learn input-output mappings on input data with different statistics. This scenario corresponds to real life situations where the training data distribution is non-stationary. The tasks can be separated without any additional top-down information. In this case, the gatings are bottom-up, and are functions of the input data only. In the second case, different tasks give conﬂicting labels for the same inputs, corresponding to the situation where performing the two tasks require additional top-down contextual information, and the information can be incorporated into the gating units in GGDLNs. In both scenarios, when the gatings are ﬁxed and we modulate the de-correlation by changing network width and thus the strength of the kernel renormalization, we ﬁnd that de-correlation between tasks leads to better generalization performance.

5.1 Bottom-up gating units

0 500 1000 N

0 500 1000 2.2

-2 0 2 threshold

-2 0 2 threshold

Figure 6: GGDLNs with bottom-up gating units learning multiple tasks trained on permuted MNIST. (a-b) Task-task correlation matrix C for N = 50 and N = 1000, different permutations are more decorrelated for larger N. (c) Error rate decreases as a function of N due to the decorrelation. (d) Ratio of the average amplitude of diagonal elements versus off-diagonal elements in C increases as a function of N. (e) Error rate ﬁrst decreases then increases as a function of gating threshold. (f) Decorrelation increases as a function of gating threshold.

First we consider learning different tasks deﬁned by vastly different input statistics with bottom-up gatings, using permuted MNIST as an example. Previous works have shown that GLNs mitigate catastrophic forgetting when sequentially trained on permuted MNIST [8]. While our theory does not address directly the dynamics of sequential learning, we aim to shed light on this question by asking how the two tasks interfere with each other when they are learned simultaneously.

We introduce a measure of inter-task interference by noting that after learning the mean predictor on a new data x , Eq.8, is a linear combination of the output labels Y µ of all the training data,

and the coefﬁcient of this linear combination, is given by the µ-th coefﬁcient of k(x)> K 1. Thus, we deﬁne a task-task correlation matrix, via Cpq = PP

γ=1| k T K 1|pγ,qµ(p, q = 1, , n), where we assume there are P training examples and Pt test data for each task, with a total of n tasks. The amplitude of each element Cpq measures how much training data of task q contribute to the prediction on the test data of task p. Stronger diagonal elements indicates that the network separates the processing of data of different tasks (Fig.6(a)-(b)). As we show in Fig.6, we can tune the relative strength of the diagonal elements of C smoothly by changing the network width (Fig.6(a)-(d)) or by changing the threshold of the gating (Fig.6 (e)-(f)). In the case where the gatings are ﬁxed and the network width is changed, an increase in the strength of the diagonal elements (Fig.6(d)) results in better generalization (Fig.6(c)), indicating that the network generalizes better by processing data of different tasks separately through the gating units. However, in the case where we change the activation of the gatings by adjusting the threshold, although different tasks are more de-correlated when the threshold is large due to a set of less overlapping gatings activated for each task, generalization error ﬁrst decreases and then increases again. This is because for large threshold the sparsity of the gatings activated for each task limits the nonlinearity of the network, and therefore the generalization performance on this nonlinear task.

5.2 Combined top-down and bottom-up gating units

forbidden gatings

permitted gatings

forbidden gatings

permitted gatings

0 200 400 600 N

0 200 400 600 N

200 600 1000

200 600 1000

200 600 1000

0.4 N=500 GD

200 600 1000

Figure 7: Kernel renormalization de-correlates different tasks deﬁned by different labels on the same inputs. (a) GGDLNs performing two tasks using combined top-down and bottom-up task signal. (b) Top: Renormalized kernel calculated with Eq. 7 from GD dynamics. Bottom: Renormalized kernel theory. (c) Ratio of the magnitude of diagonal (blocks with white dashed lines in (b)) versus off diagonal blocks decreases as a function of N. (d) Generalization error increases with N.

We now consider learning two tasks that provide conﬂicting labels on the same input data. The gating units combine both top-down task signal which informs the system of which task to perform for a given input, and bottom-up signals which, as before, depend on the input. In different tasks, different sets of gatings are permitted or forbidden depending on the top-down signal, then the states of the permitted gatings are further determined as a function of the input x, while the forbidden gatings are set to 0, and the corresponding dendritic branches do not connect to the previous layer neurons (Fig.7(a)) in this task. For a single hidden layer network, with a similar argument as in Section 2, it is straightforward to show that the number of different tasks that can be memorized is given by n M and the number of training examples for each task needs to satisfy P N0Mp, where Mp is the number of permitted gating units in each task. In the limiting case where a set of non-overlapping gating units are permitted in each of the n tasks, the network is equivalent to n sub-networks, each independently performing one task. In this case Mp is limited by M/n, which in turn limits the capacity and the effective input-output nonlinearity for each independent task. We consider the case where the permitted gatings are chosen randomly for each task and are therefore in

general overlapping across tasks. We then investigate how learning modiﬁes the correlation induced by the overlapping gatings through the renormalization matrix. As an example we consider training on permuted and un-permuted MNIST digits of 0 and 1 s. One task is to classify the two digits in both permuted and un-permuted data, and the second task is to separate the permuted digits (both 0 and 1) from the un-permuted digits. The labels of the two tasks are uncorrelated, while the permitted gatings of the two tasks are partially overlapping. In this case the renormalized kernel K1 can be written as Kpµ,q = ( 1

M gp(xµ)>U1gq(x )) σ2

N0 xµ>x . Here p, q 2 {1, 2} are the task indices, and µ, = 1, , P are the input indices. The kernel is therefore 2P 2P as shown in Fig.7(b) (P = 600); the diagonal blocks (white dashed lines) correspond to kernels of task 1 and task 2, while the off diagonal blocks correspond to the cross kernels. In Fig.7(b) bottom, we show the renormalized kernel with the renormalization matrix U1 calculated by solving Eq.6. Similar results are achieved by by numerically estimating Eq.7 with readout weights obtained from GD dynamics (Fig.7(b) top).

The results demonstrate that stronger kernel renormalization achieved in narrower networks suppresses more strongly the correlation between tasks, reﬂected by the weaker off-diagonal blocks in Fig.7(b). A decreasing ratio between the average amplitudes of the diagonal and off-diagonal blocks shows that the de-correlation effect diminishes for large N, leading to increasing generalization error with N(Fig.7(c & d)).

6 Discussion

In this work, we proposed a novel gating network architecture, the GGDLN, amenable to theoretical analysis of the network expressivity and generalization performance. The predictor statistics of GGDLNs can be expressed in terms of kernels that undergo shape renormalization, resulting diverse behavior of the bias as a function of various network parameters. This renormalization slows down the ﬂattening of the GP kernel in deep networks, suggesting that the loss of input information as L increases may be prevented in ﬁnite-width nonlinear networks. We also investigate the capability of GGDLNs to perform multiple tasks. While our theory is an exact description of the posterior of weight distribution induced by Langevin dynamics in Bayesian learning, it provides surprisingly well qualitative agreement with results obtained with GD dynamics for not only the generalization but also the kernel representation with matrix renormalization, largely extending its applicability. There are several limitations of our work. Our mean-ﬁeld analysis is accurate in the ﬁnite-width thermodynamic limit where both P and N go to inﬁnity, but M and L remain ﬁnite. In practice, the size of the renormalization matrix increases as M L, hence for some moderate M, as L increases, any large but ﬁnite N might eventually get the network outside the above thermodynamic regime. The theory also focuses on the equilibrium distribution induced by learning and does not address important questions related to the learning dynamics. Finally, although we have shown qualitative correspondence of the GGDLN properties and standard DNNs with local nonlinearity, as Re LU, a full theory of the thermodynamic limit of DNNs with local nonlinearity is still an open challenge.

While our theory currently addresses learning in GGDLNs using a global cost function, exploring the possibility of extending the formalization of the equilibrium distribution to characterize local learning dynamics is an ongoing work. Recent works have shown that multilayer perceptrons (MLPs) with learned gatings that implements spatial attention have surprisingly good performance on Natural Language Processing (NLP) and computer vision [29]. Extension of our theory to learnable gatings that implements attention mechanisms remains to be explored. Furthermore, incorporating convolutional architecture [30, 31, 32, 33, 34] into our GGDLNs and using the gating units to encode context-dependent modiﬁcation of different feature maps is an interesting direction related to the fast-developing research topic of visual question-answering (VQA) [35, 36, 37] , where answering different questions about the same image is similar to performing multiple tasks in different contexts with different labels on the same dataset, as we discussed in Section 5.2. We leave these exciting research directions for future work.

Acknowledgement

We thank the anonymous reviewers for their helpful comments. This research is supported by the Swartz Foundation, the NIH grant from the NINDS (No. 1U19NS104653), and the Gatsby Charitable Foundation. We acknowledge the support of a generous gift from Amazon. This paper is dedicated to the memory of Mrs. Lily Safra, a great supporter of brain research.

[1] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. Advances in neural

information processing systems, 22, 2009. 1, B.4

[2] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington,

and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017. 1, 2, 4.2

[3] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin

Ghahramani. Gaussian process behaviour in wide deep neural networks. ar Xiv preprint ar Xiv:1804.11271, 2018. 1

[4] Qianyi Li and Haim Sompolinsky. Statistical mechanics of deep linear neural networks: The

backpropagating kernel renormalization. Physical Review X, 11(3):031059, 2021. 1, 3, 3, 4, 4.1

[5] Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. A mathematical theory of semantic

development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537 11546, 2019. 1

[6] Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and

transfer learning in deep linear networks. ar Xiv preprint ar Xiv:1809.10374, 2018. 1

[7] David Budden, Adam Marblestone, Eren Sezener, Tor Lattimore, Gregory Wayne, and Joel

Veness. Gaussian gated linear networks. Advances in Neural Information Processing Systems, 33:16508 16519, 2020. 1

[8] Eren Sezener, Agnieszka Grabska-Barwi nska, Dimitar Kostadinov, Maxime Beau, Sanjukta

Krishnagopal, David Budden, Marcus Hutter, Joel Veness, Matthew Botvinick, Claudia Clopath, et al. A rapid and efﬁcient learning rule for biological neural circuits. Bio Rxiv, 2021. 1, 5.1

[9] Joel Veness, Tor Lattimore, David Budden, Avishkar Bhoopchand, Christopher Mattern, Ag-

nieszka Grabska-Barwinska, Eren Sezener, Jianan Wang, Peter Toth, Simon Schmitt, et al. Gated linear networks. ar Xiv preprint ar Xiv:1910.01526, 2019. 1

[10] Jonathan Fiat, Eran Malach, and Shai Shalev-Shwartz. Decoupling gating from linearity. ar Xiv

preprint ar Xiv:1906.05032, 2019. 1, 2

[11] Joel Veness, Tor Lattimore, Avishkar Bhoopchand, Agnieszka Grabska-Barwinska, Christo-

pher Mattern, and Peter Toth. Online learning with gated linear networks. ar Xiv preprint ar Xiv:1712.01897, 2017. 1

[12] Samuel Lippl, LF Abbott, and Sue Yeon Chung. The implicit bias of gradient descent on

generalized gated linear networks. ar Xiv preprint ar Xiv:2202.02649, 2022. 1

[13] Roman Vershynin. Memory capacity of neural networks with threshold and rectiﬁed linear unit

activations. SIAM Journal on Mathematics of Data Science, 2(4):1004 1033, 2020. 2

[14] Masami Yamasaki. The lower bound of the capacity for a neural network with multiple hidden

layers. In International Conference on Artiﬁcial Neural Networks, pages 546 549. Springer, 1993. 2

[15] Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Statistical mechanics of neural

networks near saturation. Annals of physics, 173(1):30 67, 1987. 2

[16] Madhu Advani, Subhaneil Lahiri, and Surya Ganguli. Statistical mechanics of complex neural

systems and high dimensional data. Journal of Statistical Mechanics: Theory and Experiment, 2013(03):P03014, 2013. 2

[17] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoenholz, Jascha Sohl-

Dickstein, and Surya Ganguli. Statistical mechanics of deep learning. Annual Review of Condensed Matter Physics, 11:501 528, 2020. 2

[18] Andreas Engel and Christian Van den Broeck. Statistical mechanics of learning. Cambridge

University Press, 2001. 2

[19] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. 2, 4.1

[20] Gadi Naveh, Oded Ben David, Haim Sompolinsky, and Zohar Ringel. Predicting the outputs of

ﬁnite deep neural networks trained with noisy gradients. Physical Review E, 104(6):064301, 2021. 3

[21] Jean Zinn-Justin. Quantum ﬁeld theory and critical phenomena, volume 171. Oxford university

press, 2021. 3

[22] H. Risken and H. Haken. The Fokker-Planck Equation: Methods of Solution and Applications

Second Edition. Springer, 1989. 3

[23] Shinji Watanabe, Takaaki Hori, Jonathan Le Roux, and John R Hershey. Student-teacher

network learning with enhanced features. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5275 5279. IEEE, 2017. 4.1

[24] Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual

intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 4.1

[25] Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of

generalization error in neural networks. Neural Networks, 132:428 446, 2020. 4.1, C.5

[26] Hyunjune Sebastian Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanics of

learning from examples. Physical review A, 45(8):6056, 1992. 4.1

[27] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-

learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. 4.1

[28] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM

Journal on Mathematics of Data Science, 2(4):1167 1180, 2020. 4.1

[29] Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps. Advances in Neural

Information Processing Systems, 34:9204 9215, 2021. 6

[30] Mark Van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian

processes. Advances in Neural Information Processing Systems, 30, 2017. 6

[31] Gadi Naveh and Zohar Ringel. A self consistent theory of gaussian processes captures feature

learning effects in ﬁnite cnns. Advances in Neural Information Processing Systems, 34, 2021. 6

[32] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting

Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. Pattern Recognition, 77:354 377, 2018. 6

[33] Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional

networks as shallow gaussian processes. ar Xiv preprint ar Xiv:1808.05587, 2018. 6

[34] Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. Understanding of a convolutional

neural network. In 2017 international conference on engineering and technology (ICET), pages 1 6. Ieee, 2017. 6

[35] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence

Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425 2433, 2015. 6

[36] Kushal Kaﬂe and Christopher Kanan. Visual question answering: Datasets, algorithms, and

future challenges. Computer Vision and Image Understanding, 163:3 20, 2017. 6

[37] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-

attention for visual question answering. Advances in neural information processing systems, 29, 2016. 6

[38] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent

learning. Constructive Approximation, 26(2):289 315, 2007. C.5

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justiﬁcation to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [Yes] See Section ??. Did you include the license to the code and datasets? [No] The code and the data are

proprietary. Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 3 and 6

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 3 (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix A 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] In supplementary material (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Appendix C (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] See Fig.6 (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix C 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] Appendix C

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]