# implicit_bias_of_linear_rnns__a6175815.pdf

Implicit Bias of Linear RNNs

Melikasadat Emami 1 Mojtaba Sahraee-Ardakan 1 2 Parthe Pandit 1 2 Sundeep Rangan 3 Alyson K. Fletcher 1 2

Contemporary wisdom based on empirical studies suggests that standard recurrent neural networks (RNNs) do not perform well on tasks requiring long-term memory. However, RNNs poor ability to capture long-term dependencies has not been fully understood. This paper provides a rigorous explanation of this property in the special case of linear RNNs. Although this work is limited to linear RNNs, even these systems have traditionally been difﬁcult to analyze due to their non-linear parameterization. Using recently-developed kernel regime analysis, our main result shows that as the number of hidden units goes to inﬁnity, linear RNNs learned from random initializations are functionally equivalent to a certain weighted 1Dconvolutional network. Importantly, the weightings in the equivalent model cause an implicit bias to elements with smaller time lags in the convolution, and hence shorter memory. The degree of this bias depends on the variance of the transition matrix at initialization and is related to the classic exploding and vanishing gradients problem. The theory is validated with both synthetic and real data experiments.

1. Introduction

Over the past decade, models based on neural networks have become commonplace in virtually all machine learning applications. A key feature about this class of models is that, in principle, they do not require an explicit design of features but instead rely on an implicit learning of meaningful representations of the data. This makes neural networks attractive from the point of view of machine perception where raw signals are inputs to the model. While neural

1Department of Electrical and Computer Engineering, University of California, Los Angeles, Los Angeles, USA 2Department of Statistics, University of California, Los Angeles, Los Angeles, USA 3Department of Electrical and Computer Engineering, New York University, Brooklyn, New York, USA. Correspondence to: Melikasadat Emami <emami@ucla.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

networks have become ubiquitous in practical applications, several theoretical aspects of them largely remain a mystery. Of signiﬁcant importance is the lack of understanding of the potential implicit biases that these representations bring to the predictions of the model.

This paper aims to understand the implicit bias behaviour of Recurrent Neural Networks (RNN). Several machine learning tasks require dealing with sequential data with possibly varying lengths of input sequences. Example tasks include automatic speech recognition, language translation, and image captioning, among others. A well-known shortcoming of standard RNNs trained using gradient descent is their poor performance at tasks requiring long-term dependence (Bengio et al., 1994). Traditional RNNs suffer from the well-known vanishing / exploding gradient problem where gradients from input time samples at long delays either decay or explode making the learning of long-term effects difﬁcult. In practice, there have been a large number of enhancements to RNNs to overcome these issues most notably LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Cho et al., 2014). However, while these methods have had tremendous practical success, the ability of different RNN architectures to capture or not capture long-term dependencies has been difﬁcult to rigorously analyze.

As a simple starting point, this paper seeks to precisely understand the training and memory of linear RNNs. Even such linear models have been difﬁcult to analyze completely due to their non-linear parameterization. Our main result shows rigorously that these models have an implicit bias toward short-term memory, conﬁrming the qualitative empirically observed behavior of these networks. To our knowledge, this paper is the ﬁrst that provides a such a characterization of an implicit bias in RNNs.

Our results are based on the key observation that linear RNNs are functionally equivalent to a 1D-convolutional model which is feed-forward in nature. Further, due to the Neural Tangent Kernel (NTK) regime based analysis (Jacot et al., 2018), we are able to show that RNNs trained using gradient descent are implicitly biased towards learning tasks with short-term contexts. Our result holds in a certain wide limit regime where the number of hidden units in RNN goes to inﬁnity. The contributions are as follows:

We explicitly compute the NTK for a linear RNN. This

Implicit Bias of Linear RNNs

is challenging due to the weight sharing in RNNs which leads to statistical dependencies across time. We calculate this NTK using a conditioning technique as in (Bayati & Montanari, 2011) to deal with the dependencies. This NTK is also calculated in (Alemohammad et al., 2020) using the Tensor program results of (Yang, 2019a).

We show that the linear RNN NTK is equivalent to the NTK of a scaled convolutional model with certain scaling coefﬁcients. This means that in the wide limit regime (number of hidden units in RNN ), gradient descent training of a linear RNN with nonlinear parameterization is identical to the training of an appropriately scaled convolutional model.

The above results rigorously show that there is an implicit bias in using the non-linear parameterization associated with a linear RNN. In particular, training linear RNNs with non-linear parameterization using gradient descent is implicitly biased towards short memory.

We demonstrate the bias-variance trade-off of linear RNNs in experiments on synthetic and real data.

1.1. Prior Work

The connection between kernel methods and inﬁnite width neural networks was ﬁrst introduced in (Neal, 1996). Neural networks in the inﬁnite width limit are equivalent to Gaussian processes at initialization and several papers have investigated the correspondence to kernel methods for a variety of architectures (Lee et al., 2018; Novak et al., 2019; Garriga-Alonso et al., 2019; Yang, 2019a; Daniely et al., 2016; Matthews et al., 2018). In particular, (Daniely et al., 2016) introduced a framework to link a reproducing kernel to the neural network and stochastic gradient descent was shown to learn any function in the corresponding RKHS if the network is sufﬁciently wide (Daniely, 2017).

A recent line of work has shown that gradient descent on over-parameterized networks can achieve zero training error with parameters very close to their initialization (Allen-Zhu et al., 2018; Du et al., 2018a;b; Li & Liang, 2018; Zou et al., 2018). The analysis of the generalization error in this high dimensional regime led to exact characterizations of the test error for different architectures (Montanari et al., 2019; Hastie et al., 2019; Belkin et al., 2019; Emami et al., 2020; Barbier et al., 2019; Gerbelot et al., 2020). In addition to convergence to a global minimum for an over-parameterized two-layer neural network, (E et al., 2020) also showed that the resulting functions are uniformly close to the ones found with the kernel regime. It was shown by (Jacot et al., 2018) that the behavior of an inﬁnitely wide fully-connected neural network trained by gradient descent is characterized by the so-called Neural Tangent Kernel (NTK) which is essentially

the linearization of the network around its initialization. The NTK was later extended for different architectures (Arora et al., 2019; Yang, 2019a;b; Alemohammad et al., 2020).

A different line of papers investigated the overparameterized neural networks from the mean ﬁeld viewpoint (Mei et al., 2018; Wei et al., 2019; Ma et al., 2019; Dou & Liang, 2020; Sirignano & Spiliopoulos, 2020; Rotskoff & Vanden-Eijnden, 2018). For recurrent neural networks in particular, (Chen et al., 2018) has provided a theory for signal propagation in these networks which could predict their trainability. The authors also give a closed-form initialization to improve the conditioning of input-output Jacobian.

Trainablility of RNNs has also been improved by using orthogonal/unitary weight matrices (Arjovsky et al., 2016; Wisdom et al., 2016; Jing et al., 2017; Emami et al., 2019). It has been shown in (Emami et al., 2019) that for RNNs with Re LU activations, there is no loss in the expressiveness of the model when imposing the unitary constraint.

Note that in an RNN the states are correlated due to weight sharing. Previous work such as (Chen et al., 2018), has simpliﬁed the setting by assuming an independence over RNN weights (ignoring the correlation) to show that the preactivations are Gaussian distributed. In this work, taking into account these dependencies, we use techniques used in (Bayati & Montanari, 2011; Rangan et al., 2019) to characterize the behaviour of RNNs at initialization. Similar techniques have also been explored in (Yang, 2019a).

We should mention that learning the weight matrices of a linear RNN using data is essentially a system identiﬁcation task. There is a large body of literature in control theory that consider the system identiﬁcation problem and propose many different methods to ﬁnd a system that matches the input-output behavior of a given system. These methods include the prediction error method (PEM), subspace methods, empirical transfer function estimate (ETFE), correlation method, spectral analysis method, and sequential Monte Carlo method to name a few. For a more comprehensive list of system identiﬁcation methods and details see (Ljung, 1999; Ljung & Glad, 1994; Lennart, 1999; Katayama, 2006; Viberg, 1995; S oderstr om & Stoica, 1989; Pintelon & Schoukens, 2012; Sj oberg et al., 1995). Even though it would be interesting to see how different system identiﬁcation methods can be incorporated into neural network training pipeline, the vast majority of works currently learn the weights directly by optimizing a loss function via gradient descent or its variants. As such, in this work we solely focus on training of linear RNNs using gradient descent.

Implicit Bias of Linear RNNs

2. Linear RNN and Convolutional Models

Linear RNNs We ﬁx a time period T and consider a linear RNN mapping an input sequence x = (x0, . . . , x T 1) to an ouptut sequence y = (y0, . . . , y T 1) via the updates

ht = 1 n Wht 1 + Fxt,

yt = 1 n Cht, t = 0, . . . , T 1, (1)

with the initial condition h 1 = 0. We let nx, ny, and n be the dimension at each time of the input, xt, output, yt, and hidden state ht respectively. Note that a bias term can be added for ht by extending xt and F. We will let

y = f RNN(x, θRNN) (2)

denote the mapping (1) where θRNN are the parameters

θRNN = (W, F, C). (3)

The goal is to learn parameters θRNN for the system from N training data samples (xi, yi), i = 1, . . . , N. In this case, each sample (xi, yi) is a T-length input-output pair.

Wide System Limit We wish to understand learning of this system in the wide-system limit where the number of hidden units n while the dimensions nx, ny and number of time steps T are ﬁxed. Since the parameterization of the RNN is non-linear, the initialization is critical. For each n, we will assume that the parameters W, F, C are initialized with i.i.d. components,

Wij N(0, νW ), Fij N(0, νF ), Cki N(0, νC), (4) for constants νW , νF , νC.

Stability In the initialization (4), νW is the variance of the components of the kernel matrix W. One critical aspect in selecting νW is the stability of the system. A standard result in linear systems theory (see, e.g. (Kailath, 1980)) is that the system (1) is stable if and only if 1 nλmax(W) < 1 where λmax(W) is the maximum absolute eigenvalue of W (i.e. the spectral radius). Stable W are generally necessary for linear RNNs: Otherwise bounded inputs xt can result in outputs yt that grow unbounded with time t. Hence, training will be numerically unstable. Now, a classic result in random matrix theory (Bai & Yin, 1986) is that, since the entries of W are i.i.d. Gaussian N(0, νW ),

lim n 1 nλmax(W) = νW

almost surely. Hence, for stability we need to select νW < 1. As we will see below, it is this constraint that will limit the ability of the linear RNN to learn long-term memory.

Scaled 1D Convolutional Equivalent Systems: Our main result will draw an equivalence between the learning of linear RNNs and certain types of linear convolutional models. Speciﬁcally, consider a linear convolutional model of the form,

j=0 Ljxt j, (5)

where Lj Rny nx are the ﬁlter coefﬁcient matrices. In neural network terminology, the model (5) is a simply a linear 1D convolutional network with nx input channels, ny output channels and T-wide kernels.

Both the linear RNN model (1) and the 1D convolutional model (5) deﬁne linear mappings of T-length input sequences x to T-length output sequences y. To state our equivalence result between these models, we need to introduce a certain scaled parametrization: Fix a set of scaling factors ρ = (ρ0, . . . , ρT 1) where ρj > 0 for all j. Deﬁne the parameters

θconv = (θ0, . . . , θT 1), (6)

where θj Rny nx. Given any θ, let the impulse response coefﬁcients be Lj = ρjθj. (7)

As we will see below, the effect of the weighting is to favor certain coefﬁcients Lj over others during training: For coefﬁcients j where ρj is large, the ﬁtting will tend to select Lj large if needed. This scaling will be fundamental in understanding implicit bias.

Now, given a set of weights ρ = (ρ0, . . . , ρT 1), let

y = fconv(x, θconv) :=

denote the mapping of the input x = (x0, . . . , x T 1) through the convolutional ﬁlter θconv with ﬁlter coefﬁcients ρ to produce the output sequence y = (y0, . . . , y T 1).

It is well-know that the RNN and convolutional models deﬁne the same total set of input-output mappings as given by the following standard result: Proposition 2.1. Consider the linear RNN model (2) and the 1D convolutional model (8).

(a) Given a linear RNN parameters θRNN = (W, F, C), a 1D convolutional ﬁlter with coefﬁcient matrices

2 CW j F, j = 0, . . . , T 1, (9)

will have an identical input-output mapping. That is, there exists parameters θconv such that

f RNN(x, θRNN) = fconv(x, θconv), (10)

Implicit Bias of Linear RNNs

for all inputs x.

(b) Conversely, given any T ﬁlter coefﬁcients {Lj}T 1 j=0 , there exists RNN model with n Tnxny hidden states such that the RNN and 1D convolutional model have identical input-output mappings over T-length sequences.

Proof. These are standard results from linear systems theory (Kailath, 1980). In the linear systems theory, the coefﬁcients Li are together called the matrix impulse response. Part (b) follows by ﬁnding C, W and F to match the equations (9). Details are given in the Appendix B.1.

Linear and Non-Linear Parametrizations: Proposition 2.1 shows that linear RNNs with sufﬁcient width can represent the same input-output mappings as any linear convolution system. The difference between the models is in the parameterizations. The output of the convolutional model is linear in the parameters θconv whereas it is non-linear in θRNN. As we will see below, the non-linear parameterization of the RNN results in certain implicit biases.

3. NTKs of Linear RNNs and Scaled Convolutional Models

3.1. Neural Tangent Kernel Background

To state our ﬁrst set of results, we brieﬂy review the neural tangent kernel (NTK) theory from (Jacot et al., 2018; Arora et al., 2019). The main deﬁnitions and results we need are as follows: Consider the problem of learning a (possibly non-linear) model of the form

by = f(x, θ), (11)

where x Rmx is an input, f( ) is a model function differentiable with respect to parameters θ, and by is some prediction of an output y Rmy. The problem is to learn the parameters θ from training data {xi, yi}N i=1. For sequence problems, we use the convention that each xi and yi represent one entire input-output sequence pair. Hence, the dimensions will be mx = nx T and my = ny T.

Now, given the training data {xi, yi}N i=1 and an initial parameter estimate θ0, the neural tangent kernel (NTK) model is the linear model

by = f lin(x, α) := f(x, θ0) +

j=1 K(xj, x)αj, (12)

where K(x, x ) is the so-called NTK,

[K(x, x )]ij := fi(x, θ0)

θ , fj(x , θ0)

where fi denotes the i-th output dimension and α is a vector of dual coefﬁcients,

α = (α1, . . . , αN), αj Rmx. (14)

Note that K(x, x ) depends implicitly on θ0. Also, for a ﬁxed initial condition, θ0, the model (12) is linear in the parameters α, and hence potentially easier to analyze than the original non-linear model (11). The key result in NTKs is that, for certain wide neural networks with random initializations, (full-batch) gradient descent training of the non-linear and linear models are asymptotically identical. For example, the results in (Lee et al., 2019) and (Alemohammad et al., 2020) provide the following proposition:

Proposition 3.1. Suppose that fn(x, θ) is a sequence of recurrent neural networks with n hidden states and non-linear activation function σ( ). Let {(xi, yi)}N i=1 be some ﬁxed training data contained in a compact set. Let bθ0 n denote a random initial condition generated as (4) and let bθℓ n denote the parameter estimate after ℓsteps of (full-batch) gradient descent with some learning rate η. Let Kn(x, x ) denote the NTK of the RNN and f lin n (x, α) denote the corresponding linear NTK model (12). Let bαℓ n denote the parameter estimate obtained with GD with the same learning rate. We further assume that the non-linear activation σ satisﬁes

|σ(0)|, σ , sup x =x |σ(x) σ(x )|/|x x | < .

Then, for all x and x ,

lim n Kn(x, x ) = K(x, x ) a.s. (15)

for some deterministic positive semi-deﬁnite matrix K(x, x ). Moreover, if λmin(K) > 0, then for sufﬁciently small learning rate η and any new sample x,

lim n sup ℓ 0 fn(x, bθℓ n) f lin n (x, bαℓ n) = 0, (16)

where the convergence is in probability.

The consequence of this result is that the behavior of certain inﬁnitely-wide neural networks on new samples x is identical to the behavior of the linearized network around its initialization. This essentially means that as n , the learning dynamics for the original and the linearized networks match during training.

3.2. NTK for the Convolutional Model

Having deﬁned the NTK, we ﬁrst compute the NTK of the scaled convolutional model (8).

Theorem 3.2. Fix a time period T and consider the convolutional model (8) for a given set of scale factors ρ =

Implicit Bias of Linear RNNs

(ρ0, . . . , ρT 1). Then, for any initial condition, and any two input sequences x and x , the NTK for this model is,

K(x, x ) = T (x)TD(ρ)T (x ) Iny, (17)

where denotes the tensor product and T (x) is the Toeplitz matrix,

x0 x1 x T 1 0 x0 x T 2 ... ... ... ... 0 0 x0

RT nx T . (18)

and D(ρ) is the diagonal matrix,

D(ρ) := diag(ρ0Inx, , ρT 1Inx). (19)

Proof Sketch. Given that yt = Pt j=0 ρjθjxt j, we consider a perturbation in parameters θ. We show that the NTK of this model can be written as an expectation over a certain Gaussian random variable. We further derive the full kernel as in (17). Details of the proof are provided in Appendix B.2.

3.3. NTK for the RNN Parametrization

We now compare the NTK of the scaled convolutional model to the NTK for the RNN parameterization.

Theorem 3.3. Fix a time period T and consider the RNN model (1) mapping an input sequence x = (x0, . . . , x T 1) to an ouptut sequence y = (y0, . . . , y T 1) with the parameters (W, F, C). Assume the parameters are initialized as (4) for some constants νW , νF , νC > 0. In the limit as the number of hidden states n :

(a) The impulse response coefﬁcients Lj in (9) converge in distribution to independent Gaussians, where the components of Lj are i.i.d. N(0, νCνF νj W ).

(b) Given any input sequences x and x , the NTK converges almost surely to the deterministic limit:

K(x, x ) = T (x)TD(ρ)T (x ) Iny, (20)

where T (x) and D(ρ) are given in (18) and (19) and

ρj = νC(jνF νj 1 W + νj W ) + νF νj W . (21)

Proof Sketch. Part (a) of the theorem is a special case of a more general lemma, Lemma C.1, that we present in the Appendix. Given (1) and (9), let qt be the impulse response coefﬁcients from xt to ht. We show that (q0, . . . , qt) converges to a Gaussian vector (Q0, . . . , Qt) where Qi s are independent zero mean. We then calculate the variances following Lemma C.1 equations.

In part (b), we consider perturbations W , F , and C of the parameters of the RNN W, F, and C. We observe that the mapping from xt to [ht , dht

dt ] is a linear time-invariant system and its impulse response coefﬁcients can be analyzed in the LSL using Lemma C.1. We can then calculate the NTK as in (20). Details of the proof are provided in Appendix B.3.

Comparing Theorems 3.2 and 3.3, we see that the NTK for linear RNN is identical to that of an scaled convolution model when the scalings are chosen as (21). From Proposition 3.1, we see that, in the wide limit regime where n , gradient descent training of the linear RNN with the nonlinear parametrization θRNN = (W, F, C) in (3) is identical to the training of the convolutional model (5) where the linear parameters Lj are initialized as i.i.d. Gaussians and then trained with certain scaling factors (21).

Moreover, the scaling factors have a geometric decay. Recall from Section 2 that, for stability of the linear RNN we require that νW < 1. When νW < 1 and j > 1, the scaling factors (21) can be bounded as

ρj ρmaxνj 1 W , ρmax := νC(TνF + 1) + νF . (22)

Consequently, the scale factors decay geometrically with νj 1 W . This implies that, in the training of the scaled convolutional model, the coefﬁcients with higher delay j will be given lower weight.

4. Implicit Bias of Linear RNNs

An importance consequence of the geometric decay of the scaling factors ρj in (22) is the implicit bias of GD training of linear RNNs towards networks with short-term memory. To state this precisely, ﬁx input and output training data (xi, yi), i = 1, . . . , N. For each n, consider the RNN model (2) with n hidden states and parameters θRNN in (3). Assume the parameters are initialized as θ0 RNN = (W 0, F 0, C0) in (4) for some νW , νF , νC > 0. Let θℓ RNN = (W ℓ, F ℓ, Cℓ),

denote the parameter after ℓsteps of (full batch) gradient descent with some learning rate η. Let Lℓ RNN be the resulting impulse response coefﬁcients (9),

Lℓ RNN,j = n (j+1)/2Cℓ(W ℓ)j F ℓ, j = 0, . . . , T 1. (23) We then have the following bound.

Theorem 4.1. Under the above assumptions, the norm of the impulse response coefﬁcients of the RNN at the initial iteration ℓ= 0 are given by

lim n E L0 RNN,j 2 F = nxnyνCνF νj W . (24)

Implicit Bias of Linear RNNs

Also, there exists constants B1 and B2 such that if the learning rate satisﬁes η < B1, then for all iterations ℓ,

lim sup n Lℓ RNN,j L0 RNN,j F B2ρjηℓ, (25)

where the convergence is in probability. Moreover, the constants B1 and B2 can be selected independent of νW .

Proof Sketch. We ﬁrst bound the initial impulse response from the result of Theorem 3.3. Next, we use Theorems 3.2 and 3.3 to construct a scaled convolutional model that has the same NTK and intial conditions as the RNN. Finally, we analyze the convolutional model to obtain the bound in (25). For details of the proof see Appendix B.4.

To understand the signiﬁcance of the theorem, observe that in the convolutional model (5), Lj relates the input time samples xt j to the output yt. The coefﬁcient Lj thus describes the inﬂuence of the inputs samples on the output samples j time steps later. Combining (24), (25), and (22), we see that these coefﬁcients decay as

Lℓ RNN,j = O(νj/2 W + ℓνj W ).

Also, as discussed in Section 2, we need νW < 1 for stability. Hence, the magnitude of these coefﬁcients decay geometrically with νj 1 W . Therefore, for a ﬁxed number of training steps, the effect of the input on the output at a lag of j would be exponentially small in j. In this sense, training linear RNNs with the non-linear parameterization θRNN = (W, F, C) is implicitly biased to short memory.

It is useful to compare the performance of an unscaled convolutional model with the linear RNN. The convolutional model can ﬁt any linear time-invariant system with an arbitrary delay. We have seen in Proposition 2.1 that, in principle, the linear RNN can also ﬁt any such system with a sufﬁcient number of hidden states. However, the above theorem shows that unless the number of gradient steps grows exponentially with the desired delay, the parameterization of the linear RNN will strongly bias the solutions to systems with short memory. This restriction will create bias error on systems that have long-term memory. On the other hand, due to the implicit constraint of the linear RNN, the parameterization will reduce the variance error.

5. Numerical Experiments

We validate our theoretical results on a number of synthetic and real data experiments.

5.1. Synthetic data

In section 3, we showed that the NTK for a linear RNN given in (1) with parameters θRNN in (3) is equivalent to the

Figure 1. Dynamics of an RNN and its equivalent scaled Conv-1D in learning a synthetic task. the data is generated from a synthetic RNN with nx = ny = 1 and nh = 4. Noise is added to the output with SNR= 20 d B The sequence length T=10. Training and test samples are Ntr = Nts = 50. Full batch gradient descent is used with lr = 10 4. The dynamics of these models perfectly match.

NTK for its convolutional parameterization with parameters θconv. In order to validate this, we compared the training dynamics of a linear RNN, eq (1), with a large number of hidden states, n, and a scaled convolutional model (8) with scale coefﬁcients ρj deﬁned in (21).

We generated data from a synthetic (teacher) RNN with random parameters. For the data generation system we used a linear RNN with 4 hidden units and nx = ny = 1. Matrices W, F, and C are generated as i.i.d random Gaussian with νW = 0.3 and νF = νC = 1. We added noise to the output of this system such that the signal-to-noise ratio (SNR) is 20 d B. We generated 50 training sequences and 50 test sequences in total and each sequence has T = 10 time steps.

Given the training and test data, we train (i) a (student) linear RNN with n = 1000 hidden units, νW = 0.3, and νF = νC = 1; and (ii) a 1D-convolutional model with scale coefﬁcients ρj calculated in (21) using νW = 0.3, νF = νC = 1. We used mean-squared error as the loss function for both models and applied full-batch gradient descent with learning rate lr = 10 4. Fig. 1 shows the

Implicit Bias of Linear RNNs

Figure 2. The ﬁrst 10 epochs in Fig. 1. Note that to be theoretically accurate, we need lr 0 to be in the kernel regime.

Figure 3. Test performance with respect to delay. For this task we have nh = 1000, Ntr = Nts = 10, nx = 15, ny = 1, and T = 20. The delay is added manually by shifting i.e. yt = xt delay and the output SNR = 20 d B.

identical dynamics of training for both models. Fig. 2 shows a zoomed-in version of the dynamics for training error. We see, as the theory predicts, the RNN and scaled convolution 1D model have an identical performance.

To evaluate the performance of these models for a task with long-term dependencies, we created a dataset where we manually added different delay steps τ to the output of a true linear RNN system i.e. yt = xt τ. We have chosen longer (T = 20) true sequences for this task. We then learned this data using the aforementioned linear RNN and scaled 1D convolutional models. We also trained an unscaled 1D convolutional model with this data to compare performances. With unscaled convolutional model, we exactly learn the impulse response coefﬁcients Lj deﬁned in (5) during training.

Fig. 3 shows the test error with respect to delay steps for all three models. Observe that the performance of the scaled convolutional and the linear RNN models match during training. Due to the bias of these models against the delay,

the test error increases as we increase the delay steps in our system. On the other hand, the performance of the unscaled convolutional model stays almost the same with increasing delay, slightly changing at larger delays as there is less data to track. We thus see the effect of implicit bias: When the true system does not have high delay, the implicit bias of the RNN and scaled convolutional model against delay helps. As the true delay increases, the implicit bias causes bias error not present in the unscaled convolutional model.

Our theoretical results hold in the asymptotic regime where the number of hidden units of the RNN goes to inﬁnity. To test the applicability of our results to real-world RNNs that have a ﬁnite number of hidden units, we run an experiment where we change the number of hidden units and test how closely the RNN training follows the kernel training, i.e. the equivalent scaled convolutional model. In this experiment, the true RNN that generates the synthetic data has 20 hidden units and we train different RNNs with 10, 40, and 200 hidden units to learn the synthetic data. The results are shown in Figure 4. For each nh, the RNN (solid lines) along with the equivalent scaled convolutional model (dashed line) are trained. Our theory claims that when the learning rate of gradient descent goes to zero and the number of hidden units goes to inﬁnity, the training error of these two models should be exactly the same over the course of the optimization, i.e. the solid curves and dashed curves should match. We see that it is indeed the case. As we increase nh the two curves become closer and for nh = 200 they almost perfectly match. This suggests that our results should be applicable in practice to RNNs of moderates size with more that 200 hidden units.

5.2. Real data

We also validated our theory using spikes rate data from the macaque primary somatosensory cortex (S1) (Benjamin et al., 2018). Somatosensory cortex is a part of the brain responsible for receiving sensations of touch, pain, etc from the entire body. The data is recorded during a twodimensional reaching task. In this task, a macaque was rewarded for positioning a cursor on a series of randomly generated targets on the screen using a handle. The data is from a single recording of 51 minutes and includes 52 neurons. The mean and median ﬁring rates are 9.3 and 6.3 spikes/sec.Similar to the previous experiments, we also trained an unscaled 1D convolutional model with this data and compared the performances with the linear RNN and the scaled convolutional models.

We compared the performances on two sets of experiments. We ﬁrst used only the 4.5 minutes of the total recorded data. The purpose of this experiment is to compare the performances in limited data circumstances. With this limited data, we expect the scaled convolutional model (and thus

Implicit Bias of Linear RNNs

Figure 4. Training error over the course of training for RNNs with different number of hidden units. Solid curves show the error of RNN whereas the dashed curves show the error of equivalent scaled convolutional model for each epoch. Our theoretical results show that in the asymptotic regime of nh with iniﬁnitesimal learning rate, the solid curves and dashed curves should exactly match. Here, we see that as we increase nh the two curves get closer and even for nh = 200 our theoretical prediction almost perfectly matches what we observe in practice.

the RNN) to perform better than the unscaled model due to the implicit bias of the towards short-term memory and the fact that the effective number of parameters is smaller in the scaled model which leads to a smaller variance. In the second experiment, we trained our models using all the available data ( 51 mins). In this case, the scaled model (and the RNN) performs worse because of the increased bias error. In our experiments setup, the linear RNN has n = 1000 hidden states and the sequence length T = 15. Also, νW = 0.3 and νF = νC = 1.

Fig. 5 shows the R2 scores for x and y directions of all three models for this task. Observe that, the dynamics of the linear RNN and scaled convolutional model are identical during training using either the entire recording or a part of it. For the case of limited data, as discussed earlier, we observe the implicit bias of the RNN and scaled convolutional model in the ﬁgures (a). This bias leads to better performance of these two models compared to the unscaled model. Using the total available data, the unscaled convolutional model performs better because of the increased bias error in the other two models (ﬁgures in (b)). Table 1 shows the test R2-score of the ﬁnal trained models for all three cases.

6. Conclusion

In this work, we focus on the special class of linear RNNs and observe a functional equivalence between linear RNNs and 1D convolutional models. Using the kernel regime framework, we show that the training of a linear RNN is

(a) Limited data

(b) Entire data

Figure 5. R2 score for the two dimensional reaching task described in section 5.2 . The data is recorded from the primary somatosensory cortex of macaques. (a) Limited data: The models are trained on 4.5 minutes of recorded data. (b) Entire data: the whole recording ( 51 mins) is used to compare the performances. For both cases we used mini-batch (batch size = 128) gradient descent with lr = 10 4.

Implicit Bias of Linear RNNs

RNN Scaled Conv-1D Conv-1D R2 x 0.6462 0.6442 0.6565 R2 y 0.5911 0.5860 0.6027 R2 x (limited data) 0.6043 0.6046 0.5856 R2 y (limited data) 0.4257 0.4234 0.3918

Table 1. R2-score on test data for x and y directions in the two dimensional reaching task described in section 5.2

identical to the training of a certain scaled convolutional model. We further provide an analysis for an inductive bias in linear RNNs towards short-term memory. We show that this bias is driven by the variances of RNN parameters at random initialization. Our theory is validated by both synthetic and real data experiments.

Acknowledgements

The work of M. Emami, M. Sahraee-Ardakan, P. Pandit, and A. K. Fletcher was supported in part by the National Science Foundation under Grants 1254204 and 1738286, and the Ofﬁce of Naval Research under Grant N00014-15-1-2677. S. Rangan was supported in part by the National Science Foundation under Grants 1116589, 1302336, and 1547332, NIST, the industrial afﬁliates of NYU WIRELESS, and the SRC.

Alemohammad, S., Wang, Z., Balestriero, R., and Baraniuk, R. The recurrent neural tangent kernel. ar Xiv preprint ar Xiv:2006.10246, 2020.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962, 2018.

Arjovsky, M., Shah, A., and Bengio, Y. Unitary evolution recurrent neural networks. In International Conference on Machine Learning, pp. 1120 1128, 2016.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an inﬁnitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8139 8148, 2019.

Bai, Z. D. and Yin, Y. Q. Limiting behavior of the norm of products of random matrices and two problems of geman-hwang. Probability theory and related ﬁelds, 73 (4):555 569, 1986.

Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zdeborov a, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proc. National Academy of Sciences, 116(12):5451 5460,

March 2019. ISSN 1091-6490. doi: 10.1073/ pnas.1802705116. URL http://dx.doi.org/10. 1073/pnas.1802705116.

Bayati, M. and Montanari, A. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57 (2):764 785, 2011.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proc. National Academy of Sciences, 116(32):15849 15854, 2019.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difﬁcult. IEEE transactions on neural networks, 5(2):157 166, 1994.

Benjamin, A. S., Fernandes, H. L., Tomlinson, T., Ramkumar, P., Ver Steeg, C., Chowdhury, R. H., Miller, L. E., and Kording, K. P. Modern machine learning as a benchmark for ﬁtting neural responses. Frontiers in computational neuroscience, 12:56, 2018.

Chen, M., Pennington, J., and Schoenholz, S. S. Dynamical isometry and a mean ﬁeld theory of rnns: Gating enables signal propagation in recurrent neural networks. ar Xiv preprint ar Xiv:1806.05394, 2018.

Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder decoder approaches. Proceedings of SSST8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014. doi: 10.3115/v1/ w14-4012. URL http://dx.doi.org/10.3115/ v1/W14-4012.

Daniely, A. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422 2430, 2017.

Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253 2261, 2016.

Dou, X. and Liang, T. Training neural networks as learning data-adaptive kernels: Provable representation and approximation beneﬁts. Journal of the American Statistical Association, pp. 1 14, 2020.

Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent ﬁnds global minima of deep neural networks. ar Xiv preprint ar Xiv:1811.03804, 2018a.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018b.

Implicit Bias of Linear RNNs

E, W., Ma, C., and Wu, L. A comparative analysis of optimization and generalization properties of twolayer neural network and random feature models under gradient descent dynamics. Science China Mathematics, Jan 2020. ISSN 1869-1862. doi: 10.1007/ s11425-019-1628-5. URL http://dx.doi.org/ 10.1007/s11425-019-1628-5.

Emami, M., Ardakan, M. S., Rangan, S., and Fletcher, A. K. Input-output equivalence of unitary and contractive rnns. In Advances in Neural Information Processing Systems, pp. 15342 15352, 2019.

Emami, M., Sahraee-Ardakan, M., Pandit, P., Rangan, S., and Fletcher, A. K. Generalization error of generalized linear models in high dimensions. ar Xiv preprint ar Xiv:2005.00180, 2020.

Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. Deep convolutional networks as shallow gaussian processes. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Bklfsi0c Km.

Gerbelot, C., Abbara, A., and Krzakala, F. Asymptotic errors for convex penalized linear regression beyond gaussian matrices. ar Xiv preprint ar Xiv:2002.04372, 2020.

Givens, C. R., Shortt, R. M., et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231 240, 1984.

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. ar Xiv preprint ar Xiv:1903.08560, 2019.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018.

Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., Le Cun, Y., Tegmark, M., and Soljaˇci c, M. Tunable efﬁcient unitary neural networks (eunn) and their application to rnns. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733 1741. JMLR. org, 2017.

Kailath, T. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980.

Katayama, T. Subspace methods for system identiﬁcation. Springer Science & Business Media, 2006.

Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=B1EA-M-0Z.

Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pp. 8570 8581, 2019.

Lennart, L. System identiﬁcation: theory for the user. PTR Prentice Hall, Upper Saddle River, NJ, pp. 1 14, 1999.

Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157 8166, 2018.

Ljung, L. System identiﬁcation. Wiley encyclopedia of electrical and electronics engineering, pp. 1 19, 1999.

Ljung, L. and Glad, T. Modeling of Dynamic Systems. Prentice-Hall information and system sciences series. PTR Prentice Hall, 1994. ISBN 9780135970973. URL https://books.google. com/books?id=z O9q Qg AACAAJ.

Ma, C., Wu, L., et al. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. ar Xiv preprint ar Xiv:1904.04326, 2019.

Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. ar Xiv preprint ar Xiv:1804.11271, 2018.

Mei, S., Montanari, A., and Nguyen, P.-M. A mean ﬁeld view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33): E7665 E7671, 2018.

Montanari, A., Ruan, F., Sohn, Y., and Yan, J. The generalization error of max-margin linear classiﬁers: Highdimensional asymptotics in the overparametrized regime. ar Xiv preprint ar Xiv:1911.01544, 2019.

Neal, R. M. Bayesian Learning for Neural Networks. Springer New York, 1996. doi: 10.1007/ 978-1-4612-0745-0. URL https://doi.org/10. 1007%2F978-1-4612-0745-0.

Novak, R., Xiao, L., Bahri, Y., Lee, J., Yang, G., Abolaﬁa, D. A., Pennington, J., and Sohl-dickstein, J.

Implicit Bias of Linear RNNs

Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=B1g30j0q F7.

Pintelon, R. and Schoukens, J. System identiﬁcation: a frequency domain approach. John Wiley & Sons, 2012.

Rangan, S., Schniter, P., and Fletcher, A. K. Vector approximate message passing. IEEE Transactions on Information Theory, 65(10):6664 6684, 2019.

Rotskoff, G. M. and Vanden-Eijnden, E. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. ar Xiv preprint ar Xiv:1805.00915, 2018.

Sirignano, J. and Spiliopoulos, K. Mean ﬁeld analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820 1852, 2020.

Sj oberg, J., Zhang, Q., Ljung, L., Benveniste, A., Deylon, B., Glorennec, P.-Y., Hjalmarsson, H., and Juditsky, A. Nonlinear black-box modeling in system identiﬁcation: a uniﬁed overview. Link oping University, 1995.

S oderstr om, T. and Stoica, P. System identiﬁcation. Prentice Hall International, 1989.

Viberg, M. Subspace-based methods for the identiﬁcation of linear time-invariant systems. Automatica, 31(12): 1835 1851, 1995.

Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Wei, C., Lee, J. D., Liu, Q., and Ma, T. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pp. 9709 9721, 2019.

Wisdom, S., Powers, T., Hershey, J., Le Roux, J., and Atlas, L. Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 4880 4888, 2016.

Yang, G. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ar Xiv preprint ar Xiv:1902.04760, 2019a.

Yang, G. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Advances in Neural Information Processing Systems, pp. 9951 9960, 2019b.

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1811.08888, 2018.