# learning_and_generalization_in_rnns__36f12aea.pdf Learning and Generalization in RNNs Abhishek Panigrahi Department of Computer Science Princeton University ap34@princeton.edu Navin Goyal Microsoft Research India navingo@microsoft.com Simple recurrent neural networks (RNNs) and their more advanced cousins LSTMs etc. have been very successful in sequence modeling. Their theoretical understanding, however, is lacking and has not kept pace with the progress for feedforward networks, where a reasonably complete understanding in the special case of highly overparameterized one-hidden-layer networks has emerged. In this paper, we make progress towards remedying this situation by proving that RNNs can learn functions of sequences. In contrast to the previous work that could only deal with functions of sequences that are sums of functions of individual tokens in the sequence, we allow general functions. Conceptually and technically, we introduce new ideas which enable us to extract information from the hidden state of the RNN in our proofs addressing a crucial weakness in previous work. We illustrate our results on some regular language recognition problems. 1 Introduction Simple Recurrent Neural Networks [1] also known as Elman RNNs or vanilla RNNs (just RNNs henceforth) along with their more advanced versions such as LSTMs [2] and GRU [3] are among the most successful models for processing sequential data, finding wide-ranging applications including natural language processing, audio processing [4] and time series classification [5]. Feedforward networks (FFNs) model functions on inputs of fixed length, such as vectors in Rd. In contrast, RNNs model functions whose input consists of sequences of tokens x(1), x(2), . . ., where x(i) Rd for each i. RNNs have a notion of memory; formally it is given by the hidden state vector which is denoted by h(t) after processing the t-th token. RNNs apply a fixed function to h(t) and x(t+1) to compute h(t+1) and the output. This fixed function is modeled by a neural networks with one hidden-layer. Compared to FFNs, new challenges arise in the analysis of RNNs: for example, the use of memory and the same function at each step introduces dependencies across time and RNN training suffers from vanishing and exploding gradients [6]. Studies aimed at understanding the effectiveness of RNNs have been conducted since their introduction; for some of the early work, see, e.g., [7, 8]. These works take the form of experimental probing of the inner workings of these models as well as theoretical studies. The theoretical studies are often focused on expressibility, training and generalization questions in isolation rather than all together the latter needs to be addressed to approach full understanding of RNNs and appears to be far more challenging. While experimental probing has continued apace, e.g., [9, 10], progress on theoretical front has been slow. It is only recently that training and generalization are starting to be addressed in the wake of progress on the relatively easier case of FFNs as discussed next. RNNs are closely related to deterministic finite automata [11, 9] as well as to dynamical systems. With finite precision and Re LU activation, they are equivalent to finite automata [11] in computational power. In the last few years progress was made on theoretical analysis of overparameterized FFNs Work done as a research fellow in Microsoft Research India 35th Conference on Neural Information Processing Systems (Neur IPS 2021). with one-hidden-layer, e.g., [12, 13, 14, 15, 16, 17, 18]. Building upon these techniques, [19] proved that RNNs trained with SGD (stochastic gradient descent) achieve small training loss if the number of neurons is sufficiently large polynomial in the number of training datapoints and the maximum sequence length. But the gap between our understanding of RNNs and FFNs remains large. [20, 21] provide generalization bounds on RNNs in terms of certain norms of the parameters. While interesting, these bounds shed light on only a part of the picture as they do not consider the training of the networks nor do not preclude the possibility that the norms of the parameters for the trained networks are large leading to poor generalization guarantees. RNNs can be viewed as dynamical systems and many works have used this viewpoint to study RNNs, e.g., [22, 23, 24, 25]. Other related work includes relation to kernel methods, e.g., [26, 27, 28], linear RNNs [29], saturated RNNs [30, 31, 32], and echo state networks [33, 34]. Several other works talk about the expressive power of the novel sequence to sequence models Transformers [35, 36]. Due to a large number of works in this area it is not possible to be exhaustive: apart from the references directly relevant to our work we have only been able to include a small subset. [37] gave the first end-to-end result for RNNs. Very informally, their result is: if the concept class consists of functions that are sums of functions of tokens then overparameterized RNNs trained using SGD with sufficiently small learning rate can learn such a concept class. They introduce new technical ideas, most notably what they call re-randomization which allows one to tackle the dependencies that arise because the same weights are used in RNN across time. However, an important shortcoming of their result is limited expressive power of their concept class: while this class can be surprisingly useful as noted there, it cannot capture problems where the RNN needs to make use of the information in the past tokens when processing a token (in their terminolgy, their concept class can adapt to time but not to tokens). Indeed, a key step in their proof shows that RNNs can learn to ignore the hidden state h(t). (The above concept class comes up because it can be learnt even if h(t) is ignored.) But the hidden state h(t) is the hallmark of RNNs and is the source of information about the past tokens in general, not something to be ignored. Thus, it is an important question to theoretically analyze RNNs performance on general concept classes and it was also raised in [37]. This question is addressed in the present paper. As in previous work, we work with sequences of bounded length L. Without loss of generality, we work with token sequences x(1), . . . , x(L) of fixed length as opposed to sequences of length up to L. Informally, our result is: Overparameterized RNNs can efficiently learn concept classes consisting of one-hidden-layer neural networks that take the entire sequence of tokens as input. The training algorithm used is SGD with sufficiently small step size. By the universality theorem for one-hidden-layer networks, such RNNs can approximate all continuous functions of x(1), . . . , x(L) though naturally the more complex the functions in the class the larger the network size required. We note that the above result applies to all three aspects mentioned above: expressive power, training and generalization. We illustrate the power of our result by showing that some regular languages such as PARITY can be recognized efficiently by RNNs. In a concurrent work [38], the authors also study related problems. The results there are not directly comparable: On the one hand, they do not normalize the input unlike our work. On the other hand, the concept classes treated in their work appear to be weak. The authors show that RNNs can learn concept classes that include functions of the form f(x(ℓ1), x(ℓ2), . . . , x(ℓN)), where ℓ1, . . . , ℓN [L]. However, the number of SGD iterations and training samples necessary depend exponentially on the minimum of the two parameters, N and ℓ0 = max(ℓ1, . . . , ℓN) min(ℓ1, . . . , ℓN). 2 Preliminaries Let Sd 1 := {x Rd | x 2 = 1} be the unit sphere in Rd. For positive integer n define [n] := {1, 2, . . . , n}. Given a vector v, by vi we denote its i-th component. Given two vectors a Rd1 and b Rd2, [a, b] Rd1+d2 denotes the concatenation of the two vectors. , denotes the standard dot product. Given a matrix M, we will denote its i-th row as mi and the element in row i and column j as mij. Given two matrices A Ra1 a2 and B Rb1 b2 with a1 = b1 let [A, B]r Ra1 (a2+b2) denote the matrix whose rows are obtained by concatenating the respective rows of A and B. Similarly, [A, B]c R(a1+b1) a2 (assuming a2 = b2) denotes the matrix whose columns are obtained by concatenating the columns of A and B. O( ) and Ω( ) hide absolute constants. Similarly, poly( ) denotes a polynomial in its arguments with degree and coefficients bounded by absolute constants; different instances of poly( ) may refer to different polynomials. Writing out explicit constants would lead to unwieldy formulas without any new insights. Let σ : R R, given by σ(x) := max{x, 0} = x Ix 0, be Re LU activation function. Re LU can be extended to act on vectors by coordinate-wise application: σ((x1, . . . , xd)) := (σ(x1), . . . , σ(xd)). Note that Re LU is a positive homogenous function of degree 1, that is to say σ(λx) = λ σ(x) for all x and all λ 0. To be learnable efficiently, the functions in the concept class need to be not too complex. We will quantify this with the following two complexity measures which are weighted norms of the Taylor expansion and intuitively can be thought of as quantifying network size and sample complexities, resp., needed to learn φ up to error ϵ. Definition 2.1 (Function complexity [15]). Suppose that function φ : R R has Taylor expansion φ(z) = P i=0 cizi. For R, ϵ > 0, define Cε(φ, R) := P i=0 Cs(φ, R) := C P i=0(i + 1)1.75Ri |ci| , where C = 104. As an example, if φ(z) = zd for positive integer d, then Cs(φ, R) = O(Rd) and Cε(φ, R) = O(Rd logd/2( 1 ε)). For φ(z) = sin z, cos z, ez, we have Cs(φ, R) = O(1) and Cε(φ, R) = poly(1/ε). We have Cs(φ, R) Cε(φ, R) Cs(φ, O(R)) poly(1/ε) for all φ and for φ(z) = sin z, ez or constant degree polynomials, they only differ by o(1/ε). See [15] for details. Note that φ itself is not a member of our concept class but functions like it will be used to construct members of our concept class. 3 Problem Formulation In our set-up, RNNs output a label after processing the whole input sequence.2 The data are generated from an unknown distribution D over ((x(2), . . . , x(L 1)), y ) ((Sd 2)L 2, Y), for some label set Y Rdout for some positive integer dout. We call x := (x(2), . . . , x(L 1)) the true sequence and y the true label. Denote by Z the training dataset containing N i.i.d. samples from D. We preprocess the true sequence to normalize it: Definition 3.1 (Normalized Input sequence). Let x = (x(2), . . . , x(L 1)) be a given true input sequence of length L 2, s.t. x(i) Sd 2 and x(i) d 1 = 1 2, for all i [2, L 1]. The normalized input sequence x := (x(1), . . . , x(L)) is given by x(1) := (0d 1, 1), x(ℓ) := (εxx(ℓ), 0), ℓ [2, L 1], x(L) := (0d 1, 1), where we set εx > 0 later in Theorem 3.1. We use normalized sequence in place of the true sequence as input to RNNs, as it helps in proofs, e.g., with bounds on the changes in the activation patterns at each RNN cell, when the input sequences change and also with inversion of RNNs (defined later). Our method can be applied without normalization too, but in that case our error bound has exponential dependence on the length of the input sequence. The extra dimension in the normalized sequence serves as bias which we do not use explicitly to simplify notation. Definition 3.2 (Recurrent Neural Networks). We assume that the input sequences are of length L for some given L > 0 and are of the form x(1), . . . , x(L) with x(ℓ) Rd for all ℓ [L]. An RNN is 2While our set-up has similarity to previous work [37], there are also important differences. specified by three matrices Wrnn Rm m, Arnn Rm d and Brnn Rdout m, where m is the dimension of the hidden state and dout is the dimension of the output. The hidden states of the RNN are given by h(0) rnn = 0 Rm and h(ℓ) rnn := σ(Arnnx(ℓ) + Wrnnh(ℓ 1) rnn ) for ℓ [L]. (1) The output at each step ℓ [L] is given by y(ℓ) rnn = Brnnh(ℓ) rnn. By RNN cell we mean the underlying FFN in (1). The m rows of Wrnn and Arnn correspond to the m neurons in the RNN. Pick the matrices W Rm m and A Rm d by sampling entries i.i.d. from N(0, 2 m), and pick B by sampling entries i.i.d. from N(0, 2 dout ). When Wrnn = W and Arnn = A, the RNN is said to be at random initialization. We will denote the parameters of an RNN at initialization by dropping the subscript rnn , thus the hidden states are {h(ℓ)}ℓ [L]. In the following theorems, we will keep Brnn at initialization B and train only Arnn and Wrnn. We write F (ℓ) rnn(x; Wrnn, Arnn) = y(ℓ) rnn = Bh(ℓ) rnn for the output of the ℓ-th step. Our goal is to use y(L) rnn Rdout to fit the true label y Y using some loss function G : Rdout Y R. We assume that for every y Y, G 0k, y [ 1, 1] is bounded, and G ( , y ) is convex and 1-Lipschitz continuous in its first variable. This includes, for instance, the cross-entropy loss and ℓ2 -regression loss (for bounded arguments). 3.2 Concept Class We now define our target concept class, which we will show to be learnable by RNNs using SGD. Definition 3.3 (Concept Class). Our concept class consists of functions F : R(L 2) (d 1) Rdout defined as follows. Let Φ denote a set of smooth functions with Taylor expansions with finite complexity as in Def. 2.1. To define a function F, we choose a subset {Φr,s : R R}r [p],s [dout] from Φ, {w r,s S(L 2)(d 1) 1}r [p],s [dout], a set of weight vectors, and {b r,s R}r [p],s [dout], a set of output coefficients with |b r,s| 1. Then, we define F : R(L 2) (d 1) Rdout, where for each output dimension s [dout] we define the s-th coordinate Fs of F = (F1, . . . , Fdout) by r [p] b r,sΦr,s w r,s, [x(2), . . . , x(L 1)] . (2) To simplify formulas, we assume Φr,s(0) = 0 for all r and s. We denote the complexity of the concept class by Cε(Φ, R) := max φ Φ{Cε(φ, R)}, Cs(Φ, R) := max φ Φ{Cs(φ, R)}. Let F be a function in the concept class with smallest possible population loss which we denote by OPT. Hence, we are in an agnostic learning setting where our aim is to learn a function with population objective OPT + ε. As one can observe, functions in the concept class are given by a one hidden layer network with p neurons and smooth activations. We will show that the complexity of the functions Φr,s determines the number of neurons and the number of training samples necessary to train the recurrent neural network that has OPT + ε population loss. While we have defined F as a function of x, since there s a one-to-one correspondence between x and x, it will occasionally be convenient to talk about F as being a function of x and this should cause no confusion. And similarly for other functions like F (ℓ) rnn(x; W, A). 3.3 Objective Function and the Learning Algorithm We assume that there exists a function F in the concept class that can achieve a population loss OPT, i.e. E (x,y ) DG(F (x), y ) OPT. The following loss function is used for gradient descent: Obj(W , A ) = E (x,y ) ZObj(x, y ; W , A ), where Obj(x, y ; W , A ) = G(λF (L) rnn (x; W + W , A + A ), y ). Parameter λ whose value is set in the main Theorem 3.1 is a scaling factor needed for technical reasons discussed later. We consider vanilla stochastic gradient updates with Wt, At denoting the matrices after t-steps of sgd. Wt and At are given by Wt = Wt 1 η Wt 1Obj(x, y ; Wt 1, At 1), At = At 1 η At 1Obj(x, y ; Wt 1, At 1), where (x, y ) is a random sample from Z and x is its normalized form. It should be noted that [37] train only W. Remark. We made two assumptions in our set-up: (1) input sequences are of fixed length, and (2) the output is only considered at the last step. These assumptions are without loss of generality and allow us to keep already quite complex formulas manageable without affecting the essential ideas. The main change needed to drop these assumptions is a change in the objective function, which will now include terms not just for how well the output fits the target at step L but also for the earlier steps. The objective function for each step behaves in the same way as that for step L, and so the sum can be analyzed similarly. Intuitively speaking, considering the output at the end is the hardest training regime for RNNs as it uses the minimal amount of label information. 3.4 RNNs learn the concept class We are now ready to state our main theorem. We use ρ := 100Ldout log m in the following. Recall that a set Φ of smooth functions induces a concept class as in Def. 3.3. Theorem 3.1 (Main, restated in the appendix as Theorem D.5). Let Φ be a set of smooth functions. For ϵx := 1 poly(ρ) and ε 0, 1 p poly(ρ) Cs(Φ,O(ϵ 1 x )) , define complexity C := Cε(Φ, O(ϵ 1 x )) and λ := ε 10Lρ. Assume that the number of neurons m poly C, p, L, dout, ε 1 and the number of samples N poly C, p, L, dout, ε 1 . Then with parameter choices η := Θ 1 ερ2m and T := Θ(p2C2 poly(ρ)ε 2) with probability at least 1 e Ω(ρ2) over the random initialization, SGD satisfies t=0 E (x,y ) DObj x, y ; Wt, At i OPT + ε + 1/ poly(ρ). (3) Informally, the above theorem states that by SGD training of overparameterized RNNs with sufficiently small learning rate and appropriate preprocessing of the input sequence, we can efficiently find an RNN that has population objective nearly as small as OPT as ε + 1/ poly(ρ) is small. The required number of neurons and the number of training samples have polynomial dependence on the function complexity of the concept class, the length of the input sequence, the output dimension, and the additional prediction error ε. 4 Proof Sketch While the full proof is highly technical, in this section we will sketch the proof focusing on the conceptual aspects while minimizing the technical aspects to the essentials; full proofs are in the appendix. The high-level outline of our proof is as follows. 1. Overparameterization simplifies the neural network behavior. The function F (L) rnn (x; W + W , A + A ) computed by the RNN is a function of the parameters W , A as well as of the input x. It is a highly non-linear and non-convex function in both the parameters and in the input. The objective function inherits these properties and its direct analysis is difficult. However, it has been realized in the last few years predominantly for the FFN setting that when the network is overparameterized (i.e., as the number of neurons m becomes large compared to other paramters of the problem such as the complexity of the concept class), the network behavior simplifies in a certain sense. The general idea carries over to RNNs as well: in (4) below we write the first-order Taylor approximation of F (L) rnn (x; W + W , A + A ) at W and A as a linear function of W and A ; it is still a non-linear function of the input sequence. As in [37] we call this function pseudo-network, though our notion is more general as we vary both the parameters W and A . Pseudo-network is a good approximation of the target network as a function of x for all x. 2. Existence of a good RNN. In order to show that the RNN training successfully learns, we first show that there are parameters values for RNN so that as a function of x it is a good approximation of F . Instead of doing this directly, we show that the pseudo-network can approximate F ; this suffices as we know that the RNN and the pseudo-network remain close. This is done by constructing paramters W and A so that the resulting pseudo-network approximates the target function in the concept class (Section 4.2) for all x. 3. Optimization. SGD makes progress because the loss function is convex in terms of the pseudo-network which stays close to the RNN as a function of x. Thus, SGD finds parameters with training loss close to that achieved by W , A . 4. Generalization. Apply a Rademacher complexity-based argument to show that SGD has low population loss. Step 2 is the main novel contribution of our paper and we will give more details of this step in the rest of this section.3 4.1 Pseudo-network We define the pseudo-network here. Suppose W, A, B are at random initialization. The linear term in the first-order Taylor approximation is given by the pseudo-network F (L)(x; W , A ) := i=1 Backi LD(i) W h(i 1) + A x(i) (4) F (L) rnn (x; W + W , A + A ) F (L) rnn (x; W, A). (using Lemma G.3) This function approximates the change in the output of the RNN, when (W, A) changes to (W + W , A + A ). The parameter λ, that we defined in the objective function, will be used to make the contribution of F (L) rnn at initialization small thus making pseudo-network a good approximation of RNN. Hence, we can observe that the pseudo network is a good approximation of the RNN, provided the weights stay close to the initialization. To complete the above definition of pseudo-network we define the two new notations in the above formula. For each ℓ [L], define D(ℓ) Rm m as a diagonal matrix, with diagonal entries d(ℓ) rr := I[w r h(ℓ 1) + a r x(ℓ) 0], r [m]. (5) In words, the diagonal of matrix D(ℓ) represents the activation pattern for the RNN cell at step ℓat initialization. Define Backi j Rdout m for each 1 i j L by Backi j := BD(j)W . . . D(i+1)W, with Backi i := B for each i [L]. Matrices Backi j in Eq. (4) arise naturally in the computation of the first-order Taylor approximation (equivalently, gradients w.r.t. the parameters) using standard matrix calculus.Very roughly, one can think of Backi j as related to the backpropagation signal from the output at step j to the parameters at step i. 4.2 Existence of good pseudo-network Our goal is to construct W and A such that for any true input sequence x = (x(2), . . . , x(L 1)), if we define the normalized sequence x = (x(1), . . . , x(L)), then with high probability we have F (L)(x; W , A ) F (x). (6) 3The above outline is similar to prior work, e.g., [37]. Details can be quite different though, e.g., they only train W and keep A fixed to its initial value. Their contribution was also mainly in Step 2 and the other steps were similar to prior work. To simplify the presentation, in this sketch we will assume that p, the number of neurons in the concept class, and the output dimension dout are both equal to 1. Also, let the output weight b := 1. These assumptions retain the main proof ideas while simplifying equations. Overall, we assume that the target function F : R(L 2) (d 1) R on a given sequence is given by F (x) = Φ ( w , [x(2), , x(L 1)] ), (7) where Φ : R R is a smooth function and w S(L 2) (d 1) 1. First, we state Lemma 6.2 in [15], which is useful for our construction of the matrices W and A . Consider a smooth function φ : [ 1, 1] R. It can be approximated as a linear combination of step functions (derivatives of Re LU) for all u ( 1, 1), i.e., there exists a weight function H : R2 R such that φ (u) Eα1,β1,b0[H (α1, b0) Iα1u+β1 1 u2+b0 0] where α1, β1 N (0, 1) and b0 N (0, 1) are independent random variables (we omitted some technical details). The above statement can be straightforwardly extended to the following slightly more general version: Lemma 4.1. For every smooth function φ, any w Sd 1, and any ε 0, 1 Cs(φ,1) there exists a H : R2 Cε (φ, 1) , Cε (φ, 1) , which is Cε (φ, 1)-Lipschitz continuous and for all u Sd 1, we have φ(w u) Ew N(0,I),b0 N(0,1)[H(w w, b0) Iw u+b0 0] ε. Very informally, this lemma states that the activation pattern of a one-layer Re LU network (given by Iw u 0) at initialization can be used to express a smooth function of the dot product of the input vector with a fixed vector. While the above statement involves an expectation, one can easily replace it by an empirical average with slight increase in error. This statement formed the basis for FFN and RNN results in [15, 37]. Can we use it for RNNs for our general concept class? An attempt to do so is the following lemma showing that the pseudo-network can express any smooth function of the hidden state h(L 1) and x(L). Lemma 4.2 (Informal). For a given smooth function φ, a vector w S(m+d 1), and any ε 0, 1 Cs(φ,1) , there exist matrices W and A such that for every normalized input sequence x = (x(1), . . . , x(L)) formed from a sequence x, we have with high probability, F (L)(x; W , A ) φ( w, [h(L 1), x(L) :d 1] ) ε, provided m = poly( 1 ε, L, Cϵ(φ, O(1))). Vector x(L) :d 1 is x(L) without the last coordinate, the bias term appended to each input. The reason h(L 1) and x(L) come up is because they serve as inputs to the RNN cell when processing the L-th input. The proof sketched below uses the fact that RNNs are one-layer FFNs unrolled over time. Hence, we could try to apply the result of Lemma 4.1 to the RNN cell at step L. However, a difficulty arises in carrying out this plan because the contributions of previous times steps also come up (as seen in the equations below) and it can be difficult to disentangle the contribution of step L. This is addressed in the proof: Proof. Recall that W, A N(0, 2 m I). Also, recall that we have assumed for simplicity dout = 1. Hence, B and Backi L are row and column vectors respectively. For typographical simplicty, denote by br and Backi L,r the respective r-th components of these vectors. We set W := 0 and for every r [m], a r := 1 mbr H( p m/2( [wr, ar,:d 1], w ), p m/2ar,d)ed, for a function H that we will describe below. With these choices we have F (L)(x; W , A ) = i=1 Backi LD(i) W h(i 1) + A x(i) r [m] br Backi L,r H( p m/2( [wr, ar,:d 1], w ), p m/2ar,d) Iw r h(i 1)+a r x(i) 0. In the last step, we have simplified the formula using sum over neurons. The first L 1 summands in the outer sum above nearly vanish due to small correlation between B and Backi L for i < L (see Lemma F.11). Recall that Back L L = B and thus the correlation is not small for i = L. This gives F (L)(x; W , A ) 1 r [m] b2 r H( p m/2( [wr, ar,:d 1], w ), p m/2ar,d) Iw r h(i 1)+a r x(i) 0, Now, this resembles a discretized version of Lemma 4.1. We can substitute u as [h(L 1), x(L)] in Lemma 4.1 and use concentration bounds with respect to the randomness of weights W and A to complete the proof. More generally, with much more technical work, it might be possible to prove an extension of the above lemma asserting the existence of a pseudo-network approximating a sum of functions of type P i [L] φi( wi, [h(i 1), x(i)] ). However, even so it is not at all clear what class of functions of x this represents because of the presence of the hidden state vectors. Thus, the major challenge in constructing W and A to express the functions from the desired concept class is to use the information contained in h(ℓ). The construction of W in [37] is not able to use this information and ignores it by treating it as noise (which is also non-trivial). The idea underlying our construction is that h(ℓ) in fact contains information about all the inputs x(1), . . . , x(ℓ) up until step ℓ. Furthermore and crucially, this information can be recovered approximately by a linear transformation (Theorem 4.5 below). This enables us to show: Theorem 4.3 (Existence of pseudo-network approximation for target function; abridged statement of Theorem D.2 in the appendix). For every target function F of the form Eq. (7), there exist matrices W and A such that with probability at least 1 e Ω(ρ2) over W, A, B, we have for every normalized input sequence x = (x(1), . . . , x(L)) formed from a true sequence x, F (L)(x; W , A ) Φ w , [x(2), . . . , x(L 2)] ε + 1 poly(ρ), provided m poly(ρ, L, ε 1, Cε(Φ, O(ε 1 x ))) and ϵx 1 poly(ρ). Proof sketch. By Theorem 4.5 there exists a matrix W [L] such that W [L] h(L 1) [x(1), . . . , x(L)] for all input sequences [x(1), . . . , x(L)]. We can modify W [L] to get [x(2), . . . , x(L 1)]. Hence, by using [W [L]w , 0] as w and Φ as φ in Lemma 4.2, we can have F (L)(x; W , A ) Φ ( w , W [L] h(L 1) ) F (x), implying F (L) and F are close. Accounting for all the errors in inversion and approximation of function, we get the final bound. Re-randomization. In the proof sketches of Lemmas 4.2 and Theorem 4.3 above we swept a technical but critical consideration under the rug: the random variables {wr, ar}r [m], W (L), {Backi L}i [L] and {h(i)}i [L] are not independent. This invalidates application of standard concentration inequalities w.r.t. the randomness of W and A this application is required in the proofs. Here our new variation of the re-randomization technique from [37] comes in handy. The basic idea is the following: whenever we want to apply concentration bounds w.r.t. the randomness of W and A, we divide the set of rows into disjoint sets of equal sizes. For each set, we will re-randomize the rows of the matrix [W, A]r, show that the matrices W [L], {Backi L}i [L] and {h(i)}i [L] don t change a lot and then apply concentration bounds w.r.t. the new weights in the set. Finally, we account for the error from each set. 4.3 The rest of the proof Having shown that there exists a pseudo-network approximation of the RNN that can also approximate the concept class, we will complete the proof by showing that SGD can find matrices with performance similar to W and A on the population objective Obj( ). Lemma D.3 shows that the training loss decreases with time. The basic idea is to use the fact that within small radius of perturbation, overparameterized RNNs behave as a linear network and hence the training can be analyzed via convex optimization. Then, we show using Lemma D.4 that the Rademacher complexity for overparameterized RNNs is bounded. Again, the main idea here is that overparameterized RNNs behave as pseudo-networks in our overparameterized regime and hence their Rademacher complexity can be approximated by the Rademacher complexity of pseudo-networks. Finally, using generalization bounds on the Rademacher complexity, we get the final population-level objective in Theorem D.5. 4.4 Invertibility of RNNs at initialization In this section, we describe how to get back x(1), . . . , x(L) from the hidden state h(L). The following lemma states that any linear function can be represented by a one-hidden layer FFN with activation function Re LU,with a small approximation error of the order 1 m: Lemma 4.4. [a simpler continuous version can be found in Lemma C.1 in the appendix] For any v Rd, the linear function taking x to v x for x Rd, can be represented as v x p σ(Tx) v x m , (8) with p = 2 Tv, where T Rm d is a matrix with elements i.i.d. sampled from N(0, 1). Using the above lemma,4 we will show that the hidden state h(L) can be inverted using a matrix W [L] to get back the input sequence x(1), . . . , x(L). Theorem 4.5. [informal version of Theorem D.1] There exists a set of matrices {W [ℓ]}ℓ [L], which can possibly depend on W and A, such that for any εx < 1 L and any given normalized sequence x(1), . . . , x(L), with probability at least 1 e Ω(ρ2) we have [x(1), . . . , x(L)] W [L] h(L) poly(L, ρ, m 1, εx). Very roughly, the above result is obtained by repeated application of Lemma 4.4 to go from h(ℓ) to (h(ℓ 1), x(ℓ)) starting with ℓ= L. This uses the fact that the RNN cell is a one-hidden layer neural network and hence Lemma 4.4 is applicable. Several difficulties need to be overcome to carry out this plan. One difficulty is that a naive application of Lemma 4.4 results in exponential blowup of error with L. The reason is as follows: in Lemma 4.4, the factor 2 in the linear transformation p appears because of the activation function Re LU. A naive application will result in defining W [ℓ] inductively as 2[WW [ℓ 1], A]r for 2 ℓ L, with W[1] = A. This will lead to the approximation error exploding exponentially with L in induction, since the approximation error at each step will depend on the norm of W [ℓ 1]. However, one can observe from the proof of Lemma 4.4 that it suffices to use the following linear transformation: p(w) = w v Iw x 0, which helps in removing the factor of 2. However, this requires knowing the indicator function Iw x 0. This implies for RNNs, we need to know the indicator matrix D(ℓ) at each step. To do so, we define a base sequence x(0), whose indicator matrices {D(ℓ) (0)}ℓ [L] will be used to approximate the indicator matrix D(ℓ) at each step ℓfor any sequence x. That is, W [ℓ] will be defined inductively as [D(ℓ) (0)WW [ℓ 1], D(ℓ) (0)A]r for 2 ℓ L, with W[1] = D(1) (0)A. We then show that the approximation error of the indicator matrix builds up to give an error with polynomial dependence on εx and L. We defer the technical details of this resolution to the full proof in the appendix. Secondly, we apply re-randomization to tackle the dependence between W, A, {h(ℓ)}ℓ [L] and {W [ℓ]}ℓ [L]. We performed few toy experiments on the ability of invertibility for RNNs at initializa- 4This lemma is from a companion paper (forthcoming) where it is used to invert feedforward networks; apart from the above lemma, this work is very different from the present paper. We have reproduced the proof in full in the appendix. tion (Sec. I). We observed, as predicted by our theorem above, that the error involved in inversion decreases with the number of neurons and increases with the length of the sequence (Fig. 4). 5 On concept classes It is apparent that our concept class is very general as it allows arbitrary dependencies across tokens. To concretely illustrate the generality of our concept class, and to compare with previous work, we show that our result implies that RNNs can recognize a simple formal language DL1. Here we are working in the discrete setting where each input token comes from {0, 1} possibly represented as a vector when fed to the RNN. For a sequence z {0, 1}L, we define DL1(z) to be 1 if the number of 1 s in z is exactly 1 and define it to be 0 otherwise. We can show that DL1 is not representable in the concept class of [37] (see Theorem H.1 in the appendix). However, we can show that the language DL1 can be recognized with a one-layer FFN with one neuron and quadratic activation. The idea is that we can simply calculate the number of 1 s in the input string, which is doable using a single neuron. This implies that our concept class can represent language DL1 with low complexity. More generally, we can show that our concept class can efficiently represent pattern matching problems, where strings belong to a language only if they contain given strings as substrings. In general, we can show that our concept class can express general regular languages. However, the complexity of the concept class may depend super-polynomially on the length of the input sequence, depending on the regular language (more discussion in sec. H). Some regular languages allow special treatment though. For example, consider the language PARITY. PARITY is the language over alphabet {0, 1} with a string w = (w1, . . . , wj) PARITY iff w1 + . . . + wj = 1 mod 2, for j 1. We can show in sec. H that PARITY is easily expressible by our concept class with small complexity. RNNs perform well on regular language recognition task in our experiments in Sec. I. Figuring out which regular languages can be efficiently expressed by our concept class remains an interesting open problem. 6 Limitations and Conclusions We proved the first result on the training and generalization of RNNs when the functions in the concept class are allowed to be essentially arbitrary continuous functions of the token sequence. Conceptually the main new idea was to show that the hidden state of the RNN contains information about the whole input sequence and this can be recovered via a linear transformation. We believe our techniques can be used to prove similar results for echo state networks. Two main limitations of the present work are: (1) Our overparameterized setting requires the number of neurons to be large in terms of the problem parameters including the sequence length and it is often qualitatively different from the practical setting. Theoretical analysis of practical parameter setting remains an outstanding challenge even for one-hidden layer FFNs. (2) We did not consider generalization to sequences longer than those in the training data. Such a result would be very interesting but it appears that it would require stronger assumptions than our very general assumptions about the data distribution. Our techniques might be a useful starting point to that end: for example, if we knew that the distributions of the hidden states are similar at different times steps and the output is the same as the hidden state (i.e. B is the identity) then our results might easily generalize to higher lengths. We note that to our knowledge the limitation noted here holds for all works dealing with generalization for RNNs. (3) Understanding LSTMs remains open. Our work addresses theoretical analysis of RNNs and is not directly concerned with applications. We do not anticipate any immediate societal impact of our work. In the long run, it may help improve RNNs and translate to impact on the applications. 7 Acknowledgements We thank Sanjeev Arora, Raghav Somani and Abhishek Shetty for their valuable suggestions on improving the presentation of the paper. We also thank the reviewers, the area chair and the senior area chair for their valuable suggestions regarding this work. [1] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179 211, 1990. [2] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. [3] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724 1734, Doha, Qatar, October 2014. Association for Computational Linguistics. [4] Dan Jurafsky and James H. Martin. Speech and Language Processing. 3rd draft edition, 2020. [5] Bryan Lim and Stefan Zohren. Time series forecasting with deep learning: A survey, 2020. [6] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310 1318, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. [7] Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. J. Comput. Syst. Sci., 50(1):132 150, 1995. [8] John F Kolen and Stefan C Kremer. A field guide to dynamical recurrent networks. John Wiley & Sons, 2001. [9] Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of finite precision rnns for language recognition. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 740 745. Association for Computational Linguistics, 2018. [10] Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7096 7116. Association for Computational Linguistics, 2020. [11] Samuel A Korsky and Robert C Berwick. On the computational power of rnns. ar Xiv preprint ar Xiv:1906.06349, 2019. [12] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 8580 8589, 2018. [13] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems 31, pages 8157 8166. 2018. [14] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In Proceedings of the 35th International Conference on Learning Representations, 2018. [15] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155 6166, 2019. [16] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes overparameterized deep relu networks. Machine Learning, 109:1 26, 03 2020. [17] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 322 332. PMLR, 2019. [18] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 1054, 2021. [19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In Advances in neural information processing systems, pages 6676 6688, 2019. [20] Zhuozhuo Tu, Fengxiang He, and Dacheng Tao. Understanding generalization in recurrent neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. [21] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1233 1243. PMLR, 26 28 Aug 2020. [22] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. Co RR, abs/1609.05191, 2016. [23] John Miller and Moritz Hardt. Stable recurrent models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. [24] Samet Oymak. Stochastic gradient descent learns state equations with nonlinear activations. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2551 2579, Phoenix, USA, 25 28 Jun 2019. PMLR. [25] Niru Maheswaranathan, Alex H Williams, Matthew D Golub, Surya Ganguli, and David Sussillo. Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32:15696, 2019. [26] Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d AlchéBuc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 9947 9960, 2019. [27] Sina Alemohammad, Zichao Wang, Randall Balestriero, and Richard Baraniuk. The recurrent neural tangent kernel. In 9th International Conference on Learning Representations, ICLR 2021, 2021. [28] Sina Alemohammad, Randall Balestriero, Zichao Wang, and Richard Baraniuk. Scalable neural tangent kernel of recurrent architectures. ar Xiv preprint ar Xiv:2012.04859, 2020. [29] Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep Rangan, and Alyson K. Fletcher. Implicit bias of linear rnns, 2021. [30] William Merrill. Sequential neural networks as automata. ar Xiv preprint ar Xiv:1906.01615, 2019. [31] William Merrill, Gail Weiss, Yoav Goldberg, Roy Schwartz, Noah A Smith, and Eran Yahav. A formal hierarchy of rnn architectures. ar Xiv preprint ar Xiv:2004.08500, 2020. [32] William Merrill. Formal language theory meets modern nlp. ar Xiv preprint ar Xiv:2102.10094, 2021. [33] Lyudmila Grigoryeva and Juan-Pablo Ortega. Echo state networks are universal. Neural Networks, 108:495 508, 2018. [34] Mustafa C Ozturk, Dongming Xu, and Jose C Principe. Analysis and design of echo state networks. Neural computation, 19(1):111 138, 2007. [35] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? ar Xiv preprint ar Xiv:1912.10077, 2019. [36] Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. o(n) connections are expressive enough: Universal approximability of sparse transformers. ar Xiv preprint ar Xiv:2006.04862, 2020. [37] Zeyuan Allen-Zhu and Yuanzhi Li. Can sgd learn recurrent neural networks with provable generalization? In Advances in Neural Information Processing Systems, pages 10331 10341, 2019. [38] Lifu Wang, Bo Shen, Bo Hu, and Xing Cao. On the provable generalization of recurrent neural networks. ar Xiv preprint ar Xiv:2109.14142, 2021. [39] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. [40] Roman Vershynin. Spectral norm of products of random and deterministic matrices. Probability theory and related fields, 150(3-4):471 509, 2011. [41] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. [42] S Boucheron, L Gabor, and P Massart. Concentration inequalities oxford university press, 2013. [43] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. [44] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [45] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. ar Xiv, pages ar Xiv 1811, 2018. [46] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629 681, 2017. [47] G.G. Lorentz. Approximation of Functions. Holt, Rinehart and Winston, New York, 1966. [48] Ryan O Donnell. Analysis of boolean functions. Cambridge University Press, 2014. [49] M. Tomita. Dynamic construction of finite automata from examples using hill-climbing. In Proceedings of the Fourth Annual Conference of the Cognitive Science Society, pages 105 108, Ann Arbor, Michigan, 1982. [50] G Hinton, N Srivastava, and K Swersky. Coursera: Neural networks for machine learning: Lecture 6 (a) overview of mini-batch gradient descent, 2014.