# understanding_generalization_in_recurrent_neural_networks__5ad3e256.pdf

Published as a conference paper at ICLR 2020

UNDERSTANDING GENERALIZATION IN RECURRENT NEURAL NETWORKS

Zhuozhuo Tu, Fengxiang He, Dacheng Tao UBTECH Sydney AI Centre, School of Computer Science, Faculty of Engineering The University of Sydney Darlington, NSW 2008, Australia zhtu3055@uni.sydney.edu.au, {fengxiang.he,dacheng.tao}@sydney.edu.au

In this work, we develop the theory for analyzing the generalization performance of recurrent neural networks. We ﬁrst present a new generalization bound for recurrent neural networks based on matrix 1-norm and Fisher-Rao norm. The deﬁnition of Fisher-Rao norm relies on a structural lemma about the gradient of RNNs. This new generalization bound assumes that the covariance matrix of the input data is positive deﬁnite, which might limit its use in practice. To address this issue, we propose to add random noise to the input data and prove a generalization bound for training with random noise, which is an extension of the former one. Compared with existing results, our generalization bounds have no explicit dependency on the size of networks. We also discover that Fisher-Rao norm for RNNs can be interpreted as a measure of gradient, and incorporating this gradient measure not only can tighten the bound, but allows us to build a relationship between generalization and trainability. Based on the bound, we theoretically analyze the effect of covariance of features on generalization of RNNs and discuss how weight decay and gradient clipping in the training can help improve generalization.

1 INTRODUCTION

The Recurrent Neural network (RNN) is a neural sequence model that has achieved the state-ofthe-art performance on numerous tasks, including natural language processing (Yang et al., 2018; Mikolov & Zweig, 2012), speech recognition (Chiu et al., 2018; Graves, 2013) and machine translation (Wu et al., 2016; Kalchbrenner & Blunsom, 2013). Unlike feedforward neural networks, RNNs allow connections among hidden units associated with a time delay. Through these connections, RNNs can maintain a memory that summarizes the past sequence of inputs, enabling it to capture correlations between temporally distant events in the data.

RNNs are very powerful, and empirical studies have shown that they have a very good generalization property. For example, Graves (2013) showed that deep LSTM RNNs achieved a test error of 17.7% on TIMT phoneme recognition benchmark after training with only 462 speech samples. Despite of the popularity of RNNs in practice, their theory is still not well understood. A number of recent works have sought to shed light on the effective representational properties of recurrent networks trained in practice. For example, Oymak (2018) studied the state equation of recurrent neural networks and showed that SGD can efﬁciently learn the unknown dynamics from few observations under proper assumptions. Miller & Hardt (2019) tried to explain why feed-forward neural networks are competitive with recurrent networks in practice. They identiﬁed stability as a necessary condition and proved that stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent. Despite of the impressive progress in understanding the training behavior of RNNs, there is no generalization guarantee in these works.

Understanding the generalization performance in machine learning has been a central problem for many years and revived in recent years with the advent of deep learning. One classical approach to proving generalization bound is via notions of complexity. For deep neural networks, numerous complexity measures have been proposed to capture the generalization behavior such as VC

Published as a conference paper at ICLR 2020

dimension (Harvey et al., 2017) and norm-based capacity including spectral norm (Bartlett et al., 2017; Neyshabur et al., 2019), Frobenius norm (Neyshabur et al., 2015b;a; 2018) and lp-path norm (Neyshabur et al., 2015b; Bartlett & Mendelson, 2002; Golowich et al., 2018). These existing normbased complexity measures depend on the number of hidden units of the network explicitly and thus can not explain why neural networks generalize so well in practice, despite that they operate in an overparametrized setting (Zhang et al., 2017). Neyshabur et al. (2019) proved generalization bounds for two layer Re LU feedforward networks, which decreased with the increasing number of hidden unit in the network. However their results only applied to two layer Re LU networks and some speciﬁc experiments. More recently, a new generalization bound based on Fisher-Rao norm was proposed (Liang et al., 2017). This notion of Fisher-Rao norm is motivated by information geometry and has good invariance properties. But they proved the bound only for linear deep neural networks. There are also some works about the generalization of recurrent neural networks (Zhang et al., 2018; Chen et al., 2019; Allen-Zhu & Li, 2019). However these bounds also depend on the size of networks, which makes them vacuous for very large neural networks.

Our main contributions are summarized as follows.

We deﬁne the Fisher-Rao norm for RNNs based on its gradient structure and derive new Rademacher complexity bound and generalization bound for recurrent neural networks based on Fisher-Rao norm and matrix 1-norm. In contrast to existing results such as spectral norm-based bounds, our bound has no explicit dependence on the size of networks.

We prove a generalization bound for RNNs when training with random noises. Our bound applies to general types of noises and can potentially explain the effect of noise training on generalization of recurrent neural networks as demonstrated by our empirical results.

We propose a new technique to decompose RNNs with Re LU activation into a sum of linear network and difference terms. As a result, each term in the decomposition can be treated independently and easily when estimating the Rademacher complexity. This decomposition technique can potentially be applied to other neural networks architectures such as convolutional neural networks, which might be of independent interest.

The remainder of this paper is structured as follows. We deﬁne the problem and notations in Section 2. The notion of Fisher-Rao norm for RNNs is introduced in Section 3.1. We prove the generalization bound for RNNs in Section 3.2, and the generalization bound for training with random noise is derived in Section 3.3. Section 3.4 gives a detailed analysis of our generalization bound. Finally we conclude and discuss future directions.

2 PRELIMINARIES

We focus on the vanilla RNNs with Re LU activation. Let U Rm d, V Rk m and W Rm m be the weight matrices. Given the input sequence x = (x1, x2, , x L) RLd where each xi Rd and L is the input sequence length, the vanilla RNNs can be described as follows.

gt = Uxt + Wht 1, ht = ρ(gt), yt = V ht, (1)

where gt and ht Rm represents the input and output of hidden layer at step t, ρ( ) is the Re LU function and yt Rk denotes the output value at step t.

For simplicity, in this paper, we only consider the ﬁnal output y L. We assume that data (x, y) is drawn i.i.d. from some unknown distribution D over RLd Y where Y represents the label space {1, 2, , k}. The RNNs above deﬁne a mapping y L(x) from RLd Rk, where k is the number of classes. We convert y L(x) to a classiﬁer by selecting the output coordinate with the largest magnitude, meaning x argmaxi[y L(x)]i,

where [ ]i represents the i-th element of a vector. This naturally leads to the deﬁnition of margin My L(x, y) of the output y L at a labeled example (x, y):

My L(x, y) = [y L(x)]y max y =y[y L(x)]y .

Published as a conference paper at ICLR 2020

Thus, y L misclassiﬁes (x, y) if and only if My L(x, y) 0. The quality of the prediction made by y L is measured by the expected risk deﬁned as

E(x,y) D[1My L(x,y) 0].

Since the underlying distribution D is unknown to us, we instead consider the empirical error on sample data given by

i=1 (1My L(xi,yi) α).

The generalization error is then the difference between expected risk and empirical risk, deﬁned as

E(x,y) D[1My L(x,y) 0] 1

i=1 (1My L(xi,yi) α).

Our goal in this paper is to study the generalization error for RNNs theoretically.

To establish the generalization bound, a little bit of notations are necessary. For a vector, we denote the lp norm by ||v||p = (P |vi|p)1/p and the l norm by ||v|| = max |vi|. For a matrix, we denote the matrix p-norm as ||A||p = max||x||p=1 ||Ax||p, the matrix 1-norm by ||A||1 = maxj{P

i |aij|} and the Frobenius norm by ||A||2 F = trace(AAT ). The smallest eigenvalue of a matrix A is given by λmin(A). The activation function ρ and its derivative ρ are entrywise, i.e., ρ(A) = (ρ(aij))ij and ρ (v) = (ρ (vi))i. We denote c = (L + 1, L, , 2)T , η(θ) = [V diag(ρ (g L))...Wdiag(ρ (g1))Ux1, V diag(ρ (g L))...Wdiag(ρ (g2))Ux2, , V diag(ρ (g L)) Ux L] Rk L and τ(θ) = (V W L 1Ux1, V W L 2Ux2, , V Ux L) where θ = (U, W, V ) and diag converts a vector into a diagonal matrix.

3 MAIN RESULT

In this section, we prove a generalization bound for RNNs with Re LU activation. Our new bound is based on Fisher-Rao norm and matrix 1-norm. We ﬁrst deﬁne the Fisher-Rao norm for RNNs.

3.1 FISHER-RAO NORM FOR RNNS

We adapt the notion of Fisher Rao norm to recurrent neural networks. To begin with, we establish the following structural result for RNNs.

Lemma 1. Given an input x = (x1, x2, , x L), consider the recurrent neural network in (1), we have the identity X

y L vab vab + X

y L wij wij + X

y L upq upq = η(θ)c.

The notion of Fisher-Rao norm is motivated by Fisher-Rao metric of information geometry and is deﬁned as follows.

Deﬁnition 1 ((Liang et al., 2017), Deﬁnition 2). The Fisher-Rao norm for a parameter θ is deﬁned as ||θ||2 fr :=< θ, I(θ)θ >,

where I(θ) = E( l(y Lθ(x), y) l(y Lθ(x), y)) and l(., .) is the loss function.

The following lemma gives the explicit formula of Fisher-Rao norm in RNNs. We can see that the notion of Fisher-Rao norm relies mainly on the gradient structure of RNNs.

Lemma 2. Assume that the loss function l(., .) is smooth in the ﬁrst argument. Then the following identity holds for the RNN in (1),

||θ||2 fr = E η(θ)c, l(y Lθ(x), y)

Published as a conference paper at ICLR 2020

Remark 1. We observe that each term V diag(ρ (g L))...Wdiag(ρ (gi))Uxi in η(θ) is actually the gradient component in Backpropagation through time (BPTT). Therefore, the Fisher-Rao norm can be regarded as a measure of gradient. As will be shown later, we can build a relationship between generalization and trainability in RNNs via Fisher-Rao norm.

For the linear activation function and margin loss l(y Lθ(x), y) = Φα(My L(x, y)) where α > 0 is the margin parameter, one might upper bound the Fisher-Rao norm in Lemma 2 by

||θ||2 fr 4

α2 E max i [(τ(θ)c)i]2 ,

since τ(θ)c, l(y Lθ(x), y)

2 4 α2 maxi[(τ(θ)c)i]2 by deﬁnition of My L(x, y) and lipschitz

property of Φα( ). We deﬁne this upper bound as

||θ||2 fs := E max i [(τ(θ)c)i]2 , (2)

and still call it Fisher-Rao norm in the paper by slightly abusing the terminology as they are equivalent for k = 1. In the rest of the paper, we will use this Fisher-Rao norm || ||fs to derive generalization bound for RNNs.

3.2 GENERALIZATION BOUND FOR RNNS

We use matrix 1-norm and Fish-Rao norm together to derive a generalization bound for RNNs. Since it is very challenging to bound the Radermacher complexity of Re LU networks directly in terms of the Fisher-Rao norm, we consider decomposing the Re LU network into the sum of a linear network and a difference term, i.e., y L = ψ(θ)x + (y L ψ(θ)x). For the linear network part ψ(θ)x, the Rademacher complexity can be bounded directly by Fisher-Rao norm. For the difference term (y L ψ(θ)x), we further decompose it into a sum of simpler terms and then upper bound the Rademacher complexity of these simpler terms by matrix 1-norm. We ﬁrst give the results for the linear network part. Lemma 3. Deﬁne Fr := {x [ψ(θ)x]y : ||θ||fs r, y Y} where x RLd and ψ(θ) := (V W L 1U, V W L 2U, , V U). For any data x1, x2, , xn drawn i.i.d from the distribution D, collect them as columns of a matrix X RLd n. Then we have

ˆRn(Fr) r||X||F

1 λmin(E(xx T )),

assuming that E(xx T ) is positive deﬁnite.

Remark 2. If E(x) = 0, E(xx T ) is the covariance matrix of random variable x.

Remark 3. We should mention that our assumption that E(xx T ) is positive deﬁnite is not so restrictive and usually holds in practice. For example, for the case that x is continuous random variable, we can prove that E(xx T ) is positive deﬁnite as follows. Suppose that x is a continuous random variable in the n-dimensional subspace X Rn. If there exists u Rn such that u T E(xx T )u = 0, then for any x X we have u T x = 0, i.e., u X. Since X is n-dimensional, the only u that satisﬁes is that u = 0. Therefore, by deﬁnition, E(xx T ) is positive deﬁnite. As we will show in Section 3.3, this assumption can be removed, and a more general generalization bound will be presented.

Now we bound the Rademacher complexity of the difference term y L ψ(θ)x. With a slight abuse of notations, given input data x1, x2, , xn RLd, the corresponding g1, g2, , gn RLm and h1, h2, , hn RLm are calculated by (1). We collect all input data as a matrix denoted by X, all input data at time t as a matrix denoted by Xt, all input of the hidden layer at time t as a matrix denoted by Gt and all output of hidden layer at time t denoted by Ht, where X RLd n, Xt Rd n, Gt Rm n, Ht Rm n and t = 1, , L. The difference term can be decomposed by the following lemma. Lemma 4. Deﬁne H t := Ht Gt. Then the following equality holds

V HL ψ(θ)X =

i=1 V W L i H i .

Published as a conference paper at ICLR 2020

To bound the Rademacher complexity of each term in the above decomposition, we need a technical lemma given as follows.

Lemma 5. For any p 1, ||H t ||p m 1 p (1 1

p )n 1 p (1 1

p )||Gt||p.

As we will see, the operator norm in Lemma 5 will be instantiated for the case of p = 1. The use of || ||1 helps avoid the appearance of the dimension m when upper bounding the Rademacher complexity. Also it guarantees that Rademacher complexity has a convergence rate O(1/n). The upper bound for the Rademacher complexity of these individual term is given by the following lemma. Lemma 6. Let Ω:= {θ = (U, W, V ) : ||V T ||1 βV , ||W T ||1 βW , ||U T ||1 βU}. Then for any i = 1, , L, we have

Eσ sup θ Ω,y Y

1 n[V W L i H i ]y,σ 1

j=1 βL j W ||Xj T ||1,

where σ = (σ1, σ2, , σn)T is Rademacher random variable and [ ]y, represents the y-th row of the matrix.

We are now ready to put the ingredients together to prove our ﬁrst theorem. Theorem 1 (Rademacher complexity of RNNs). Let Ω:= {θ = (U, W, V ) : ||V T ||1 βV , ||W T ||1 βW , ||U T ||1 βU, ||θ||fs r}. Then, the empirical Rademacher complexity of RNNs with Re LU can be bounded as follows

Eσ sup θ Ω,y Y

1 n Pn i=1[y Lθ(xi)]yσi r||X||F

r 1 λmin(E(xx T )) + 1

nβV βU||XT ||1Λ ,

where Λ := 1 1 βW 1 βL W 1 βW LβL W if βW = 1 and Λ := L+L2

2 for βW = 1.

To establish the generalization bound for RNNs, we need the following classical results for multiclass margin bounds. Lemma 7 ((Kuznetsov et al., 2015), Theorem 2). Let H RX Y be a hypothesis set with Y = {1, 2, , k}. Fix α > 0. Then, for any δ > 0, with probability at least 1 δ, the following multi-class classiﬁcation generalization bound holds for all h H:

i=1 Φα(Mh(xi, yi)) + 4k

α ˆRn(Π1(H)) + 3

where Π1(H) = {x h(x, y) : y Y, h H}.

The generalization bound for RNNs follows from combining Theorem 1 and Lemma 7. Theorem 2. Fix margin parameter α, then for any δ > 0, with probability at least 1 δ, the following holds for every RNN whose weight matrices θ = (U, W, V ) satisfy ||V T ||1 βV , ||W T ||1 βW , ||U T ||1 βU and ||θ||fs r:

E[1My L(x,y) 0] 1 n P 1My L(xi,yi) a + 4k

r 1 λmin(E(xx T )) + 1

nβV βU||XT ||1Λ +

Comparison with existing results. We compare our result with the existing generalization bounds (Zhang et al., 2018; Chen et al., 2019). In comparison with the bound in Zhang et al. (2018), which

is of the order O(max{d, m, k}L2||U||2||V ||2 max{1, ||W||L 2 } nα ), there is no explicit appearance of

the network size parameters d and m in our bound. As we have mentioned before, the reason that we can avoid these dimensional factors is that we use matrix 1-norm instead of spectral norm to

Published as a conference paper at ICLR 2020

upper bound the Rademacher complexity of the network. Moreover, there is always a L2 factor in their bound, whereas the L2 term only occurs in our bound when ||W T ||1 = 1. For the case that ||W T || > 1, our bound has only a linear dependence on L, and for the case that ||W T ||1 < 1, by simple calculation, we can show that Λ 1 (1 βW )2 and the dependence on L would vanish. Both of our bounds have an exponential term ||W||L, which would make the bounds become vacuous for ||W|| > 1. It should also be pointed out that our bound scales linearly with the number of classes since we handle multiclass on each coordinate of a k-tuple of functions and pay a factor of k. Chen et al. (2019) also derived generalization bound for RNNs in terms of spectral norm and the total number of parameters of the network by using covering number analysis. Since their work assumed that the activation function in the hidden layers was bounded rather than the Re LU activation function considered in our paper, their bound is not directly comparable to ours, and we do not make a comparison here due to the page limit. We should emphasis that our proof technique is totally different from the PAC-Bayes approach (Zhang et al., 2018) and covering number analysis (Chen et al., 2019). In particular, we work on the Rademacher complexity of RNNs directly with no invocation of complicated tools such as covering number, which makes our analysis conceptually much simpler. There is also an additional bonus of our proof technique. In the next section, we will use this proof technique to derive a generalization bound for RNNs when training with random noise.

3.3 GENERALIZATION BOUND FOR TRAINING WITH RANDOM NOISE

The generalization bound in Theorem 2 requires the input covariance matrix E(xx T ) to be positive deﬁnite and would become very poor when the smallest eigenvalue is close to 0, which greatly limits the power of our bound. To address this issue, we consider adding random noise to the input data. We notice that after adding random noise with mean 0 and variance σ2 ϵ , the term E(xx T ) in the bound becomes E((x + ϵ)(x + ϵ)T ) and the smallest eigenvalue of E((x + ϵ)(x + ϵ)T ) is (λmin(E(xx T )) + σ2 ϵ ), which is greater than σ2 ϵ . Therefore, our bound is still applicable even when the covariance matrix of original input data is rank-deﬁcient. Involving noise variables has been widely used in recurrent neural networks as a regularization technique (Bayer et al., 2013; Zaremba et al., 2014; Dieng et al., 2018; Gal & Ghahramani, 2016). For example, Bayer et al. (2013) claimed that conventional dropout did not work well with RNNs because the recurrence ampliﬁed noise, which in turn hurt learning. To ﬁx this problem, Zaremba et al. (2014) proposed to inject noise only to the input and output of RNNs. Although their method greatly reduced overﬁtting on a variety of tasks, the generalization guarantee was not provided. In this section, we present a generalization bound for RNNs with noise training. For simplicity, we assume that the noise is drawn i.i.d. from a Gaussian distribution with zero mean and variance σ2 ϵ . Let ϵi denotes the d-dimensional gaussian noise generated at step i and ϵ = (ϵ1, ϵ2, , ϵL) RLd. We collect all noise data as a matrix denoted by Xϵ. To prove the generalization bound, we need to use the Lipschitz property of RNNs given by the following lemma. Lemma 8. For every RNN in (1) with weight matrices θ = (U, W, V ), y L is Lipschitz with respect to || || , i.e.,

||y L(x) y L(x )|| X

i ||V T ||1||U T ||1||W T ||L i 1 ||xi x i||

for any x = (x1, x2, , x L), x = (x 1, x 2, , x L) RLd.

The generalization bound for training with random noise is described as follows. Theorem 3. Fix margin parameter α, then for any δ > 0, with probability at least 1 δ over a sample ((x1, ϵ1, y1), (x2, ϵ2, y2), , (xn, ϵn, yn)), the following holds for every RNN whose weight matrices θ = (U, W, V ) satisfy ||V T ||1 βV , ||W T ||1 βW , ||U T ||1 βU and ||θ||fs r:

E[1My L(x,y) 0] 1

n P Φα(My L(xi + ϵi, yi)) + 2

i βV βUβL i W σϵ p

2 log(2d) + 3

α r||X + Xϵ||F

r 1 λmin(E(xx T )) + σ2ϵ + 1

nβV βU||XT + XT ϵ ||1Λ .

Remark 4. The above bound can be easily extended to other kinds of noises by replacing σϵ p

2 log(2d) by Eϵ||ϵi|| .

Published as a conference paper at ICLR 2020

Remark 5. The bound in Theorem 3 is an extension of that in Theorem 2 and can be applied even when the smallest eigenvalue of E(xx T ) is very close to 0. For example, when λmin(E(xx T )) = 1 10 6, applying Theorem 2 directly might lead to a vacuous bound. But if using Theorem 3 by choosing a small noise with mean 0 and variance 0.01, we might obtain a better bound since the term q

1 λmin(E((xx T )))+σ2ϵ 10. Notice that adding noise can not always

guarantee an improved generalization especially when λmin(E(xx T )) is not so small as it incurs

an additional linear term 2

i βV βUβL i W σϵ p

2 log(2d) and might also increase other parameters

in the bound such as ||X + Xϵ||F . Therefore we suggest adding noise only when the smallest eigenvalue of E(xx T ) is very small. For this case, a small noise such as σϵ = 0.1 not only can

greatly improve the term r 1 λmin(E(xx T )) but also ensure that the extra cost σϵ p

2 log(2d) and

||X + Xϵ||F /n be small enough since ||X + Xϵ||F /n ||X||F /n + ||Xϵ||F /n and ||Xϵ||F /n would be small when n is large.

Remark 6. If we remove the constraint condition ||θ||fs r, which means that we do not have any knowledge about the gradients, the generalization bound in Theorem 2 and Theorem 3 still holds by substituting r with βV βUB( 1 (1 βW )2 + 1 1 βW ) for βW < 1. But with this extra gradient measure, the bound can become much tighter, especially when λmin(E(xx T ) is small. Please refer to the detailed analysis in the next section.

Figure 1: Generalization error for training with noise (mean standard error averaged on 5 runs).

Experiments. We now study the effect of random noise on generalization of RNNs empirically. For simplicity, we consider the IMDB dataset, a collection of 50K movie reviews for binary sentiment classiﬁcation. We use Glo Ve word embedding to map each word to a 50dimensional vector. We train vanilla RNNs with Re LU activation function for sequence length L = 100. The corresponding smallest eigenvalue of E(xx T ) is approximated by using the total training data, which is 4 10 4. We add Gaussian noise to the input data in the training process with σϵ = 0.1, 0.2, 0.3 and 0.4. The generalization error which is the gap between test error without noise and training error with noise for L = 100 and different values of σϵ is shown in Figure 1 (results for other values of L in Appendix D). We observe that as we start injecting noise, the generalization error becomes better. But when the deviation of noise keeps growing, the generalization error shows an increasing tendency. This behavior is consistent with the prediction made by our bound.

3.4 ANALYSIS OF GENERALIZATION BOUND

Our theoretical results have a number of implications for the generalization performance in RNNs, and some of them have been observed in empirical studies. We summarize these implications as follows.

3.4.1 GENERALIZATION AND SMALLEST EIGENVALUE OF E(xx T )

According to our results, the generalization performance in RNNs is inﬂuenced by the smallest eigenvalue of E(xx T ). Since the smaller eigenvalues may contribute to high frequency components of the input signal, our bound suggests that high frequency information is potentially more difﬁcult to generalize, which is consistent with intuition. There are many factors that impact on the smallest eigenvalue and therefore the generalization performance in RNNs. In particular, we study the effect of the correlation between features on the generalization in RNNs. The exact answer for this problem may be complicated. Here we only make an initial attempt and claim that weaker correlation would

Published as a conference paper at ICLR 2020

help improve the generalization, and a non-rigorous proof is given as follows. Denote the covariance matrix E(xx T ) by Ξ where each element ξij in Ξ represents the covariance between feature i and j. Suppose that ||Ξ I||1 ζ with ζ < 1. By deﬁnition of || ||1 matrix norm, we immediately get |ξii 1| + P

j =i |ξij| ζ for any i. Then by simple derivation, we obtain ξii P

j =i |ξij| 1 ζ for any i. Applying Gershgorin circle theorem, we have that the smallest eigenvalue must be greater or equal than 1 ζ. Since the element ξij with i = j represents the covariance between feature i and j, a weaker correlation between feature i and j means a smaller value of |ξij| and we need a smaller ξ to upper bound ||Ξ I||1, which gives us a bigger lower bound on the smallest eigenvalue. Therefore the generalization bound becomes better.

3.4.2 GENERALIZATION AND TRAINABILITY

The generalization of RNNs also depends on parameters βU, βV , βW and r, where βU, βV and βW control the weight matrices and r represents the gradient measure. It has a natural relationship with the training process. The normal procedure in training RNNs is to use weight decay for regularization and gradient clipping to avoid the exploding gradients problems (Bengio et al., 1994; Pascanu et al., 2013). From the perspective of generalization, these strategies can decrease the value of these parameters βU, βV , βW and r and thus improves the generalization. For example, if βW 1, we

have Λ 1 (1 βW )2 , and the second term 1

nβV βU||XT ||1Λ in the generalization bound would be

small when βW is not so close to 1. Similarly, if λmin(E(xx T )) is very small, by setting the gradient clipping value in the training procedure, we can achieve a smaller value of r and thus good generalization. Therefore our bound partially explains why training RNNs in this way can achieve good performance in practice.

3.4.3 GENERALIZATION AND GRADIENT MEASURE

We are interested in how the gradient measure contributes to generalization. Suppose now that we only have the weights, i.e., the parameters βU, βW and βV and the gradient measure parameterized by r is unknown to us. To apply our bound, a natural idea is to infer the gradient measure parameter r based on the known weight parameters. An upper bound for r in terms of βU, βW and βV is given as follows. Under the same conditions as Theorem 2, if we further assume that the data x be given with ||x T ||1 B, by the deﬁnition of || ||fs in (2), for any y Y, we have

((τ(θ)c)y 2 = ((L + 1)[V ]y,W L 1Ux1 + L[V ]y,W L 2Ux2 + + 2[V ]y,Ux L)2

(|(L + 1)[V ]y,W L 1Ux1| + |L[V ]y,W L 2Ux2| + + |2[V ]y,Ux L|)2

((L + 1)βV βUBβL 1 W + LβV βUBβL 2 W + 2βV βUB)2

= (βV βUB( βW βL W (1 βW )2 + 2 (L+1)βL W 1 βW ))2 (βV βUB( 1 (1 βW )2 + 1 1 βW ))2

for βW < 1, and ((τ(θ)c)y 2 (βV βUB 3L+L2

2 )2 for βW = 1. The above inequality holds

for any x and y. So we can get ||θ||fs = E maxi[(τ(θ)c)i]2 1/2 βV βUB( 1 (1 βW )2 + 1 1 βW ) for βW < 1. By replacing r with βV βUB( 1 (1 βW )2 + 1 1 βW ), the generalization bound (3) also holds. But notice that this bound is obtained without any knowledge about the gradients. If we happen to know that the parameter r is much smaller than βV βUB( 1 (1 βW )2 + 1 1 βW ), for example, by setting gradient clipping value to be small in training process, this extra gradient measure can provide us with a better generalization bound, especially when the smallest eigenvalue of E(xx T ) is small. Therefore the introduction of Fisher-Rao norm can help eliminate the negative effect of λmin(E(xx T )) and thus improve the generalization bound.

4 CONCLUSION

In this paper, we propose a new generalization bound for RNNs in terms of matrix 1-norm and Fisher-Rao norm, which has no explicit dependence on the size of networks. Based on the bound, we analyze the inﬂuence of covariance of features on generalization of RNNs and discuss how weight decay and gradient clipping in the training can help improve generalization. While our bound is useful for analyzing generalization performance of RNNs, it would become vacuous because of an exponential term when ||W T ||1 > 1. It is of interest to get a tighter bound which can avoid this

Published as a conference paper at ICLR 2020

exponential dependence. Moreover, our bound only applies to vanilla RNNs with Re LU activation, and extending the results to tangent and sigmoid activation functions or other variants of RNNs like LSTM and MGU might be an interesting topic for future research.

ACKNOWLEDGMENTS

This work was supported by Australian Research Council Project FL-170100117. We would like to thank Gemeng Zhang from Tulane University for helpful discussions and all the reviewers for their constructive comments.

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. ar Xiv preprint ar Xiv:1710.03667, 2017.

Zeyuan Allen-Zhu and Yuanzhi Li. Can sgd learn recurrent neural networks with provable generalization? ar Xiv preprint ar Xiv:1902.01028, 2019.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240 6249, 2017.

Justin Bayer, Christian Osendorfer, Daniela Korhammer, Nutan Chen, Sebastian Urban, and Patrick van der Smagt. On fast dropout and its applicability to recurrent networks. ar Xiv preprint ar Xiv:1311.0701, 2013.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difﬁcult. IEEE transactions on neural networks, 5(2):157 166, 1994.

Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks, 2019. URL https://openreview.net/forum?id=Skf-oo0qt7.

Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774 4778. IEEE, 2018.

Adji Bousso Dieng, Rajesh Ranganath, Jaan Altosaar, and David Blei. Noisin: Unbiased regularization for recurrent neural networks. In International Conference on Machine Learning, pp. 1251 1260, 2018.

Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pp. 1019 1027, 2016.

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 297 299. PMLR, 2018.

Alex Graves. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013.

Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension bounds for piecewise linear neural networks. In Conference on Learning Theory, pp. 1064 1068, 2017.

Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700 1709, 2013.

Vitaly Kuznetsov, Mehryar Mohri, and U Syed. Rademacher complexity margin bounds for learning with a large number of classes. In ICML Workshop on Extreme Classiﬁcation: Learning with a Very Large Number of Labels, 2015.

Published as a conference paper at ICLR 2020

Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. ar Xiv preprint ar Xiv:1711.01530, 2017.

Vladimir Alexandrovich Marchenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4):507 536, 1967.

Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 234 239. IEEE, 2012.

John Miller and Moritz Hardt. Stable recurrent models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hygxb2Cq Km.

Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2422 2430, 2015a.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pp. 1376 1401, 2015b.

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Skz_Wfb CZ.

Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann Le Cun, and Nathan Srebro. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bygfgh Ac YX.

Samet Oymak. Stochastic gradient descent learns state equations with nonlinear activations. ar Xiv preprint ar Xiv:1809.03019, 2018.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310 1318, 2013.

J Saniuk and I Rhodes. A matrix inequality associated with bounds on solutions of algebraic riccati and lyapunov equations. IEEE Transactions on Automatic Control, 32(8):739 740, 1987.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv preprint ar Xiv:1609.08144, 2016.

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hkw ZSG-CZ.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. ar Xiv preprint ar Xiv:1409.2329, 2014.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. 2017. URL https://arxiv.org/abs/ 1611.03530.

Jiong Zhang, Qi Lei, and Inderjit Dhillon. Stabilizing gradients for deep neural networks via efﬁcient svd parameterization. In International Conference on Machine Learning, pp. 5801 5809, 2018.

Published as a conference paper at ICLR 2020

A PROOFS IN SECTION 3.1

A.1 PROOF OF LEMMA 1

Proof. To begin with, by equation (1), we have y L = V h L. Then the derivative of y L with respect to vab can be calculated as y L vab = (0, 0, , [h L]b, , 0)T ,

i.e., a k-dimensional vector with a-th element [h L]b and all other elements zero. Multiplying both sides by vab and summing them up, we get X

y L vab vab = V h L = y L.

The derivative of y L with respect to W and U can be derived by using chain rule in the similar way as follows. y L wij = V h L

wij h L wij = diag(ρ (g L))(0, , [h L 1]j, , 0)T + diag(ρ (g L))W h L 1

wij and y L upq = V h L

upq h L upq = diag(ρ (g L))(0, , [x L]q, , 0)T + diag(ρ (g L))W h L 1

where we use the property of Re LU activation function that ρ(z) = ρ (z)z. Summing up these terms immediately gives us the following equality. P

h L wij wij + P

h L upq upq

= diag(ρ (g L))(P

i,j (0, , [h L 1]j, , 0)T wij + P

p,q (0, , [x L]q, , 0)T upq) + diag(ρ (g L))

wij wij + P

= diag(ρ (g L))(Wh L 1 + Ux L) + diag(ρ (g L))W(P

wij wij + P

= h L + diag(ρ (g L))W(P

wij wij + P

For ease of exposition, deﬁne f L := P

h L wij wij + P

h L upq upq. Then the above equality can be

rewritten as f L = h L + diag(ρ (g L))Wf L 1. By induction, we have

f L = h L + diag(ρ (g L))Wh L 1 + diag(ρ (g L))Wdiag(ρ (g L 1))Wh L 2+ diag(ρ (g L))Wdiag(ρ (g L 1))...Wdiag(ρ (g2))Wh1 .

Multiplying both sides by V and using some basic calculation, we can show that V f L = Ly L η(θ)(0, 1, , L 1)T . Therefore, P

y L vab vab + P

y L wij wij + P

y L upq upq = y L + V f L = (L + 1)y L η(θ)(0, 1, , L 1)T .

Substituting y L = η(θ)(1, 1, , 1)T into the above equality leads to the desired result P

y L vab vab + P

y L wij wij + P

y L upq upq = η(θ)c .

Published as a conference paper at ICLR 2020

A.2 PROOF OF LEMMA 2

Proof. Using the deﬁnition of Fisher-Rao norm,

||θ||2 fr = E(< θ, l(y Lθ, y) >2) = E(< θ, y Lθ(x) l(y Lθ, y)

= E((θT y Lθ(x) l(y Lθ, y)

y Lθ )2) = E(< y Lθ(x)T θ, l(y Lθ, y)

By Lemma 1, we have y Lθ(x)T θ = η(θ)c. Substituting it into the above equality gives us

||θ||2 fr = E η(θ)c, l(y Lθ(x), y)

B PROOFS IN SECTION 3.2

B.1 PROOF OF LEMMA 3

The proof of Lemma 3 relies on the following result in Saniuk & Rhodes (1987). Proposition 1. Let X, Y Rn n with Y symmetric and nonnegative deﬁnite. Then, trace(XY ) ||X||2 trace(Y ).

Now we are ready to prove Lemma 3.

Proof. Denote A := E(((L + 1)x T 1 , Lx T 2 , , 2x T L)T ((L + 1)x T 1 , Lx T 2 , , 2x T L)). By the deﬁnition of ||θ||fs, for any y Y, we have

||[ψ(θ)]y, T ||2 A = [ψ(θ)]y,E(((L + 1)x T 1 , Lx T 2 , , 2x T L)T ((L + 1)x T 1 , Lx T 2 , , 2x T L))[ψ(θ)]y, T

= E([ψ(θ)]y,((L + 1)x T 1 , Lx T 2 , , 2x T L)T ((L + 1)x T 1 , Lx T 2 , , 2x T L)[ψ(θ)]y, T ) = E([(τ(θ)c)y]2) ||θ||2 fs

where [ψ(θ)]y, represents the y-th row of ψ(θ). On the other hand, from the deﬁnition of Rademacher complexities,

ˆRn(Fr) = Eσ( sup f Fr

1 n Pn i=1 f(xi)σi) = Eσ(sup θ,y

1 n Pn i=1[ψ(θ)xi]yσi) = Eσ(sup θ,y

1 n Pn i=1[ψ(θ)]y,xiσi)

= Eσ(sup θ,y < 1

n Pn i=1 xiσi, [ψ(θ)]y, T >) Eσ(sup θ,y (|| 1

n Pn i=1 xiσi||A 1||[ψ(θ)]y, T ||A))

n Pn i=1 xiσi||A 1) = r Eσ

n Pn i=1 xiσi)( 1

n Pn i=1 x T i σi), A 1 >

n Pn i=1 xiσi)( 1

n Pn i=1 x T i σi), A 1 > = r

1 n2 < Pn i=1 xix T i , A 1 >

< XXT , A 1 > = r

< XXT , (CE(xx T )C) 1 > = r

trace(XXT C 1(E(xx T )) 1C 1)

where the ﬁrst inequality uses Cauchy-Schwarz inequality and C := diag((L+1)Id, LId, , 2Id).

Since C 1 and E(xx T ) are positive deﬁnite, we have

trace(XXT C 1(E(xx T )) 1C 1) = trace(C 1(E(xx T )) 1C 1XXT ) ||C 1(E(xx T )) 1C 1||2trace(XXT ) ||C 1||2||(E(xx T )) 1||2||C 1||2trace(XXT )

4||(E(xx T )) 1||2||X||2 F 1

4 ||X||2 F λmin(E(xx T ))

where the ﬁrst inequality is by Proposition 1 and the last inequality uses trace(XXT ) = ||X||2 F .

ˆRn(Fr) r||X||F

1 λmin(E(xx T )).

Published as a conference paper at ICLR 2020

B.2 PROOF OF LEMMA 4

Proof. Denote H t := UXt + WH t 1 and H 1 := UX1. By the deﬁnition of HL, we have

HL H L = ρ(UXL + WHL 1) UXL WH L 1 = ρ(UXL + WHL 1) UXL WHL 1 + W(HL 1 H L 1) = H L + W(HL 1 H L 1) ,

which by induction gives

HL H L = H L + WH L 1 + + W L 1(H1 H 1) = PL i=1 W L i H i .

So the difference term can be rewritten as

V HL ψ(θ)X = V HL V H L = PL i=1 V W L i H i ,

where the second equality uses ψ(θ)X = V H L.

B.3 PROOF OF LEMMA 5

Proof. Using Riesz-Thorin Theorem, we have ||H t ||p ||H t ||1/p 1 ||H t ||1 1/p . And since H t = ρ(Gt) Gt, by the deﬁnition of the induced L1 and L matrix norm, we know ||H t ||1 ||Gt||1 and ||H t || ||Gt|| . Therefore,

||H t ||p ||H t ||1/p 1 ||H t ||1 1/p ||Gt||1/p 1 ||Gt||1 1/p m 1 p (1 1

p )n 1 p (1 1

p )||Gt||p ,

where the last inequality uses some basic facts about matrix norm that ||Gt||1 m1 1/p||Gt||p and ||Gt|| n1/p||Gt||p.

B.4 PROOF OF LEMMA 6

Proof. For any y Y, by H older s inequality, for any p, q 1 with 1

1 n[V W L i H i ]y,σ = 1

n[V ]y,W L i H i σ 1

n||H i T W T L i[V ]y, T ||p||σ||q = ||[V ]y, T ||p||W T ||L i p ||H i T ||pn1/q 1 ||[V ]y, T ||p||W T ||L i p m 1 p (1 1

p2 ||Gi T ||p .

In order to eliminate the dimension dependency on m and simultaneously enjoy a faster convergence rate with respect to n, we choose p = 1. Then the above inequality reduces to 1 n[V W L i H i ]y,σ ||[V ]y, T ||1||W T ||L i 1 n 1||Gi T ||1 βV βL i W n 1||Gi T ||1 βV βL i W n 1(βU||Xi T ||1 + βW ||Hi 1 T ||1). For ||Hi 1 T ||1, we have

||Hi 1 T ||1 = ||ρ(Xi 1 T U T + Hi 1 T W T )||1 ||Xi 1 T U T + Hi 2 T W T ||1 βU||Xi 1 T ||1 + βW ||Hi 2 T ||1 ,

which by induction gives

||Hi 1 T ||1 βU

j=1 βi 1 j W ||Xj T ||1.

Eσ sup θ Ω,y Y

1 n[V W L i H i ]y,σ 1

j=1 βL j W ||Xj T ||1.

Published as a conference paper at ICLR 2020

B.5 PROOF OF THEOREM 1

Proof. Using the notations that we have introduced earlier, the empirical Rademacher complexity can be rewritten as

Eσ sup θ Ω,y Y

1 n Pn i=1[y Lθ(xi)]yσi

= Eσ sup θ Ω,y Y

1 n[V HL]y,σ

= Eσ sup θ Ω,y Y

1 n[PL i=1 V W L i H i + ψ(θ)X]y,σ

PL i=1 Eσ sup θ Ω,y Y

1 n[V W L i H i ]y,σ + Eσ sup θ Ω,y Y

1 n[ψ(θ)X]y,σ

PL i=1 Eσ sup θ Ω,y Y

1 n[V W L i H i ]y,σ + Eσ sup ||θ||fs r,y Y

1 n[ψ(θ)X]y,σ

where the second equality uses Lemma 4 and the last inequality is due to the fact that Ω Ωand Ω {θ : ||θ||fs r}.

For the last term, by Lemma 3, we have

Eσ sup ||θ||fs r,y Y

1 n[ψ(θ)X]y,σ r||X||F

r 1 λmin(E(xx T )) .

The other terms can be handled by Lemma 6 in the following way.

PL i=1 Eσ sup θ Ω,y Y

1 n[V W L i H i ]y,σ PL i=1 1

nβV βU Pi j=1 βL j W ||Xj T ||1

nβV βU Pi j=1 βL j W ||XT ||1 = 1

n βV βU||XT ||1

1 βL W 1 βW LβL W

for βW = 1, and PL i=1 Eσ sup θ Ω,y Y

1 n[V W L i H i ]yσ 1

nβV βUB2 L+L2

2 for βW = 1, where the

second inequality uses the deﬁnition of matrix norm || ||1.

Collecting all terms, we establish

Eσ sup θ Ω,y Y

1 n Pn i=1[y Lθ(xi)]yσi r||X||F

r 1 λmin(E(xx T )) + 1

nβV βU||XT ||1Λ.

C PROOFS IN SECTION 3.3

This section includes the full proofs of the generalization bound for training with random noise.

C.1 LIPSCHITZ PROPERTIES OF RELU NONLINEARITIES AND MARGIN OPERATOR

We ﬁrst establish the Lipschitz properties of the Re LU activation function and margin operator M(y L(x), y) := My L(x, y).

Lemma 9. Let ρ : Rn Rn be the coordinate-wise Re LU function, then it is 1-Lipschitz according to || ||p for any p 1.

Proof. For any x, x Rn,

||ρ(x) ρ(x )||p = X |ρ(x)i ρ(x )i|p 1/p X |xi x i|p 1/p = ||x x ||p.

Published as a conference paper at ICLR 2020

Lemma 10. For every j and every p 1, M( , j) is 2-Lipschitz wrt || ||p.

Proof. Let y, y , j be given, and suppose that M(y, j) M(y , j) without loss of generality. Select coordinate i = j so that M(y, j) = yj yi. Then

M(y , j) M(y, j) = y j max l =j y l yj +yi (y j yj)+(yi y i) 2||y y|| 2||y y ||p.

C.2 PROOF OF LEMMA 8

Proof. We prove this Lemma by induction. Let x = (x1, x2, , x L) and x = (x 1, x 2, , x L). Denote g t := Ux t + Wh t 1, h t := ρ(g t) and y t := V h t. Then we have

||ht h t|| = ||ρ(gt) ρ(g t)|| ||gt g t|| = ||Uxt + Wht 1 Ux t Wh t 1|| ||U T ||1||xt x t|| + ||W T ||1||ht 1 h t 1|| ,

where the ﬁrst inequality uses Lemma 9 and the second inequality uses basic properties of || || . By induction, we get

||h L h L|| X

i ||U T ||1||W T ||L i 1 ||xi x i|| .

||y L y L|| = ||V h L V h L|| ||V || ||h L h L|| X

i ||V T ||1||U T ||1||W T ||L i 1 ||xi x i|| .

C.3 PROOF OF THEOREM 3

We begin by establishing two auxiliary lemmas that we will need for the subsequent theorem. Lemma 11. For every RNNs in (1) with weight matrices θ = (U, W, V ), the following inequality holds for any x = (x1, x2, , x L) and y.

|Eϵ[Φα(My L(x, y)) Φα(My L(x + ϵ, y))]| 2

i ||V T ||1||U T ||1||W T ||L i 1 (Eϵ||ϵi|| ).

Proof. For any ﬁxed x and y,

|Eϵ[Φα(My L(x, y)) Φα(My L(x + ϵ, y))]| Eϵ|Φα(My L(x, y)) Φα(My L(x + ϵ, y))|

αEϵ||y L(x) y L(x + ϵ)|| 2

i ||V T ||1||U T ||1||W T ||L i 1 ||ϵi|| )

i ||V T ||1||U T ||1||W T ||L i 1 (Eϵ||ϵi|| )

where the ﬁrst inequality uses Jensen s inequality, the second inequality follows from the 1 αLipschitz property of Φα( ) and Lemma 10 and the last inequality is by Lemma 8.

Lemma 12. Let {ϵi}d i=1 be an i.i.d sequence of N(0, σ2) variables, then E[maxi |ϵi|] σ p

Proof. Deﬁne Z = [maxi |ϵi|]. For any t > 0, by Jensen inequality, we have

et E(Z) E(et Z) = E(max i et|ϵi|) X

i E(et|ϵi|) = 2dΦ(σ2t)eσ2t2/2 2deσ2t2/2,

where the second inequality uses the deﬁnition of normal distribution and Φ is the cumulative distribution function of the standard normal distribution. Taking logs on both sides and dividing by t, we get

E(Z) log(2d)

Published as a conference paper at ICLR 2020

Choosing t =

σ leads to the desired result,

We now return to the proof of Theorem 3.

Proof. For any RNNs with weight matrices θ = (U, W, V ) satisfying ||V T ||1 βV , ||W T ||1 βW , ||U T ||1 βU, we have

|Ex,y[Φα(My L(x, y))] Ex,ϵ,y[Φα(My L(x + ϵ, y))| = |Ex,y(Φα(My L(x, y)) Eϵ[Φα(My L(x + ϵ, y))])|

Ex,y|Φα(My L(x, y)) Eϵ[Φα(My L(x + ϵ, y))]| 2

i βV βUβL i W (Eϵ||ϵi|| ) ,

where the ﬁrst equality is due to the fact that the input x and noise ϵ are independent, the ﬁrst inequality uses Jensen s inequality and the last inequality follows from Lemma 11. The inequality above can be rewritten as follows.

Ex,y[Φα(My L(x, y))] Ex,ϵ,y[Φα(My L(x + ϵ, y)) + 2

i βV βUβL i W (Eϵ||ϵi|| ) .

For the ﬁrst term in the right hand side of the above inequality, by Theorem 2, with probability at least 1 δ, the following holds:

Ex,ϵ,y[Φα(My L(x + ϵ, y))

α r||X + Xϵ||F

r 1 λmin(E((x + ϵ)(x + ϵ)T )) + 1

nβV βU||XT + XT ϵ ||1Λ + 3

1 n P Φα(My L(xi + ϵi, yi))

α r||X + Xϵ||F

r 1 λmin(E(xx T ) + E(ϵϵT )) + 1

nβV βU||XT + XT ϵ ||1Λ + 3

1 n P Φα(My L(xi + ϵi, yi))

α r||X + Xϵ||F

r 1 λmin(E(xx T ) + σ2ϵ I) + 1

nβV βU||XT + XT ϵ ||1Λ + 3

1 n P Φα(My L(xi + ϵi, yi))

α r||X + Xϵ||F

r 1 λmin(E(xx T )) + σ2ϵ + 1

nβV βU||XT + XT ϵ ||1Λ + 3

1 n P Φα(My L(xi + ϵi, yi))

Combining the above two inequalities together leads to

E[1My L(x,y) 0] Ex,y[Φα(My L(x, y))]

n P Φα(My L(xi + ϵi, yi)) + 2

α P i βV βUβL i W (Eϵ||ϵi|| )+

α r||X + Xϵ||F

r 1 λmin(E(xx T )) + σ2ϵ + 1

nβV βU||XT + XT ϵ ||1Λ + 3

where the ﬁrst inequality makes use of the fact that 1u Φα(u). Therefore, the desired result can be immediately obtained by substituting Eϵ||ϵi|| with σϵ p

2 log(2d) according to Lemma 12.

D SUPPLEMENTARY FIGURES

Figure 2 shows the behavior of generalization error for RNNs as the standard derivation of noise σϵ varies for the sequence length L = 200 and 300 trained on IMDB dataset. As in the body, increasing σϵ will ﬁrst improve the generalization error and then, after a certain point, harm the performance of RNNs.

Published as a conference paper at ICLR 2020

Figure 2: Generalization error for training with noise (mean standard error averaged on 5 runs). The left and right panel are for L = 200 (smallest eigenvalue: 1 10 4) and L = 300 (smallest eigenvalue: 2.5 10 5) respectively.