# stable_recurrent_models__772cc51b.pdf

Published as a conference paper at ICLR 2019

STABLE RECURRENT MODELS

John Miller & Moritz Hardt University of California, Berkeley {miller john,hardt}@berkeley.edu

Stability is a fundamental property of dynamical systems, yet to this date it has had little bearing on the practice of recurrent neural networks. In this work, we conduct a thorough investigation of stable recurrent models. Theoretically, we prove stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent. Empirically, we demonstrate stable recurrent models often perform as well as their unstable counterparts on benchmark sequence tasks. Taken together, these ﬁndings shed light on the effective power of recurrent networks and suggest much of sequence learning happens, or can be made to happen, in the stable regime. Moreover, our results help to explain why in many cases practitioners succeed in replacing recurrent models by feed-forward models.

1 INTRODUCTION

Recurrent neural networks are a popular modeling choice for solving sequence learning problems arising in domains such as speech recognition and natural language processing. At the outset, recurrent neural networks are non-linear dynamical systems commonly trained to ﬁt sequence data via some variant of gradient descent.

Stability is of fundamental importance in the study of dynamical system. Surprisingly, however, stability has had little impact on the practice of recurrent neural networks. Recurrent models trained in practice do not satisfy stability in an obvious manner, suggesting that perhaps training happens in a chaotic regime. The difﬁculty of training recurrent models has compelled practitioners to successfully replace recurrent models with non-recurrent, feed-forward architectures.

This state of affairs raises important unresolved questions. Is sequence modeling in practice inherently unstable? When and why are recurrent models really needed?

In this work, we shed light on both of these questions through a theoretical and empirical investigation of stability in recurrent models.

We ﬁrst prove stable recurrent models can be approximated by feed-forward networks. In particular, not only are the models equivalent for inference, they are also equivalent for training via gradient descent. While it is easy to contrive non-linear recurrent models that on some input sequence cannot be approximated by feed-forward models, our result implies such models are inevitably unstable. This means in particular they must have exploding gradients, which is in general an impediment to learnibility via gradient descent.

Second, across a variety of different sequence tasks, we show how recurrent models can often be made stable without loss in performance. We also show models that are nominally unstable often operate in the stable regime on the data distribution. Combined with our ﬁrst result, these observation helps to explain why an increasingly large body of empirical research succeeds in replacing recurrent models with feed-forward models in important applications, including translation (Vaswani et al., 2017; Gehring et al., 2017), speech synthesis (Van Den Oord et al., 2016), and language modeling (Dauphin et al., 2017). While stability does not always hold in practice to begin with, it is often possible to generate a high-performing stable model by imposing stability during training.

Our results also shed light on the effective representational properties of recurrent networks trained in practice. In particular, stable models cannot have long-term memory. Therefore, when stable and

Published as a conference paper at ICLR 2019

unstable models achieve similar results, either the task does not require long-term memory, or the unstable model does not have it.

1.1 CONTRIBUTIONS

In this work, we make the following contributions.

1. We present a generic deﬁnition of stable recurrent models in terms of non-linear dynamical systems and show how to ensure stability of several commonly used models. Previous work establishes stability for vanilla recurrent neural networks. We give new sufﬁcient conditions for stability of long short-term memory (LSTM) networks. These sufﬁcient conditions come with an efﬁcient projection operator that can be used at training time to enforce stability. 2. We prove, under the stability assumption, feed-forward networks can approximate recurrent networks for purposes of both inference and training by gradient descent. While simple in the case of inference, the training result relies on non-trivial stability properties of gradient descent. 3. We conduct extensive experimentation on a variety of sequence benchmarks, show stable models often have comparable performance with their unstable counterparts, and discuss when, if ever, there is an intrinsic performance price to using stable models.

2 STABLE RECURRENT MODELS

In this section, we deﬁne stable recurrent models and illustrate the concept for various popular model classes. From a pragmatic perspective, stability roughly corresponds to the criterion that the gradients of the training objective do not explode over time. Common recurrent models can operate in both the stable and unstable regimes, depending on their parameters. To study stable variants of common architectures, we give sufﬁcient conditions to ensure stability and describe how to efﬁciently enforce these conditions during training.

2.1 DEFINING STABLE RECURRENT MODELS

A recurrent model is a non-linear dynamical system given by a differentiable state-transition map φw : Rn Rd Rn, parameterized by w Rm. The hidden state ht Rn evolves in discrete time steps according to the update rule

ht = φw(ht 1, xt) , (1)

where the vector xt Rd is an arbitrary input provided to the system at time t. This general formulation allows us to unify many examples of interest. For instance, for a recurrent neural network, given weight matrices W and U, the state evolves according to

ht = φW,U(ht 1, xt) = tanh (Wht 1 + Uxt) .

Recurrent models are typically trained using some variant of gradient descent. One natural even if not strictly necessary requirement for gradient descent to work is that the gradients of the training objective do not explode over time. Stable recurrent models are precisely the class of models where the gradients cannot explode. They thus constitute a natural class of models where gradient descent can be expected to work. In general, we deﬁne a stable recurrent model as follows. Deﬁnition 1. A recurrent model φw is stable if there exists some λ < 1 such that, for any weights w Rm, states h, h Rn, and input x Rd,

φw(h, x) φw(h , x) λ h h . (2)

Equivalently, a recurrent model is stable if the map φw is λ-contractive in h. If φw is λ-stable, then hφw(h, x) < λ, and for Lipschitz loss p, wp is always bounded (Pascanu et al., 2013).

Stable models are particularly well-behaved and well-justiﬁed from a theoretical perspective. For instance, at present, only stable linear dynamical systems are known to be learnable via gradient

Published as a conference paper at ICLR 2019

descent (Hardt et al., 2018). In unstable models, the gradients of the objective can explode, and it is a delicate matter to even show that gradient descent converges to a stationary point. The following proposition offers one such example. The proof is provided in the appendix.

Proposition 1. There exists an unstable system φw where gradient descent does not converge to a stationary point, and wp as the number of iterations N .

2.2 EXAMPLES OF STABLE RECURRENT MODELS

In this section, we provide sufﬁcient conditions to ensure stability for several common recurrent models. These conditions offer a way to require learning happens in the stable regime after each iteration of gradient descent, one imposes the corresponding stability condition via projection.

Linear dynamical systems and recurrent neural networks. Given a Lipschitz, point-wise nonlinearity ρ and matrices W Rn n and U Rn d, the state-transition map for a recurrent neural network (RNN) is

ht = ρ(Wht 1 + Uxt).

If ρ is the identity, then the system is a linear dynamical system. Jin et al. (1994) show if ρ is Lρ-Lipschitz, then the model is stable provided W < 1 Lρ . Indeed, for any states h, h , and any x,

ρ(Wh + Ux) ρ(Wh + Ux) Lρ Wh + Ux Wh Ux Lρ W h h .

In the case of a linear dynamical system, the model is stable provided W < 1. Similarly, for the 1-Lipschitz tanh-nonlinearity, stability obtains provided W < 1. In the appendix, we verify the assumptions required by the theorems given in the next section for this example. Imposing this condition during training corresponds to projecting onto the spectral norm ball.

Long short-term memory networks. Long Short-Term Memory (LSTM) networks are another commonly used class of sequence models (Hochreiter & Schmidhuber, 1997). The state is a pair of vectors s = (c, h) R2d, and the model is parameterized by eight matrices, W Rd d and U Rd n, for {i, f, o, z}. The state-transition map φLSTM is given by

ft = σ(Wfht 1 + Ufxt) it = σ(Wiht 1 + Uixt) ot = σ(Woht 1 + Uoxt) zt = tanh(Wzht 1 + Uzxt) ct = it zt + ft ct 1 ht = ot tanh(ct),

where denotes elementwise multiplication, and σ is the logistic function.

We provide conditions under which the iterated system φr LSTM = φLSTM φLSTM is stable. Let f = supt ft . If the weights Wf, Uf and inputs xt are bounded, then f < 1 since |σ| < 1 for any ﬁnite input. This means the next state ct must forget a non-trivial portion of ct 1. We leverage this phenomenon to give sufﬁcient conditions for φLSTM to be contractive in the ℓ norm, which in turn implies the iterated system φr LSTM is contractive in the ℓ2 norm for r = O(log(d)). Let W denote the induced ℓ matrix norm, which corresponds to the maximum absolute row sum maxi P j |Wij|.

Proposition 2. If Wi , Wo < (1 f ), Wz (1/4)(1 f ), Wf < (1 f )2, and r = O(log(d)), then the iterated system φr LSTM is stable.

The proof is given in the appendix. The conditions given in Proposition 2 are fairly restrictive. Somewhat surprisingly we show in the experiments models satisfying these stability conditions still achieve good performance on a number of tasks. We leave it as an open problem to ﬁnd different parameter regimes where the system is stable, as well as resolve whether the original system φLSTM is stable. Imposing these conditions during training and corresponds to simple row-wise normalization of the weight matrices and inputs. More details are provided in Section 4 and the appendix.

Published as a conference paper at ICLR 2019

3 STABLE RECURRENT MODELS HAVE FEED-FORWARD APPROXIMATIONS

In this section, we prove stable recurrent models can be well-approximated by feed-forward networks for the purposes of both inference and training by gradient descent. From a memory perspective, stable recurrent models are equivalent to feed-forward networks both models use the same amount of context to make predictions. This equivalence has important consequences for sequence modeling in practice. When a stable recurrent model achieves satisfactory performance on some task, a feed-forward network can achieve similar performance. Consequently, if sequence learning in practice is inherently stable, then recurrent models may not be necessary. Conversely, if feed-forward models cannot match the performance of recurrent models, then sequence learning in practice is in the unstable regime.

3.1 TRUNCATED RECURRENT MODELS

For our purposes, the salient distinction between a recurrent and feed-forward model is the latter has ﬁnite-context. Therefore, we say a model is feed-forward if the prediction made by the model at step t is a function only of the inputs xt k, . . . , xt for some ﬁnite k.

While there are many choices for a feed-forward approximation, we consider the simplest one truncation of the system to some ﬁnite context k. In other words, the feed-forward approximation moves over the input sequence with a sliding window of length k producing an output every time the sliding window advances by one step. Formally, for context length k chosen in advance, we deﬁne the truncated model via the update rule

hk t = φw(hk t 1, xt), hk t k = 0 . (3)

Note that hk t is a function only of the previous k inputs xt k, . . . , xt. While this deﬁnition is perhaps an abuse of the term feed-forward , the truncated model can be implemented as a standard autoregressive, depth-k feed-forward network, albeit with signiﬁcant weight sharing.

Let f denote a prediction function that maps a state ht to outputs f(ht) = yt. Let yk t denote the predictions from the truncated model. To simplify the presentation, the prediction function f is not parameterized. This is without loss of generality because it is always possible to fold the parameters into the system φw itself. In the sequel, we study yt yk t both during and after training.

3.2 APPROXIMATION DURING INFERENCE

Suppose we train a full recurrent model φw and obtain a prediction yt. For an appropriate choice of context k, the truncated model makes essentially the same prediction yk t as the full recurrent model. To show this result, we ﬁrst control the difference between the hidden states of both models. Lemma 1. Assume φw is λ-contractive in h and Lx-Lipschitz in x. Assume the input sequence xt Bx for all t. If the truncation length k log1/λ Lx Bx (1 λ)ε , then the difference in hidden

states ht hk t ε.

Lemma 1 effectively says stable models do not have long-term memory distant inputs do not change the states of the system. A proof is given in the appendix. If the prediction function is Lipschitz, Lemma 1 immediately implies the recurrent and truncated model make nearly identical predictions. Proposition 3. If φw is a Lx-Lipschitz and λ-contractive map, and f is Lf Lipschitz, and the

truncation length k log1/λ Lf Lx Bx

(1 λ)ε , then yt yk t ε.

3.3 APPROXIMATION DURING TRAINING VIA GRADIENT DESCENT

Equipped with our inference result, we turn towards optimization. We show gradient descent for stable recurrent models ﬁnds essentially the same solutions as gradient descent for truncated models. Consequently, both the recurrent and truncated models found by gradient descent make essentially the same predictions.

Our proof technique is to initialize both the recurrent and truncated models at the same point and track the divergence in weights throughout the course of gradient descent. Roughly, we show if

Published as a conference paper at ICLR 2019

k O(log(N/ε)), then after N steps of gradient descent, the difference in the weights between the recurrent and truncated models is at most ε. Even if the gradients are similar for both models at the same point, it is a priori possible that slight differences in the gradients accumulate over time and lead to divergent weights where no meaningful comparison is possible. Building on similar techniques as Hardt et al. (2016), we show that gradient descent itself is stable, and this type of divergence cannot occur.

Our gradient descent result requires two essential lemmas. The ﬁrst bounds the difference in gradient between the full and the truncated model. The second establishes the gradient map of both the full and truncated models is Lipschitz. We defer proofs of both lemmas to the appendix.

Let p T denote the loss function evaluated on recurrent model after T time steps, and deﬁne pk T similarly for the truncated model. Assume there some compact, convex domain Θ Rm so that the map φw is stable for all choices of parameters w Θ. Lemma 2. Assume p (and therefore pk) is Lipschitz and smooth. Assume φw is smooth, λcontractive, and Lipschitz in x and w. Assume the inputs satisfy xt Bx, then wp T wpk T = γkλk,

where γ = O Bx(1 λ) 2 , suppressing dependence on the Lipschitz and smoothness parameters. Lemma 3. For any w, w Θ, suppose φw is smooth, λ-contractive, and Lipschitz in w. If p is Lipschitz and smooth, then

wp T (w) wp T (w ) β w w ,

where β = O (1 λ) 3 , suppressing dependence on the Lipschitz and smoothness parameters.

Let wi recurr be the weights of the recurrent model on step i and deﬁne wi trunc similarly for the truncated model. At initialization, w0 recurr = w0 trunc. For k sufﬁciently large, Lemma 2 guarantees the difference between the gradient of the recurrent and truncated models is negligible. Therefore, after a gradient update, w1 recurr w1 trunc is small. Lemma 3 then guarantees that this small difference in weights does not lead to large differences in the gradient on the subsequent time step. For an appropriate choice of learning rate, formalizing this argument leads to the following proposition. Proposition 4. Under the assumptions of Lemmas 2 and 3, for compact, convex Θ, after N steps of projected gradient descent with step size αt = α/t, w N recurr w N trunc αγkλk N αβ+1.

The decaying step size in our theorem is consistent with the regime in which gradient descent is known to be stable for non-convex training objectives (Hardt et al., 2016). While the decay is faster than many learning rates encountered in practice, classical results nonetheless show that with this learning rate gradient descent still converges to a stationary point; see p. 119 in Bertsekas (1999) and references there. In the appendix, we give empirical evidence the O(1/t) rate is necessary for our theorem and show examples of stable systems trained with constant or O(1/

t) rates that do not satisfy our bound.

Critically, the bound in Proposition 4 goes to 0 as k . In particular, if we take α = 1 and k Ω(log(γN β/ε)), then after N steps of projected gradient descent, w N recurr w N trunc ε. For this choice of k, we obtain the main theorem. The proof is left to the appendix. Theorem 1. Let p be Lipschitz and smooth. Assume φw is smooth, λ-contractive, Lipschitz in x and w. Assume the inputs are bounded, and the prediction function f is Lf-Lipschitz. If k Ω(log(γN β/ε)), then after N steps of projected gradient descent with step size αt = 1/t, y T yk T ε.

4 EXPERIMENTS

In the experiments, we show stable recurrent models can achieve solid performance on several benchmark sequence tasks. Namely, we show unstable recurrent models can often be made stable without a loss in performance. In some cases, there is a small gap between the performance between unstable and stable models. We analyze whether this gap is indicative of a price of stability and show the unstable models involved are stable in a data-dependent sense.

Published as a conference paper at ICLR 2019

We consider four benchmark sequence problems word-level language modeling, character-level language modeling, polyphonic music modeling, and slot-ﬁlling.

Language modeling. In language modeling, given a sequence of words or characters, the model must predict the next word or character. For character-level language modeling, we train and evaluate models on Penn Treebank (Marcus et al., 1993). To increase the coverage of our experiments, we train and evaluate the word-level language models on the Wikitext-2 dataset, which is twice as large as Penn Treebank and features a larger vocabulary (Merity et al., 2017). Performance is reported using bits-per-character for character-level models and perplexity for word-level models.

Polyphonic music modeling. In polyphonic music modeling, a piece is represented as a sequence of 88-bit binary codes corresponding to the 88 keys on a piano, with a 1 indicating a key that is pressed at a given time. Given a sequence of codes, the task is to predict the next code. We evaluate our models on JSB Chorales, a polyphonic music dataset consisting of 382 harmonized chorales by J.S. Bach (Allan & Williams, 2005). Performance is measured using negative log-likelihood.

Slot-ﬁlling. In slot ﬁlling, the model takes as input a query like I want to Boston on Monday and outputs a class label for each word in the input, e.g. Boston maps to Departure City and Monday maps to Departure Time. We use the Airline Travel Information Systems (ATIS) benchmark and report the F1 score for each model (Price, 1990).

4.2 COMPARING STABLE AND UNSTABLE MODELS

For each task, we ﬁrst train an unconstrained RNN and an unconstrained LSTM. All the hyperparameters are chosen via grid-search to maximize the performance of the unconstrained model. For consistency with our theoretical results in Section 3 and stability conditions in Section 2.2, both models have a single recurrent layer and are trained using plain SGD. In each case, the resulting model is unstable. However, we then retrain the best models using projected gradient descent to enforce stability without retuning the hyperparameters. In the RNN case, we constrain W < 1. After each gradient update, we project the W onto the spectral norm ball by computing the SVD and thresholding the singular values to lie in [0, 1). In the LSTM case, after each gradient update, we normalize each row of the weight matrices to satisfy the sufﬁcient conditions for stability given in Section 2.2. Further details are given in the appendix.

Stable and unstable models achieve similar performance. Table 1 gives a comparison of the performance between stable and unstable RNNs and LSTMs on each of the different tasks. Each of the reported metrics is computed on the held-out test set. We also show a representative comparison of learning curves for word-level language modeling and polyphonic music modeling in Figures 1(a) and 1(b).

0 50000 100000 Number of Iterations

Training Perplexity

Word-Level Language Model (Wikitext-2)

Stable RNN Unstable RNN Stable LSTM Unstable LSTM

(a) Word-level language modeling

0 10000 20000 Number of Iterations

Training Loss

Polyphonic Music Modeling (Bach Chorales)

Stable RNN Unstable RNN Stable LSTM Unstable LSTM

(b) Polyphonic music modeling

Figure 1: Stable and unstable variants of common recurrent architectures achieve similar performance across a range of different sequence tasks.

Published as a conference paper at ICLR 2019

Table 1: Comparison of stable and unstable models on a variety of sequence modeling tasks. For all the tasks, stable and unstable RNNs achieve the same performance. For polyphonic music and slot-ﬁlling, stable and unstable LSTMs achieve the same results. On language modeling, there is a small gap between stable and unstable LSTMs. We discuss this in Section 4.3. Performance is evaluated on the held-out test set. For negative log-likelihood (nll), bits per character (bpc), and perplexity, lower is better. For F1 score, higher is better.

Model RNN LSTM Sequence Task Dataset (measure) Unstable Stable Unstable Stable

Polyphonic Music JSB Chorales (nll) 8.9 8.9 8.5 8.5 Slot-Filling Atis (F1 score) 94.7 94.7 95.1 94.6 Word-level LM Wikitext-2 (perplexity) 146.7 143.5 95.7 113.2 Character-level LM Penn Treebank (bpc) 1.8 1.9 1.4 1.9

Across all the tasks we considered, stable and unstable RNNs have roughly the same performance. Stable RNNs and LSTMs achieve results comparable to published baselines on slot-ﬁlling (Mesnil et al., 2015) and polyphonic music modeling (Bai et al., 2018). On word and character level language modeling, both stable and unstable RNNs achieve comparable results to (Bai et al., 2018).

On the language modeling tasks, however, there is a gap between stable and unstable LSTM models. Given the restrictive conditions we place on the LSTM to ensure stability, it is surprising they work as well as they do. Weaker conditions ensuring stability of the LSTM could reduce this gap. It is also possible imposing stability comes at a cost in representational capacity required for some tasks.

4.3 WHAT IS THE PRICE OF STABILITY IN SEQUENCE MODELING?

The gap between stable and unstable LSTMs on language modeling raises the question of whether there is an intrinsic performance cost for using stable models on some tasks. If we measure stability in a data-dependent fashion, then the unstable LSTM language models are stable, indicating this gap is illusory. However, in some cases with short sequences, instability can offer modeling beneﬁts.

LSTM language models are stable in a data-dependent way. Our notion of stability is conservative and requires stability to hold for every input and pair of hidden states. If we instead consider a weaker, data-dependent notion of stability, the word and character-level LSTM models are stable (in the iterated sense of Proposition 2). In particular, we compute the stability parameter only using input sequences from the data. Furthermore, we only evaluate stability on hidden states reachable via gradient descent. More precisely, to estimate λ, we run gradient ascent to ﬁnd worst-case hidden

states h, h to maximize φw(h,x) φw(h ,x)

h h . More details are provided in the appendix.

The data-dependent deﬁnition given above is a useful diagnostic when the sufﬁcient stability conditions fail to hold, the data-dependent condition addresses whether the model is still operating in the stable regime. Moreover, when the input representation is ﬁxed during training, our theoretical results go through without modiﬁcation when using the data-dependent deﬁnition.

Using the data-dependent measure, in Figure 2(a), we show the iterated character-level LSTM, φr LSTM, is stable for r 80 iterations. A similar result holds for the word-level language model for r 100. These ﬁndings are consistent with experiments in Laurent & von Brecht (2017) which ﬁnd LSTM trajectories converge after approximately 70 steps only when evaluated on sequences from the data. For language models, the price of stability is therefore much smaller than the gap in Table 1 suggests even the unstable models are operating in the stable regime on the data distribution.

Unstable systems can offer performance improvements for short-time horizons. When sequences are short, training unstable models is less difﬁcult because exploding gradients are less of an issue. In these case, unstable models can offer performance gains. To demonstrate this, we train truncated unstable models on the polyphonic music task for various values of the truncation parameter k. In Figure 2(b), we simultaneously plot the performance of the unstable model and the stability

Published as a conference paper at ICLR 2019

parameter λ for the converged model for each k. For short-sequences, the ﬁnal model is more unstable (λ 3.5) and achieves a better test-likelihood. For longer sequence lengths, λ decreases closer to the stable regime (λ 1.5), and this improved test-likelihood performance disappears.

0 75 150 Number of Iterations

Data-Dependent Stability

Data-dependent Stability for Iterated LSTM (Char-LM)

Iterated LSTM Stability Threshold

(a) Data-dependent stability of character-level language models. The iterated-LSTM refers to the iteration system φr LSTM = φLSTM φLSTM.

20 40 60 80 100 Truncation Length

Test Log-Likelihood

Stability Constant vs. Truncation Length

(Polyphonic Music)

Unstable RNN Stability Parameter

(b) Unstable models can boost performance for short sequences.

Figure 2: What is the intrinsic price of stability ? For language modeling, we show the unstable LSTMs are actually stable in weaker, data-dependent sense. On the other hand, for polyphonic music modeling with short sequences, instability can improve model performance.

4.4 UNSTABLE MODELS OPERATE IN THE STABLE REGIME

In the previous section, we showed nominally unstable models often satisfy a data-dependent notion of stability. In this section, we offer further evidence unstable models are operating in the stable regime. These results further help explain why stable and unstable models perform comparably in experiments.

Vanishing gradients. Stable models necessarily have vanishing gradients, and indeed this ingredient is a key ingredient in the proof of our training-time approximation result. For both word and character-level language models, we ﬁnd both unstable RNNs and LSTMs also exhibit vanishing gradients. In Figures 3(a) and 3(b), we plot the average gradient of the loss at time t+i with respect to the input at time t, xtpt+i as t ranges over the training set. For either language modeling task, the LSTM and the RNN suffer from limited sensitivity to distant inputs at initialization and throughout training. The gradients of the LSTM vanish more slowly than those of the RNN, but both models exhibit the same qualitative behavior.

0 15 30 Distance i from input xt to loss pt+i

Average Input

Gradient Norm

Vanishing Gradients on Wikitext-2

Unstable LSTM, Epoch 0 Unstable RNN, Epoch 0 Unstable LSTM, Epoch 20 Unstable RNN, Epoch 20

(a) Word-Level language modeling

0 20 40 60 Distance i from input xt to loss pt+i

Average Input

Gradient Norm

Vanishing Gradients on Penn Tree Bank

Unstable LSTM, Epoch 0 Unstable RNN, Epoch 0 Unstable LSTM, Epoch 100 Unstable RNN, Epoch 100

(b) Character-level language modeling

Figure 3: Unstable word and character-level language models exhibit vanishing gradients. We plot the norm of the gradient with respect to inputs, xtpt+i , as the distance between the input and the loss grows, averaged over the entire training set. The gradient vanishes for moderate values of i for both RNNs and LSTMs, though the decay is slower for LSTMs.

Published as a conference paper at ICLR 2019

Truncating Unstable Models. The results in Section 3 show stable models can be truncated without loss of performance. In practice, unstable models can also be truncated without performance loss. In Figures 4(a) and 4(b), we show the performance of both LSTMs and RNNs for various values of the truncation parameter k on word-level language modeling and polyphonic music modeling. Initially, increasing k increases performance because the model can use more context to make predictions. However, in both cases, there is diminishing returns to larger values of the truncation parameter k. LSTMs are unaffected by longer truncation lengths, whereas the performance of RNNs slightly degrades as k becomes very large, possibly due to training instability. In either case, diminishing returns to performance for large values of k means truncation and therefore feed-forward approximation is possible even for these unstable models.

50 100 150 200 Truncation Length

Test Perplexity

Eﬀect of Truncation Length on Performance (Word-LM)

Unstable LSTM Unstable RNN

(a) Word-level language modeling

20 40 60 80 100 Truncation Length

Test Log-Likelihood

Eﬀect of Truncation Length on Performance (Polyphonic Music)

Unstable LSTM Unstable RNN

(b) Polyphonic music modeling

Figure 4: Effect of truncating unstable models. On both language and music modeling, RNNs and LSTMs exhibit diminishing returns for large values of the truncation parameter k. In LSTMs, larger k doesn t affect performance, whereas for unstable RNNs, large k slightly decreases performance

Proposition (4) holds for unstable models. In stable models, Proposition (4) in Section 3 ensures the distance between the weight matrices wrecurr wtrunc grows slowly as training progresses, and this rate decreases as k becomes large. In Figures 5(a) and 5(b), we show a similar result holds empirically for unstable word-level language models. All the models are initialized at the same point, and we track the distance between the hidden-to-hidden matrices W as training progresses. Training the full recurrent model is impractical, and we assume k = 65 well captures the fullrecurrent model. In Figures 5(a) and 5(b), we plot Wk W65 for k {5, 10, 15, 25, 35, 50, 64} throughout training. As suggested by Proposition (4), after an initial rapid increase in distance, Wk W65 grows slowly, as suggested by Proposition 4. Moreover, there is a diminishing return to choosing larger values of the truncation parameter k in terms of the accuracy of the approximation.

0 10000 20000 Number of Gradient Steps

Diﬀerence in Spectral Norm

Wtrunc Wrecurr

Diﬀerence in Recurrent Weight Matrices During Gradient Descent on Wikitext-2 (RNNs)

k=5 k=10 k=15 k=25 k=35 k=50 k=64

(a) Unstable RNN language model

0 10000 20000 Number of Gradient Steps

Diﬀerence in Spectral Norm

Wtrunc Wrecurr

Diﬀerence in Recurrent Weight Matrices During Gradient Descent on Wikitext-2 (LSTMs)

k=5 k=10 k=15 k=25 k=35 k=50 k=64

(b) Unstable LSTM language model

Figure 5: Qualitative version of Proposition 4 for unstable, word-level language models. We assume k = 65 well-captures the full-recurrent model and plot wtrunc wrecurr = Wk W65 as training proceeds, where W denotes the recurrent weights. As Proposition 4 suggests, this quantity grows slowly as training proceeds, and the rate of growth decreases as k increases.

Published as a conference paper at ICLR 2019

5 ARE RECURRENT MODELS TRULY NECESSARY?

Our experiments show recurrent models trained in practice operate in the stable regime, and our theoretical results show stable recurrent models are approximable by feed-forward networks, As a consequence, we conjecture recurrent networks trained in practice are always approximable by feed-forward networks. Even with this conjecture, we cannot yet conclude recurrent models as commonly conceived are unnecessary. First, our present proof techniques rely on truncated versions of recurrent models, and truncated recurrent architectures like LSTMs may provide useful inductive bias on some problems. Moreover, implementing the truncated approximation as a feed-forward network increases the number of weights by a factor of k over the original recurrent model. Declaring recurrent models truly superﬂuous would require both ﬁnding more parsimonious feed-forward approximations and proving natural feed-forward models, e.g. fully connected networks or CNNs, can approximate stable recurrent models during training. This remains an important question for future work.

6 RELATED WORK

Learning dynamical systems with gradient descent has been a recent topic of interest in the machine learning community. Hardt et al. (2018) show gradient descent can efﬁciently learn a class of stable, linear dynamical systems, Oymak (2018) shows gradient descent learns a class of stable, non-linear dynamical systems. Work by Sedghi & Anandkumar (2016) gives a moment-based approach for learning some classes of stable non-linear recurrent neural networks. Our work explores the theoretical and empirical consequences of the stability assumption made in these works. In particular, our empirical results show models trained in practice can be made closer to those currently being analyzed theoretically without large performance penalties.

For linear dynamical systems, Tu et al. (2017) exploit the connection between stability and truncation to learn a truncated approximation to the full stable system. Their approximation result is the same as our inference result for linear dynamical systems, and we extend this result to the non-linear setting. We also analyze the impact of truncation on training with gradient descent. Our training time analysis builds on the stability analysis of gradient descent in Hardt et al. (2016), but interestingly uses it for an entirely different purpose. Results of this kind are completely new to our knowledge.

For RNNs, the link between vanishing and exploding gradients and W was identiﬁed in Pascanu et al. (2013). For 1-layer RNNs, Jin et al. (1994) give sufﬁcient conditions for stability in terms of the norm W and the Lipschitz constant of the non-linearity. Our work additionally considers LSTMs and provides new sufﬁcient conditions for stability. Moreover, we study the consequences of stability in terms of feed-forward approximation.

A number of recent works have sought to avoid vanishing and exploding gradients by ensuring the system is an isometry, i.e. λ = 1. In the RNN case, this amounts to constraining W = 1 (Arjovsky et al., 2016; Wisdom et al., 2016; Jing et al., 2017; Mhammedi et al., 2017; Jose et al., 2018). Vorontsov et al. (2017) observes strictly requiring W = 1 reduces performance on several tasks, and instead proposes maintaining W [1 ε, 1 + ε]. Zhang et al. (2018) maintains this soft-isometry constraint using a parameterization based on the SVD that obviates the need for the projection step used in our stable-RNN experiments. Kusupati et al. (2018) sidestep these issues and stabilizes training using a residual parameterization of the model. At present, these unitary models have not yet seen widespread use, and our work shows much of the sequence learning in practice, even with nominally unstable models, actually occurs in the stable regime.

From an empirical perspective, Laurent & von Brecht (2017) introduce a non-chaotic recurrent architecture and demonstrate it can perform as well more complex models like LSTMs. Bai et al. (2018) conduct a detailed evaluation of recurrent and convolutional, feed-forward models on a variety of sequence modeling tasks. In diverse settings, they ﬁnd feed-forward models outperform their recurrent counterparts. Their experiments are complimentary to ours; we ﬁnd recurrent models can often be replaced with stable recurrent models, which we show are equivalent to feed-forward networks.

Published as a conference paper at ICLR 2019

ACKNOWLEDGEMENTS

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814 and a generous grant from the AWS Cloud Credits for Research program.

Moray Allan and Christopher Williams. Harmonising chorales by probabilistic inference. In Advances in Neural Information Processing Systems (Neur IPS), pp. 25 32, 2005.

Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International Conference on Machine Learning (ICML), pp. 1120 1128, 2016.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. ar Xiv preprint ar Xiv:1803.01271, 2018.

Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 1999.

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International Conference on Machine Learning (ICML), pp. 933 941, 2017.

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In International Conference on Machine Learning (ICML), pp. 1243 1252, 2017.

Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (ICML), pp. 1225 1234, 2016.

Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025 1068, 2018.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Liang Jin, Peter N Nikiforuk, and Madan M Gupta. Absolute stability conditions for discrete-time recurrent neural networks. IEEE Transactions on Neural Networks, 5(6):954 964, 1994.

Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann Le Cun, Max Tegmark, and Marin Soljaˇci c. Tunable efﬁcient unitary neural networks (EUNN) and their application to rnns. In International Conference on Machine Learning (ICML), pp. 1733 1741, 2017.

Cijo Jose, Moustpaha Cisse, and Francois Fleuret. Kronecker recurrent units. In International Conference on Machine Learning (ICML), pp. 2380 2389, 2018.

Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Kumar, Prateek Jain, and Manik Varma. Fastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In Advances in Neural Information Processing Systems (Neur IPS), pp. 9031 9042, 2018.

Thomas Laurent and James von Brecht. A recurrent neural network without chaos. In International Conference on Learning Representations (ICLR), 2017.

Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313 330, 1993.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations (ICLR), 2017.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM Language Models. In International Conference on Learning Representations (ICLR), 2018.

Published as a conference paper at ICLR 2019

Gr egoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. Using recurrent neural networks for slot ﬁlling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):530 539, 2015.

Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Efﬁcient orthogonal parametrisation of recurrent neural networks using householder reﬂections. In International Conference on Machine Learning (ICML), pp. 2401 2409, 2017.

Samet Oymak. Stochastic gradient descent learns state equations with nonlinear activations. ar Xiv preprint ar Xiv:1809.03019, 2018.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In International Conference on Machine Learning (ICML), pp. 1310 1318, 2013.

Patti J Price. Evaluation of spoken language systems: The atis domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, 1990.

Hanie Sedghi and Anima Anandkumar. Training input-output recurrent neural networks through spectral methods. Co RR, abs/1603.00954, 2016.

Stephen Tu, Ross Boczar, Andrew Packard, and Benjamin Recht. Non-asymptotic analysis of robust control from coarse-grained identiﬁcation. ar Xiv preprint ar Xiv:1707.04791, 2017.

Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), pp. 6000 6010, 2017.

Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learning recurrent networks with long term dependencies. In International Conference on Machine Learning (ICML), pp. 3570 3578, 2017.

Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems (Neur IPS), pp. 4880 4888, 2016.

Jiong Zhang, Qi Lei, and Inderjit Dhillon. Stabilizing gradients for deep neural networks via efﬁcient SVD parameterization. In International Conference on Machine Learning (ICML), pp. 5806 5814, 2018.

Published as a conference paper at ICLR 2019

A PROOFS FROM SECTION 2

A.1 GRADIENT DESCENT ON UNSTABLE SYSTEMS NEED NOT CONVERGE

Proof of Proposition 1. Consider a scalar linear dynamical system

ht = aht 1 + bxt (4) ˆyt = ht, (5)

where h0 = 0, a, b R are parameters, and xt, yt R are elements the input-output sequence {(xt, yt)}T t=1, where L is the sequence length, and ˆyt is the prediction at time t. Stability of the above system corresponds to |a| < 1.

Suppose (xt, yt) = (1, 1) for t = 1, . . . , L. Then the desired system (4) simply computes the identity mapping. Suppose we use the squared-loss ℓ(yt, ˆyt) = (1/2)(yt ˆyt)2, and suppose further b = 1, so the problem reduces to learning a = 0. We ﬁrst compute the gradient. Compactly write

i=0 atb = 1 at

Let δt = (ˆyt yt). The gradient for step T is then

d daℓ(y T , ˆy T ) = δT d da = δT

t=0 a T 1 tht

t=0 a T 1 t 1 at

t=0 at Ta T 1

(1 a)2 Ta T 1

Plugging in yt = 1, this becomes

d daℓ(y T , ˆy T ) = (1 a T )

(1 a) 1 (1 a T )

(1 a)2 Ta T 1

For large T, if |a| > 1, then a L grows exponentially with T and the gradient is approximately

d daℓ(y T , ˆy T ) a T 1 1 Ta T 2 Ta2T 3

Therefore, if a0 is initialized outside of [ 1, 1], the iterates ai from gradient descent with step size αi = (1/i) diverge, i.e. ai , and from equation (6), it is clear that such ai are not stationary points.

A.2 PROOFS FROM SECTION 2.2

A.2.1 RECURRENT NEURAL NETWORKS

Assume W λ < 1 and U BU. Notice tanh (x) = 1 tanh(x)2, so since tanh(x) [ 1, 1], tanh(x) is 1-Lipschitz and 2-smooth. We previously showed the system is stable since, for any states h, h ,

tanh(Wh + Ux) tanh(Wh + Ux)

Wh + Ux Wh Ux

Published as a conference paper at ICLR 2019

Using Lemma 1 with k = 0, ht BUBx

(1 λ) for all t. Therefore, for any W, W , U, U ,

tanh(Wht + Ux) tanh(W ht + U x)

Wht + Ux W ht U x

sup t ht W W + Bx U U .

(1 λ) W W + Bx U U ,

so the model is Lipschitz in U, W. We can similarly argue the model is BU Lipschitz in x. For smoothness, the partial derivative with respect to h is

h = diag(tanh (Wh + Ux))W,

so for any h, h , bounding the ℓ norm with the ℓ2 norm, φw(h, x)

h φw(h , x)

= diag(tanh (Wh + Ux))W diag(tanh (Wh + Ux))W

W diag(tanh (Wh + Ux) tanh (Wh + Ux))

2 W Wh + Ux Wh Ux 2λ2 h h .

For any W, W , U, U satisfying our assumptions, φw(h, x)

h φw (h, x)

= diag(tanh (Wh + Ux))W diag(tanh (W h + U x))W

diag(tanh (Wh + Ux) tanh (W h + U x)) W

+ diag(tanh (W h + U x)) W W

2λ (W W )h + (U U )x + W W

2λ (W W ) h + 2λ U U x + W W

2λBUBx + (1 λ)

(1 λ) W W + 2λBx U U .

Similar manipulations establish φw(h,x)

w is Lipschitz in h and w.

A.2.2 LSTMS

Similar to the previous sections, we assume s0 = 0.

The state-transition map is not Lipschitz in s, much less stable, unless c is bounded. However, assuming the weights are bounded, we ﬁrst prove this is always the case.

Lemma 4. Let f = supt ft . If Wf < , Uf < , and xt Bx, then f < 1 and ct 1 (1 f ) for all t.

Proof of Lemma 4. Note |tanh(x)| , |σ(x)| 1 for all x. Therefore, for any t, ht = ot tanh(ct) 1. Since σ(x) < 1 for x < and σ is monotonically increasing

ft σ Wfht 1 + Ufxt

σ Wf ht 1 + Uf xt

σ (BW + Bu Bx) < 1.

Using the trivial bound, it 1 and zt 1, so

ct+1 = it zt + ft ct 1 + ft ct .

Published as a conference paper at ICLR 2019

Unrolling this recursion, we obtain a geometric series

i=0 ft i 1 (1 f ).

Proof of Proposition 2. We show φLSTM is λ-contractive in the ℓ -norm for some λ < 1. For r log1/λ(

d), this in turn implies the iterated system φr LSTM is contractive is the ℓ2-norm.

Consider the pair of reachable hidden states s = (c, h), s = (c , h ). By Lemma 4, c, c are bounded. Analogous to the recurrent network case above, since σ is (1/4)-Lipschitz and tanh is 1-Lipschitz,

z z Wz h h .

Both z , i 1 since they re the output of a sigmoid. Letting c+ and c + denote the state on the next time step, applying the triangle inequality, c+ c + i z i z + f c f c (i i ) z + i (z z ) + f (c c ) + c (f f ) i i z + z z i + c c f + f f c

h h + f c c .

A similar argument shows

h+ h + o o + c+ c + Wo

4 h h + c+ c + .

By assumption, Wi + c Wf + Wo

and so h+ h + < (1 f ) h h + f c c s s ,

as well as c+ c + < (1 f ) h h + f c c s s ,

which together imply s+ s + < s s ,

establishing φLSTM is contractive in the ℓ norm.

B PROOFS FROM SECTION 3

Throughout this section, we assume the initial state h0 = 0. Without loss of generality, we also assume φw(0, 0) = 0 for all w. Otherwise, we can reparameterize φw(h, x) 7 φw(h, x) φw(0, 0) without affecting expressivity of φw. For stable models, we also assume there some compact, convex domain Θ Rm so that the map φw is stable for all choices of parameters w Θ.

Published as a conference paper at ICLR 2019

Proof of Lemma 1. For any t 1, by triangle inequality,

ht = φw(ht 1, xt) φw(0, 0) φw(ht 1, xt) φw(0, xt) + φw(0, xt) φw(0, 0) .

Applying the stability and Lipschitz assumptions and then summing a geometric series,

ht λ ht 1 + Lx xt

i=0 λi Lx Bx Lx Bx

Now, consider the difference between hidden states at time step t. Unrolling the iterates k steps and then using the previous display yields

ht hk t = φw(ht 1, xt) φw(hk t 1, xt) λ ht 1 hk t 1 λk ht k λk Lx Bx

and solving for k gives the result.

B.1 PROOFS FROM SECTION 3.3

Before proceeding, we introduce notation for our smoothness assumption. We assume the map φw satisﬁes four smoothness conditions: for any reachable states h, h , and any weights w, w Θ, there are some scalars βww, βwh, βhw, βhh such that

w βww w w .

w βwh h h .

h βhw w w .

h βhh h h .

B.1.1 GRADIENT DIFFERENCE DUE TO TRUNCATION IS NEGLIGIBLE

In the section, we argue the difference in gradient with respect to the weights between the recurrent and truncated models is O(kλk). For sufﬁciently large k (independent of the sequence length), the impact of truncation is therefore negligible. The proof leverages the vanishing-gradient phenomenon the long-term components of the gradient of the full recurrent model quickly vanish. The remaining challenge is to show the short-term components of the gradient are similar for the full and recurrent models.

Proof of Lemma 2. The Jacobian of the loss with respect to the weights is

w is the partial derivative of ht with respect to w, assuming ht 1 is constant with respect to w. Expanding the expression for the gradient, we wish to bound

wp T (w) wpk T (w) =

h T p T hk T hk t

The ﬁrst term consists of the long-term components of the gradient for the recurrent model. The second term is the difference in the short-term components of the gradients between the recurrent and truncated models. We bound each of these terms separately.

Published as a conference paper at ICLR 2019

For the ﬁrst term, by the Lipschitz assumptions, h T p T Lp and wht Lw. Since φw is

λ-contractive, so ht ht 1

λ. Using submultiplicavity of the spectral norm, p T h T

t=0 λT t λk Lp Lw

Focusing on the second term, by triangle inequality and smoothness,

h T p T hk T hk t

h T p T hk T pk T hk T hk t

+ h T p T h T

w hk T hk t

t=T k+1 βp h T hk T λT t Lw | {z }

w hk T hk t

Using Lemma 1 to upper bound (a),

t=T k βp h T hk T λT t Lw

t=T k λT t λkβp Lw Lx Bx

(1 λ) λkβp Lw Lx Bx

Using the triangle inequality, Lipschitz and smoothness, (b) is bounded by

w hk T hk t

ht hk T hk t

t=T k+1 LpλT tβwh ht hk t + Lp Lw

ht hk T hk t

kλk Lpβwh Lx Bx

(1 λ) + Lp Lw

ht hk T hk t

where the last line used ht hk t λt (T k) Lx Bx

(1 λ) for t T k. It remains to bound (c), the difference of the hidden-to-hidden Jacobians. Peeling off one term at a time and applying triangle inequality, for any t T k + 1, h T

ht hk T hk t

h T h T 1 hk T hk T 1

+ hk T hk T 1

ht hk T 1 hk t

βhh h T 1 h T 1 λT t 1 + λ h T 1

ht hk T 1 hk t

i=t βhhλT t 1 hi hk i

λk βhh Lx Bx

λk βhh Lx Bx

so (c) is bounded by kλk Lp Lwβhh Lx Bx

(1 λ)2 . Ignoring Lipschitz and smoothness constants, we ve shown

the entire sum is O kλk (1 λ)2 .

Published as a conference paper at ICLR 2019

B.1.2 STABLE RECURRENT MODELS ARE SMOOTH

In this section, we prove that the gradient map wp T is Lipschitz. First, we show on the forward pass, the difference between hidden states ht(w) and h t(w ) obtained by running the model with weights w and w , respectively, is bounded in terms of w w . Using smoothness of φ, the difference in gradients can be written in terms of ht(w) h t(w ) , which in turn can be bounded in terms of w w . We repeatedly leverage this fact to conclude the total difference in gradients must be similarly bounded.

We ﬁrst show small differences in weights don t signiﬁcantly change the trajectory of the recurrent model.

Lemma 5. For some w, w , suppose φw, φw are λ-contractive and Lw Lipschitz in w. Let ht(w), ht(w ) be the hidden state at time t obtain from running the model with weights w, w on common inputs {xt}. If h0(w) = h0(w ), then

ht(w) ht(w ) Lw w w

Proof. By triangle inequality, followed by the Lipschitz and contractivity assumptions,

ht(w) ht(w )

= φw(ht 1(w), xt) φw (ht 1(w ), xt)

φw(ht 1(w), xt) φw (ht 1(w), xt) + φw (ht 1(w), xt) φw (ht 1(w ), xt)

Lw w w + λ ht 1(w) ht 1(w ) .

Iterating this argument and then using h0(w) = h0(w ), we obtain a geometric series in λ.

ht(w) ht(w ) Lw w w + λ ht 1(w) ht 1(w )

i=0 Lw w w λi

The proof of Lemma 3 is similar in structure to Lemma 2, and follows from repeatedly using smoothness of φ and Lemma 5.

Proof of Lemma 3. Let h t = ht(w ). Expanding the gradients and using ht(w) ht(w ) Lw w w

(1 λ) from Lemma 5.

wp T (w) wp T (w )

h T p T h T h t

h T p T h T p T h T h t

+ h T p T h T

t=1 βp h T h T λT t Lw + Lp

βp L2 w w w (1 λ)2 + Lp

Published as a conference paper at ICLR 2019

Focusing on term (a),

t=1 λT t (βwh ht h t + βww w w )

+Lpβwh Lw w w

(1 λ)2 + Lpβww w w

where the penultimate line used,

φw(ht 1, xt)

w φw(h t 1, xt) w

+ φw(h t 1, xt) w φw (h t 1, xt) w

βwh h h + βww w w .

To bound (b), we peel off terms one by one using the triangle inequality,

h T h T 1 h T h T 1

+ h T h T 1

ht h T 1 h t

βhh h T 1 h T 1 + βhw w w λT t 1 + λ h T 1

ht h T 1 h t

βhw(T t)λT t 1 w w + βhh

h T i h T i λT t 1 #

βhw(T t)λT t 1 w w + βhh Lw w w

(1 λ) (T t)λT t 1

Lp Lwβhw w w

(1 λ)2 + Lp L2 wβhh w w

Supressing Lipschitz and smoothness constants, we ve shown the entire sum is O(1/(1 λ)3), as required.

B.1.3 GRADIENT DESCENT ANALYSIS

Equipped with the smoothness and truncation lemmas (Lemmas 2 and 3), we turn towards proving the main gradient descent result.

Proof of Proposition 4. Let ΠΘ denote the Euclidean projection onto Θ, and let δi = wi recurr wi trunc . Initially δ0 = 0, and on step i + 1, we have the following recurrence rela-

Published as a conference paper at ICLR 2019

tion for δi+1,

δi+1 = wi+1 recurr wi+1 trunc

= ΠΘ(wi recurr αi p T (wi)) ΠΘ(wi trunc αi pk T (wi trunc))

wi recurr αi p T (wi)) wi trunc αi pk T (wi trunc)

wi recurr wi trunc + αi p T (wi recurr) pk T (wi trunc)

δi + αi p T (wi recurr) p T (wi trunc) + αi p T (wi trunc) pk T (wi trunc)

δi + αi βδi + γkλk

exp (αiβ) δi + αiγkλk,

the penultimate line applied lemmas 2 and 3, and the last line used 1 + x ex for all x. Unwinding the recurrence relation at step N,

j=i+1 exp(αjβ)

j=i+1 exp αβ

Bounding the inner summation via an integral, PN j=i+1 1 j log(N/i) and simplifying the resulting expression,

i=1 exp(αβ log(N/i))αγkλk

= αγkλk N αβ N X

αγkλk N αβ+1 .

B.1.4 PROOF OF THEOREM 1

Proof of Theorem 1. Using f is Lf-Lipschitz and the triangle inequality, y T yk T Lf h T (w N recurr) hk T (w N trunc)

Lf h T (w N recurr) h T (w N trunc) + Lf h T (w N trunc) hk T (w N trunc) .

By Lemma 5, the ﬁrst term is bounded by Lw w N recurr w N trunc (1 λ) , and by Lemma 1, the second term is

bounded by λk Lx Bx

(1 λ). Using Proposition 4, after N steps of gradient descent, we have

y T yk T Lf Lw w N recurr w N trunc

(1 λ) + λk Lf Lx Bx

kλk αLf Lw N αβ+1

(1 λ) + λk Lf Lx Bx

and solving for k such that both terms are less than ε/2 gives the result.

Published as a conference paper at ICLR 2019

0 100 200 Number of Gradient Steps

Diﬀerence in Spectral Norm

Wrecurr Wtrunc 2

Diﬀerence in Recurrent Weight Matrices during Gradient Descent (λ = 0.75, k = 35)

LDS, step-size: 1/t Tanh-rnn, step-size, 1/t

LDS, step-size: 1/ t

Tanh-rnn, step-size: 1/ t Bound from proposition 2

Figure 6: Empirical validation Proposition 4 on random Gaussian instances. Without the 1/t rate, the gradient descent bound no longer appears qualitatively correct, suggesting the O(1/t) rate is necessary.

C EXPERIMENTS

The O(1/t) rate may be necessary. The key result underlying Theorem 1 is the bound on the parameter difference wtrunc wrecurr while running gradient descent obtained in Proposition 4. We show this bound has the correct qualitative scaling using random instances and training randomly initialized, stable linear dynamical systems and tanh-RNNs. In Figure 6, we plot the parameter error wt trunc wt recurr as training progresses for both models (averaged over 10 runs). The error scales comparably with the bound given in Proposition 4. We also ﬁnd for larger step-sizes like α/

t or constant α, the bound fails to hold, suggesting the O(1/t) condition is necessary.

Concretely, we generate random problem instance by ﬁxing a sequence length T = 200, sampling input data xt i.i.d. N (0, 4 I32), and sampling y T Unif[ 2, 2]. Next, we set λ = 0.75 and randomly initialize a stable linear dynamical system or RNN with tanh non-linearity by sampling Uij, Wij i.i.d. N (0, 0.5) and thresholding the singular values of W so W λ. We use the

squared loss and prediction function f(ht, xt) = Cht + Dxt, where C, D i.i.d. N (0, I32). We ﬁx the truncation length to k = 35, set the learning rate to αt = α/t for α = 0.01, and take N = 200 gradient steps. These parameters are chosen so that the γkλk N αβ+1 bound from Proposition 4 does not become vacuous by triangle inequality, we always have wtrunc wrecurr 2λ.

Stable vs. unstable models. The word and character level language modeling experiments are based on publically available code from Merity et al. (2018). The polyphonic music modeling code is based on the code in Bai et al. (2018), and the slot-ﬁlling model is a reimplementation of Mesnil et al. (2015) 1

Since the sufﬁcient conditions for stability derived in Section 2.2 only apply for networks with a single layer, we use a single layer RNN or LSTM for all experiments. Further, our theoretical results are only applicable for vanilla SGD, and not adaptive gradient methods, so all models are trained with SGD. Table 2 contains a summary of all the hyperparameters for each experiment.

All hyperparameters are shared between the stable and unstable variants of both models. In the RNN case, enforcing stability is conceptually simple, though computationally expensive. Since tanh is 1Lipschitz, the RNN is stable as long as W < 1. Therefore, after each gradient update, we project W onto the spectral norm ball by taking the SVD and thresholding the singular values to lie in [0, 1). In the LSTM case, enforcing stability is conceptually more difﬁcult, but computationally simple. To ensure the LSTM is stable, we appeal to Proposition 2. We enforce the following inequalities after each gradient update

1The word-level language modeling code is based on https://github.com/pytorch/examples/ tree/master/word_language_model, the character-level code is based on https://github. com/salesforce/awd-lstm-lm, and the polyphonic music modeling code is based on https:// github.com/locuslab/TCN.

Published as a conference paper at ICLR 2019

Table 2: Hyperparameters for all experiments

Model RNN LSTM

Word LM Number layers 1 1 Hidden units 256 1024 Embedding size 1024 512 Dropout 0.25 0.65 Batch size 20 20 Learning rate 2.0 20. BPTT 35 35 Gradient clipping 0.25 1.0 Epochs 40 40 Char LM Number layers 1 1 Hidden units 768 1024 Embedding size 400 400 Dropout 0.1 0.1 Weight decay 1e-6 1e-6 Batch size 80 80 Learning rate 2.0 20.0 BPTT 150 150 Gradient clipping 1.0 1.0 Epochs 300 300 Polyphonic Music Number layers 1 1 Hidden units 1024 1024 Dropout 0.1 0.1 Batch size 1 1 Learning rate 0.05 2.0 Gradient clipping 5.0 5.0 Epochs 100 100 Slot-Filling Number layers 1 1 Hidden units 128 128 Embedding size 64 64 Dropout 0.5 0.5 Weight decay 1e-4 1e-4 Batch size 128 128 Learning rate 10.0 10.0 Gradient clipping 1.0 1.0 Epochs 100 100

Published as a conference paper at ICLR 2019

1. The hidden-to-hidden forget gate matrix should satisfy Wf < 0.128, which is enforced by normalizing the ℓ1norm of each row to have value at most 0.128.

2. The input vectors xt must satisfy xt Bx = 0.75, which is achieved by thresholding all values to lie in [ 0.75, 0.75].

3. The bias of the forget gate bf, must satsify bf 0.25, which is again achieved by thresholding all values to lie in [ 0.25, 0.25].

4. The input-hidden forget gate matrix Uf should satisfy Uf 0.25. This is enforced by normalizing the ℓ1norm of each row to have value at most 0.25.

5. Given 1-4, the forget gate can take value at most f < 0.64. Consequently, we enforce Wi , Wo 0.36, Wz 0.091, and Wf < min 0.128, (1 0.64)2 = 0.128.

After 1-5 are enforced, by Proposition 2, the resulting (iterated)-LSTM is stable. Although the above description is somewhat complicated, the implementation boils down to normalizing the rows of the LSTM weight matrices, which can be done very efﬁciently in a few lines of Py Torch.

Data-dependent stability. Unlike the RNN, in an LSTM, it is not clear how to analytically compute the stability parameter λ. Instead, we rely on a heuristic method to estimate λ. Recall a model is stable if for all x, h, h , we have

S(h, h , x) := φw(h, x) φw(h , x)

h h λ < 1. (7)

To estimate suph,h ,x S(h, h , x), we do the following. First, we take x to be point in the training set. In the language modeling case, x is one of the learned word-vectors. We randomly sample and ﬁx x, and then we perform gradient ascent on S(h, h , x) to ﬁnd worst-case h, h . In our experiments, we initialize h, h N (0, 0.1 I) and run gradient ascent with learning rate 0.9 for 1000 steps. This procedure is repeated 20 times, and we estimate λ as the maximum value of S(h, h , x) encounted during any iteration from any of the 20 random starting points.