# cells_in_multidimensional_recurrent_neural_networks__76ca680b.pdf Journal of Machine Learning Research 17 (2016) 1-37 Submitted 5/14; Revised 12/15; Published 7/16 Cells in Multidimensional Recurrent Neural Networks Gundram Leifert gundram.leifert@uni-rostock.de Tobias Strauß tobias.strauss@uni-rostock.de Tobias Grüning tobias.gruening@uni-rostock.de Welf Wustlich welf.wustlich@uni-rostock.de Roger Labahn roger.labahn@uni-rostock.de University of Rostock Institute of Mathematics 18051 Rostock, Germany Editor: Yoshua Bengio Keywords: LSTM, MDRNN, CTC, handwriting recognition, neural network Abstract The transcription of handwritten text on images is one task in machine learning and one solution to solve it is using multi-dimensional recurrent neural networks (MDRNN) with connectionist temporal classification (CTC). The RNNs can contain special units, the long short-term memory (LSTM) cells. They are able to learn long term dependencies but they get unstable when the dimension is chosen greater than one. We defined some useful and necessary properties for the one-dimensional LSTM cell and extend them in the multidimensional case. Thereby we introduce several new cells with better stability. We present a method to design cells using the theory of linear shift invariant systems. The new cells are compared to the LSTM cell on the IFN/ENIT and Rimes database, where we can improve the recognition rate compared to the LSTM cell. So each application where the LSTM cells in MDRNNs are used could be improved by substituting them by the new developed cells. 1. Introduction Since the last decade, artificial neural networks (NN) became state-of-the-art in many fields of machine learning, for example they can be applied to pattern recognition. Typical NN are feedforward NN (FFNN) or recurrent NN (RNN), whereas the latter contain recurrent connections. When nearby inputs depend on each other, providing these inputs as additional information to the NN can improve its recognition result. FFNNs obtain these dependencies by making this nearby inputs accessible. If RNNs are used, the recurrent connections can be used to learn if the surrounding input is relevant, but these connections result in a vanishing dependency over time. In S. Hochreiter, J. Schmidhuber (1997) the authors develop the long short-term memory (LSTM) which is able to have a long term dependency. This LSTM is extended in A. Graves, S. Fernandez and J. Schmidhuber (2007) to the multidimensional (MD) case and is used in a hierarchical multi-dimensional RNN (MDRNN) which performed best in three competitions at the International Conference on Document Analysis and Recognition (ICDAR) in 2009 without any feature extraction and knowledge of the recognized language model. In this paper we analyse these MD LSTM regarding the ability to provide long term depen- c 2016 Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, Roger Labahn. Gundram Leifert et al. dencies in MDRNNs and show that it can easily have an unwanted growing dependency for higher dimensions. We define a more general description of an LSTM a cell and change the LSTM architecture which leads to new MD cell types, which also can provide long term dependencies. In two experiments we show that substituting the LSTM in MDRNNs by these cells works well. Due to this we assume that substituting the LSTM cell by the best performing cell, the Leaky LP cell, will improve the performance of an MDRNN also in other scenarios. Furthermore the new cell types could also be used for the one-dimensional (1D) case, so using them in a bidirectional RNN with LSTMs (BLSTM) could lead to better recognition rates. In Section 2 we introduce the reader to the development of the LSTM cells (S. Hochreiter, J. Schmidhuber, 1997) and its extension (F. A. Gers, J. Schmidhuber and F. Cummins, 1999). Based on that in Section 3 we define two properties that probably lead to the good performance of the 1D LSTM cells. Both together guarantee that the cell can have a long term dependency. A third property ensures that gradient cannot explode over time. In Section 4 we show that the MD version of the LSTM is still able to provide long term dependency whereas the gradient can explode easily for dimension greater than 1. In Section 5 we change the architecture of the MD LSTM cell and reduce it to the 1D LSTM cell so that the cell fulfills the two properties for any dimension. Nevertheless the internal cell state can linearly grow over time. This problem is solved in Section 6 using a trainable convex combination of the input and the previous internal cell states. The new cell type can provide long term dependencies and does not suffer from exploding gradients. Motivated by the last sections we introduce a more general way to define MD cells in Section 7. Using the theory of linear shift-invariant systems and their frequency analysis we are able to get a new interpretation of the cells and we create 5 new cell types. To test the performance of the cells in Section 8 we take two data sets from the ICDAR 2009 competitions, where the MDRNNs with LSTM cell won. On these data sets we compare the recognition results of the MDRNNs when we substitute the LSTM cells by the new developed cells. On both data sets, the IFN/ENIT data set and the RIMES data set we can improve the recognition rate using the new developed cells. 2. Previous Work In this section we briefly want to introduce a recurrent neural network (RNN) and the development of the LSTM cell. In previous literature there are various notation to describe the update equations of RNNs an LSTMs. To unify the notations we will refer to their notation using (F. A. Gers, J. Schmidhuber and F. Cummins, 1999; S. Hochreiter, J. Schmidhuber, 1997; A. Graves and J. Schmidhuber, 2008). Therefore we concentrate on a simple hierarchical RNN with one input layer with the set of neurons I, one recurrent hidden layer with the set of neurons H and one output layer with the set of neurons O. For each time step t N the layers are updated asynchronously in the order I, H, O. In one specific layer all neurons can be updated synchronously. In the hidden layer for one neuron c H Cells in MDRNNs P netγ(t) fγ yγ(t) γ y H(t 1) Figure 1: Schematic diagram of a unit: The unit γ H is a simple neuron with the network s feed forward input y I(t) = yi(t) i I and recurrent input y H(t 1) = yh(t 1) h H and an output activation yγ(t). Right: A unit has an input activation netγ(t), which is a linear combination of the source activations y I(t), y H(t 1). The output activation yγ(t) is computed by applying the activation function fγ to the input activation. Left: The short notation of a unit. at time t N we calculate the neuron s input activation netc by ac(t) netc(t) = X i I wc,iyi(t) + X h H wc,hyh(t 1). (1) with weights w[target neuron],[source neuron]. A bias in (1) can be added by extending the set I := I {bias} with ybias(t) = 1 t N and hence we will not write the bias in the equations, but we use them in our RNNs in Section 8. The neuron s output activation is calculated by yc(t), bc(t) yc(t) = fc (netc(t)) with a differentiable sigmoid activation function fc. To make (1) suitable for t 0 we define h H, t Z \ N : yh(t) = 0. This simple neuron with a linear function of activations as input and one activation function we call unit (compare to Figure 1). In (1) the activation of the unit is dependent on the current activations of the layer below and the previous activations of the units from the same layer. When there are no recurrent connections ( c, h H : wc,h = 0), the layer is called feed-forward layer, otherwise recurrent layer. 2.1 The Long Short-Term Memory A standard LSTM cell c has one input with an input activation ycin(t) a set of gates, one internal state sc and one output(-activation) yc ( yc). The gates are also units and their task is to learn whether a signal should pass the gate or not. They almost always have the logistic activation function flog(x) := 1 1+exp( x) ( f1(x)). The input of the standard LSTM cell is calculated from a unit with an odd activation function with a slope of 1 at x = 0. We use fc(x) = tanh (x) in this paper, another solution could be fc(x) = 2 tanh x 2 (see S. Hochreiter, J. Schmidhuber, 1997). The standard LSTM has two gates: The input gate (IG or ι) and the output gate (OG or ω). These both gates are calculated like a unit, so that netι(t) = X i I wι,iyi(t) + X h H wι,hyh(t 1) yinc(t), bι(t) yι(t) = flog (netι(t)) Gundram Leifert et al. netω(t) = X i I wω,iyi(t) + X h H wω,hyh(t 1) youtc(t), bω(t) yω(t) = flog (netω(t)) . The input of an LSTM is defined like in (1) by netc(t) netcin(t) = X i I wc,iyi(t) + X h H wc,hyh(t 1), g (netc(t)) , f2 (netc(t)) ycin(t) = fc (netcin(t)) . The internal state sc(t) is calculated by sc(t) = ycin(t) yι(t) + sc(t 1), (2) the output activation yc(t) of the LSTM is calculated from yc(t), bc(t) yc(t) = hc (sc(t)) yω(t) (3) with hc (x) := tanh(x) ( f3(x)). The LSTM can be interpreted as a kind of memory module where the internal state stores the information. For a given input ycin(t) ( 1, 1) the IG decides if the new input is relevant for the internal state. If so, the input is added to the internal state. The information of the input is now saved in the activation of the internal state. The OG determines whether or not the internal activation should be displayed to the rest of the network. So the information, stored in the LSTM is just readable when the OG is active. To sum up, an open IG can be seen as a write -operation into the memory and an open OG as a read -operation of the memory. Another way to understand the LSTM is to take a look at the gradient propagated through it. To analyse the LSTM properly, we have to ignore gradients comming from recurrent weights. We define the truncated gradient similar to S. Hochreiter, J. Schmidhuber (1997) and F. A. Gers, J. Schmidhuber and F. Cummins (1999). Definition 1 (truncated gradient) Let γ {cin, ι, ω} be any input or gate unit and yc(t 1) any previous output activation. The truncated gradient differs from the exact gradient only by setting recurrent weighted gradient propagation netγ(t) yc(t 1) to zero. We write netγ(t) yc(t 1) = wγ,c = tr 0. Now, let E be an arbitrary error which is used to train the RNN and E(t) yc(t) the resulting derivative at the output of the LSTM. The OG can eliminate the gradient coming from the output, because yc(t) sc(t) = h c (sc(t)) | {z } (0,1] yω(t) | {z } (0,1) Cells in MDRNNs so the OG decides when the gradient should go into the internal state. Especially for |sc(t)| 1 we get yc(t) sc(t) yω(t). The key idea of the LSTMs is that an error that occurs at the internal state neither explode nor vanish over time. Therefore, we take a look at the partial derivative sc(t) sc(t 1), which is also known as error carousel (for more details see S. Hochreiter, J. Schmidhuber, 1997). Using the truncated gradient of Definition 1 for this derivative, we get sc(t) sc(t 1) =ycin(t) yι(t) sc(t 1) + yι(t) ycin(t) sc(t 1) + 1 =ycin(t) yι(t) yc(t 1) yc(t 1) sc(t 1) + yι(t) ycin(t) yc(t 1) yc(t 1) sc(t 1) + 1 =ycin(t) yι(t) netι(t) netι(t) yc(t 1) | {z } = tr0 yc(t 1) sc(t 1) + yι(t) ycin(t) netcin(t) netcin(t) yc(t 1) | {z } = tr0 yc(t 1) sc(t 1) + 1 sc(t) sc(t 1) = tr1. (4) So, once having a gradient at the internal state we can use the chain rule and get τ N : sc(t) sc(t τ) = tr 1. This is called constant error carousel. Like the OG can eliminate the gradient coming from the LSTM output, the IG can do the same with the gradient coming from the internal state, that means it decides when the gradient should be injected to the source activations. This can be seen by taking a look at the partial derivative sc(t) netcin(t) = sc(t) ycin(t) ycin(t) netcin(t) = yι(t)f c (netcin(t)) . If there is a small input |netcin(t)| 1, we get f c (netcin(t)) 1 and can estimate sc(t) netcin(t) yι(t). All in all, this LSTM is able to store information and learn long-term dependencies, but it has one drawback which will be discussed in 2.2. 2.2 Learning to Forget For long time series the internal state is unbounded (compare with F. A. Gers, J. Schmidhuber and F. Cummins, 1999, 2.1). Assuming a positive or negative input and a non zero Gundram Leifert et al. activation of the IG, the absolute activation of the internal state grows over time. Using the weight-space symmetries in a network with at least one hidden layer (Bishop, 2006, 5.1.1) we assume without loss of generality ycin(t) 0, so sc(t) t . Hence, the activation function hc saturates and (3) can be simplified to yc(t) = hc (sc(t)) | {z } 1 yω(t) yω(t). Thus, for great activations of sc(t) the whole LSTM works like a unit with a logistic activation function. A similar problem can be observed for the gradient. The gradient coming from the output is multiplied by the activation of the OG and the derivative of hc. For great values of sc(t) we get h c (sc(t)) 0 and we can estimate the partial derivative yc(t) sc(t) = h c ((sc(t)) yω(t) 0, which can be interpreted that the OG is not able to propagate back the gradient into the LSTM. Some solutions to solve the linear growing state problem are introduced in F. A. Gers, J. Schmidhuber and F. Cummins (1999). They tried to stabilize the LSTM with a state decay by multiplying the internal state in each time step with a value (0, 1), which did not improve the performance. Another solution was to add an additional gate, the forget gate (FG or φ). The last state sc(t 1) is multiplied by the activation of the FG before it is added to the current state sc(t). So we can substitute (2) by sc(t) = ycin(t) yι(t) + sc(t 1) yφ(t), so that the truncated gradient in (4) is changed to sc(t) sc(t 1) = ycin(t) yι(t) sc(t 1) + yι(t) ycin(t) sc(t 1) + yφ(t) and for longer time series we get τ N sc(t) sc(t τ) = tr t =0 yφ(t t ). Now, the Extended LSTM is able to learn to forget its previous state. However, an Extended LSTM is still able to work like an standard LSTM without FG by having an activation yφ(t) 1. In this paper we denote the Extended LSTM as LSTM Another point of view was introduce in Bengio et al. (1994): To learn long-term dependencies a system must have an architecture to that an input can be saved over long time and does not suffer from the vanishing gradient problem. On the other hand the system should avoid an exploding gradient , which means that a small disturbance has a growing influence over time. In this paper we do not want to solve the problem of vanishing and exploding gradient for a whole system, we want to solve this problem only for one single cell. But we think that it is an necessary condition to provide long time dependencies of a system. Cells in MDRNNs yΓ(t) Γ c γ1 y I(t) y H(t 1) y I(t) y H(t 1) gint ( ) gout ( ) memory sc(t 1),...,sc(t k) ycin(t) sc(t) yc(t) Figure 2: Schematic diagram of a cell: The function gint calculates the internal state sc(t) from the previous internal states sc(t 1), . . . , sc(t k) and the cell input ycin(t) using the gate activations yΓ(t). The function gout calculates the output yc(t) of the cell from the actual and previous internal states sc(t), . . . , sc(t k), the cell input ycin(t) also using the gate activations yΓ(t). Gundram Leifert et al. 3. Cells and Their Properties In this section we want to introduce a general cell and figure out properties for these cells which probably lead to the good performance observed by LSTM cells. Definition 2 (Cell, cf. Fig. 2) A cell, c, of order k consists of one designated input unit, cin, with sigmoid activation function fc (typically fc = tanh unless specified otherwise); a set Γ (not containing cin) of units called gates γ1, γ2, . . . with sigmoid activation functions fγi, i = 1, . . . (typically logistic fγi = flog unless specified otherwise); an arbitrary function, gint, and a cell activation function, gout, mapping into [ 1, 1]. Each unit of Γ {cin} receives the same set of input activations. The cell update in time step t N is performed in three subsequent phases: 1. Following the classical update scheme of neurons (see Section 2), all units in Γ {cin} calculate synchronously their activations, which will be denoted by yΓ(t) := yγ(t) γ Γ and ycin(t). Furthermore, we call ycin(t) the input activation of the cell. 2. Then, the cell computes it s so-called internal state sc(t) := gint (yΓ(t), ycin(t), sc(t 1), . . . , sc(t k)) . 3. Finally, the cell computes it s so-called output activation yc(t) := gout (yΓ(t), ycin(t), sc(t), sc(t 1), . . . , sc(t k)) . In this paper we concentrate on first order cells (k = 1). Now, we use Definition 2 to re-introduce the (Extended) LSTM cell. Remark 3 (LSTM cell) An LSTM cell is a cell of order 1 where hc = tanh and Γ = {ι, φ, ω} sc(t) := gint (yΓ(t), ycin(t), sc(t 1)) := ycin(t)yι(t) + sc(t 1)yφ(t) yc(t) := gout (yΓ(t), sc(t)) := hc (sc(t)) yω(t) Properties of cells. Developing the 1D LSTM cells, the main idea is to save exactly one piece of information over a long time series and to propagate the gradient back over this long time, so that the system can learn precise storage of this piece of information. In instance a given input ycin (which represent the information) at time tin should be stored into the cell state sc until the information is required at time tout. To be able to prove the following properties, we will assume the truncated gradient defined in Definition 1. Nevertheless we will use the full gradient in our Experiments, because it turned out that it works much better. The next two properties of a cell ensure the ability to work as such a memory. Cells in MDRNNs y I(t) y H(t 1) y I(t) y H(t 1) y I(t) y H(t 1) + sc(t) yc(t) memory sc(t 1) Figure 3: Schematic diagram of a one-dimensional LSTM cell: The input (cin) is multiplied by the IG (ι). The previous state sc(t 1) is gated by the FG (φ) and added to the activation coming from the IG and input. The output of the cell is the squashed internal state (squashed by hc (x) = tanh(x)) and gated by the OG (ω). Gundram Leifert et al. The first property should ensure that an input ycin at time tin can be memorized (the cell input is open) in the internal activation sc until tout (the cell memorizes) and has a negligibly influence on the internal activation for t > tout (the cell forgets). In addition, the cell is able to prevent influence of other inputs at time steps t = tin (the cell input is closed). Definition 4 (Not vanishing gradient (NVG)) A cell c allows an NVG : For arbitrary tin, tout N, tin tout, δ > 0 there exist gate activations yΓ(t) such that for any t1, t2 N sc (t2) ycin (t1) tr [1 δ, 1] for t1 = tin and tin t2 tout [0, δ] otherwise (5) The next definition guaranties that at any time t N the gate activations can (the cell output is open) or not (the cell output is closed) distribute the piece of information saved in sc to the network. This is an important property because the piece of information can be memorized in the cell without presenting it to the network. Note that the decision is just dependent on gate activations at time t and there are no constraints to previous gate activations. In Definition 2 we require yc(t) [ 1, 1] whereas sc(t) R. So we cannot have arbitrarily small intervals of the derivative as in (5), but we can ensure two distinct intervals for open and closed cell output. When we take Definition 4 and 5 together, a cell is able to save an input over long term series, can decide at each time step whether or not it is presented to the network and can forget the saved input. Definition 5 (Controllable output dependency (COD)) A cell c of order k allows an COD : There exist δ1, δ2 (0, 1), δ2 < δ1 so that for any time t N there exists a gate vector yΓ(t) leading to open output dependency yc (t) sc (t) [δ1, 1] (6) and there exists another gate vector yΓ(t) leading to a closed output dependency yc (t) sc (t) [0, δ2] . (7) The third property is a kind of stability criterion. An unwanted case is that a small change (caused by any noisy signal) at time step tin has a growing influence at later time steps. This is equivalent to an exploding gradient over time. Controlling the gradient of the whole system and avoiding him not to explode is a hard problem. But we can at least avoid the exploding gradient in one cell. This should be prohibited for any gate activations. Definition 6 (Not exploding gradient (NEG)) A cell c has an NEG : For any time steps tin, t N, tin < t and any gate activations yΓ(t) the truncated gradient in bounded by sc (t) sc (tin) tr [0, 1] . Cells in MDRNNs We think that a cell fulfilling these three properties can work as stable memory. To be able to prove these properties for the LSTM cell we have to considerate the gate activations. In general, the activation function of the gates does not have to be the logistic activation function flog, whereas for this paper we set γ Γ : fγ := flog. So the activation of gates can never be exactly 0 or 1, because of a finite input activation netγ(t) to the gate activation function. But a gate can have an activation yγ(t) [1 ε, 1) if it is opened or yγ(t) (0, ε] if it is closed, because for a realistic large input activation netγ(t) 7 (low input activation netγ(t) < 7) we get an activation within the interval yγ(t) [1 ε, 1) (yγ(t) (0, ε]) with ε < 1 1000. Handling with these activation intervals we can prove the definitions for the LSTM cell. Now we can prove whether or not the LSTM cell has these properties. Theorem 7 (Properties of the LSTM cell) The 1D LSTM cell allows NVG and has an NEG, but does not allow COD. Proof see A.1 in appendix. 4. Expanding to More Dimensions In A. Graves, S. Fernandez and J. Schmidhuber (2007) the 1D LSTM cell is extended to an arbitrary number of dimensions; this is solved by using one FG for each dimension. In many publications using the MD LSTM cell in MDRNNs outperform state-of-the-art recognition systems (for example see A. Graves and J. Schmidhuber, 2008). But by expanding the cell to the MD case, the absolute value of the internal state |sc| can grow faster than linear over time. When sp c and there are peephole connections (for peephole connection details see F. A. Gers, N. Schraudolph and J. Schmidhuber, 2002), the cells have an output activation of yp c { 1, 0, 1}: The internal state multiplied by the peephole weight overlays the other activation-weight-products and this leads to an activation of the OG yp ω {0, 1} and a squashed internal state hc sp c { 1, 1}. So the output of the cell is yp ωhc sp c = yp c { 1, 0, 1}. But also without peephole connections the internal state can grow, which leads to hc sp c { 1, 1} and the cell works like a conventional unit with a logistic activation function yc(t) yω(t). Our goal is to transfer the Definitions 4, 5 and 6 defined in Section 3 into the MD case and we will see that the MD LSTM cell has an exploding gradient. In the next sections we will provide alternative cell types, that fulfill two or all of these definitions. In the 1D case it is clear, that there is just one way to come from date t1 to date t2, when t1 < t2, by incrementing t1 as long as t2 is reached. For the MD case the number of paths depends on the number of dimensions and the distance between these two dates. An MD path is defined as follows. Definition 8 (MD path) Let p,q ND be two dates. A p-q-path π of length k 0 is a sequence π := {p = p0,p1, . . . ,pk = q} with i {1, . . . , k} !d {1, . . . , D} : (pi) d = pi 1. Further, let πi := pi. Gundram Leifert et al. We can define the distance vector pq := q p = q1 p1 ... q D p D pq1 ... pq D between the dates p and q. When pq has at least one negative component, there exists no p-q-path. Otherwise there exist exactly pq1, . . . , pq D PD i=1 pqi ! QD i=1 pqi! p-q-paths (compare with the multinomial coefficient). We write p < q when #{ pq} 1 and p q when p = q p < q. Now we can extend the definitions of the 1D case to the MD case, whereas we concentrate on the MD cells of order 1. Definition 9 (MD cell) An MD cell, c, of order 1 and dimension D consists of the same parts as a 1D cell of order 1. The cell update in date p ND is performed in three subsequent phases: 1. Following the classical update scheme of neurons (see Section 2), all units in Γ {cin} synchronously calculate their activations, which will be denoted by yp Γ = yp γ γ Γ. Furthermore, we call yp cin the input activation of the cell. 2. Then, the cell computes it s so-called internal state sp c := gint yp Γ, yp cin, sp 1 c , . . . , s p D c 3. Finally, the cell computes it s so-called output activation yp c := gout yp Γ, yp cin, sp c, sp 1 c , . . . , s p D c Using this, we can reintroduce the LSTM cell as well as Definition 4, 5 and 6 for the MD case: Definition 10 (MD LSTM cell) An MD LSTM cell is a cell of dimension D and order 1 where hc = tanh and Γ = {ι, (φ, 1) , . . . , (φ, D) , ω} sp c = gint yp Γ, yp cin, sp 1 c , . . . , s p D c = yp ι yp cin + D P d=1 s p d c yp φ,d yp c = gout yp Γ, sp c = hc sp c yp ω Cells in MDRNNs Definition 11 (MD Not vanishing gradient (NVG)) An MD cell c allows an NVG : For arbitrary pin,pout ND,pin pout, δ > 0 there exist p ND gate activations yp Γ such that for any p1,p2 ND sp2 c yp1 cin tr [1 δ, 1] for p1 = pin and pin p2 pout [0, δ] otherwise (8) Definition 12 (MD Controllable output dependency (COD)) An MD cell c allows an COD : There exist δ1, δ2 (0, 1), δ2 < δ1 so that for any time t N there exists a gate vector yp Γ leading to open output dependency yp c sp c [δ1, 1] (9) and there exists another gate vector yp Γ leading to a closed output dependency yp c sp c [0, δ2] . (10) Definition 13 (MD Not exploding gradient (NEG)) An MD cell c has an NEG : For any time steps pin,p ND,pin < p and any gate activations yp Γ the truncated gradient in bounded by sp c spin c tr [0, 1] We can now consider these definitions for the MD LSTM cell. Theorem 14 (NVG of MD LSTM cells) An MD LSTM cell allows an NVG. Proof see A.2 in appendix For arbitrary activations of FGs the partial derivative sp c spin c can grow over time: Theorem 15 (NEG of MD LSTM cells) An MD LSTM cell can have an exploding gradient, when D 2. Proof see A.3 in appendix. The MD LSTM cell does not allow the COD, because the 1D case is a special case of the MD case. Our idea for the next section is to change the MD LSTM layout, so that it has an NEG. Gundram Leifert et al. 5. Reducing the MD LSTM Cell to One Dimension In the last section, we showed that the MD LSTM cell can have an exploding gradient. We tried different ways to solve this problem. For example we divided the activation of the FG by the number of dimensions. Then the gradient cannot explode over time, but the gradient vanishes along some paths rapidly. Another approach was to give the cells the opportunity to learn to stabilize itself, when the internal state starts diverging. Therefore we add an additional peephole connection between the square value of the previous internal states s p d c 2 and the FGs so that the cell is able to learn that it has to close the FG for large internal states. This also does not make a significant difference. Also forcing the cell to learn to stabilize itself by adding an error Lossstate = ε sp c p with p = {1, 2, 3, 4} and different learning rates ε does not work. So we tried to change the layout of the MD LSTM cell. 5.1 MD LSTM Stable Cell In Section 3 we realized that 1D LSTM cells work good and the gradient does not explode, but in the MD case it does. Our idea is to combine the previous states s p d c at date p to one previous state sp c and take the 1D form of the LSTM cell. For this reason we call this cell LSTM Stable cell. Therefore, a function sp c = f sp 1 c , . . . , s p D c is needed, so that the following two benefits of the 1D LSTM cell remain: 1. The MD LSTM Stable cell has an NEG 2. The MD LSTM Stable cell allows NVG. The convex combination sp c = f sp 1 c , . . . , s p D c d=1 λp ds p d c , d = 1, . . . , D : λp d 0, d=1 λp d = 1 (11) of all states satisfies these both points (see Theorems 17 and 18). To calculate these D coefficients we want to use the activation of D gates and we call them lambda gates (LG or λ). Definition 16 (MD LSTM Stable cell) An MD LSTM Stable cell is a cell of dimension D and order 1 where hc = tanh and Γ = {ι, (λ, 1) , . . . , (λ, D) , φ, ω} Cells in MDRNNs sp c = gconv yp Γ, sp 1 c , . . . , s p D c d=1 s p d c yp λ,d D P d =1 yp λ,d sp c = gint yp Γ, yp cin, sp c = yp ι yp cin + sp c yp φ yp c = gout yp Γ, sp c = yp ωhc sp c Using these equations we can test the cell for its properties. The MD LSTM Stable cell does not have the COD, because the 1D LSTM cell also does not have this property. For the other propertiese we get: Theorem 17 (LTD of MD LSTM Stable cells) An MD LSTM Stable cell allows NVG. Proof See A.4 in appendix. Theorem 18 (NEG of MD LSTM Stable cells) An MD LSTM Stable cell has an NEG. Proof See A.5 in appendix. Reducing the number of gates by one. When D 2 an MD LSTM Stable cell has one more gate than a classical MD LSTM (for D = 1 the both cells are equivalent). But it is possible to reduce the number of LGs by one. One solution is to choose one dimension d {1, . . . , D} which does not get an LG. Its activation is calculated by d {1,...,D}\{d } In the special case of D = 2 we can choose d = 2 and we get P2 d =1 yλ,d = yλ,1+(1 yλ,1) = 1 and the update equation of the internal state can be simplified to sp c = gint yp ι , yp λ,1, yp φ, sp 1 c , sp 2 c = yp ι yp cin + yp λ,1sp 1 c + 1 yp λ,1 sp 2 c . 6. Bounding the Internal State In the last sections we discussed the growing of the EC over time and we found a solution to have a NGEC for higher dimensions. Nevertheless it is possible that the internal state grows linearly over time. When we take a look at Definition 10, we see that the partial derivative for p = pout depends on h c sp c . So having the inequality yp c sp c h c (sp c) with h c (sp c) |sp c| 0 Gundram Leifert et al. the cell allows NVG defined in Definition 11, but actually we have ypout c ypin cin |spout c | 0 for arbitrary gate activations. Again, ideas like state decay, additional peephole connections or additional loss functions like mentioned in Section 4 either do not work or destroy the NVG of the LSTM and LSTM Stable cell. So, our solution is to change the architecture of the MD LSTM Stable cell, so that it fulfills has an NEG and allows NVG and COD. The key idea is to bound the internal state, so that for all inputs yp cin 1, p ND the internal state is bounded by sp c 1. Note that this is comparable with the well-known Bounded-Input-Bounded-Output-Stability (BIBO-Stability). To create an MD cell that has an NEG, allows NVG and has a bounded internal state, we take the MD LSTM Stable cell proposed in the last section and change its layout. Therefore we calculate the activation of the IG as function of the FG, so that we achieve sp c 1 by choosing yp ι := 1 yp φ. So the activation of the FG controls how much leaks from the previous states. The activation of the FG can also be interpreted as switch, if the internal activation, the new activation or a convex combination of these both activations should be stored in the cell. So the sc can be seen as time-dependent exponential moving average of ycin. Definition 19 (MD Leaky cell) An MD Leaky cell is a cell of dimension D and order 1 where hc = tanh and Γ = {(λ, 1) , . . . , (λ, D) , φ, ω} sp c = gconv yp Γ, sp 1 c , . . . , s p D c d=1 s p d c yp λ,d D P d =1 yp λ,d sp c = gint yp Γ, yp cin, sp c = 1 yp φ yp cin + sp c yp φ yp c = gout yp Γ, sp c = yp ωhc sp c Now we can prove that the resulting cell has all benefits. Theorem 20 The MD Leaky cell has an NEG and allows NVG and COD. Proof See A.6 in appendix. The MD Leaky cell can have one gate less than the MD LSTM cell and the MD LSTM Stable cell and because of this, the update path requires less computations. 7. General Derivation of Leaky Cells So far we proposed cells for the MD case, which are able to provide long term memory. But especially in MDRNNs with more than one MD layer it is hard to measure if and how much long term dependencies are used and even if it is useful. Another way to interpret the cell is to consider them as kind of MD feature extractor like feature maps in Convolutional Neural Networks (Bengio and Le Cun, 1995). Then the aim is to construct an MD cell which is able to generate useful features. Having a hierarchical Neural Network like in Bengio and Le Cun (1995) and A. Graves and J. Schmidhuber (2008) over the hierarchies the number of Cells in MDRNNs features increases with a simultaneously decreasing feature resolution. Features in a layer with low resolution can be seen as low frequency features in comparison to features in a layer with high resolution. So it would be useful to construct a cell as feature extractor which produces a low frequency output in comparison to its input. In appendix B we take a closer look at the theory of linear shift invariant (LSI)-systems and their frequency analysis and analyse a first order LSI-system regarding its free selectable parameters using the Fand Z-transform. There, we derive the MD Leaky LP cell (see Definition 21) and 5 additional first order MD cells, which we do not test in Section 8. Definition 21 (MD Leaky LP cell) An MD Leaky LP cell is a cell of dimension D and order 1 where hc = tanh and Γ = {(λ, 1) , . . . , (λ, D) , φ, ω0, ω1} sp c = gconv yp Γ, sp 1 c , . . . , s p D c d=1 s p d c yp λ,d D P d =1 yp λ,d sp c = gint yp Γ, yp cin, sp c = 1 yp φ yp cin + sp c yp φ yp c = gout yp Γ, sp c, sp c = hc sp cyp ω0 + sp c yp ω1 Setting the second OG (yp ω1) to zero, the Leaky LP cell corresponds to the Leaky cell, hence it fulfills all three properties, but has one more gate, which is as much gates as the LSTM cell. 8. Experiments RNNs with 1D LSTM cells are well studied. In some experiments the activations of the gates and the internal state are observed and one can see that the cell can really learn, when to forget information and when the internal state should be accessible for the network (see F. A. Gers, N. Schraudolph and J. Schmidhuber, 2002). However, we did not find experiments like these for the MD case and we do not want to transfer these experiments into the MD case. Instead we compare the different cell types with each other in two scenarios where the MD RNNs with LSTM cells perform very well. In both benchmarks the task is to transcribe a handwritten text on an image, so we have a 2D RNN. In this case we compare the cells on the IFN/ENIT (Pechwitz, M. and Maddouri, S. and Märgner, V. and Ellouze, N. and Amiri, H. and others, 2002) and the Rimes database (Augustin, E. and Brodin, J.-M and Carré, M. and Geoffrois, E. and Grosicki, E. and Prêteux, F., 2006). Both tasks are solved with the MD RNN layout described in A. Graves and J. Schmidhuber (2008) and shown in Figure 4. All networks are trained with Backpropagation through time (BPTT) . To compare the different cell types in RNNs with each other we take 10 RNNs with different weight initializations of each cell type and calculate the minimum, the maximum and the median of the best label error rate (LER) on a validation set of these 10 RNNs. In all tables we present these three LERs to compare the cell types. We think it is more important to have stable cells in the lower MD layers because of two reasons: First, when we have just a few cells in a layer, the saturation of one cell has a Gundram Leifert et al. INPUT LAYER OUTPUT LAYER Subsample 1 Subsample 2 Subsample 3 Figure 4: Architecture of the hierarchical MDRNN used for the experiments: It is equivalent to A. Graves and J. Schmidhuber (2008, Figure 2). A 2D layer contains 22 distinct layers (for each combination of scanning direction left/right and up/down one layer). To reduce the number of weights between two 2D layers, a 0D layer is inserted, which contains units with tanh as activation function. They have dimension 0 because they have no recurrent connections. These layers can be seen as feed-forward or convolutional layer. Each 2D layer (or its allocated 0D layer) reduces its size in in x and y dimension using a two-dimensional subsampling. Simultaneously the number of feature maps (z-dimension) increases to have no bottleneck between input and output layer. greater effect on the performance of the network. Second, in lower layers there are longer time series so having an unstable cell in such a layer, it has time to saturate. So our first experiment compares the recognition results when we substitute the LSTM cells in the lowest layer (which is 2D layer 1 in Figure 4) by the newly developed cells. In the second experiment we compare the LSTM cell and the Leaky LP cell also in the higher MD layers ( 2D layer 2 and 3 in Figure 4), to evaluate if the Leaky LP cell work better also Cells in MDRNNs Label-Error-Rate in Percent Celltype min max median LSTM 8,58% 14,73% 10,58% Stable 8,78% 11,75% 9,55% Leaky 8,87% 10,47% 9,10% Leaky LP 8,24% 9,40% 8,93% Table 1: Different cell types in the lowest MD layer in long time series. In Bengio (2012, 3.1.1) it is mentioned, that an important hyper parameter for a training is the learning rate, so another experiment is to train all networks with stochastic gradient decent with different learning rates δ 1 10 3, 5 10 4, 2 10 4, 1 10 4 and compare the best LER according a fixed learning rate. 8.1 The IFN/ENIT Database This database contains handwritten names of towns and villages of Tunisia. The set is divided into 7 (a-f,s) sets, where 5 (a-e) are available for training and validation (for details see Pechwitz, M. and Maddouri, S. and Märgner, V. and Ellouze, N. and Amiri, H. and others, 2002). With all information we got from A. Graves, we were able to get comparable results to A. Graves and J. Schmidhuber (2008). Therefor we divide the sets a-e into 30000 training samples and 2493 validation samples. All network are trained 100 epochs with a fixed learning rate δ = 1 10 4. The LER is calculated on the validation set. 8.1.1 Different Cells in the Lowest MD Layer In our first experiment we substitute the LSTM cell in the lowest MD layer. We take some of the cells described in this paper. In Table 1 the results are shown. The first row is the same RNN layout used in A. Graves and J. Schmidhuber (2008). We can see, that the Leaky LP cell performs the best. Nevertheless the worst RNN with Leaky LP cells in the lowest MD layer performs worth than the best RNN with LSTM cells. So we cannot say, that Leaky LP is always better. But it can be observed that the variance of the RNN performance is very high with LSTM cells in the lowest MD layer. Our interpretation is that LSTM cells have a comparable performance like the Leaky LP cells in the lowest layer, when they do not saturate. Note, that the Leaky cell has one gate less, so they are faster and have less trainable weights. 8.1.2 Different Cells in Other MD Layers Now we want to compare the best new developed cell the Leaky LP cell with the LSTM cell in the other MD layers. So we also substitute the LSTM cell in the upper MD layers. We enumerate the 2D layers like shown in Figure 4. In Table 2 we can see that substituting the LSTM cells only in the lowest or in the both lowest layer perform slightly better. The best results can be achieved when we use Leaky LP cells in 2D layer 1 and LSTM cells in 2D layer 3. Using LSTM in the middle layer seems to work slightly better than using the Leaky LP Gundram Leifert et al. Celltype in 2D layer Label-Error-Rate in Percent 1 2 3 min max median LSTM LSTM LSTM 8,58% 14,73% 10,58% Leaky LP LSTM LSTM 8,24% 9,40% 8,93% Leaky LP Leaky LP LSTM 8,35% 11,27% 8,91% Leaky LP Leaky LP Leaky LP 8,92% 11,69% 9,74% Table 2: Different cells in other layers Label-Error-Rate in Percent Celltype BP-delta min max median LSTM 1 10 4 8,58% 14,73% 10,58% LSTM 2 10 4 9,15% 16,86% 10,51% LSTM 5 10 4 9,03% 21,77% 11,44% LSTM 1 10 3 10,21% 30,20% 11,44% Leaky LP 1 10 4 8,92% 11,69% 9,74% Leaky LP 2 10 4 8,38% 9,09% 8,81% Leaky LP 5 10 4 8,25% 8,95% 8,78% Leaky LP 1 10 3 8,29% 9,20% 8,88% Leaky LP 2 10 3 8,95% 12,81% 9,55% Table 3: Performance of cells regarding learning-rate cells instead. This fits to our intuition mentioned before that the LSTM cells perform better when they do not have a too long time series and when there are enough cells in one layer which do not saturate. 8.1.3 Performance of Cells Regarding Learning-Rate When we take a look at the update equations and the proofs of the NEG it can be assumed, that the gradient going through the cells is lower for Leaky LP cells in contrast to LSTM cells. So we think the learning rate have to be larger for Leaky LP cells. In Table 3 we compare the networks with either only LSTM or Leaky LP cells. There we can see that the learning rate have to be much higher for the Leaky LP cells. In addition, the RNNs with Leaky LP cells are more robust to the choice of the learning rate. 8.2 The Rimes Database One task of the Rimes database is the handwritten word recognition (for more details see E. Grosicki, M. Carré, J.-M. Brodin and E. Geoffrois, 2008; E. Grosicki and H. El-Abed, 2011). It contains 59292 images of french single words. It is divided into distinct subsets; a training set of 44196 samples, a validation set of 7542 samples and a test set of 7464 samples. We train the MD RNNs by using the training set for training and calculate the LER over the validation set, so the network is trained on 44196 training samples each epoch. The network used in this section differs only in the subsampling rate between two layers from the network used in A. Graves and J. Schmidhuber (2008). When there is a subsampling between layers, Cells in MDRNNs Label-Error-Rate in Percent Celltype min max median LSTM 14,96% 17,63% 16,50% Stable 14,45% 16,02% 15,11% Leaky 14,77% 16,39% 15,85% Leaky LP 14,63% 15,78% 15,30% Table 4: Different cell types in the lowest MD layer Celltype in 2D layer Label-Error-Rate in Percent 1 2 3 min max median LSTM LSTM LSTM 14,96% 17,63% 16,50% Leaky LP LSTM LSTM 14,63% 15,78% 15,30% Leaky LP Leaky LP LSTM 14,21% 15,57% 14,92% Leaky LP Leaky LP Leaky LP 14,94% 16,18% 15,52% Table 5: Different cells in other layers the factors are 3 2 instead of 4 3 or 4 2. The rest of the experiment is the same like described in Section 8.1. 8.2.1 Different Cells in the Lowest MD Layer In Table 4 we can see that substituting the LSTM in the lowest layer by one of the three cells improves the performance of the network, even the Leaky cell with one gate less. 8.2.2 Different Cells in Other MD Layers We want to see the effect of the substitution of the LSTM cell by the Leaky LP cell in the upper MD layers. In Table 5 we can see that using Leaky LP cells in both lowest layers perform very well. So we also take this setup to try different learning rates. Performance of Cells Regarding Learning-Rate. Using different learning rates we can see that the RNN with Leaky LP cells in the both lowest layers and the LSTM cells in the top layer can significantly improve the performance . Even the maximal LER of this RNN works better than the best network with LSTM cells in each layer. 9. Conclusion In this paper we took a look at the one-dimensional LSTM cell and discussed the benefits of this cell. We found two properties, that probably make these cells so powerful in the one dimensional case. Expanding these properties to the multi dimensional case, we saw that the LSTM does not fulfill one of these properties any more. We solved this problem by changing the architecture of the cell. In addition we presented a more general idea how to create one dimensional or multi dimensional cells. We compare some newly developed cells with the LSTM cell on two data sets and we can improve the performance using the Gundram Leifert et al. Label-Error-Rate in Percent Celltype BP-delta min max median LSTM 1 10 4 14,96% 17,63% 16,50% LSTM 2 10 4 14,41% 16,88% 15,61% LSTM 5 10 4 15,05% 16,27% 15,47% Leaky LP 1 10 4 14,94% 16,18% 15,52% Leaky LP 5 10 4 12,68% 13,95% 13,57% Leaky LP in 2D layer 1 & 2 2 10 4 13,26% 14,04% 13,65% Leaky LP in 2D layer 1 & 2 5 10 4 12,08% 13,42% 12,87% Table 6: Performance of cells regarding learning-rate new cell types. Due to this we think that substituting the multi-dimensional LSTM cells by the multi-dimensional Leaky LP cell could improve the performance of many system working with a multi-dimensional space. Appendix A. Proofs A.1 Proof of 7 Proof Let c be a 1D LSTM cell. To get the derivative sc(t2) sc(t1) according the truncated gradient between two time steps t1, t2 N we have to take a look at gint. sc (t2) sc (t1) = gint (yΓ(t2), ycin(t2), sc(t2 1)) sc (t1) (12) = (ycin(t2)yι(t2)) sc (t1) | {z } = tr0 sc (t1) yφ(t2) + sc(t2 1) yφ(t2) sc (t1) | {z } = tr0 = tr sc(t2 1) sc (t1) yφ(t2) t=t1+1 yφ(t) (13) In addition, t N we have sc (t) ycin (t) = yι(t) and yc (t) sc (t) = h c (sc(t)) yω(t). (14) We will prove the properties successively. NEG: For the LSTM cell the FG fφ = flog ensures yφ(t) (0, 1), so using these bounds in (13) with sc (t) sc (tin) = tr t =tin+1 yφ(t ) (0, 1) Cells in MDRNNs the LSTM cell has an NEG. NVG: Therefore, we choose yι (t) [1 ε, 1) if t = tin (0, ε] otherwise , yφ (t) [1 ε, 1) if tin < t tout (0, ε] otherwise , with a later chosen ε > 0. Let t1, t2 N, t1 t2 be two arbitrary dates, where we want to calculate the gradient sc(t2) ycin(t1). First, we want to show that the LSTM cell allows NVG for t1 = tin and tin t2 tout: We have yι(t1) [1 ε, 1) and t = tin + 1, . . . , tout : yφ(t) [1 ε, 1). Then, we can estimate the derivative from (12) and (14) by sc (t2) ycin (t1) = sc (t2) sc (t1) sc (t1) ycin (t1) = tr yι (t1) t=t1+1 yφ (t) t=t1+1 (1 ε) , 1 (1 ε)tout tin+1 , 1 To fulfill the equation for NVG we choose ε depending on δ such that 1 δ (1 ε)tout tin+1 1 tout tin+1 holds. Second, we have to show, that the derivative is in [0, δ], when t1 = tin and tin t2 tout is not fulfilled. In the case of t1 = tin when ε δ we can use the NEG which leads to sc (t2) ycin (t1) = sc (t2) sc (t1) | {z } tr[0,1] sc (t1) ycin (t1) | {z } (0,ε] [0, ε] [0, δ]. When t1 = tin we have two cases: t2 < tin or t2 > tout. For the case t2 < tin the derivative is zero ( [0, δ]), because the cell is causal. For t2 > tout we can split the derivative at tout and get sc (t2) ycin (t1) = tr yι(t1) t=t1+1 yφ(t) | {z } (0,1) t=tout+1 yφ(t) | {z } (0,ε] tr 0, εt2 tout [0, ε] [0, δ]. For ε min n δ, 1 (1 δ) 1 tout tin+1 o the LSTM cell allows NVG. COD: To prove that the LSTM cell has no COD, we show that there are gate activations Gundram Leifert et al. such that in Definition 5 we get δ2 > δ1. Therefore, we assume that all gate activations are arbitrary (yγ(t) (0, 1)), closed (yγ(t) (0, ε]) or opened (yγ(t) [1 ε, 1)) with a later chosen ε > 0. We take a look at the right side of (14). For sc(t) = 0 we get h c (sc(t)) = 1. In Definition 5 we have to satisfy yΓ(t) : yc(t) sc(t) [0, δ2] an choose the OG yω(t) (0, ε] with But then for t = 1, . . . , t 1 we can choose the IG and FG open with the same ε so that yφ(t ), yι(t ) [1 ε, 1) . When for all time steps t = 1, . . . , t there is a positive input ycin(t ) [c, 1), c (0, 1) R and an internal state sc(t 1) < c(1 ε) ε , the internal state is growing over time, because sc(t ) = ycin(t )yι(t ) + sc(t 1)yφ(t ) c(1 ε) + sc(t 1)(1 ε) sc(t 1) + c(1 ε) sc(t 1)ε > sc(t 1) + c(1 ε) c(1 ε) For large sc(t) c(1 ε) ε 1 we can estimate tanh(sc(t)) | {z } exp( 2sc(t)) exp ( sc(t)) exp c(1 ε) This yields in (14) to the bound yc (t) sc (t) = h c (sc(t)) yω(t) (16) so in Definition 5 we get δ1 exp c(1 ε) But when we combine (15), (18) and the restriction in Definition 5, we have ε δ2 < δ1 exp c(1 ε) but there exist ε, c, such that the inequality is not fulfilled, which is a contradiction. Summarized, the 1D LSTM cell allows an NVG and has an NEG, but does not allow COD. Cells in MDRNNs A.2 Proof of 14 Proof Let c be an MD LSTM cell of dimension D, p,p1,p2,pin,pout ND,pin pout arbitrary dates and hc = tanh the sigmoid function. Besides ε > 0 is a later chosen value. In the first step we want to show that there are activations of the forget gates, so that sp c spin c tr (1 ε) p pin 1, 1 for pin p pout [0, Dε] otherwise (19) is fulfilled. The prove is done using induction over k = p pin 1 with p pin. The base k = 0 is clear. Let be k 1. We define Pp := d {1, . . . , D} | p d pin the set of dimensions d, in which are pin-p d -paths. Note, that this set cannot be empty, because p > pin for k 1. When we have a dimension d Pp then p d pin 1 = k 1 and we assume s p d c spin c tr h (1 ε) p d pin 1 , 1 i = h (1 ε)k 1 , 1 i . (20) Then we choose the activations of the FG to be ( h 1 ε |Pp|, 1 |Pp| for d Pp and pin < p pout [0, ε] otherwise . (21) Then we can estimate the derivative for pin p pout using (20) and (21) to sp c spin c = tr s p d c spin c yp φ,d s p d c spin c s p d c spin c sp c spin c tr |Pp| (1 ε)k 1 1 ε |Pp| , |Pp| 1 |Pp| sp c spin c tr h (1 ε) p pin 1 , 1 , (22) so (19) is fulfilled for pin p pout. If we have p < pin in (19), the derivative is 0, because we have a causal system. For p > pout in (19), we choose ε 1 D 1 |Pp| in (21) to ensure p ND: sp c spin c 1 (see (22)) and we get sp c spin c = tr s p d c spin c yp φ,d 0, Dε max d=1,...,D s p d c spin c (0, Dε] , (23) and (19) is fulfilled. In the second step let p1 p2 be the date, for which we want to calculate the truncated gradient sp2 c yp1 cin . We choose the IG activation as yp ι [1 ε, 1) if p = pin (0, ε] otherwise (24) Gundram Leifert et al. and we get sp c yp cin = yp ι . Using (22), (23) and (24), we can estimate the partial derivative by sp2 c yp1 cin = sp2 c sp1 c sp1 c yp1 cin sp2 c sp1 c tr (1 ε)(1 ε) p2 pin 1, 1 for p1 = pin and pin p2 pout [0, Dε] otherwise . and setting 1 pin pout 1+1 the conditions of Definition 11 are fulfilled. A.3 Proof of 15 Proof Let c be an MD cell of dimension D with the internal state sc and pin,pk ND,pin pk two dates. Let pk be a date k steps further in each dimension than a fixed date pin. So the distance between them is pin pk 1 = Dk. Let Π be the set of all pin-pk-paths, then there exist |Π| = #{ pinpk} paths (see Definition 8). We assume yp φ,d [ε, 1 ε] with ε (0, 0.5) and we can estimate the partial derivative, using the truncated gradient, with spk c spin c = tr i=1 yπi φ,d h εk#{ pinpk}, (1 ε)k #{ pinpk} i . For D = 1 we get |Π| = 1 and the cell has a NGEC. When D 2 we can count the number of paths using the Stirling s approximation and we can estimate the number of paths with #{ pinpk} = When we combine it with the FG activations we can estimate the derivative for great k with the Stirling s approximation and get spk c spin c tr h εDk#{ pinpk}, (1 ε)Dk #{ pinpk} i (25) 2πk D 1 (Dε)Dk , 2πk D 1 (D (1 ε))Dk # Cells in MDRNNs The upper bound of this interval can grow for great k, if [D (1 ε)] > 1 and this is the case for D 2. So the MD LSTM cell can have an exploding gradient for D 2. When the weights to the FGs are initialized with small values, we have yp φ,d 0.5. Then we have an exploding gradient when D 3, when the training is starting. In the worst case we have yp φ,d 1 and the derivative in (25) goes for great k to spk c spin c 2π D 1 k 1 D A.4 Proof of 17 Proof Let c be an MD LSTM Stable cell of dimension D 2 (for D = 1 the proof is equivalent to the 1D case of the LSTM cell), p,p1,p2,pin,pout ND,pin pout arbitrary dates and hc = tanh the sigmoid function. Besides ε > 0 is a later chosen value. In the first step we want to show that there are activations of the forget gates, so that sp c spin c tr (1 (D 1)ε)2 p pin 1, 1 for pin p pout [0, ε] otherwise (26) is fulfilled. The prove is done using induction over k = p pin 1. The base k = 0 is clear. Let be k 1. We define Pp := d {1, . . . , D} | p d pin the set of dimensions d, in which are pin-p d -paths. Note, that this set cannot be empty, because p > pin for k 1. When we have a dimension d Pp then p d pin 1 = k 1 and we assume s p d c spin c tr h (1 (D 1)ε)2 p d pin 1 , 1 i = h (1 (D 1)ε)2(k 1) , 1 i . (27) When we choose the activations of the LGs to be yp λ,d [1 ε, 1) for d Pp and pin < p pout (0, ε] otherwise , Gundram Leifert et al. we can estimate P d Pp yp λ,d PD d =1 yp λ,d (1 (D 1)ε, 1], because P d Pp yp λ,d PD d =1 yp λ,d = P d Pp yp λ,d P d Pp yp λ,d + P d {1,...,D}\Pp yp λ,d (28) |Pp| (1 ε) |Pp| (1 ε) + (D |Pp|) | {z } D 1 |Pp| (1 (D 1)ε) |Pp| (1 (D 1)ε) + (D 1) ε (1 (D 1)ε) |Pp| |Pp| ε(D 1) (|Pp| 1) (1 (D 1)ε). Setting the FG to yp φ [1 ε, 1) for pin < p pout (0, ε] otherwise (29) we can estimate the derivative for pin p pout using (27),(28) and (29) to sp c spin c = tr yp φ s p d c spin c yp λ,d PD d =1 yp λ,d + X d {1,...,D}\Pp s p d c spin c yp λ,d PD d =1 yp λ,d | {z } =0 (1 ε) (1 (D 1)ε)2(k 1) (1 (D 1)ε), 1 sp c spin c tr (1 (D 1)ε)2k , 1 (30) so (26) is fulfilled for pin p pout. If we have p < pin in (26), the derivative is 0, because we have a causal system. For p > pout the FG is closed (see (29)), and using the upper bounds of (27) and (28) we get sp c spin c = tr yp φ s p d c spin c yp λ,d PD d =1 yp λ,d and (26) is fulfilled. In the second step let p1 p2 be the date, for which we want to calculate the truncated gradient sp2 c yp1 cin . We choose the IG activation as yp ι [1 ε, 1) if p = pin (0, ε] otherwise (32) Cells in MDRNNs and we get sp c yp cin = yp ι . Using (30), (31) and (32), we can estimate the partial derivative by sp2 c yp1 cin = sp2 c sp1 c sp1 c yp1 cin sp2 c sp1 c tr (1 ε)(1 (D 1)ε)2 p2 pin 1, 1 for p1 = pin and pin p2 pout [0, ε] otherwise . and setting ε := min δ, 1 (1 δ) 1 2 pin pout 1+1 1 D 1 the conditions of Definition 11 are fulfilled. A.5 Proof of 18 Proof Let c be a MD LSTM Stable cell of dimension D with the internal state sc and pin,p ND,pin p two arbitrary dates and pin p 1 = k. Let all gate activations be arbitrary in [0, 1]. We show that sp c spin c tr [0, 1] (33) is fulfilled k N using induction over k. For the base case k = 0 we get sp c spin c = spin c spin c = 1. Let (33) be fulfilled for k 1. That means if p d pin we have p d pin 1 = k 1 and this leads to s p d c spin c tr [0, 1]. If p d pin then there is no pin-p d -path and we have s p d c spin c = 0 for this dimension. Then we can calculate the derivative 0 sp c spin c = tr yp φ s p d c spin c d =1 yp λ,d 0, max d Pp ( s p d c spin c d =1 yp λ,d | {z } 1 tr [0, 1] , which gives us the desired interval. A.6 Proof of 20 Proof NEG: The cell has an NEG, because all gates have the same bounds as the MD Stable cell. NVG: To prove the NVG, we use the proof of Theorem 17. The difference between the MD Stable cell and the MD Leaky cell is that the activations of the FG and IG are dependent Gundram Leifert et al. on each other for the Leaky cell. Let pin,p ND,pin p be two arbitrary dates like in Theorem 17. The IG has just the a restriction that for p = pin it has to hold yp ι [1 ε, 1) . Here, the FG can have an arbitrary activation, so we chose yp φ = 1 yp ι . For all p > pin the FG have to be in the ranges, shown in (29), while the IG has no restriction and we choose yp ι = 1 yp φ, so the MD Leaky cell has the NVG. COD: The proof that the MD Leaky cell allows COD can be done by estimating the bounds of sp c. From the update equations of the cell we get sp c max i=1,...,D Now we can estimate the internal state using the ranges yp cin [ 1, 1], recursion over p |sp c| = 1 yp φ yp cin + yp φsp c max yp cin , sp 1 c , . . . , s p D c max q

1 ε δ2 < δ1 h c (1) (1 ε) 4 < h c (1) h c (1) + 1 the COD is proven. Appendix B. Theory to Create First Order MD Cells If one wants to take a closer look at the theory of linear shift invariant (LSI)-systems and their frequency analysis and analyse a first order LSI-system regarding its free selectable parameters using the Fand Z-transform, it is highly recommended to be familar with these theories (for a good overview and more details see Poularikas, 2000; Schlichthärle, 2000). Adding the knowledge of reducing the MD case to the 1D case (see Section 5) we create new cell types for the MD case. Cells in MDRNNs B.1 Analysing a First Order LSI-System The update equations of a first order LSI-system with one input u, one internal state x and one output y can be written as x[n] = h1(u[n], x[n 1]) = α0u[n] + α1x[n 1], (36) y[n] = h2(x[n], x[n 1]) = b0x[n] + b1x[n 1] (37) with the free selectable coefficients α0, α1, b0, b1 R. Let U(z) = Z {u[n]} be the Ztransformed signal of u[n] and X(z), Y (z) respectively. Then we can write the so called transfer functions H1(z) := X(z) U(z) = α0 1 α1z 1 , H2(z) := Y (z) X(z) = b0 + b1z 1, H(z) := Y (z) U(z) = H2(z)H1(z). To analyse (36) and (37) according their frequency response we use the relationship between the F-transform and the Z-transform: Remark 22 Let u[n] = ejωn be a harmonic input sequence with the imaginary number j2 = 1 and H(z) = Y (z) U(z) be a transfer functions of an LSI-system. When the poles of H(z) are inside the circle |z| = 1, we can change from Zto F-transform using the substitution H(ω) = H(z) z=ejω with the harmonic sequence y[n] = H(ω)u[n] = H(ω)ejωn with the same frequency ω but with a different amplitude and a different phase dependent on the frequency ω. We only want to analyse the amplitude of this harmonic sequence |y[n]| = H(z)ejωn = |H(z)| = |H2(z)| |H1(z)| and do that by analysing both transfer functions H1(z) and H2(z) separately. The amplitude of H1(ω) = H1(z)|z=ejω is calculated by |H1(ω)| = |α0| q (1 α1 cos(ω))2 + α2 1 sin2(ω) . Like mentioned before, in many tasks, the information signal has a low frequency. To have the largest amplitude at ω = 0 we have to choose α1 0. As mentioned in Remark 22 the poles of H1(z) = α0 1 α1z 1 = zα0 z α1 have to be in the circle |z| = 1, so we have the additional constraint |α1| < 1. This leads to the bounds α1 [0, 1). But for α1 1 we Gundram Leifert et al. y I(t) y H(t 1) y I(t) y H(t 1) y I(t) y H(t 1) λ sc(t) ycin(t) memory sc(t 1) Figure 5: Schematic diagram of a one-dimensional Leaky LP cell: The internal state is a convex combination of the new input cin and the previous state sc(t 1). The previous state sc(t 1) and the current state sc(t) are gated (ω0 and ω1) and accumulated afterwards. The output is squashed by tanh into the interval [ 1, 1]. have H1 (0) , so we have to choose α0 dependent on α1. We set a maximum gain of max ω |H1(ω)| = |H1(0)| = 1, so we get the constraint |α0| 1 α1. (38) In the same way we analyse H2(z): |H2(ω)| = b0 + b1e jω = q (b0 + b1 cos(ω))2 + b2 1 sin2(ω) To get the maximal gain at low frequency the parameters b0 and b1 must have the same sign. B.2 Creating a First Order Cell With these constraints for the parameters we now can define a new cell type. The parameters α0, α1, b0, b1 should be activations of gates like in LSTM cells. We have to find the right activation functions to fulfill the inequalities above. Using the weight-space symmetries in a network with at least one hidden layer (Bishop, 2006, 5.1.1), without loss of generality we set α0, α1, b0, b1 0. To fulfill the bounds for H1, we set α1 as activation of a gate with activation function flog. So we have α1 (0, 1). This is comparable with the FG in the previous sections. To select the α0 we choose α0 := 1 α1 (see (38)). So the value of α0 is comparable with the activation of the IG. For H2 we set both values b0, b1 as activations of Cells in MDRNNs a gate with activation function flog which leads to max ω |H2(ω)| = max {b0 + b1} = 2, so the amplitude response is bounded by 2. With these bounds we can define a cell with a cell input yp cin = u[n], a previous internal state sp c = x[n 1], an internal state sp c = x[n] and a cell output yp c = y[n]. We substitute the coefficients by time dependent gate activations α0 := 1 yp φ = yp ι IG α1 := yp φ FG b0 := yp ω0 OG b1 := yp ω1 OG of the previous internal state which leads to the transfer functions H yp φ 1 (z) = α0 1 α1z 1 = 1 yp φ 1 yp φz 1 , H yp ω0;yp ω1 2 (z) = b0 + b1z 1 = yp ω0 + yp ω1z 1, H(z) = Hyp φ;yp ω0;yp ω1(z) = Y (z) U(z) = α0 b0 + b1z 1 1 α1z 1 = 1 yp φ 1 yp φz 1 yp ω0 + yp ω1z 1 . (39) and the update equations x[n] =α0u[n] + α1x[n 1] sp c = (1 yp φ)yp cin + yp φsp c , y[n] =b0x[n] + b1x[n 1] yp c = yp ω0sp c + yp ω1sp c . (40) The output of the cell is already bounded in [ 2, 2], but to fulfill Definition 9 we change (40) to yp c = hc yp ω0sp c + yp ω1sp c (41) with hc = tanh to ensure yp c [ 1, 1]. This additional non-linearity is not necessary but leads to a better performance. This new cell type called MD Leaky lowpass (Leaky LP) cell is defined in Definition 21. A block diagram of a 1D Leaky LP cell is shown in Figure 5 and different frequency responses in Figure 6. B.3 General First Order MD Cells With the theory of this section we can easily create new cell types. In general, a cell has a number of gates γ1, γ2, . . . Γc. For D = 1 a previous state sp c is given directly. Otherwise the previous state is calculated as trainable convex combination of D previous states, like described in Section 5. In Table 7 cell layouts are depicted whereby type A is the cell developed in Section 7 (compare to (39)). For the other types we briefly want to describe the main ideas. Gundram Leifert et al. 0 0.1 0.2 0.3 0.4 0.5 0 frequency f frequency response H0.1 1 H0.5 1 H0.9 1 0 0.1 0.2 0.3 0.4 0.5 0 frequency f frequency response H0.9;0.1 2 H0.8;0.2 2 H0.5;0.5 2 0 0.1 0.2 0.3 0.4 0.5 0 frequency f frequency response H0.5 1 H0.5;0.5 2 H0.5;0.5;0.5 0 0.1 0.2 0.3 0.4 0.5 0 frequency f frequency response H0.1 1 H0.8;0.2 2 H0.1;0.8;0.2 Figure 6: Frequency response of H1 (dashed), H2 (dotted) and H (solid) for special parameters. Top-left: The frequency response of an IIR filter is able to block even low frequency signals, but it cannot be zero at f = 0.5. Top-right: The frequency response of an FIR filter cannot be lower than the lightgray dotted line, but for f = 0.5 it can be zero. Bottom: When these both filters are concatenated, the resulting frequency response can combine the benefits of each filter. B.3.1 The MD Butterworth Lowpass Filter The cell of type B is a special case of the Leaky LP cell. When we set yp ω0 = yp ω1 = 0.5 there is a direct relation between the cutofffrequency of a discrete Butterworth lowpass filter and the activation of yp φ: Let fcutoffbe the frequency, where amplitude response is reduced to Cells in MDRNNs 2 of the maximal gain. We can calculate fcutoffby 1 yp φ 1 + yp φ yp φ = 1 + tan(πfcutoff) 1 tan(πfcutoff) with the bounds fcutoff (0, 0.5) and yp φ ( 1, 1) (for more details see Schlichthärle, 2000, 2.2;6.4.2). For yp φ (0, 1) we get fcutoff (0, 0.25). In Figure 7 (left) we can see, that even for a negative value of yp φ and a highpass characteristic of H1(z) the impulse response H(z) has a lowpass characteristic. B.3.2 Adding an Additional State Gate In B.2 we fulfilled (38) for the MD Leaky LP cell by setting α0 := 1 α1, so α0 is directly connected with α1. Another solution would be to add an additional value γ (0, 1) and choose α0 := γ (1 α1). So we can extend the MD Leaky LP cell by adding an additional gate γ4 for the previous state (see type C). Unfortunately this does not lead to a better performance and one more gate has to be calculated. B.3.3 Another Solution for the Output The cell of type D is another solution to choose b0 and b1 in Section B.2. For the Leaky LP cell we calculate the output as described in (41). Now we set b0 = γp 2γp 3 and b1 = 1 γp 2 γp 3, and get yp c = γp 3 γp 2sp c + 1 γp 2 sp c . This cell actually works as well as the MD Leaky LP cell and has the same number of gates. In this case we do not need a squashing function hc, because we already have yp c [ 1, 1]. B.3.4 An MD Cell as MD PID-Controller Type E has a completely different interpretation: In controlling engineering a PID-controller gets an error as input. In our case the gate activations have to decide, if the proportional (P), the integral (I) or the derivative (D) term of the error is important for the output. When γp 1 0 we have yp cin sp c so the internal state is proportional to the input. Then γp 2 gates the proportional part (P) of the input. The second gate γp 3 gates the difference between the last and the current input, which can be seen as a discrete derivative (D). If γp 1 1 the internal state is an exponential moving average of yp cin which is an integral term. So γp 2 gates a mainly integral part of the input (I), whereas γp 3 gates a mainly proportional part of the input (P). Dependent on γp 1 type E can be a PD-controller, a PI-controller or a mix of these both. In Figure 7(right) can be seen the frequency response of this cell for different gate activations. Gundram Leifert et al. Type gint ( ) gout ( ) H(z) for hc (x) = x A 1 γp 1 yp cin + γp 1sp c hc γp 2sp c + γp 3sp c (1 γp 1) 1 γp 1z 1 γp 2 + γp 3z 1 B 1 γp 1 yp cin + γp 1sp c sp c+sp c 2 (1 γp 1) 1 γp 1z 1 1+z 1 2 C 1 γp 1 yp cin + γp 1γp 4sp c hc γp 2sp c + γp 4sp c (1 γp 1) 1 γp 1γp 4z 1 γp 2 + γp 3z 1 D 1 γp 1 yp cin + γp 1sp c γp 3 γp 2sp c + 1 γp 2 sp c (1 γp 1) 1 γp 1z 1 γp 3 γp 2 + 1 γp 2 z 1 E 1 γp 1 yp cin + γp 1sp c hc γp 2sp c + γp 3 sp c sp c (1 γp 1) 1 γp 1z 1 γp 2 + γp 3 1 z 1 Table 7: Update equations and transfer function for different cell layouts. The column sp c contains the update equations to calculate the internal state and column yp c contains the update equation for the output. These equations lead to the transfer function H(z) = Hγp 1,γp 2,...(z). 0 0.1 0.2 0.3 0.4 0.5 0 frequency f frequency response H 0.5 1 H0.5;0.5 2 H 0.5;0.5;0.5 0 0.1 0.2 0.3 0.4 0.5 0 frequency f frequency response H0.5 1 H0.5;0.9;0.1 H0.5;0.5;0.5 H0.5;0.1;0.9 Figure 7: Frequency response of H1 (dashed), H2 (dotted) and H (solid) for special layouts and parameters of Table 7. Left (type B): A butterworth lowpass filter with a negative gate activation γp 0 = 0.5 leads to the cutofffrequency π arctan 1+0.5 1 0.5 0.3976. Right (type E): Different frequency responses of a PID controller. Having a fixed γp 0 = 0.5 the frequency response is dependent on the activations of γp 1 and γp 2 and can have lowpass (black), allpass (gray) and highpass (lightgray) characteristic. A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS, pages 545 552, 2008. Cells in MDRNNs A. Graves, S. Fernandez and J. Schmidhuber. Multi-dimensional recurrent neural networks. Technical report, IDSIA-04-07, 2007. Augustin, E. and Brodin, J.-M and Carré, M. and Geoffrois, E. and Grosicki, E. and Prêteux, F. RIMES evaluation campaign for handwritten mail processing. In Proc. of the Workshop on Frontiers in Handwriting Recognition, number 1, 2006. Y. Bengio. Practical recommendations for gradient-based training of deep architectures. Co RR, abs/1206.5533, 2012. Y. Bengio and Y. Le Cun. Convolutional networks for images, speech andtime-series. 1995. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994. URL http://www.iro.umontreal.ca/~lisa/pointeurs/ieeetrnn94.pdf. C. M. Bishop. Pattern Recognition and Mashine Learning. Springer, 2006. E. Grosicki and H. El-Abed. ICDAR 2011: French handwriting recognition competition. In Proc. of the Int. Conf. on Document Analysis and Recognition, pages 1459 1463, 2011. E. Grosicki, M. Carré, J.-M. Brodin and E. Geoffrois. RIMES evaluation campaign for handwritten mail processing. In Proc. of the Int. Conf. on Frontiers in Handwriting Recognition, 2008. F. A. Gers, J. Schmidhuber and F. Cummins. Learning to forget: Continual prediction with lstm. Technical report, IDSIA-01-99, 1999. F. A. Gers, N. Schraudolph and J. Schmidhuber. Learning precise timing with lstm recurrent networks. Journal of Machine Learning Research, 3:115 143, 2002. Pechwitz, M. and Maddouri, S. and Märgner, V. and Ellouze, N. and Amiri, H. and others. Ifn/enit-database of handwritten arabic words. In Proc. of CIFED, volume 2, pages 127 136. Citeseer, 2002. A. D. Poularikas. The Transforms and Applications Handbook. CRC Press, 2. edition edition, 2000. S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, 9:1735 1780, 1997. D. Schlichthärle. Digital Filters. Springer, 2000.