# cells_in_multidimensional_recurrent_neural_networks__76ca680b.pdf

Journal of Machine Learning Research 17 (2016) 1-37 Submitted 5/14; Revised 12/15; Published 7/16

Cells in Multidimensional Recurrent Neural Networks

Gundram Leifert gundram.leifert@uni-rostock.de Tobias Strauß tobias.strauss@uni-rostock.de Tobias Grüning tobias.gruening@uni-rostock.de Welf Wustlich welf.wustlich@uni-rostock.de Roger Labahn roger.labahn@uni-rostock.de University of Rostock Institute of Mathematics 18051 Rostock, Germany

Editor: Yoshua Bengio

Keywords: LSTM, MDRNN, CTC, handwriting recognition, neural network Abstract

The transcription of handwritten text on images is one task in machine learning and one solution to solve it is using multi-dimensional recurrent neural networks (MDRNN) with connectionist temporal classiﬁcation (CTC). The RNNs can contain special units, the long short-term memory (LSTM) cells. They are able to learn long term dependencies but they get unstable when the dimension is chosen greater than one. We deﬁned some useful and necessary properties for the one-dimensional LSTM cell and extend them in the multidimensional case. Thereby we introduce several new cells with better stability. We present a method to design cells using the theory of linear shift invariant systems. The new cells are compared to the LSTM cell on the IFN/ENIT and Rimes database, where we can improve the recognition rate compared to the LSTM cell. So each application where the LSTM cells in MDRNNs are used could be improved by substituting them by the new developed cells.

1. Introduction

Since the last decade, artiﬁcial neural networks (NN) became state-of-the-art in many ﬁelds of machine learning, for example they can be applied to pattern recognition. Typical NN are feedforward NN (FFNN) or recurrent NN (RNN), whereas the latter contain recurrent connections. When nearby inputs depend on each other, providing these inputs as additional information to the NN can improve its recognition result. FFNNs obtain these dependencies by making this nearby inputs accessible. If RNNs are used, the recurrent connections can be used to learn if the surrounding input is relevant, but these connections result in a vanishing dependency over time. In S. Hochreiter, J. Schmidhuber (1997) the authors develop the long short-term memory (LSTM) which is able to have a long term dependency. This LSTM is extended in A. Graves, S. Fernandez and J. Schmidhuber (2007) to the multidimensional (MD) case and is used in a hierarchical multi-dimensional RNN (MDRNN) which performed best in three competitions at the International Conference on Document Analysis and Recognition (ICDAR) in 2009 without any feature extraction and knowledge of the recognized language model. In this paper we analyse these MD LSTM regarding the ability to provide long term depen-

c 2016 Gundram Leifert, Tobias Strauß, Tobias Grüning, Welf Wustlich, Roger Labahn.

Gundram Leifert et al.

dencies in MDRNNs and show that it can easily have an unwanted growing dependency for higher dimensions. We deﬁne a more general description of an LSTM a cell and change the LSTM architecture which leads to new MD cell types, which also can provide long term dependencies. In two experiments we show that substituting the LSTM in MDRNNs by these cells works well. Due to this we assume that substituting the LSTM cell by the best performing cell, the Leaky LP cell, will improve the performance of an MDRNN also in other scenarios. Furthermore the new cell types could also be used for the one-dimensional (1D) case, so using them in a bidirectional RNN with LSTMs (BLSTM) could lead to better recognition rates. In Section 2 we introduce the reader to the development of the LSTM cells (S. Hochreiter, J. Schmidhuber, 1997) and its extension (F. A. Gers, J. Schmidhuber and F. Cummins, 1999). Based on that in Section 3 we deﬁne two properties that probably lead to the good performance of the 1D LSTM cells. Both together guarantee that the cell can have a long term dependency. A third property ensures that gradient cannot explode over time. In Section 4 we show that the MD version of the LSTM is still able to provide long term dependency whereas the gradient can explode easily for dimension greater than 1. In Section 5 we change the architecture of the MD LSTM cell and reduce it to the 1D LSTM cell so that the cell fulﬁlls the two properties for any dimension. Nevertheless the internal cell state can linearly grow over time. This problem is solved in Section 6 using a trainable convex combination of the input and the previous internal cell states. The new cell type can provide long term dependencies and does not suﬀer from exploding gradients. Motivated by the last sections we introduce a more general way to deﬁne MD cells in Section 7. Using the theory of linear shift-invariant systems and their frequency analysis we are able to get a new interpretation of the cells and we create 5 new cell types. To test the performance of the cells in Section 8 we take two data sets from the ICDAR 2009 competitions, where the MDRNNs with LSTM cell won. On these data sets we compare the recognition results of the MDRNNs when we substitute the LSTM cells by the new developed cells. On both data sets, the IFN/ENIT data set and the RIMES data set we can improve the recognition rate using the new developed cells.

2. Previous Work

In this section we brieﬂy want to introduce a recurrent neural network (RNN) and the development of the LSTM cell. In previous literature there are various notation to describe the update equations of RNNs an LSTMs. To unify the notations we will refer to their notation using (F. A. Gers, J. Schmidhuber and F. Cummins, 1999; S. Hochreiter, J. Schmidhuber, 1997; A. Graves and J. Schmidhuber, 2008). Therefore we concentrate on a simple hierarchical RNN with one input layer with the set of neurons I, one recurrent hidden layer with the set of neurons H and one output layer with the set of neurons O. For each time step t N the layers are updated asynchronously in the order I, H, O. In one speciﬁc layer all neurons can be updated synchronously. In the hidden layer for one neuron c H

Cells in MDRNNs

P netγ(t) fγ yγ(t) γ y H(t 1)

Figure 1: Schematic diagram of a unit: The unit γ H is a simple neuron with the network s feed forward input y I(t) = yi(t)

i I and recurrent input y H(t 1) = yh(t 1)

h H and an output activation yγ(t). Right: A unit has an input activation netγ(t), which is a linear combination of the source activations y I(t), y H(t 1). The output activation yγ(t) is computed by applying the activation function fγ to the input activation. Left: The short notation of a unit.

at time t N we calculate the neuron s input activation netc by ac(t) netc(t) = X

i I wc,iyi(t) + X

h H wc,hyh(t 1). (1)

with weights w[target neuron],[source neuron]. A bias in (1) can be added by extending the set I := I {bias} with ybias(t) = 1 t N and hence we will not write the bias in the equations, but we use them in our RNNs in Section 8. The neuron s output activation is calculated by yc(t), bc(t) yc(t) = fc (netc(t))

with a diﬀerentiable sigmoid activation function fc. To make (1) suitable for t 0 we deﬁne h H, t Z \ N : yh(t) = 0. This simple neuron with a linear function of activations as input and one activation function we call unit (compare to Figure 1). In (1) the activation of the unit is dependent on the current activations of the layer below and the previous activations of the units from the same layer. When there are no recurrent connections ( c, h H : wc,h = 0), the layer is called feed-forward layer, otherwise recurrent layer.

2.1 The Long Short-Term Memory

A standard LSTM cell c has one input with an input activation ycin(t) a set of gates, one internal state sc and one output(-activation) yc ( yc). The gates are also units and their task is to learn whether a signal should pass the gate or not. They almost always have the logistic activation function flog(x) := 1 1+exp( x) ( f1(x)). The input of the standard LSTM cell is calculated from a unit with an odd activation function with a slope of 1 at x = 0. We use fc(x) = tanh (x) in this paper, another solution could be fc(x) = 2 tanh x

2 (see S. Hochreiter, J. Schmidhuber, 1997). The standard LSTM has two gates: The input gate (IG or ι) and the output gate (OG or ω). These both gates are calculated like a unit, so that

netι(t) = X

i I wι,iyi(t) + X

h H wι,hyh(t 1)

yinc(t), bι(t) yι(t) = flog (netι(t))

Gundram Leifert et al.

netω(t) = X

i I wω,iyi(t) + X

h H wω,hyh(t 1)

youtc(t), bω(t) yω(t) = flog (netω(t)) .

The input of an LSTM is deﬁned like in (1) by netc(t) netcin(t) = X

i I wc,iyi(t) + X

h H wc,hyh(t 1),

g (netc(t)) , f2 (netc(t)) ycin(t) = fc (netcin(t)) .

The internal state sc(t) is calculated by

sc(t) = ycin(t) yι(t) + sc(t 1), (2)

the output activation yc(t) of the LSTM is calculated from yc(t), bc(t) yc(t) = hc (sc(t)) yω(t) (3)

with hc (x) := tanh(x) ( f3(x)). The LSTM can be interpreted as a kind of memory module where the internal state stores the information. For a given input ycin(t) ( 1, 1) the IG decides if the new input is relevant for the internal state. If so, the input is added to the internal state. The information of the input is now saved in the activation of the internal state. The OG determines whether or not the internal activation should be displayed to the rest of the network. So the information, stored in the LSTM is just readable when the OG is active. To sum up, an open IG can be seen as a write -operation into the memory and an open OG as a read -operation of the memory.

Another way to understand the LSTM is to take a look at the gradient propagated through it. To analyse the LSTM properly, we have to ignore gradients comming from recurrent weights. We deﬁne the truncated gradient similar to S. Hochreiter, J. Schmidhuber (1997) and F. A. Gers, J. Schmidhuber and F. Cummins (1999).

Deﬁnition 1 (truncated gradient) Let γ {cin, ι, ω} be any input or gate unit and yc(t 1) any previous output activation. The truncated gradient diﬀers from the exact gradient only by setting recurrent weighted gradient propagation netγ(t)

yc(t 1) to zero. We write

netγ(t) yc(t 1)

= wγ,c = tr 0.

Now, let E be an arbitrary error which is used to train the RNN and E(t)

yc(t) the resulting derivative at the output of the LSTM. The OG can eliminate the gradient coming from the output, because

yc(t) sc(t) = h c (sc(t)) | {z } (0,1]

yω(t) | {z } (0,1)

Cells in MDRNNs

so the OG decides when the gradient should go into the internal state. Especially for |sc(t)| 1 we get

yc(t) sc(t) yω(t).

The key idea of the LSTMs is that an error that occurs at the internal state neither explode nor vanish over time. Therefore, we take a look at the partial derivative sc(t) sc(t 1), which is also known as error carousel (for more details see S. Hochreiter, J. Schmidhuber, 1997). Using the truncated gradient of Deﬁnition 1 for this derivative, we get

sc(t) sc(t 1) =ycin(t) yι(t) sc(t 1) + yι(t) ycin(t) sc(t 1) + 1

=ycin(t) yι(t) yc(t 1) yc(t 1) sc(t 1) + yι(t) ycin(t) yc(t 1) yc(t 1) sc(t 1) + 1

=ycin(t) yι(t)

netι(t) netι(t) yc(t 1) | {z } = tr0

yc(t 1) sc(t 1)

+ yι(t) ycin(t)

netcin(t) netcin(t) yc(t 1) | {z } = tr0

yc(t 1) sc(t 1) + 1

sc(t) sc(t 1) = tr1. (4)

So, once having a gradient at the internal state we can use the chain rule and get τ N : sc(t) sc(t τ) = tr 1. This is called constant error carousel. Like the OG can eliminate the gradient coming from the LSTM output, the IG can do the same with the gradient coming from the internal state, that means it decides when the gradient should be injected to the source activations. This can be seen by taking a look at the partial derivative

sc(t) netcin(t) = sc(t)

ycin(t) ycin(t) netcin(t) = yι(t)f c (netcin(t)) .

If there is a small input |netcin(t)| 1, we get f c (netcin(t)) 1 and can estimate

sc(t) netcin(t) yι(t).

All in all, this LSTM is able to store information and learn long-term dependencies, but it has one drawback which will be discussed in 2.2.

2.2 Learning to Forget

For long time series the internal state is unbounded (compare with F. A. Gers, J. Schmidhuber and F. Cummins, 1999, 2.1). Assuming a positive or negative input and a non zero

Gundram Leifert et al.

activation of the IG, the absolute activation of the internal state grows over time. Using the weight-space symmetries in a network with at least one hidden layer (Bishop, 2006, 5.1.1) we assume without loss of generality ycin(t) 0, so sc(t) t . Hence, the activation function hc saturates and (3) can be simpliﬁed to

yc(t) = hc (sc(t)) | {z } 1

yω(t) yω(t).

Thus, for great activations of sc(t) the whole LSTM works like a unit with a logistic activation function. A similar problem can be observed for the gradient. The gradient coming from the output is multiplied by the activation of the OG and the derivative of hc. For great values of sc(t) we get h c (sc(t)) 0 and we can estimate the partial derivative

yc(t) sc(t) = h c ((sc(t)) yω(t) 0,

which can be interpreted that the OG is not able to propagate back the gradient into the LSTM. Some solutions to solve the linear growing state problem are introduced in F. A. Gers, J. Schmidhuber and F. Cummins (1999). They tried to stabilize the LSTM with a state decay by multiplying the internal state in each time step with a value (0, 1), which did not improve the performance. Another solution was to add an additional gate, the forget gate (FG or φ). The last state sc(t 1) is multiplied by the activation of the FG before it is added to the current state sc(t). So we can substitute (2) by

sc(t) = ycin(t) yι(t) + sc(t 1) yφ(t),

so that the truncated gradient in (4) is changed to

sc(t) sc(t 1) = ycin(t) yι(t) sc(t 1) + yι(t) ycin(t) sc(t 1) + yφ(t)

and for longer time series we get τ N

sc(t) sc(t τ) = tr

t =0 yφ(t t ).

Now, the Extended LSTM is able to learn to forget its previous state. However, an Extended LSTM is still able to work like an standard LSTM without FG by having an activation yφ(t) 1. In this paper we denote the Extended LSTM as LSTM Another point of view was introduce in Bengio et al. (1994): To learn long-term dependencies a system must have an architecture to that an input can be saved over long time and does not suﬀer from the vanishing gradient problem. On the other hand the system should avoid an exploding gradient , which means that a small disturbance has a growing inﬂuence over time. In this paper we do not want to solve the problem of vanishing and exploding gradient for a whole system, we want to solve this problem only for one single cell. But we think that it is an necessary condition to provide long time dependencies of a system.

Cells in MDRNNs

yΓ(t) Γ c γ1

y I(t) y H(t 1)

y I(t) y H(t 1)

gint ( ) gout ( )

memory sc(t 1),...,sc(t k)

ycin(t) sc(t) yc(t)

Figure 2: Schematic diagram of a cell: The function gint calculates the internal state sc(t) from the previous internal states sc(t 1), . . . , sc(t k) and the cell input ycin(t) using the gate activations yΓ(t). The function gout calculates the output yc(t) of the cell from the actual and previous internal states sc(t), . . . , sc(t k), the cell input ycin(t) also using the gate activations yΓ(t).

Gundram Leifert et al.

3. Cells and Their Properties

In this section we want to introduce a general cell and ﬁgure out properties for these cells which probably lead to the good performance observed by LSTM cells.

Deﬁnition 2 (Cell, cf. Fig. 2) A cell, c, of order k consists of

one designated input unit, cin, with sigmoid activation function fc (typically fc = tanh unless speciﬁed otherwise);

a set Γ (not containing cin) of units called gates γ1, γ2, . . . with sigmoid activation functions fγi, i = 1, . . . (typically logistic fγi = flog unless speciﬁed otherwise);

an arbitrary function, gint, and a cell activation function, gout, mapping into [ 1, 1].

Each unit of Γ {cin} receives the same set of input activations. The cell update in time step t N is performed in three subsequent phases:

1. Following the classical update scheme of neurons (see Section 2), all units in Γ {cin} calculate synchronously their activations, which will be denoted by yΓ(t) := yγ(t)

γ Γ and ycin(t). Furthermore, we call ycin(t) the input activation of the cell.

2. Then, the cell computes it s so-called internal state

sc(t) := gint (yΓ(t), ycin(t), sc(t 1), . . . , sc(t k)) .

3. Finally, the cell computes it s so-called output activation

yc(t) := gout (yΓ(t), ycin(t), sc(t), sc(t 1), . . . , sc(t k)) .

In this paper we concentrate on ﬁrst order cells (k = 1). Now, we use Deﬁnition 2 to re-introduce the (Extended) LSTM cell.

Remark 3 (LSTM cell) An LSTM cell is a cell of order 1 where hc = tanh and

Γ = {ι, φ, ω}

sc(t) := gint (yΓ(t), ycin(t), sc(t 1)) := ycin(t)yι(t) + sc(t 1)yφ(t)

yc(t) := gout (yΓ(t), sc(t)) := hc (sc(t)) yω(t)

Properties of cells. Developing the 1D LSTM cells, the main idea is to save exactly one piece of information over a long time series and to propagate the gradient back over this long time, so that the system can learn precise storage of this piece of information. In instance a given input ycin (which represent the information) at time tin should be stored into the cell state sc until the information is required at time tout. To be able to prove the following properties, we will assume the truncated gradient deﬁned in Deﬁnition 1. Nevertheless we will use the full gradient in our Experiments, because it turned out that it works much better. The next two properties of a cell ensure the ability to work as such a memory.

Cells in MDRNNs

y I(t) y H(t 1)

y I(t) y H(t 1)

y I(t) y H(t 1)

+ sc(t) yc(t)

memory sc(t 1)

Figure 3: Schematic diagram of a one-dimensional LSTM cell: The input (cin) is multiplied by the IG (ι). The previous state sc(t 1) is gated by the FG (φ) and added to the activation coming from the IG and input. The output of the cell is the squashed internal state (squashed by hc (x) = tanh(x)) and gated by the OG (ω).

Gundram Leifert et al.

The ﬁrst property should ensure that an input ycin at time tin can be memorized (the cell input is open) in the internal activation sc until tout (the cell memorizes) and has a negligibly inﬂuence on the internal activation for t > tout (the cell forgets). In addition, the cell is able to prevent inﬂuence of other inputs at time steps t = tin (the cell input is closed).

Deﬁnition 4 (Not vanishing gradient (NVG)) A cell c allows an NVG : For arbitrary tin, tout N, tin tout, δ > 0 there exist gate activations yΓ(t) such that for any t1, t2 N

sc (t2) ycin (t1) tr

[1 δ, 1] for t1 = tin and tin t2 tout [0, δ] otherwise (5)

The next deﬁnition guaranties that at any time t N the gate activations can (the cell output is open) or not (the cell output is closed) distribute the piece of information saved in sc to the network. This is an important property because the piece of information can be memorized in the cell without presenting it to the network. Note that the decision is just dependent on gate activations at time t and there are no constraints to previous gate activations. In Deﬁnition 2 we require yc(t) [ 1, 1] whereas sc(t) R. So we cannot have arbitrarily small intervals of the derivative as in (5), but we can ensure two distinct intervals for open and closed cell output. When we take Deﬁnition 4 and 5 together, a cell is able to save an input over long term series, can decide at each time step whether or not it is presented to the network and can forget the saved input.

Deﬁnition 5 (Controllable output dependency (COD)) A cell c of order k allows an COD : There exist δ1, δ2 (0, 1), δ2 < δ1 so that for any time t N there exists a gate vector yΓ(t) leading to open output dependency

yc (t) sc (t) [δ1, 1] (6)

and there exists another gate vector yΓ(t) leading to a closed output dependency

yc (t) sc (t) [0, δ2] . (7)

The third property is a kind of stability criterion. An unwanted case is that a small change (caused by any noisy signal) at time step tin has a growing inﬂuence at later time steps. This is equivalent to an exploding gradient over time. Controlling the gradient of the whole system and avoiding him not to explode is a hard problem. But we can at least avoid the exploding gradient in one cell. This should be prohibited for any gate activations.

Deﬁnition 6 (Not exploding gradient (NEG)) A cell c has an NEG : For any time steps tin, t N, tin < t and any gate activations yΓ(t) the truncated gradient in bounded by

sc (t) sc (tin) tr [0, 1] .

Cells in MDRNNs

We think that a cell fulﬁlling these three properties can work as stable memory. To be able to prove these properties for the LSTM cell we have to considerate the gate activations. In general, the activation function of the gates does not have to be the logistic activation function flog, whereas for this paper we set γ Γ : fγ := flog. So the activation of gates can never be exactly 0 or 1, because of a ﬁnite input activation netγ(t) to the gate activation function. But a gate can have an activation yγ(t) [1 ε, 1) if it is opened or yγ(t) (0, ε] if it is closed, because for a realistic large input activation netγ(t) 7 (low input activation netγ(t) < 7) we get an activation within the interval yγ(t) [1 ε, 1) (yγ(t) (0, ε]) with ε < 1 1000. Handling with these activation intervals we can prove the deﬁnitions for the LSTM cell. Now we can prove whether or not the LSTM cell has these properties.

Theorem 7 (Properties of the LSTM cell) The 1D LSTM cell allows NVG and has an NEG, but does not allow COD.

Proof see A.1 in appendix.

4. Expanding to More Dimensions

In A. Graves, S. Fernandez and J. Schmidhuber (2007) the 1D LSTM cell is extended to an arbitrary number of dimensions; this is solved by using one FG for each dimension. In many publications using the MD LSTM cell in MDRNNs outperform state-of-the-art recognition systems (for example see A. Graves and J. Schmidhuber, 2008). But by expanding the cell to the MD case, the absolute value of the internal state |sc| can grow faster than linear over time. When sp c and there are peephole connections (for peephole connection details see F. A. Gers, N. Schraudolph and J. Schmidhuber, 2002), the cells have an output activation of yp c { 1, 0, 1}: The internal state multiplied by the peephole weight overlays the other activation-weight-products and this leads to an activation of the OG yp ω {0, 1} and a squashed internal state hc sp c { 1, 1}. So the output of the cell is yp ωhc sp c = yp c { 1, 0, 1}. But also without peephole connections the internal state can grow, which leads to hc sp c { 1, 1} and the cell works like a conventional unit with a logistic activation function yc(t) yω(t). Our goal is to transfer the Deﬁnitions 4, 5 and 6 deﬁned in Section 3 into the MD case and we will see that the MD LSTM cell has an exploding gradient. In the next sections we will provide alternative cell types, that fulﬁll two or all of these deﬁnitions. In the 1D case it is clear, that there is just one way to come from date t1 to date t2, when t1 < t2, by incrementing t1 as long as t2 is reached. For the MD case the number of paths depends on the number of dimensions and the distance between these two dates. An MD path is deﬁned as follows.

Deﬁnition 8 (MD path) Let p,q ND be two dates. A p-q-path π of length k 0 is a sequence

π := {p = p0,p1, . . . ,pk = q}

with i {1, . . . , k} !d {1, . . . , D} : (pi) d = pi 1. Further, let πi := pi.

Gundram Leifert et al.

We can deﬁne the distance vector

pq := q p =

q1 p1 ... q D p D

pq1 ... pq D

between the dates p and q. When pq has at least one negative component, there exists no p-q-path. Otherwise there exist exactly

pq1, . . . , pq D

PD i=1 pqi ! QD i=1 pqi!

p-q-paths (compare with the multinomial coeﬃcient). We write p < q when #{ pq} 1 and p q when p = q p < q. Now we can extend the deﬁnitions of the 1D case to the MD case, whereas we concentrate on the MD cells of order 1.

Deﬁnition 9 (MD cell) An MD cell, c, of order 1 and dimension D consists of the same parts as a 1D cell of order 1. The cell update in date p ND is performed in three subsequent phases:

1. Following the classical update scheme of neurons (see Section 2), all units in Γ {cin} synchronously calculate their activations, which will be denoted by yp Γ = yp γ

γ Γ. Furthermore, we call yp cin the input activation of the cell.

2. Then, the cell computes it s so-called internal state

sp c := gint

yp Γ, yp cin, sp 1 c , . . . , s p D c

3. Finally, the cell computes it s so-called output activation

yp c := gout

yp Γ, yp cin, sp c, sp 1 c , . . . , s p D c

Using this, we can reintroduce the LSTM cell as well as Deﬁnition 4, 5 and 6 for the MD case:

Deﬁnition 10 (MD LSTM cell) An MD LSTM cell is a cell of dimension D and order 1 where hc = tanh and

Γ = {ι, (φ, 1) , . . . , (φ, D) , ω}

sp c = gint

yp Γ, yp cin, sp 1 c , . . . , s p D c

= yp ι yp cin + D P

d=1 s p d c yp φ,d

yp c = gout yp Γ, sp c = hc sp c yp ω

Cells in MDRNNs

Deﬁnition 11 (MD Not vanishing gradient (NVG)) An MD cell c allows an NVG : For arbitrary pin,pout ND,pin pout, δ > 0 there exist p ND gate activations yp Γ such that for any p1,p2 ND

sp2 c yp1 cin tr

[1 δ, 1] for p1 = pin and pin p2 pout [0, δ] otherwise (8)

Deﬁnition 12 (MD Controllable output dependency (COD)) An MD cell c allows an COD : There exist δ1, δ2 (0, 1), δ2 < δ1 so that for any time t N there exists a gate vector yp Γ leading to open output dependency

yp c sp c [δ1, 1] (9)

and there exists another gate vector yp Γ leading to a closed output dependency

yp c sp c [0, δ2] . (10)

Deﬁnition 13 (MD Not exploding gradient (NEG)) An MD cell c has an NEG : For any time steps pin,p ND,pin < p and any gate activations yp Γ the truncated gradient in bounded by

sp c spin c tr [0, 1]

We can now consider these deﬁnitions for the MD LSTM cell.

Theorem 14 (NVG of MD LSTM cells) An MD LSTM cell allows an NVG.

Proof see A.2 in appendix

For arbitrary activations of FGs the partial derivative sp c spin c can grow over time:

Theorem 15 (NEG of MD LSTM cells) An MD LSTM cell can have an exploding gradient, when D 2.

Proof see A.3 in appendix.

The MD LSTM cell does not allow the COD, because the 1D case is a special case of the MD case. Our idea for the next section is to change the MD LSTM layout, so that it has an NEG.

Gundram Leifert et al.

5. Reducing the MD LSTM Cell to One Dimension

In the last section, we showed that the MD LSTM cell can have an exploding gradient. We tried diﬀerent ways to solve this problem. For example we divided the activation of the FG by the number of dimensions. Then the gradient cannot explode over time, but the gradient vanishes along some paths rapidly. Another approach was to give the cells the opportunity to learn to stabilize itself, when the internal state starts diverging. Therefore we add an additional peephole connection between the square value of the previous internal

states s p d c

2 and the FGs so that the cell is able to learn that it has to close the FG for

large internal states. This also does not make a signiﬁcant diﬀerence. Also forcing the cell to learn to stabilize itself by adding an error

Lossstate = ε sp c p

with p = {1, 2, 3, 4} and diﬀerent learning rates ε does not work. So we tried to change the layout of the MD LSTM cell.

5.1 MD LSTM Stable Cell

In Section 3 we realized that 1D LSTM cells work good and the gradient does not explode,

but in the MD case it does. Our idea is to combine the previous states s p d c at date p to one previous state sp c and take the 1D form of the LSTM cell. For this reason we call this cell LSTM Stable cell. Therefore, a function

sp c = f sp 1 c , . . . , s p D c

is needed, so that the following two beneﬁts of the 1D LSTM cell remain:

1. The MD LSTM Stable cell has an NEG

2. The MD LSTM Stable cell allows NVG.

The convex combination

sp c = f sp 1 c , . . . , s p D c

d=1 λp ds p d c , d = 1, . . . , D : λp d 0,

d=1 λp d = 1 (11)

of all states satisﬁes these both points (see Theorems 17 and 18). To calculate these D coeﬃcients we want to use the activation of D gates and we call them lambda gates (LG or λ).

Deﬁnition 16 (MD LSTM Stable cell) An MD LSTM Stable cell is a cell of dimension D and order 1 where hc = tanh and

Γ = {ι, (λ, 1) , . . . , (λ, D) , φ, ω}

Cells in MDRNNs

sp c = gconv

yp Γ, sp 1 c , . . . , s p D c

d=1 s p d c yp λ,d D P

d =1 yp λ,d

sp c = gint yp Γ, yp cin, sp c = yp ι yp cin + sp c yp φ

yp c = gout yp Γ, sp c = yp ωhc sp c

Using these equations we can test the cell for its properties. The MD LSTM Stable cell does not have the COD, because the 1D LSTM cell also does not have this property. For the other propertiese we get:

Theorem 17 (LTD of MD LSTM Stable cells) An MD LSTM Stable cell allows NVG.

Proof See A.4 in appendix.

Theorem 18 (NEG of MD LSTM Stable cells) An MD LSTM Stable cell has an NEG.

Proof See A.5 in appendix.

Reducing the number of gates by one. When D 2 an MD LSTM Stable cell has one more gate than a classical MD LSTM (for D = 1 the both cells are equivalent). But it is possible to reduce the number of LGs by one. One solution is to choose one dimension d {1, . . . , D} which does not get an LG. Its activation is calculated by

d {1,...,D}\{d }

In the special case of D = 2 we can choose d = 2 and we get P2 d =1 yλ,d = yλ,1+(1 yλ,1) = 1 and the update equation of the internal state can be simpliﬁed to

sp c = gint

yp ι , yp λ,1, yp φ, sp 1 c , sp 2 c

= yp ι yp cin + yp λ,1sp 1 c + 1 yp λ,1 sp 2 c .

6. Bounding the Internal State

In the last sections we discussed the growing of the EC over time and we found a solution to have a NGEC for higher dimensions. Nevertheless it is possible that the internal state grows linearly over time. When we take a look at Deﬁnition 10, we see that the partial derivative for p = pout depends on h c sp c . So having the inequality

yp c sp c h c (sp c) with h c (sp c) |sp c| 0

Gundram Leifert et al.

the cell allows NVG deﬁned in Deﬁnition 11, but actually we have ypout c ypin cin

|spout c | 0 for

arbitrary gate activations. Again, ideas like state decay, additional peephole connections or additional loss functions like mentioned in Section 4 either do not work or destroy the NVG of the LSTM and LSTM Stable cell. So, our solution is to change the architecture of the MD LSTM Stable cell, so that it fulﬁlls has an NEG and allows NVG and COD. The key idea is to bound the internal state, so that for all inputs yp cin 1, p ND the internal state is bounded by sp c 1. Note that this is comparable with the well-known Bounded-Input-Bounded-Output-Stability (BIBO-Stability). To create an MD cell that has an NEG, allows NVG and has a bounded internal state, we take the MD LSTM Stable cell proposed in the last section and change its layout. Therefore we calculate the activation of the IG as function of the FG, so that we achieve sp c 1 by choosing yp ι := 1 yp φ. So the activation of the FG controls how much leaks from the previous states. The activation of the FG can also be interpreted as switch, if the internal activation, the new activation or a convex combination of these both activations should be stored in the cell. So the sc can be seen as time-dependent exponential moving average of ycin.

Deﬁnition 19 (MD Leaky cell) An MD Leaky cell is a cell of dimension D and order 1 where hc = tanh and

Γ = {(λ, 1) , . . . , (λ, D) , φ, ω}

sp c = gconv

yp Γ, sp 1 c , . . . , s p D c

d=1 s p d c yp λ,d D P

d =1 yp λ,d

sp c = gint yp Γ, yp cin, sp c = 1 yp φ yp cin + sp c yp φ

yp c = gout yp Γ, sp c = yp ωhc sp c

Now we can prove that the resulting cell has all beneﬁts.

Theorem 20 The MD Leaky cell has an NEG and allows NVG and COD.

Proof See A.6 in appendix.

The MD Leaky cell can have one gate less than the MD LSTM cell and the MD LSTM Stable cell and because of this, the update path requires less computations.

7. General Derivation of Leaky Cells

So far we proposed cells for the MD case, which are able to provide long term memory. But especially in MDRNNs with more than one MD layer it is hard to measure if and how much long term dependencies are used and even if it is useful. Another way to interpret the cell is to consider them as kind of MD feature extractor like feature maps in Convolutional Neural Networks (Bengio and Le Cun, 1995). Then the aim is to construct an MD cell which is able to generate useful features. Having a hierarchical Neural Network like in Bengio and Le Cun (1995) and A. Graves and J. Schmidhuber (2008) over the hierarchies the number of

Cells in MDRNNs

features increases with a simultaneously decreasing feature resolution. Features in a layer with low resolution can be seen as low frequency features in comparison to features in a layer with high resolution. So it would be useful to construct a cell as feature extractor which produces a low frequency output in comparison to its input. In appendix B we take a closer look at the theory of linear shift invariant (LSI)-systems and their frequency analysis and analyse a ﬁrst order LSI-system regarding its free selectable parameters using the Fand Z-transform. There, we derive the MD Leaky LP cell (see Deﬁnition 21) and 5 additional ﬁrst order MD cells, which we do not test in Section 8.

Deﬁnition 21 (MD Leaky LP cell) An MD Leaky LP cell is a cell of dimension D and order 1 where hc = tanh and

Γ = {(λ, 1) , . . . , (λ, D) , φ, ω0, ω1}

sp c = gconv

yp Γ, sp 1 c , . . . , s p D c

d=1 s p d c yp λ,d D P

d =1 yp λ,d

sp c = gint yp Γ, yp cin, sp c = 1 yp φ yp cin + sp c yp φ

yp c = gout yp Γ, sp c, sp c = hc sp cyp ω0 + sp c yp ω1

Setting the second OG (yp ω1) to zero, the Leaky LP cell corresponds to the Leaky cell, hence it fulﬁlls all three properties, but has one more gate, which is as much gates as the LSTM cell.

8. Experiments

RNNs with 1D LSTM cells are well studied. In some experiments the activations of the gates and the internal state are observed and one can see that the cell can really learn, when to forget information and when the internal state should be accessible for the network (see F. A. Gers, N. Schraudolph and J. Schmidhuber, 2002). However, we did not ﬁnd experiments like these for the MD case and we do not want to transfer these experiments into the MD case. Instead we compare the diﬀerent cell types with each other in two scenarios where the MD RNNs with LSTM cells perform very well. In both benchmarks the task is to transcribe a handwritten text on an image, so we have a 2D RNN. In this case we compare the cells on the IFN/ENIT (Pechwitz, M. and Maddouri, S. and Märgner, V. and Ellouze, N. and Amiri, H. and others, 2002) and the Rimes database (Augustin, E. and Brodin, J.-M and Carré, M. and Geoﬀrois, E. and Grosicki, E. and Prêteux, F., 2006). Both tasks are solved with the MD RNN layout described in A. Graves and J. Schmidhuber (2008) and shown in Figure 4. All networks are trained with Backpropagation through time (BPTT) . To compare the diﬀerent cell types in RNNs with each other we take 10 RNNs with diﬀerent weight initializations of each cell type and calculate the minimum, the maximum and the median of the best label error rate (LER) on a validation set of these 10 RNNs. In all tables we present these three LERs to compare the cell types. We think it is more important to have stable cells in the lower MD layers because of two reasons: First, when we have just a few cells in a layer, the saturation of one cell has a

Gundram Leifert et al.

INPUT LAYER

OUTPUT LAYER

Subsample 1

Subsample 2

Subsample 3

Figure 4: Architecture of the hierarchical MDRNN used for the experiments: It is equivalent to A. Graves and J. Schmidhuber (2008, Figure 2). A 2D layer contains 22 distinct layers (for each combination of scanning direction left/right and up/down one layer). To reduce the number of weights between two 2D layers, a 0D layer is inserted, which contains units with tanh as activation function. They have dimension 0 because they have no recurrent connections. These layers can be seen as feed-forward or convolutional layer. Each 2D layer (or its allocated 0D layer) reduces its size in in x and y dimension using a two-dimensional subsampling. Simultaneously the number of feature maps (z-dimension) increases to have no bottleneck between input and output layer.

greater eﬀect on the performance of the network. Second, in lower layers there are longer time series so having an unstable cell in such a layer, it has time to saturate. So our ﬁrst experiment compares the recognition results when we substitute the LSTM cells in the lowest layer (which is 2D layer 1 in Figure 4) by the newly developed cells. In the second experiment we compare the LSTM cell and the Leaky LP cell also in the higher MD layers ( 2D layer 2 and 3 in Figure 4), to evaluate if the Leaky LP cell work better also

Cells in MDRNNs

Label-Error-Rate in Percent Celltype min max median LSTM 8,58% 14,73% 10,58% Stable 8,78% 11,75% 9,55% Leaky 8,87% 10,47% 9,10% Leaky LP 8,24% 9,40% 8,93%

Table 1: Diﬀerent cell types in the lowest MD layer

in long time series. In Bengio (2012, 3.1.1) it is mentioned, that an important hyper parameter for a training is the learning rate, so another experiment is to train all networks with stochastic gradient decent with diﬀerent learning rates δ 1 10 3, 5 10 4, 2 10 4, 1 10 4 and compare the best LER according a ﬁxed learning rate.

8.1 The IFN/ENIT Database

This database contains handwritten names of towns and villages of Tunisia. The set is divided into 7 (a-f,s) sets, where 5 (a-e) are available for training and validation (for details see Pechwitz, M. and Maddouri, S. and Märgner, V. and Ellouze, N. and Amiri, H. and others, 2002). With all information we got from A. Graves, we were able to get comparable results to A. Graves and J. Schmidhuber (2008). Therefor we divide the sets a-e into 30000 training samples and 2493 validation samples. All network are trained 100 epochs with a ﬁxed learning rate δ = 1 10 4. The LER is calculated on the validation set.

8.1.1 Different Cells in the Lowest MD Layer

In our ﬁrst experiment we substitute the LSTM cell in the lowest MD layer. We take some of the cells described in this paper. In Table 1 the results are shown. The ﬁrst row is the same RNN layout used in A. Graves and J. Schmidhuber (2008). We can see, that the Leaky LP cell performs the best. Nevertheless the worst RNN with Leaky LP cells in the lowest MD layer performs worth than the best RNN with LSTM cells. So we cannot say, that Leaky LP is always better. But it can be observed that the variance of the RNN performance is very high with LSTM cells in the lowest MD layer. Our interpretation is that LSTM cells have a comparable performance like the Leaky LP cells in the lowest layer, when they do not saturate. Note, that the Leaky cell has one gate less, so they are faster and have less trainable weights.

8.1.2 Different Cells in Other MD Layers

Now we want to compare the best new developed cell the Leaky LP cell with the LSTM cell in the other MD layers. So we also substitute the LSTM cell in the upper MD layers. We enumerate the 2D layers like shown in Figure 4. In Table 2 we can see that substituting the LSTM cells only in the lowest or in the both lowest layer perform slightly better. The best results can be achieved when we use Leaky LP cells in 2D layer 1 and LSTM cells in 2D layer 3. Using LSTM in the middle layer seems to work slightly better than using the Leaky LP

Gundram Leifert et al.

Celltype in 2D layer Label-Error-Rate in Percent 1 2 3 min max median LSTM LSTM LSTM 8,58% 14,73% 10,58% Leaky LP LSTM LSTM 8,24% 9,40% 8,93% Leaky LP Leaky LP LSTM 8,35% 11,27% 8,91% Leaky LP Leaky LP Leaky LP 8,92% 11,69% 9,74%

Table 2: Diﬀerent cells in other layers

Label-Error-Rate in Percent Celltype BP-delta min max median LSTM 1 10 4 8,58% 14,73% 10,58% LSTM 2 10 4 9,15% 16,86% 10,51% LSTM 5 10 4 9,03% 21,77% 11,44% LSTM 1 10 3 10,21% 30,20% 11,44% Leaky LP 1 10 4 8,92% 11,69% 9,74% Leaky LP 2 10 4 8,38% 9,09% 8,81% Leaky LP 5 10 4 8,25% 8,95% 8,78% Leaky LP 1 10 3 8,29% 9,20% 8,88% Leaky LP 2 10 3 8,95% 12,81% 9,55%

Table 3: Performance of cells regarding learning-rate

cells instead. This ﬁts to our intuition mentioned before that the LSTM cells perform better when they do not have a too long time series and when there are enough cells in one layer which do not saturate.

8.1.3 Performance of Cells Regarding Learning-Rate

When we take a look at the update equations and the proofs of the NEG it can be assumed, that the gradient going through the cells is lower for Leaky LP cells in contrast to LSTM cells. So we think the learning rate have to be larger for Leaky LP cells. In Table 3 we compare the networks with either only LSTM or Leaky LP cells. There we can see that the learning rate have to be much higher for the Leaky LP cells. In addition, the RNNs with Leaky LP cells are more robust to the choice of the learning rate.

8.2 The Rimes Database

One task of the Rimes database is the handwritten word recognition (for more details see E. Grosicki, M. Carré, J.-M. Brodin and E. Geoﬀrois, 2008; E. Grosicki and H. El-Abed, 2011). It contains 59292 images of french single words. It is divided into distinct subsets; a training set of 44196 samples, a validation set of 7542 samples and a test set of 7464 samples. We train the MD RNNs by using the training set for training and calculate the LER over the validation set, so the network is trained on 44196 training samples each epoch. The network used in this section diﬀers only in the subsampling rate between two layers from the network used in A. Graves and J. Schmidhuber (2008). When there is a subsampling between layers,

Cells in MDRNNs

Label-Error-Rate in Percent Celltype min max median LSTM 14,96% 17,63% 16,50% Stable 14,45% 16,02% 15,11% Leaky 14,77% 16,39% 15,85% Leaky LP 14,63% 15,78% 15,30%

Table 4: Diﬀerent cell types in the lowest MD layer

Celltype in 2D layer Label-Error-Rate in Percent 1 2 3 min max median LSTM LSTM LSTM 14,96% 17,63% 16,50% Leaky LP LSTM LSTM 14,63% 15,78% 15,30% Leaky LP Leaky LP LSTM 14,21% 15,57% 14,92% Leaky LP Leaky LP Leaky LP 14,94% 16,18% 15,52%

Table 5: Diﬀerent cells in other layers

the factors are 3 2 instead of 4 3 or 4 2. The rest of the experiment is the same like described in Section 8.1.

8.2.1 Different Cells in the Lowest MD Layer

In Table 4 we can see that substituting the LSTM in the lowest layer by one of the three cells improves the performance of the network, even the Leaky cell with one gate less.

8.2.2 Different Cells in Other MD Layers

We want to see the eﬀect of the substitution of the LSTM cell by the Leaky LP cell in the upper MD layers. In Table 5 we can see that using Leaky LP cells in both lowest layers perform very well. So we also take this setup to try diﬀerent learning rates.

Performance of Cells Regarding Learning-Rate. Using diﬀerent learning rates we can see that the RNN with Leaky LP cells in the both lowest layers and the LSTM cells in the top layer can signiﬁcantly improve the performance . Even the maximal LER of this RNN works better than the best network with LSTM cells in each layer.

9. Conclusion

In this paper we took a look at the one-dimensional LSTM cell and discussed the beneﬁts of this cell. We found two properties, that probably make these cells so powerful in the one dimensional case. Expanding these properties to the multi dimensional case, we saw that the LSTM does not fulﬁll one of these properties any more. We solved this problem by changing the architecture of the cell. In addition we presented a more general idea how to create one dimensional or multi dimensional cells. We compare some newly developed cells with the LSTM cell on two data sets and we can improve the performance using the

Gundram Leifert et al.

Label-Error-Rate in Percent Celltype BP-delta min max median LSTM 1 10 4 14,96% 17,63% 16,50% LSTM 2 10 4 14,41% 16,88% 15,61% LSTM 5 10 4 15,05% 16,27% 15,47% Leaky LP 1 10 4 14,94% 16,18% 15,52% Leaky LP 5 10 4 12,68% 13,95% 13,57% Leaky LP in 2D layer 1 & 2 2 10 4 13,26% 14,04% 13,65% Leaky LP in 2D layer 1 & 2 5 10 4 12,08% 13,42% 12,87%

Table 6: Performance of cells regarding learning-rate

new cell types. Due to this we think that substituting the multi-dimensional LSTM cells by the multi-dimensional Leaky LP cell could improve the performance of many system working with a multi-dimensional space.

Appendix A. Proofs

A.1 Proof of 7

Proof Let c be a 1D LSTM cell. To get the derivative sc(t2)

sc(t1) according the truncated gradient between two time steps t1, t2 N we have to take a look at gint.

sc (t2) sc (t1) = gint (yΓ(t2), ycin(t2), sc(t2 1))

sc (t1) (12)

= (ycin(t2)yι(t2))

sc (t1) | {z } = tr0

sc (t1) yφ(t2) + sc(t2 1) yφ(t2)

sc (t1) | {z } = tr0

= tr sc(t2 1)

sc (t1) yφ(t2)

t=t1+1 yφ(t) (13)

In addition, t N we have

sc (t) ycin (t) = yι(t) and yc (t) sc (t) = h c (sc(t)) yω(t). (14)

We will prove the properties successively. NEG: For the LSTM cell the FG fφ = flog ensures yφ(t) (0, 1), so using these bounds in (13) with

sc (t) sc (tin) = tr

t =tin+1 yφ(t ) (0, 1)

Cells in MDRNNs

the LSTM cell has an NEG. NVG: Therefore, we choose

yι (t) [1 ε, 1) if t = tin (0, ε] otherwise ,

yφ (t) [1 ε, 1) if tin < t tout (0, ε] otherwise ,

with a later chosen ε > 0. Let t1, t2 N, t1 t2 be two arbitrary dates, where we want to calculate the gradient sc(t2) ycin(t1). First, we want to show that the LSTM cell allows NVG for t1 = tin and tin t2 tout: We have yι(t1) [1 ε, 1) and t = tin + 1, . . . , tout : yφ(t) [1 ε, 1). Then, we can estimate the derivative from (12) and (14) by

sc (t2) ycin (t1) = sc (t2)

sc (t1) sc (t1) ycin (t1) = tr yι (t1)

t=t1+1 yφ (t)

t=t1+1 (1 ε) , 1

(1 ε)tout tin+1 , 1

To fulﬁll the equation for NVG we choose ε depending on δ such that

1 δ (1 ε)tout tin+1

1 tout tin+1

holds. Second, we have to show, that the derivative is in [0, δ], when t1 = tin and tin t2 tout is not fulﬁlled. In the case of t1 = tin when ε δ we can use the NEG which leads to

sc (t2) ycin (t1) = sc (t2)

sc (t1) | {z } tr[0,1]

sc (t1) ycin (t1) | {z } (0,ε]

[0, ε] [0, δ].

When t1 = tin we have two cases: t2 < tin or t2 > tout. For the case t2 < tin the derivative is zero ( [0, δ]), because the cell is causal. For t2 > tout we can split the derivative at tout and get

sc (t2) ycin (t1) = tr yι(t1)

t=t1+1 yφ(t)

| {z } (0,1)

t=tout+1 yφ(t) | {z } (0,ε]

tr 0, εt2 tout [0, ε] [0, δ].

For ε min n δ, 1 (1 δ)

1 tout tin+1 o the LSTM cell allows NVG. COD: To prove that the LSTM cell has no COD, we show that there are gate activations

Gundram Leifert et al.

such that in Deﬁnition 5 we get δ2 > δ1. Therefore, we assume that all gate activations are arbitrary (yγ(t) (0, 1)), closed (yγ(t) (0, ε]) or opened (yγ(t) [1 ε, 1)) with a later chosen ε > 0. We take a look at the right side of (14). For sc(t) = 0 we get h c (sc(t)) = 1. In Deﬁnition 5 we have to satisfy yΓ(t) : yc(t)

sc(t) [0, δ2] an choose the OG yω(t) (0, ε] with

But then for t = 1, . . . , t 1 we can choose the IG and FG open with the same ε so that

yφ(t ), yι(t ) [1 ε, 1) .

When for all time steps t = 1, . . . , t there is a positive input ycin(t ) [c, 1), c (0, 1) R and an internal state sc(t 1) < c(1 ε)

ε , the internal state is growing over time, because

sc(t ) = ycin(t )yι(t ) + sc(t 1)yφ(t )

c(1 ε) + sc(t 1)(1 ε)

sc(t 1) + c(1 ε) sc(t 1)ε

> sc(t 1) + c(1 ε) c(1 ε)

For large sc(t) c(1 ε)

ε 1 we can estimate

tanh(sc(t)) | {z } exp( 2sc(t))

exp ( sc(t)) exp c(1 ε)

This yields in (14) to the bound yc (t) sc (t)

= h c (sc(t)) yω(t) (16)

so in Deﬁnition 5 we get

δ1 exp c(1 ε)

But when we combine (15), (18) and the restriction in Deﬁnition 5, we have

ε δ2 < δ1 exp c(1 ε)

but there exist ε, c, such that the inequality is not fulﬁlled, which is a contradiction. Summarized, the 1D LSTM cell allows an NVG and has an NEG, but does not allow COD.

Cells in MDRNNs

A.2 Proof of 14

Proof Let c be an MD LSTM cell of dimension D, p,p1,p2,pin,pout ND,pin pout arbitrary dates and hc = tanh the sigmoid function. Besides ε > 0 is a later chosen value. In the ﬁrst step we want to show that there are activations of the forget gates, so that

sp c spin c tr

(1 ε) p pin 1, 1 for pin p pout [0, Dε] otherwise (19)

is fulﬁlled. The prove is done using induction over k = p pin 1 with p pin. The base k = 0 is clear. Let be k 1. We deﬁne

Pp := d {1, . . . , D} | p d pin

the set of dimensions d, in which are pin-p d -paths. Note, that this set cannot be empty, because p > pin for k 1. When we have a dimension d Pp then p d pin 1 = k 1 and we assume

s p d c spin c tr

h (1 ε) p d pin 1 , 1 i = h (1 ε)k 1 , 1 i . (20)

Then we choose the activations of the FG to be

( h 1 ε |Pp|, 1 |Pp| for d Pp and pin < p pout [0, ε] otherwise . (21)

Then we can estimate the derivative for pin p pout using (20) and (21) to

sp c spin c = tr

s p d c spin c yp φ,d

s p d c spin c

s p d c spin c

sp c spin c tr

|Pp| (1 ε)k 1 1 ε

|Pp| , |Pp| 1 |Pp|

sp c spin c tr

h (1 ε) p pin 1 , 1 , (22)

so (19) is fulﬁlled for pin p pout. If we have p < pin in (19), the derivative is 0, because we have a causal system. For p > pout in (19), we choose ε 1 D 1 |Pp| in (21) to ensure p ND: sp c spin c

1 (see (22)) and we get

sp c spin c = tr

s p d c spin c yp φ,d

0, Dε max d=1,...,D s p d c spin c

(0, Dε] , (23)

and (19) is fulﬁlled. In the second step let p1 p2 be the date, for which we want to calculate the truncated gradient sp2 c yp1 cin . We choose the IG activation as

yp ι [1 ε, 1) if p = pin (0, ε] otherwise (24)

Gundram Leifert et al.

and we get sp c yp cin = yp ι . Using (22), (23) and (24), we can estimate the partial derivative by

sp2 c yp1 cin = sp2 c sp1 c

sp1 c yp1 cin

sp2 c sp1 c tr

(1 ε)(1 ε) p2 pin 1, 1 for p1 = pin and pin p2 pout [0, Dε] otherwise .

and setting

1 pin pout 1+1

the conditions of Deﬁnition 11 are fulﬁlled.

A.3 Proof of 15

Proof Let c be an MD cell of dimension D with the internal state sc and pin,pk ND,pin pk two dates. Let pk be a date k steps further in each dimension than a ﬁxed date pin. So the distance between them is pin pk 1 = Dk. Let Π be the set of all pin-pk-paths, then there exist |Π| = #{ pinpk} paths (see Deﬁnition 8). We assume

yp φ,d [ε, 1 ε]

with ε (0, 0.5) and we can estimate the partial derivative, using the truncated gradient, with

spk c spin c = tr

i=1 yπi φ,d

h εk#{ pinpk}, (1 ε)k #{ pinpk} i .

For D = 1 we get |Π| = 1 and the cell has a NGEC. When D 2 we can count the number of paths using the Stirling s approximation and we can estimate the number of paths with

#{ pinpk} =

When we combine it with the FG activations we can estimate the derivative for great k with the Stirling s approximation and get

spk c spin c tr

h εDk#{ pinpk}, (1 ε)Dk #{ pinpk} i (25)

2πk D 1 (Dε)Dk ,

2πk D 1 (D (1 ε))Dk #

Cells in MDRNNs

The upper bound of this interval can grow for great k, if [D (1 ε)] > 1 and this is the case for D 2. So the MD LSTM cell can have an exploding gradient for D 2. When the weights to the FGs are initialized with small values, we have yp φ,d 0.5. Then we have an exploding gradient when D 3, when the training is starting. In the worst case we have yp φ,d 1 and the derivative in (25) goes for great k to

spk c spin c

2π D 1 k 1 D

A.4 Proof of 17

Proof Let c be an MD LSTM Stable cell of dimension D 2 (for D = 1 the proof is equivalent to the 1D case of the LSTM cell), p,p1,p2,pin,pout ND,pin pout arbitrary dates and hc = tanh the sigmoid function. Besides ε > 0 is a later chosen value. In the ﬁrst step we want to show that there are activations of the forget gates, so that

sp c spin c tr

(1 (D 1)ε)2 p pin 1, 1 for pin p pout [0, ε] otherwise (26)

is fulﬁlled. The prove is done using induction over k = p pin 1. The base k = 0 is clear. Let be k 1. We deﬁne

Pp := d {1, . . . , D} | p d pin

the set of dimensions d, in which are pin-p d -paths. Note, that this set cannot be empty, because p > pin for k 1. When we have a dimension d Pp then p d pin 1 = k 1 and we assume

s p d c spin c tr

h (1 (D 1)ε)2 p d pin 1 , 1 i = h (1 (D 1)ε)2(k 1) , 1 i . (27)

When we choose the activations of the LGs to be

yp λ,d [1 ε, 1) for d Pp and pin < p pout (0, ε] otherwise ,

Gundram Leifert et al.

we can estimate

P d Pp yp λ,d PD d =1 yp λ,d (1 (D 1)ε, 1], because

P d Pp yp λ,d PD d =1 yp λ,d =

P d Pp yp λ,d P d Pp yp λ,d + P d {1,...,D}\Pp yp λ,d (28)

|Pp| (1 ε) |Pp| (1 ε) + (D |Pp|) | {z } D 1

|Pp| (1 (D 1)ε) |Pp| (1 (D 1)ε) + (D 1) ε

(1 (D 1)ε) |Pp| |Pp| ε(D 1) (|Pp| 1)

(1 (D 1)ε).

Setting the FG to

yp φ [1 ε, 1) for pin < p pout (0, ε] otherwise (29)

we can estimate the derivative for pin p pout using (27),(28) and (29) to

sp c spin c = tr yp φ

s p d c spin c

yp λ,d PD d =1 yp λ,d + X

d {1,...,D}\Pp

s p d c spin c

yp λ,d PD d =1 yp λ,d | {z } =0

(1 ε) (1 (D 1)ε)2(k 1) (1 (D 1)ε), 1

sp c spin c tr

(1 (D 1)ε)2k , 1 (30)

so (26) is fulﬁlled for pin p pout. If we have p < pin in (26), the derivative is 0, because we have a causal system. For p > pout the FG is closed (see (29)), and using the upper bounds of (27) and (28) we get

sp c spin c = tr yp φ

s p d c spin c

yp λ,d PD d =1 yp λ,d

and (26) is fulﬁlled. In the second step let p1 p2 be the date, for which we want to calculate the truncated gradient sp2 c yp1 cin . We choose the IG activation as

yp ι [1 ε, 1) if p = pin (0, ε] otherwise (32)

Cells in MDRNNs

and we get sp c yp cin = yp ι . Using (30), (31) and (32), we can estimate the partial derivative by

sp2 c yp1 cin = sp2 c sp1 c

sp1 c yp1 cin

sp2 c sp1 c tr

(1 ε)(1 (D 1)ε)2 p2 pin 1, 1 for p1 = pin and pin p2 pout [0, ε] otherwise .

and setting

ε := min δ, 1 (1 δ)

1 2 pin pout 1+1 1 D 1

the conditions of Deﬁnition 11 are fulﬁlled.

A.5 Proof of 18

Proof Let c be a MD LSTM Stable cell of dimension D with the internal state sc and pin,p ND,pin p two arbitrary dates and pin p 1 = k. Let all gate activations be arbitrary in [0, 1]. We show that

sp c spin c tr [0, 1] (33)

is fulﬁlled k N using induction over k. For the base case k = 0 we get sp c spin c = spin c spin c = 1.

Let (33) be fulﬁlled for k 1. That means if p d pin we have p d pin 1 = k 1 and

this leads to s p d c spin c tr [0, 1]. If p d pin then there is no pin-p d -path and we have s p d c spin c = 0

for this dimension. Then we can calculate the derivative

0 sp c spin c = tr yp φ

s p d c spin c

d =1 yp λ,d

0, max d Pp

( s p d c spin c

d =1 yp λ,d | {z } 1

tr [0, 1] ,

which gives us the desired interval.

A.6 Proof of 20

Proof NEG: The cell has an NEG, because all gates have the same bounds as the MD Stable cell. NVG: To prove the NVG, we use the proof of Theorem 17. The diﬀerence between the MD Stable cell and the MD Leaky cell is that the activations of the FG and IG are dependent

Gundram Leifert et al.

on each other for the Leaky cell. Let pin,p ND,pin p be two arbitrary dates like in Theorem 17. The IG has just the a restriction that for p = pin it has to hold yp ι [1 ε, 1) . Here, the FG can have an arbitrary activation, so we chose yp φ = 1 yp ι . For all p > pin the FG have to be in the ranges, shown in (29), while the IG has no restriction and we choose yp ι = 1 yp φ, so the MD Leaky cell has the NVG. COD: The proof that the MD Leaky cell allows COD can be done by estimating the bounds of sp c. From the update equations of the cell we get

sp c max i=1,...,D

Now we can estimate the internal state using the ranges yp cin [ 1, 1], recursion over p

|sp c| = 1 yp φ yp cin + yp φsp c max yp cin , sp 1 c

, . . . , s p D c

max q<p yq cin 1

and get sp c [ 1, 1]. To fulﬁll the derivatives in Deﬁnition 12, for δ1 we choose yp ω [1 ε, 1) and get

δ1 min sp c

h c (sp c) (1 ε) = h c (1) (1 ε). (34)

For δ2 we choose yp ω (0, ε] and get

δ2 max sp c

h c (sp c) ε = ε. (35)

To fulﬁll the derivatives in Deﬁnition 12 we use (34), (35) and h c (1) > 1

ε δ2 < δ1 h c (1) (1 ε)

4 < h c (1) h c (1) + 1

the COD is proven.

Appendix B. Theory to Create First Order MD Cells

If one wants to take a closer look at the theory of linear shift invariant (LSI)-systems and their frequency analysis and analyse a ﬁrst order LSI-system regarding its free selectable parameters using the Fand Z-transform, it is highly recommended to be familar with these theories (for a good overview and more details see Poularikas, 2000; Schlichthärle, 2000). Adding the knowledge of reducing the MD case to the 1D case (see Section 5) we create new cell types for the MD case.

Cells in MDRNNs

B.1 Analysing a First Order LSI-System

The update equations of a ﬁrst order LSI-system with one input u, one internal state x and one output y can be written as

x[n] = h1(u[n], x[n 1]) = α0u[n] + α1x[n 1], (36)

y[n] = h2(x[n], x[n 1]) = b0x[n] + b1x[n 1] (37)

with the free selectable coeﬃcients α0, α1, b0, b1 R. Let U(z) = Z {u[n]} be the Ztransformed signal of u[n] and X(z), Y (z) respectively. Then we can write the so called transfer functions

H1(z) := X(z)

U(z) = α0 1 α1z 1 ,

H2(z) := Y (z)

X(z) = b0 + b1z 1,

H(z) := Y (z)

U(z) = H2(z)H1(z).

To analyse (36) and (37) according their frequency response we use the relationship between the F-transform and the Z-transform:

Remark 22 Let u[n] = ejωn be a harmonic input sequence with the imaginary number j2 = 1 and H(z) = Y (z)

U(z) be a transfer functions of an LSI-system. When the poles of H(z) are inside the circle |z| = 1, we can change from Zto F-transform using the substitution

H(ω) = H(z) z=ejω

with the harmonic sequence y[n] = H(ω)u[n] = H(ω)ejωn with the same frequency ω but with a diﬀerent amplitude and a diﬀerent phase dependent on the frequency ω.

We only want to analyse the amplitude of this harmonic sequence

|y[n]| = H(z)ejωn = |H(z)| = |H2(z)| |H1(z)|

and do that by analysing both transfer functions H1(z) and H2(z) separately. The amplitude of H1(ω) = H1(z)|z=ejω is calculated by

|H1(ω)| = |α0| q

(1 α1 cos(ω))2 + α2 1 sin2(ω) .

Like mentioned before, in many tasks, the information signal has a low frequency. To have the largest amplitude at ω = 0 we have to choose α1 0. As mentioned in Remark 22 the poles of H1(z) = α0 1 α1z 1 = zα0 z α1 have to be in the circle |z| = 1, so we have the additional constraint |α1| < 1. This leads to the bounds α1 [0, 1). But for α1 1 we

Gundram Leifert et al.

y I(t) y H(t 1)

y I(t) y H(t 1)

y I(t) y H(t 1)

λ sc(t) ycin(t)

memory sc(t 1)

Figure 5: Schematic diagram of a one-dimensional Leaky LP cell: The internal state is a convex combination of the new input cin and the previous state sc(t 1). The previous state sc(t 1) and the current state sc(t) are gated (ω0 and ω1) and accumulated afterwards. The output is squashed by tanh into the interval [ 1, 1].

have H1 (0) , so we have to choose α0 dependent on α1. We set a maximum gain of max ω |H1(ω)| = |H1(0)| = 1, so we get the constraint

|α0| 1 α1. (38)

In the same way we analyse H2(z):

|H2(ω)| = b0 + b1e jω = q

(b0 + b1 cos(ω))2 + b2 1 sin2(ω)

To get the maximal gain at low frequency the parameters b0 and b1 must have the same sign.

B.2 Creating a First Order Cell

With these constraints for the parameters we now can deﬁne a new cell type. The parameters α0, α1, b0, b1 should be activations of gates like in LSTM cells. We have to ﬁnd the right activation functions to fulﬁll the inequalities above. Using the weight-space symmetries in a network with at least one hidden layer (Bishop, 2006, 5.1.1), without loss of generality we set α0, α1, b0, b1 0. To fulﬁll the bounds for H1, we set α1 as activation of a gate with activation function flog. So we have α1 (0, 1). This is comparable with the FG in the previous sections. To select the α0 we choose α0 := 1 α1 (see (38)). So the value of α0 is comparable with the activation of the IG. For H2 we set both values b0, b1 as activations of

Cells in MDRNNs

a gate with activation function flog which leads to max ω |H2(ω)| = max {b0 + b1} = 2, so the amplitude response is bounded by 2.

With these bounds we can deﬁne a cell with a cell input yp cin = u[n], a previous internal state sp c = x[n 1], an internal state sp c = x[n] and a cell output yp c = y[n]. We substitute the coeﬃcients by time dependent gate activations

α0 := 1 yp φ = yp ι IG α1 := yp φ FG b0 := yp ω0 OG b1 := yp ω1 OG of the previous internal state

which leads to the transfer functions

H yp φ 1 (z) = α0 1 α1z 1 = 1 yp φ 1 yp φz 1 ,

H yp ω0;yp ω1 2 (z) = b0 + b1z 1 = yp ω0 + yp ω1z 1,

H(z) = Hyp φ;yp ω0;yp ω1(z) = Y (z)

U(z) = α0 b0 + b1z 1

1 α1z 1 = 1 yp φ 1 yp φz 1 yp ω0 + yp ω1z 1 . (39)

and the update equations

x[n] =α0u[n] + α1x[n 1] sp c = (1 yp φ)yp cin + yp φsp c ,

y[n] =b0x[n] + b1x[n 1] yp c = yp ω0sp c + yp ω1sp c . (40)

The output of the cell is already bounded in [ 2, 2], but to fulﬁll Deﬁnition 9 we change (40) to

yp c = hc yp ω0sp c + yp ω1sp c (41)

with hc = tanh to ensure yp c [ 1, 1]. This additional non-linearity is not necessary but leads to a better performance. This new cell type called MD Leaky lowpass (Leaky LP) cell is deﬁned in Deﬁnition 21. A block diagram of a 1D Leaky LP cell is shown in Figure 5 and diﬀerent frequency responses in Figure 6.

B.3 General First Order MD Cells

With the theory of this section we can easily create new cell types. In general, a cell has a number of gates γ1, γ2, . . . Γc. For D = 1 a previous state sp c is given directly. Otherwise the previous state is calculated as trainable convex combination of D previous states, like described in Section 5. In Table 7 cell layouts are depicted whereby type A is the cell developed in Section 7 (compare to (39)). For the other types we brieﬂy want to describe the main ideas.

Gundram Leifert et al.

0 0.1 0.2 0.3 0.4 0.5 0

frequency f

frequency response

H0.1 1 H0.5 1 H0.9 1

0 0.1 0.2 0.3 0.4 0.5 0

frequency f

frequency response

H0.9;0.1 2 H0.8;0.2 2 H0.5;0.5 2

0 0.1 0.2 0.3 0.4 0.5 0

frequency f

frequency response

H0.5 1 H0.5;0.5 2 H0.5;0.5;0.5

0 0.1 0.2 0.3 0.4 0.5 0

frequency f

frequency response

H0.1 1 H0.8;0.2 2 H0.1;0.8;0.2

Figure 6: Frequency response of H1 (dashed), H2 (dotted) and H (solid) for special parameters. Top-left: The frequency response of an IIR ﬁlter is able to block even low frequency signals, but it cannot be zero at f = 0.5. Top-right: The frequency response of an FIR ﬁlter cannot be lower than the lightgray dotted line, but for f = 0.5 it can be zero. Bottom: When these both ﬁlters are concatenated, the resulting frequency response can combine the beneﬁts of each ﬁlter.

B.3.1 The MD Butterworth Lowpass Filter

The cell of type B is a special case of the Leaky LP cell. When we set yp ω0 = yp ω1 = 0.5 there is a direct relation between the cutoﬀfrequency of a discrete Butterworth lowpass ﬁlter and the activation of yp φ: Let fcutoﬀbe the frequency, where amplitude response is reduced to

Cells in MDRNNs

2 of the maximal gain. We can calculate fcutoﬀby

1 yp φ 1 + yp φ

yp φ = 1 + tan(πfcutoﬀ)

1 tan(πfcutoﬀ)

with the bounds fcutoﬀ (0, 0.5) and yp φ ( 1, 1) (for more details see Schlichthärle, 2000, 2.2;6.4.2). For yp φ (0, 1) we get fcutoﬀ (0, 0.25). In Figure 7 (left) we can see, that even for a negative value of yp φ and a highpass characteristic of H1(z) the impulse response H(z) has a lowpass characteristic.

B.3.2 Adding an Additional State Gate

In B.2 we fulﬁlled (38) for the MD Leaky LP cell by setting α0 := 1 α1, so α0 is directly connected with α1. Another solution would be to add an additional value γ (0, 1) and choose α0 := γ (1 α1). So we can extend the MD Leaky LP cell by adding an additional gate γ4 for the previous state (see type C). Unfortunately this does not lead to a better performance and one more gate has to be calculated.

B.3.3 Another Solution for the Output

The cell of type D is another solution to choose b0 and b1 in Section B.2. For the Leaky LP cell we calculate the output as described in (41). Now we set b0 = γp 2γp 3 and b1 = 1 γp 2 γp 3, and get

yp c = γp 3 γp 2sp c + 1 γp 2 sp c .

This cell actually works as well as the MD Leaky LP cell and has the same number of gates. In this case we do not need a squashing function hc, because we already have yp c [ 1, 1].

B.3.4 An MD Cell as MD PID-Controller

Type E has a completely diﬀerent interpretation: In controlling engineering a PID-controller gets an error as input. In our case the gate activations have to decide, if the proportional (P), the integral (I) or the derivative (D) term of the error is important for the output. When γp 1 0 we have yp cin sp c so the internal state is proportional to the input. Then γp 2 gates the proportional part (P) of the input. The second gate γp 3 gates the diﬀerence between the last and the current input, which can be seen as a discrete derivative (D). If γp 1 1 the internal state is an exponential moving average of yp cin which is an integral term. So γp 2 gates a mainly integral part of the input (I), whereas γp 3 gates a mainly proportional part of the input (P). Dependent on γp 1 type E can be a PD-controller, a PI-controller or a mix of these both. In Figure 7(right) can be seen the frequency response of this cell for diﬀerent gate activations.

Gundram Leifert et al.

Type gint ( ) gout ( ) H(z) for hc (x) = x

A 1 γp 1 yp cin + γp 1sp c hc γp 2sp c + γp 3sp c (1 γp 1) 1 γp 1z 1 γp 2 + γp 3z 1

B 1 γp 1 yp cin + γp 1sp c sp c+sp c 2 (1 γp 1) 1 γp 1z 1 1+z 1

2 C 1 γp 1 yp cin + γp 1γp 4sp c hc γp 2sp c + γp 4sp c (1 γp 1) 1 γp 1γp 4z 1 γp 2 + γp 3z 1

D 1 γp 1 yp cin + γp 1sp c γp 3 γp 2sp c + 1 γp 2 sp c (1 γp 1) 1 γp 1z 1 γp 3 γp 2 + 1 γp 2 z 1

E 1 γp 1 yp cin + γp 1sp c hc γp 2sp c + γp 3 sp c sp c (1 γp 1) 1 γp 1z 1 γp 2 + γp 3 1 z 1

Table 7: Update equations and transfer function for diﬀerent cell layouts. The column sp c contains the update equations to calculate the internal state and column yp c contains the update equation for the output. These equations lead to the transfer function H(z) = Hγp 1,γp 2,...(z).

0 0.1 0.2 0.3 0.4 0.5 0

frequency f

frequency response

H 0.5 1 H0.5;0.5 2 H 0.5;0.5;0.5

0 0.1 0.2 0.3 0.4 0.5 0

frequency f

frequency response

H0.5 1 H0.5;0.9;0.1

H0.5;0.5;0.5 H0.5;0.1;0.9

Figure 7: Frequency response of H1 (dashed), H2 (dotted) and H (solid) for special layouts and parameters of Table 7. Left (type B): A butterworth lowpass ﬁlter with a negative gate activation γp 0 = 0.5 leads to the cutoﬀfrequency

π arctan 1+0.5 1 0.5 0.3976. Right (type E): Diﬀerent frequency responses

of a PID controller. Having a ﬁxed γp 0 = 0.5 the frequency response is dependent on the activations of γp 1 and γp 2 and can have lowpass (black), allpass (gray) and highpass (lightgray) characteristic.

A. Graves and J. Schmidhuber. Oﬄine handwriting recognition with multidimensional recurrent neural networks. In NIPS, pages 545 552, 2008.

Cells in MDRNNs

A. Graves, S. Fernandez and J. Schmidhuber. Multi-dimensional recurrent neural networks. Technical report, IDSIA-04-07, 2007.

Augustin, E. and Brodin, J.-M and Carré, M. and Geoﬀrois, E. and Grosicki, E. and Prêteux, F. RIMES evaluation campaign for handwritten mail processing. In Proc. of the Workshop on Frontiers in Handwriting Recognition, number 1, 2006.

Y. Bengio. Practical recommendations for gradient-based training of deep architectures. Co RR, abs/1206.5533, 2012.

Y. Bengio and Y. Le Cun. Convolutional networks for images, speech andtime-series. 1995.

Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is diﬃcult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994. URL http://www.iro.umontreal.ca/~lisa/pointeurs/ieeetrnn94.pdf.

C. M. Bishop. Pattern Recognition and Mashine Learning. Springer, 2006.

E. Grosicki and H. El-Abed. ICDAR 2011: French handwriting recognition competition. In Proc. of the Int. Conf. on Document Analysis and Recognition, pages 1459 1463, 2011.

E. Grosicki, M. Carré, J.-M. Brodin and E. Geoﬀrois. RIMES evaluation campaign for handwritten mail processing. In Proc. of the Int. Conf. on Frontiers in Handwriting Recognition, 2008.

F. A. Gers, J. Schmidhuber and F. Cummins. Learning to forget: Continual prediction with lstm. Technical report, IDSIA-01-99, 1999.

F. A. Gers, N. Schraudolph and J. Schmidhuber. Learning precise timing with lstm recurrent networks. Journal of Machine Learning Research, 3:115 143, 2002.

Pechwitz, M. and Maddouri, S. and Märgner, V. and Ellouze, N. and Amiri, H. and others. Ifn/enit-database of handwritten arabic words. In Proc. of CIFED, volume 2, pages 127 136. Citeseer, 2002.

A. D. Poularikas. The Transforms and Applications Handbook. CRC Press, 2. edition edition, 2000.

S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, 9:1735 1780, 1997.

D. Schlichthärle. Digital Filters. Springer, 2000.