# associative_long_shortterm_memory__1b6e0c2d.pdf
Associative Long Short-Term Memory
Ivo Danihelka DANIHELKA@GOOGLE.COM Greg Wayne GREGWAYNE@GOOGLE.COM Benigno Uria BURIA@GOOGLE.COM Nal Kalchbrenner NALK@GOOGLE.COM Alex Graves GRAVESA@GOOGLE.COM Google Deep Mind
We investigate a new method to augment recurrent neural networks with extra memory without increasing the number of network parameters. The system has an associative memory based on complex-valued vectors and is closely related to Holographic Reduced Representations and Long Short-Term Memory networks. Holographic Reduced Representations have limited capacity: as they store more information, each retrieval becomes noisier due to interference. Our system in contrast creates redundant copies of stored information, which enables retrieval with reduced noise. Experiments demonstrate faster learning on multiple memorization tasks.
1. Introduction
We aim to enhance LSTM (Hochreiter & Schmidhuber, 1997), which in recent years has become widely used in sequence prediction, speech recognition and machine translation (Graves, 2013; Graves et al., 2013; Sutskever et al., 2014). We address two limitations of LSTM. The first limitation is that the number of memory cells is linked to the size of the recurrent weight matrices. An LSTM with Nh memory cells requires a recurrent weight matrix with O(N 2 h) weights. The second limitation is that LSTM is a poor candidate for learning to represent common data structures like arrays because it lacks a mechanism to index its memory while writing and reading.
To overcome these limitations, recurrent neural networks have been previously augmented with soft or hard attention mechanisms to external memories (Graves et al., 2014; Sukhbaatar et al., 2015; Joulin & Mikolov, 2015; Grefen-
Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).
stette et al., 2015; Zaremba & Sutskever, 2015). The attention acts as an addressing system that selects memory locations. The content of the selected memory locations can be read or modified by the network.
We provide a different addressing mechanism in Associative LSTM, where, like LSTM, an item is stored in a distributed vector representation without locations. Our system implements an associative array that stores key-value pairs based on two contributions:
1. We combine LSTM with ideas from Holographic Reduced Representations (HRRs) (Plate, 2003) to enable key-value storage of data.
2. A direct application of the HRR idea leads to very lossy storage. We use redundant storage to increase memory capacity and to reduce noise in memory retrieval.
HRRs use a binding operator to implement key-value binding between two vectors (the key and its associated content). They natively implement associative arrays; as a byproduct, they can also easily implement stacks, queues, or lists. Since Holographic Reduced Representations may be unfamiliar to many readers, Section 2 provides a short introduction to them and to related vector-symbolic architectures (Kanerva, 2009).
In computing, Redundant Arrays of Inexpensive Disks (RAID) provided a means to build reliable storage from unreliable components. We similarly reduce retrieval error inside a holographic representation by using redundant storage, a construction described in Section 3. We then combine the redundant associative memory with LSTM in Section 5. The system can be equipped with a large memory without increasing the number of network parameters. Our experiments in Section 6 show the benefits of the memory system for learning speed and accuracy.
Associative Long Short-Term Memory
2. Background
Holographic Reduced Representations are a simple mechanism to represent an associative array of key-value pairs in a fixed-size vector. Each individual key-value pair is the same size as the entire associative array; the array is represented by the sum of the pairs. Concretely, consider a complex vector key r = (ar[1]eiφr[1], ar[2]eiφr[2], . . . ), which is the same size as the complex vector value x. The pair is bound together by element-wise complex multiplication, which multiplies the moduli and adds the phases of the elements:
y = r x (1)
= (ar[1]ax[1]ei(φr[1]+φx[1]), ar[2]ax[2]ei(φr[2]+φx[2]), . . . ) (2)
Given keys r1, r2, r3 and input vectors x1, x2, x3, the associative array is
c = r1 x1 + r2 x2 + r3 x3 (3)
where we call c a memory trace.
Define the key inverse:
r 1 = (ar[1] 1e iφr[1], ar[2] 1e iφr[2], . . . ) (4)
To retrieve the item associated with key rk, we multiply the memory trace element-wise by the vector r 1 k . For example:
r 1 2 c = r 1 2 (r1 x1 + r2 x2 + r3 x3)
= x2 + r 1 2 (r1 x1 + r3 x3)
= x2 + noise (5)
The product is exactly x2 together with a noise term. If the phases of the elements of the key vector are randomly distributed, the noise term has zero mean.
Instead of using the key inverse, Plate recommends using the complex conjugate of the key, rk = (ak[1]e iφk[1], ak[2]e iφk[2], . . . ), for retrieval: the elements of the exact inverse have moduli (ak[1] 1, ak[2] 1, . . . ), which can magnify the noise term. Plate presents two different variants of Holographic Reduced Representations. The first operates in real space and uses circular convolution for binding; the second operates in complex space and uses element-wise complex multiplication. The two are related by Fourier transformation. See Plate (2003) or Kanerva (2009) for a more comprehensive overview.
3. Redundant Associative Memory
As the number of items in the associated memory grows, the noise incurred during retrieval grows as well. Here,
we propose a way to reduce the noise by storing multiple transformed copies of each input vector. When retrieving an item, we compute the average of the restored copies. Unlike other memory architectures (Graves et al., 2014; Sukhbaatar et al., 2015), the memory does not have a discrete number of memory locations; instead, the memory has a discrete number of copies.
Formally, let cs be the memory trace for the s-th copy:
k=1 (Psrk) xk (6)
where xk CNh/2 is the k-th input; rk CNh/2 is the key. Each complex number contributes a real and imaginary part, so the input and key are represented with Nh real values. Ps RNh/2 Nh/2 is a constant random permutation, specific to each copy. Permuting the key decorrelates the retrieval noise from each copy of the memory trace.
When retrieving the k-th item, we compute the average over all copies:
xk = 1 Ncopies
(Psrk) cs (7)
where (Psrk) is the complex conjugate of Psrk.
Let us examine how the redundant copies reduce retrieval noise by inspecting a retrieved item. If each complex number in rk has modulus equal to 1, its complex conjugate acts as an inverse, and the retrieved item is:
xk = xk + 1 Ncopies
(Psrk) (Psrk ) xk
k =k xk 1 Ncopies
s=1 Ps(rk rk )
= xk + noise (8)
where the PNitems k =k sum is over all other stored items. If the terms Ps(rk rk ) are independent, they add incoherently when summed over the copies. Furthermore, if the noise due to one item ξk = xk 1 Ncopies PNcopies s=1 Ps(rk rk ) is independent of the noise due to another item, then the variance of the total noise is O( Nitems
Ncopies ). Thus, we expect that the retrieval error will be roughly constant if the number of copies scales with the number of items.
Practically, we demonstrate that the redundant copies with random permutations are effective at reducing retrieval noise in Figure 1. We take a sequence of Image Net images (Russakovsky et al., 2015), each of dimension 3 110 110, where the first dimension is a colour channel. We vectorise each image and consider the first half of the vector
Associative Long Short-Term Memory
Figure 1. From top to bottom: 20 original images and the image sequence retrieved from 1 copy, 4 copies, and 20 copies of the memory trace. Using more copies reduces the noise.
0 10 20 30 40 50 60 70 80 90 100
mean squared error
n Copies (n Images=50)
0 10 20 30 40 50 60 70 80 90 100 n Images (n Copies=50)
0 10 20 30 40 50 60 70 80 90 100 n Images = n Copies
Figure 2. The mean squared error per pixel when retrieving an Image Net image from a memory trace. Left: 50 images are stored in the memory trace. The number of copies ranges from 1 to 100. Middle: 50 copies are used, and the number of stored images goes from 1 to 100. The mean squared error grows linearly. Right: The number of copies is increased together with the number of stored images. After reaching 50 copies, the mean squared error is almost constant.
to be the real part and the second half the imaginary part of a complex vector. The sequence of images is encoded by using random keys with moduli equal to 1. We see that retrieval is less corrupted by noise as the number of copies grows.
The mean squared error of retrieved Image Net images is analysed in Figure 2 with varying numbers of copies and images. The simulation agrees accurately with our prediction: the mean squared error is proportional to the number of stored items and inversely proportional to the number of copies.
The redundant associative memory has several nice properties:
The number of copies can be modified at any time: reducing their number increases retrieval noise, while increasing the number of copies enlarges capacity.
It is possible to query using a partially known key by setting some key elements to zero. Each copy of the permuted key Psrk routes the zeroed key elements to different dimensions. We need to know O( Nh Ncopies ) elements of the key to recover the whole value.
Unlike Neural Turing Machines (Graves et al., 2014), it is not necessary to search for free locations when writing.
It is possible to store more items than the number of copies at the cost of increased retrieval noise.
4. Long Short-Term Memory
We briefly introduce the LSTM with forget gates (Gers et al., 2000), a recurrent neural network whose hidden state is described by two parts ht, ct RNh. At each time step, the network is presented with an input xt and updates its state to
ˆgf, ˆgi, ˆgo, ˆu = Wxhxt + Whhht 1 + bh (9)
gf = σ(ˆgf) (10)
gi = σ(ˆgi) (11)
go = σ(ˆgo) (12)
u = tanh(ˆu) (13)
ct = gf ct 1 + gi u (14)
ht = go tanh(ct) (15)
where σ(x) (0, 1) is the logistic sigmoid function, and gf, gi, go are the forget gate, input gate, and output gate, respectively. The u vector is a proposed update to the cell state c. Wxh, Whh are weight matrices, and bh is a bias vector. denotes element-wise multiplication of two vectors.
Associative Long Short-Term Memory
5. Associative Long Short-Term Memory
When we combine Holographic Reduced Representations with the LSTM, we need to implement complex vector multiplications. For a complex vector z = hreal+ihimaginary, we use the form
h = hreal himaginary
where h RNh, hreal, himaginary RNh/2. In the network description below, the reader can assume that all vectors and matrices are strictly real-valued.
As in LSTM, we first compute gate variables, but we also produce parameters that will be used to define associative keys ˆri, ˆro. The same gates are applied to the real and imaginary parts:
ˆgf, ˆgi, ˆgo, ˆri, ˆro = Wxhxt + Whhht 1 + bh (17)
ˆu = Wxuxt + Whuht 1 + bu (18)
gf = σ(ˆgf) σ(ˆgf)
gi = σ(ˆgi) σ(ˆgi)
go = σ(ˆgo) σ(ˆgo)
Unlike LSTM, we use an activation function that operates only on the modulus of a complex number. The following function restricts the modulus of a (real, imaginary) pair to be between 0 and 1:
bound(h) = hreal d himaginary d
where is element-wise division, and d RNh/2 = max(1, p
hreal hreal + himaginary himaginary) corresponds to element-wise normalisation by the modulus of each complex number when the modulus is greater than one. This hard bounding worked slightly better than applying tanh to the modulus. bound is then used to construct the update and two keys:
u = bound(ˆu) (23)
ri = bound(ˆri) (24)
ro = bound(ˆro) (25)
where ri RNh is an input key, acting as a storage key in the associative array, and ro RNh is an output key, corresponding to a lookup key. The update u is multiplied with the input gate gi to produce the value to be stored.
Now, we introduce redundant storage and provide the procedure for memory retrieval. For each copy, indexed by
s {1, . . . , Ncopies}, we add the same key-value pair to the cell state:
ri,s = Ps 0 0 Ps
cs,t = gf cs,t 1 + ri,s (gi u) (27)
where ri,s is the permuted input key; Ps RNh/2 Nh/2 is a constant random permutation matrix, specific to the s-th copy. is element-wise complex multiplication and is computed using
r u = rreal ureal rimaginary uimaginary rreal uimaginary + rimaginary ureal
The output key for each copy, ro,s, is permuted by the same matrix as the copy s input key:
ro,s = Ps 0 0 Ps
Finally, the cells (memory trace) are read out by averaging the copies:
ht = go bound 1 Ncopies
s=1 ro,s cs,t
Note that permutation can be performed in O(Nh) computations. Additionally, all copies can be updated in parallel by operating on tensors of size Ncopies Nh.
On some tasks, we found that learning speed was improved by not feeding ht 1 to the update u: namely, Whu is set to zero in Equation 18, which causes u to serve as an embedding of xt. This modification was made for the episodic copy, XML modeling, and variable assignment tasks below.
6. Experiments
All experiments used the Adam optimisation algorithm (Kingma & Ba, 2014) with no gradient clipping. For experiments with synthetic data, we generate new data for each training minibatch, obviating the need for a separate test data set. Minibatches of size 2 were used in all tasks beside the Wikipedia task below, where the minibatch size was 10.
We compared Associative LSTM to multiple baselines:
LSTM. We use LSTM with forget gates and without peepholes (Gers et al., 2000).
Permutation RNN. Each sequence is encoded by using powers of a constant random permutation matrix as keys:
ht = Pht 1 + Wxt (31)
Associative Long Short-Term Memory
Table 1. Network sizes on the episodic copy task. Network Nh Relative speed #parameters LSTM 128 1 72, 993 LSTM n H=512 512 0.18 1, 078, 305
Associative LSTM 128 (=64 complex numbers) 0.22 (Ncopies = 1) 0.16 (Ncopies = 4) 0.12 (Ncopies = 8) 65, 505
Permutation RNN 256 2.05 5, 642 Unitary RNN 256 (=128 complex numbers) 0.24 6, 666 Multiplicative u RNN 256 0.23 10, 506
Only the input and output weights are learned. Representing sequences by permuting sums is described in Kanerva (2009).
Unitary RNN. Arjovsky et al. (2015) recently introduced recurrent neural networks with unitary weight matrices.1
They consider dynamics of the form
ht = f(Wht 1 + V xt) (32)
where W is a unitary matrix (W W = I). The product of unitary matrices is a unitary matrix, so W can be parameterised as the product of simpler unitary matrices. In particular,
ht = f(D3R2F 1D2PR1FD1ht 1 + V xt) (33)
where D3, D2, D1 are learned diagonal complex matrices, and R2, R1 are learned reflection matrices. Matrices F and F 1 are the discrete Fourier transformation and its inverse. P is any constant random permutation. The activation function f(h) applies a rectified linear unit with a learned bias to the modulus of each complex number. Only the diagonal and reflection matrices, D and R, are learned, so Unitary RNNs have fewer parameters than LSTM with comparable numbers of hidden units.
Multiplicative Unitary RNN. To obtain a stronger baseline, we enhanced the Unitary RNNs with multiplicative interactions (Sutskever et al., 2011) by conditioning all complex diagonal matrices on the input xt:
ˆr = Wxrxt (34)
D = diag(cos(ˆr)) diag(sin(ˆr)) diag(sin(ˆr)) diag(cos(ˆr))
6.1. Episodic Tasks
6.1.1. EPISODIC COPY
The copy task is a simple benchmark that tests the ability of the architectures to store a sequence of random characters and repeat the sequence after a time lag. Each input
1We were excited to see that other researchers are also studying the benefits of complex-valued recurrent networks.
0 100 200 300 400 500 600 700 800 900 1000
cost per sequence (bits)
minibatch number (thousands)
LSTM LSTM forget Gate Bias=1 Associative LSTM n Copies=1 Associative LSTM n Copies=4 Associative LSTM n Copies=8
Permutation Unitary RNN
Figure 3. Training cost per sequence on the fixed-length episodic copy task. LSTM learns faster if the forget gate bias is set to 1. Associative LSTM was able to solve the task quickly without biasing the forget gate.
sequence is composed of 10 random characters, followed by 100 blanks, and a delimiter symbol. After the delimiter symbol is presented, networks must reproduce the first 10 characters, matching the task description in Arjovsky et al. (2015). Although copy is not interesting per se, failure on copy indicates an extreme limitation of a system s capacity to memorise.
Arjovsky et al. (2015) presented very good results on the copy task using Unitary RNNs. We wanted to determine whether Associative LSTM can learn the task with a similarly small number of data samples. The results are displayed in Figure 3. The Permutation RNN and Unitary RNN solve the task quickly. Associative LSTM solved the task a little bit slower, but still much faster than LSTM. All considered, this task requires the network to store only a small number of symbols; consequently, adding redundancy to the Associative LSTM, though not harmful, did not bestow any benefits.
We considered this variant of the copy task too easy since it posed no difficulty to the very simple Permutation RNN. The Permutation RNN can find a solution by building a hash of the input sequence (a sum of many permuted inputs). The output weights then only need to learn to classify the hash codes. A more powerful Permutation RNN
Associative Long Short-Term Memory
0 100 200 300 400 500 600 700 800 900 1000
cost per sequence (bits)
minibatch number (thousands)
LSTM forget Gate Bias=1 Associative LSTM n Copies=1 Associative LSTM n Copies=4 Associative LSTM n Copies=8
Permutation Unitary RNN Multiplicative u RNN
Figure 4. Training cost per sequence on the episodic copy task with variable-length sequences (1 to 10 characters). Associative LSTM learns quickly and almost as fast as in the fixed-length episodic copy. Unitary RNN converges slowly relative to the fixed-length task.
could use a deep network at the output.
To present a more complex challenge, we constructed one other variant of the task in which the number of random characters in the sequence is not fixed at 10 but is itself a variable drawn uniformly from 1 to 10. Surprisingly, this minor variation compromised the performance of the Unitary RNN, while the Associative LSTM still solved the task quickly. We display these results in Figure 4. We suspect that the Permutation RNN and Unitary RNN would improve on the task if they were outfitted with a mechanism to control the speed of the dynamics: for example, one could define a pause gate whose activation freezes the hidden state of the system after the first 10 symbols, including possible blanks. This would render the variablelength task exactly equivalent to the original.
Table 1 shows the number of parameters for each network. Associative LSTM has fewer parameters than LSTM if the Whu matrix in Equation 18 is set to zero and the gates are duplicated for the real and the imaginary parts. Additionally, the number of parameters in Associative LSTM is not affected by the number of copies used; the permutation matrices do not add parameters since they are randomly initialised and left unchanged.
6.2. Online Tasks
As important as remembering is forgetting. The following tasks consist of continuous streams of characters, and the networks must discover opportune moments to reset their internal states.
6.2.1. XML MODELING
The XML modeling task was defined in Jozefowicz et al. (2015). The XML language consists of nested (context-free) tags of the form
0 100 200 300 400 500 600 700 800 900 1000
cost per character (bits)
minibatch number (thousands)
LSTM LSTM n H=512 Associative LSTM n Copies=1 Associative LSTM n Copies=4 Associative LSTM n Copies=8
Unitary RNN n H=1024 Multiplicative u RNN n H=1024
Figure 5. Training cost on the XML task, including Unitary RNNs.
0 100 200 300 400 500 600 700 800 900 1000
cost per character (bits)
minibatch number (thousands)
LSTM LSTM n H=512 Associative LSTM n Copies=1 Associative LSTM n Copies=4 Associative LSTM n Copies=8
Figure 6. Training cost on the XML task. LSTM and Associative LSTM with 128 hidden units are also compared to a larger LSTM with 512 units.
... . The input is a sequence of tags with names of 1 to 10 random lowercase characters. The tag name is only predictable when it is closed by , so the prediction cost is confined to the closing tags in the sequence. Each symbol must be predicted one time step before it is presented. An example sequence looks like:
svgquspn>