# generative_image_modeling_using_spatial_lstms__125e9cc0.pdf Generative Image Modeling Using Spatial LSTMs Lucas Theis University of T ubingen 72076 T ubingen, Germany lucas@bethgelab.org Matthias Bethge University of T ubingen 72076 T ubingen, Germany matthias@bethgelab.org Modeling the distribution of natural images is challenging, partly because of strong statistical dependencies which can extend over hundreds of pixels. Recurrent neural networks have been successful in capturing long-range dependencies in a number of problems but only recently have found their way into generative image models. We here introduce a recurrent image model based on multidimensional long short-term memory units which are particularly suited for image modeling due to their spatial structure. Our model scales to images of arbitrary size and its likelihood is computationally tractable. We find that it outperforms the state of the art in quantitative comparisons on several image datasets and produces promising results when used for texture synthesis and inpainting. 1 Introduction The last few years have seen tremendous progress in learning useful image representations [6]. While early successes were often achieved through the use of generative models [e.g., 13, 23, 30], recent breakthroughs were mainly driven by improvements in supervised techniques [e.g., 20, 34]. Yet unsupervised learning has the potential to tap into the much larger source of unlabeled data, which may be important for training bigger systems capable of a more general scene understanding. For example, multimodal data is abundant but often unlabeled, yet can still greatly benefit unsupervised approaches [36]. Generative models provide a principled approach to unsupervised learning. A perfect model of natural images would be able to optimally predict parts of an image given other parts of an image and thereby clearly demonstrate a form of scene understanding. When extended by labels, the Bayesian framework can be used to perform semi-supervised learning in the generative model [19, 28] while it is less clear how to combine other unsupervised approaches with discriminative learning. Generative image models are also useful in more traditional applications such as image reconstruction [33, 35, 49] or compression [47]. Recently there has been a renewed strong interest in the development of generative image models [e.g., 4, 8, 10, 11, 18, 24, 31, 35, 45, 47]. Most of this work has tried to bring to bear the flexibility of deep neural networks on the problem of modeling the distribution of natural images. One challenge in this endeavor is to find the right balance between tractability and flexibility. The present article contributes to this line of research by introducing a fully tractable yet highly flexible image model. Our model combines multi-dimensional recurrent neural networks [9] with mixtures of experts. More specifically, the backbone of our model is formed by a spatial variant of long short-term memory (LSTM) [14]. One-dimensional LSTMs have been particularly successful in modeling text and speech [e.g., 38, 39], but have also been used to model the progression of frames in video [36] and very recently to model single images [11]. In contrast to earlier work on modeling images, here we use multi-dimensional LSTMs [9] which naturally lend themselves to the task of generative image modeling due to their spatial structure and ability to capture long-range correlations. SLSTM units SLSTM units Figure 1: (A) We factorize the distribution of images such that the prediction of a pixel (black) may depend on any pixel in the upper-left green region. (B) A graphical model representation of an MCGSM with a causal neighborhood limited to a small region. (C) A visualization of our recurrent image model with two layers of spatial LSTMs. The pixels of the image are represented twice and some arrows are omitted for clarity. Through feedforward connections, the prediction of a pixel depends directly on its neighborhood (green), but through recurrent connections it has access to the information in a much larger region (red). To model the distribution of pixels conditioned on the hidden states of the neural network, we use mixtures of conditional Gaussian scale mixtures (MCGSMs) [41]. This class of models can be viewed as a generalization of Gaussian mixture models, but their parametrization makes them much more suitable for natural images. By treating images as instances of a stationary stochastic process, this model allows us to sample and capture the correlations of arbitrarily large images. 2 A recurrent model of natural images In the following, we first review and extend the MCGSM [41] and multi-dimensional LSTMs [9] before explaining how to combine them into a recurrent image model. Section 3 will demonstrate the validity of our approach by evaluating and comparing the model on a number of image datasets. 2.1 Factorized mixtures of conditional Gaussian scale mixtures One successful approach to building flexible yet tractable generative models has been to use fullyvisible belief networks [21, 27]. To apply such a model to images, we have to give the pixels an ordering and specify the distribution of each pixel conditioned on its parent pixels. Several parametrizations have been suggested for the conditional distributions in the context of natural images [5, 15, 41, 44, 45]. We here review and extend the work of Theis et al. [41] who proposed to use mixtures of conditional Gaussian scale mixtures (MCGSMs). Let x be a grayscale image patch and xij be the intensity of the pixel at location ij. Further, let x