# oneshot_generalization_in_deep_generative_models__c93e93e0.pdf One-Shot Generalization in Deep Generative Models Danilo J. Rezende* DANILOR@GOOGLE.COM Shakir Mohamed* SHAKIR@GOOGLE.COM Ivo Danihelka DANIHELKA@GOOGLE.COM Karol Gregor KAROLG@GOOGLE.COM Daan Wierstra WIERSTRA@GOOGLE.COM Google Deep Mind, London Humans have an impressive ability to reason about new concepts and experiences from just a single example. In particular, humans have an ability for one-shot generalization: an ability to encounter a new concept, understand its structure, and then be able to generate compelling alternative variations of the concept. We develop machine learning systems with this important capacity by developing new deep generative models, models that combine the representational power of deep learning with the inferential power of Bayesian reasoning. We develop a class of sequential generative models that are built on the principles of feedback and attention. These two characteristics lead to generative models that are among the state-of-the art in density estimation and image generation. We demonstrate the one-shot generalization ability of our models using three tasks: unconditional sampling, generating new exemplars of a given concept, and generating new exemplars of a family of concepts. In all cases our models are able to generate compelling and diverse samples having seen new examples just once providing an important class of general-purpose models for one-shot machine learning. 1. Introduction Figure 1. Given the first row, our model generates new exemplars. Consider the images in the red box in figure 1. We see each of these new concepts just once, understand their structure, and are then able to imagine and generate compelling alternative variations of each concept, similar to those drawn in the rows beneath the red box. This is an *Equal contributions. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). ability that humans have for one-shot generalization: an ability to generalize to new concepts given just one or a few examples. In this paper, we develop new models that possess this capacity for one-shot generalization models that allow for one-shot reasoning from the data streams we are likely to encounter in practice, that use only limited forms of domain-specific knowledge, and that can be applied to diverse sets of problems. There are two notable approaches that incorporate one-shot generalization. Salakhutdinov et al. (2013) developed a probabilistic model that combines a deep Boltzmann machine with a hierarchical Dirichlet process to learn hierarchies of concept categories as well as provide a powerful generative model. Recently, Lake et al. (2015) presented a compelling demonstration of the ability of probabilistic models to perform one-shot generalization, using Bayesian program learning, which is able to learn a hierarchical, non-parametric generative model of handwritten characters. Their approach incorporates specific knowledge of how strokes are formed and the ways in which they are combined to produce characters of different types, exploiting similar strategies used by humans. Lake et al. (2015) see the capacity for one-shot generalization demonstrated by Bayesian programming learning as a challenge for neural models . By combining the representational power of deep neural networks embedded within hierarchical latent variable models, with the inferential power of approximate Bayesian reasoning, we show that this is a challenge that can be overcome. The resulting deep generative models are general-purpose image models that are accurate and scalable, among the state-of-the-art, and possess the important capacity for one-shot generalization. Deep generative models are a rich class of models for density estimation that specify a generative process for observed data using a hierarchy of latent variables. Models that are directed graphical models have risen in popularity and include discrete latent variable models such as sigmoid belief networks and deep auto-regressive networks (Saul et al., 1996; Gregor et al., 2014), or continuous latent variable models such as non-linear Gaussian belief networks and deep latent Gaussian models (Rezende et al., One-shot Generalization in Deep Generative Models 2014; Kingma & Welling, 2014). These models use deep networks in the specification of their conditional probability distributions to allow rich non-linear structure to be learned. Such models have been shown to have a number of desirable properties: inference of the latent variables allows us to provide a causal explanation for the data that can be used to explore its underlying factors of variation and for exploratory analysis; analogical reasoning between two related concepts, e.g., styles and identities of images, is naturally possible; any missing data can be imputed by treating them as additional latent variables, capturing the the full range of correlation between missing entries under any missingness pattern; these models embody minimum description length principles and can be used for compression; these models can be used to learn environment-simulators enabling a wide range of approaches for simulation-based planning. Two principles are central to our approach: feedback and attention. These principles allow the models we develop to reflect the principles of analysis-by-synthesis, in which the analysis of observed information is continually integrated with constructed interpretations of it (Yuille & Kersten, 2006; Erdogan et al., 2015; Nair et al., 2008). Analysis is realized by attentional mechanisms that allow us to selectively process and route information from the observed data into the model. Interpretations of the data are then obtained by sets of latent variables that are inferred sequentially to evaluate the probability of the data. The aim of such a construction is to introduce internal feedback into the model that allows for a thinking time during which information can be extracted from each data point more effectively, leading to improved inference, generation and generalization. We shall refer to such models as sequential generative models. Models such as DRAW (Gregor et al., 2015), composited variational auto-encoders (Huang & Murphy, 2015) and AIR (Eslami et al., 2016) are existing models in this class, and we will develop a general class of sequential generative models that incorporates these and other latent variable models and variational auto-encoders. Our contributions are: We develop sequential generative models that provide a generalization of existing approaches, allowing for sequential generation and inference, multi-modal posterior approximations, and a rich new class of deep generative models. We demonstrate the clear improvement that the combination of attentional mechanisms in more powerful models and inference has in advancing the state-of-the-art in deep generative models. Importantly, we show that our generative models have the ability to perform one-shot generalization. We explore three generalization tasks and show that our models can imagine and generate compelling alternative variations of images after having seen them just once. 2. Varieties of Attention Attending to parts of a scene, ignoring others, analyzing the parts that we focus on, and sequentially building up an interpretation and understanding of a scene: these are natural parts of human cognition. This is so successful a strategy for reasoning that it is now also an important part of many machine learning systems. This repeated process of attention and interpretation, analysis and synthesis, is an important component of the generative models we develop. In its most general form, any mechanism that allows us to selectively route information from one part of our model to another can be regarded as an attentional mechanism. Attention allows for a wide range of invariances to be incorporated, with few additional parameters and low computational cost. Attention has been most widely used for classification tasks, having been shown to improve both scalability and generalization (Larochelle & Hinton, 2010; Chikkerur et al., 2010; Xu et al., 2015; Jaderberg et al., 2015; Mnih et al., 2014; Ba et al., 2015). The attention used in discriminative tasks is a reading attention that transforms an image into a representation in a canonical coordinate space (that is typically lower dimensional), with the parameters controlling the attention learned by gradient descent. Attention in unsupervised learning is much more recent (Tang et al., 2014; Gregor et al., 2015). In latent variable models, we have two processes inference and generation that can both use attention, though in slightly different ways. The generative process makes use of a writing or generative attention, which implements a selective updating of the output variables, e.g., updating only a small part of the generated image. The inference process makes use of reading attention, like that used in classification. Although conceptually different, both these forms of attention can be implemented with the same computational tools. We focus on image modelling and make use of spatial attention. Two other types of attention, randomized and error-based, are discussed in appendix B. Spatially-transformed attention. Rather than selecting a patch of an image (taking glimpses) as other methods do, a more powerful approach is to use a mechanism that provides invariance to shape and size of objects in the images (general affine transformations). Tang et al. (2014) take such an approach and use 2D similarity transforms to provide basic affine invariance. Spatial transformers (Jaderberg et al., 2015) are a more general method for providing such invariance, and is our preferred attentional mechanism. Spatial transformers (ST) process an input image x using parameters λ to generate an output: ST(x, λ) = [κh(λ) κw(λ)] x, where κh and κw are 1-dimensional kernels, indicates the tensor outer-product of the two kernels and indicates a convolution. Huang & Murphy (2015) develop occlusion- One-shot Generalization in Deep Generative Models aware generative models that make use of spatial transformers in this way. When used for reading attention, spatial transformers allow the model to observe the input image in a canonical form, providing the desired invariance. When used for writing attention, it allows the generative model to independently handle position, scale and rotation of parts of the generated image, as well as their content. An direct extension is to use multiple attention windows simultaneously (see appendix). 3. Iterative and Attentive Generative Models 3.1. Latent Variable Models and Variational Inference Generative models with latent variables describe the probabilistic process by which an observed data point can be generated. The simplest formulations such as PCA and factor analysis use Gaussian latent variables z that are combined linearly to generate Gaussian distributed data points x. In more complex models, the probabilistic description consists of a hierarchy of L layers of latent variables, where each layer depends on the layer above in a non-linear way (Rezende et al., 2014). For deep generative models, we specify this non-linear dependency using deep neural networks. To compute the marginal probability of the data, we must integrate over any unobserved variables: p(x) = Z pθ(x|z)p(z)dz (1) In deep latent Gaussian models, the prior distribution p(z) is a Gaussian distribution and the likelihood function pθ(x|z) is any distribution that is appropriate for the observed data, such as a Gaussian, Bernoulli, categorical or other distribution, and that is dependent in a non-linear way on the latent variables. For most models, the marginal likelihood (1) is intractable and we must instead approximate it. One popular approximation technique is based on variational inference (Jordan et al., 1999), which transforms the difficult integration into an optimization problem that is typically more scalable and easier to solve. Using variational inference we can approximate the marginal likelihood by a lower bound, which is the objective function we use for optimization: F = Eq(z|x)[log pθ(x|z)] KL[qφ(z|x) p(z)] (2) The objective function (2) is the negative free energy, which allows us to trade-off the reconstruction ability of the model (first term) against the complexity of the posterior distribution (second term). Variational inference approximates the true posterior distribution by a known family of approximating posteriors qφ(z|x) with variational parameters φ. Learning now involves optimization of the variational parameters φ and model parameters θ. Instead of optimization by the variational EM algorithm, we take an amortized inference approach and represent the distribution q(z|x) as a recognition or inference model, which we also parameterize using a deep neural network. Inference models amortize the cost of posterior inference and makes it more efficient by allowing for generalization across the inference computations using a set of global variational parameters φ. In this framework, we can think of the generative model as a decoder of the latent variables, and the inference model as its inverse, an encoder of the observed data into the latent description. As a result, this specific combination of deep latent variable model (typically latent Gaussian) with variational inference that is implemented using an inference model is referred to as a variational auto-encoder (VAE). VAEs allow for a single computational graph to be constructed and straightforward gradient computations: when the latent variables are continuous, gradient estimators based on pathwise derivative estimators are used (Rezende et al., 2014; Kingma & Welling, 2014; Burda et al., 20) and when they are discrete, score function estimators are used (Mnih & Gregor, 2014; Ranganath et al., 2014; Mansimov et al., 2016). 3.2. Sequential Generative Models The generative models as we have described them thus far can be characterized as single-step models, since they are models of i.i.d data that evaluate their likelihood functions by transforming the latent variables using a nonlinear, feed-forward transformation. A sequential generative model is a natural extension of the latent variable models used in VAEs. Instead of generating the K latent variables of the model in one step, these models sequentially generate T groups of k latent variables (K = k T), i.e. using T computational steps to allow later groups of latent variables to depend on previously generated latent variables in a non-linear way. 3.2.1. GENERATIVE MODEL In their most general form, sequential generative models describe the observed data over T time steps using a set of latent variables zt at each step. The generative model is shown in the stochastic computational graph of figure 2(a), and described by: Latent variables zt N(zt|0, I) t = 1, . . . , T (3) Context vt = fv(ht 1, x ; θv) (4) Hidden state ht = fh(ht 1, zt, vt; θh) (5) Hidden Canvas ct = fc(ct 1, ht; θc) (6) Observation x p(x|fo(c T ; θo)) (7) Each step generates an independent set of K-dimensional latent variables zt (equation (3)). If we wish to condition the model on an external context or piece of sideinformation x , then a deterministic function fv (equation (4)) is used to read the context-images using an attentional mechanism. A deterministic transition function fh introduces the sequential dependency between each of the latent variables, incorporating the context if it exists (equation (5)). This allows any transition mechanism to be used One-shot Generalization in Deep Generative Models Generative model Inference model (a) Unconditional generative model. Generative model Inference model (b) One-step of the conditional generative model. Figure 2. Stochastic computational graph showing conditional probabilities and computational steps for sequential generative models. A represents an attentional mechanism that uses function fw for writings and function fr for reading. and our transition is specified as a long short-term memory network (LSTM, Hochreiter & Schmidhuber (1997). We explicitly represent the creation of a set of hidden variables ct that is a hidden canvas of the model (equation (6)). The canvas function fc allows for many different transformations, and it is here where generative (writing) attention is used; we describe a number of choices for this function in section 3.2.3. The generated image (7) is sampled using an observation function fo(c; θo) that maps the last hidden canvas c T to the parameters of the observation model. The set of all parameters of the generative model is θ = {θh, θc, θo}. 3.2.2. FREE ENERGY OBJECTIVE Given the probabilistic model (3)-(7) we can obtain an objective function for inference and parameter learning using variational inference. By applying the variational principle, we obtain the free energy objective: log p(x) = log R p(x|z1:T )p(z1:T )dz1:T F F = Eq(z1:T )[log pθ(x|z1:T )] PT t=1 KL[qφ(zt|z