# sequential_neural_processes__1b983c49.pdf Sequential Neural Processes Gautam Singh Rutgers University singh.gautam@rutgers.edu Jaesik Yoon SAP jaesik.yoon01@sap.com Youngsung Son ETRI ysson@etri.re.kr Sungjin Ahn Rutgers University sungjin.ahn@rutgers.edu Neural Processes combine the strengths of neural networks and Gaussian processes to achieve both flexible learning and fast prediction in stochastic processes. However, a large class of problems comprises underlying temporal dependency structures in a sequence of stochastic processes that Neural Processes (NP) do not explicitly consider. In this paper, we propose Sequential Neural Processes (SNP) which incorporates a temporal state-transition model of stochastic processes and thus extends its modeling capabilities to dynamic stochastic processes. In applying SNP to dynamic 3D scene modeling, we introduce the Temporal Generative Query Networks. To our knowledge, this is the first 4D model that can deal with the temporal dynamics of 3D scenes. In experiments, we evaluate the proposed methods in dynamic (non-stationary) regression and 4D scene inference and rendering. 1 Introduction Neural networks consume all training data and computation through a costly training phase to engrave a single function into its weights. While this makes us entertain fast prediction on the learned function, under this rigid regime changing the target function means costly retraining of the network. This lack of flexibility thus plays as a major obstacle in tasks such as meta-learning and continual learning where the function needs to be changed over time or on-demand. Gaussian processes (GP) do not suffer from this problem. Conditioning on observations, it directly performs inference on the target stochastic process. Consequently, Gaussian processes show the opposite properties to neural networks: it is flexible in making predictions because of its non-parametric nature, but this flexibility comes at a cost of having slow prediction. GPs can also capture the uncertainty on the estimated function. Neural Processes (NP) (Garnelo et al., 2018b) are a new class of methods that combine the strengths of both worlds. By taking the meta-learning framework, Neural Processes learn to learn a stochastic process quickly from observations while experiencing multiple tasks of stochastic process modeling. Thus, in Neural Processes, unlike typical neural networks, learning a function is fast and uncertaintyaware while, unlike Gaussian processes, prediction at test time is still efficient. An important aspect for which Neural Processes can be extended is that in many cases, certain temporal dynamics underlies in a sequence of stochastic processes. This covers a broad range of problems from learning RL agents being exposed to increasingly more challenging tasks to modeling dynamic 3D scenes. For instance, Eslami et al. (2018) proposed a variant of Neural Processes, called the Generative Query Networks (GQN), to learn representation and rendering of 3D scenes. Although this was successful in modeling static scenes like fixed objects in a room, we argue that to handle Equal contribution 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. more general cases such as dynamic scenes where objects can move or interact over time, we need to explicitly incorporate a temporal transition model into Neural Processes. In this paper, we introduce Sequential Neural Processes (SNP) to incorporate the temporal statetransition model into Neural Processes. The proposed model extends the potential of Neural Processes from modeling a stochastic process to modeling a dynamically changing sequence of stochastic processes. That is, SNP can model a (sequential) stochastic process of stochastic processes. We also propose to apply SNP for dynamic 3D scene modeling by developing the Temporal Generative Query Networks (TGQN). In experiments, we show that TGQN outperforms GQN in terms of capturing transition stochasticity, generation quality, generalization to time-horizons longer than those used during training. Our main contributions are: We introduce Sequential Neural Processes (SNP), a meta-transfer learning framework for a sequence of stochastic processes. We realize SNP for dynamic 3D scene inference by introducing Temporal Generative Query Networks (TGQN). To our knowledge, this is the first 4D generative model that models dynamic 3D scenes. We describe the training challenge of transition-collapse unique to SNP modeling and resolve it by introducing the posterior-dropout ELBO. We demonstrate the generalization capability of TGQN beyond the sequence lengths used during training. We also demonstrate meta-transfer learning and improved generation quality in contrast to Consistent Generative Query Networks (Kumar et al., 2018) gained from the decoupling of temporal dynamics from the scene representations. 2 Background In this section, we introduce notations and foundational concepts that underlie the design of our proposed model as well as motivating applications. Neural Processes. Neural Processes (NP) model a stochastic process mapping an input x Rdx to a random variable Y Rdy. In particular, an NP is defined as a conditional latent variable model where a set of context observations C = (XC, YC) = {(xi, yi)}i I(C) is given to model a conditional prior on the latent variable P(z|C), and the target observations D = (X, Y ) = {(xi, yi)}i I(D) are modeled by the observation model p(yi|xi, z). Here, I(S) stands for the set of data-point indices in a dataset S. This generative process can be written as follows: P(Y |X, C) = Z P(Y |X, z)P(z|C)dz (1) where P(Y |X, z) = Q i I(D) P(yi|xi, z). The dataset {(Ci, Di)}i Idataset as a whole contains multiple pairs of context and target sets. Each such pair (C, D) is associated with its own stochastic process from which its observations are drawn. Therefore NP flexibly models multiple tasks i.e. stochastic processes and this results in a meta-learning framework. It is sometimes useful to condition the observation model directly on the context C as well, i.e., p(yi|xi, s C, z) where s C = fs(C) with fs a deterministic context encoder invariant to the ordering of the contexts. A similar encoder is also used for the conditional prior giving p(z|C) = p(z|r C) with fr(C). In this case, the observation model uses the context in two ways: a noisy latent path via z and a deterministic path via s C. The design principle underlying this modeling is to infer the target stochastic process from contexts in such a way that sampling z from P(z|C) corresponds to a function which is a realization of a stochastic process. Because the true posterior is intractable, the model is trained via variational approximation which gives the following evidence lower bound (ELBO) objective: log Pθ(Y |X, C) EQφ(z|C,D) [log Pθ(Y |X, z)] KL(Qφ(z|C, D) Pθ(z|C)). (2) The ELBO is optimized using the reparameterization trick (Kingma & Welling, 2013). Generative Query Networks. The Generative Query Network (GQN) can be seen as an application of the Neural Processes specifically geared towards 3D scene inference and rendering. In GQN, query x corresponds to a camera viewpoint in a 3D space, and output y is an image taken from the camera viewpoint. Thus, the problem in GQN is cast as: given a context set of viewpoint-image pairs, (i) to infer the representation of the full 3D space and then (ii) to generate an observation image corresponding to a given query viewpoint. In the original GQN, the prior is conditioned also on the query viewpoint in addition to the context, i.e., P(z|x, r C), and thus results in inconsistent samples across different viewpoints when modeling uncertainty in the scene. The Consistent GQN (Kumar et al., 2018) (CGQN) resolved this by removing the dependency on the query viewpoint from the prior. This resulted in z to be a summary of a full 3D scene independent of the query viewpoint. Hence, it is consistent across viewpoints and more similar to the original Neural Processes. For the remainder of the paper, we use the abbreviation GQN for CGQN unless stated otherwise. For inferring representations of 3D scenes, a more complex modeling of latents is needed. For this, GQN uses Conv DRAW (Gregor et al., 2016), an auto-regressive density estimator performing P(z|C) = QL l=1 P(zl|z