# residual_neural_processes__6db88f8a.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Residual Neural Processes Byung-Jun Lee,1 Seunghoon Hong,1 Kee-Eung Kim1,2 1School of Computing, KAIST, Republic of Korea 2Graduate School of AI, KAIST, Republic of Korea bjlee@ai.kaist.ac.kr, {seunghoon.hong, kekim}@kaist.ac.kr A Neural Process (NP) is a map from a set of observed input-output pairs to a predictive distribution over functions, which is designed to mimic other stochastic processes inference mechanisms. NPs are shown to work effectively in tasks that require complex distributions, where traditional stochastic processes struggle, e.g. image completion tasks. This paper concerns the practical capacity of set function approximators despite their universality. By delving deeper into the relationship between an NP and a Bayesian last layer (BLL), it is possible to see that NPs may struggle in simple examples, which other stochastic processes can easily solve. In this paper, we propose a simple yet effective remedy; the Residual Neural Process (RNP) that leverages traditional BLL for faster training and better prediction. We demonstrate that the RNP shows faster convergence and better performance, both qualitatively and quantitatively. Introduction Inferring with stochastic processes, such as Gaussian Processes (GPs), provides a powerful probabilistic learning framework. Despite its computational cost, it is still widely used due to the unique strengths that usual function approximators are not equipped with. One important strength is that they do not require a costly training phase of parameters: after tuning a small set of hyper-parameters, GPs can be directly applied to any set of observations to infer the posterior distribution of functions. Neural Processes (NPs) (Garnelo et al. 2018a; 2018b) are a novel attempt to achieve such strength using the combination of neural network function approximators. It is defined by two components: a permutation invariant encoder that processes query-observations pairs (which are called contexts) and yields an approximate posterior of function embeddings, and a decoder that takes the function embedding and query point as inputs and yields the prediction on the query point (which is called targets). Training is done by feeding the model with random contexts and targets from random functions and maximizing the lower bound of the log probability of the predictive distribution. Attentive Neural Processes (ANPs) (Kim et al. 2019) are an improvement over NPs by adopting the attention mechanism to be more Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. flexible and accurate, at a certain computational cost. After the training, an NP incurs prediction with computational cost linear in the number of the contexts (quadratic in case of an ANP), which is favorable over a GP which requires a cubic computational cost. The success of NPs is partially due to a recently blooming line of research on the set function approximators, based on the permutation invariant neural networks such as Deep Sets (Zaheer et al. 2017; Bloem-Reddy and Teh 2019). Nevertheless, given the finite capacity, it has been reported that not every permutation invariant continuous function is reasonably well approximated. Wagstaff et al. (2019) argues that even if the function approximators, after and before the sum pooling, are flexible enough, the dimension of the embedding to be summed should be at least the cardinality of the input set to represent any permutation invariant function with neural networks. In other words, when the embedding dimension is not large enough to match the input cardinality, Deep Set based architectures can approximate only some of the permutation invariant functions (e.g. averaging function). In practice, the function approximators after and before the sum pooling are typically not flexible enough, and Deep Set fails for some of the tasks even though the embedding dimension is larger than the input set cardinality (Murphy et al. 2019). These imply the importance of building a task-specific set function approximator, as ANPs improved from NPs by a large margin. In this paper, we delve further into the structural similarity between the ANPs and traditional stochastic processes, namely the Bayesian last layer (BLL) (Calandra et al. 2016; Weber et al. 2018; Harrison, Sharma, and Pavone 2018), as ANPs are designed to mimic BLL s behavior efficiently. It turns out that the self-attention layers in an ANP are not expressive enough, and simple cases where the underlying functions lie in the space spanned by a feature extractor, which can be exactly modeled with BLL, might not be efficiently learned. This motivates us to extend the NP to explain the residual part of the prediction that the BLL cannot model, which we call Residual Neural Processes. This also allows us to improve the variety of function samples by borrowing the idea of kernel learning, i.e. learning the approximate posterior of feature functions with the implicit distribution. We show that such extension improves convergence speed and asymptotic performance significantly. Background The task that we are handling throughout this paper is regression in the context of meta-learning setting, although the method is not limited to regression tasks. A regression problem can be defined as approximating a mapping f from input variables x Rdx to continuous output variables y Rdy, i.e. y = f(x). Given the finite training samples of input and output variables, contexts XC, YC := {xc}c C, {yc}c C, one should predict target outputs YT := {yt}t T from XT := {xt}t T . With a slight abuse of notations, we will write f(XC) = {f(xc)}c C and f(XT ) = {f(xt)}t T . Using a stochastic process, a prior distribution over function values pθ(f(XC), f(XT )|XC, XT ) is defined. With a likelihood pθ(YC, YT |f(XC), f(XT )), a posterior predictive distribution p(YT |YC, XT , XC) can be computed. Meta-learning with a stochastic processes then will be the task of finding out the best θ in the prior and likelihood, with the predefined task generator, i.e. (XC, YC, XT , YT ) G. On the other hand, if we only have to learn how to map the input and the output of the inference, the training task can be alternatively defined by a black box that gives the conditional predictive distribution pθ(YT |YC, XT , XC). Similarly, the best θ should be found given the task generator G. The subscript θ will be omitted afterward for the sake of brevity. Bayesian Last Layer (BLL) A classical way of meta-learning with a stochastic process would be the Bayesian last layer (BLL) method (Weber et al. 2018; Harrison, Sharma, and Pavone 2018), which fits into the former way of defining the task. Using the fact that the last layers of many neural networks are usually (generalized) linear models, a straightforward way of building a flexible stochastic process is to set a Gaussian prior on the weights of the last layer. After the training phase with random functions (or maximizing the marginal likelihood of a fixed data set), previous layers are treated as a fixed feature function φ(x) Rdh, and the closed form predictive distribution can be inferred conditioned on an arbitrary context set since the inference is equivalent to the Bayesian linear regression y = w φ(x)+ϵ (when dy = 1 for simplicity), where w is the weights of the last linear layer. It is also equivalent to a GP with specific kernel choice, which is called a manifold GP (Calandra et al. 2016). The posterior inference of BLL can be specified as: p(w) = N(w; 0, I), p(ϵ) = N(ϵ; 0, σ2 N), p(w|XC, YC) = N(w; mw, Sw), mw = ΦC(σ2 NI + Φ CΦC) 1YC, Sw = I ΦC(σ2 NI + Φ CΦC) 1Φ C, p(YT |XT , YC, XC) = N(YT ; m wΦT , Φ T S 1 w ΦT + σ2 NI), (1) where we denote the feature matrices of the context and the target input sets by ΦC = [φ(xc1), φ(xc2), ...]c1,c2,... C and ΦT = [φ(xt1), φ(xt2), ...]t1,t2,... T . The main difference between the BLL and the Bayesian linear regression is that the BLL trains the feature function φ( ) as well, maximizing the model evidence EYC,XC[log p(YC|XC, φ( ))], while it is usually fixed in a linear regression. In a meta-learning context, we can either maximize the log probability of the predictive distribution Eq. (1) with a random context and target set sampled from the task generator G. Due to the matrix inversion, the computation of the predictive distribution takes O(|C|2dh + |C|d2 h + min(d3 h, |C|3)), where the minimum depends on whether we invert |C| |C| matrix or dh dh matrix using the Woodbury matrix identity. Neural Processes (NPs) Recently, Garnelo et al. (2018a) proposed a novel methodology to the abovementioned problem: learning the whole inference procedure using neural networks. To mimic stochastic processes, two main characteristics of general probability distributions are encoded into the neural architecture for p(YT |YC, XT , XC): Exchangeability: p(x1, x2) = p(x2, x1) Consistency: p(x1) = p(x1, x2)dx2 To achieve exchangeability, a permutation invariant neural network as Deep Set (Zaheer et al. 2017) is adopted to be invariant to the ordering of the contexts and the targets. While summarizing contexts into a finite dimensional vector r C := r(XC, YC) Rdh, to make the summarizing function r( ) permutation invariant, each context (xc, yc) is fed as the input to the neural network g( ), and its outputs are aggregated by taking an average to form the context embedding r C: r C = 1 |C| c C g(xc, yc). In the deterministic version of NPs (Garnelo et al. 2018a), the conditional predictive distribution is directly modeled as p(YT |XT , XC, YC) = p(YT |XT , r C). It is then modeled as a factorized Gaussian across the targets (xt, yt)t T for consistency: p(YT |XT , r C) = t T N(yt|μ(xt, r C), diag[σ(xt, r C)]2), where both μ( ) and σ( ) are the functions of r C and xt modeled by neural networks. The NP is then optimized by maximizing the log predictive probability of targets given contexts, L = EXC,YC,XT ,YT [log p(YT |XT , r C)]. Garnelo et al. (2018b) later proposed to extend the deterministic NP with latent variables to enable global sampling of functions. It introduces the stochastic context embedding z Rdh that provides implicit stochasticity to the posterior of functions. Following the common choice of latent variable generative models in amortized variational inference, z is modeled by the factorized Gaussian, with statistics (mean and variance, denoted as s C) being permutation invariant using the function r( ), p(YT |XT , XC, YC) p(YT |XT , z)q(z|s C)dz, s C = 1 |C| c C g(xc, yc), q(z|s C) = N(z|μ(s C), diag[σ(s C)]2). Observation likelihood p(YT |XT , z) is parameterized as in Eq. (2) where z replaces r C. The model is learned by maximizing the following approximated ELBO, log p(YT |XT , XC, YC) Eq(z|s T C)[log p(YT |XT , z)] DKL(q(z|s T C) p(z|s C)) Eq(z|s T C)[log p(YT |XT , z)] DKL(q(z|s T C) q(z|s C)) via the reparametrization trick (Kingma and Welling 2014). The main strength of NPs compared to other stochastic processes is their linear time complexity in the number of contexts, i.e. O(|C|d2 h), when performing the prediction. Attentive Neural Processes It turns out that NPs tend to underfit severely, i.e. not being able to accurately predict, even at context points. Kim et al. (2019) hypothesize that this is due to the bottleneck of fixed-length global summary r C (or z), whereas there are infinitely many potential targets xt to be predicted. They draw inspiration from the locality of GPs, where the kernel forces the prediction yt to be necessarily close to yc when xt is close to xc and propose Attentive Neural Processes (ANPs) that implement the locality via the attention mechanism. In an ANP, the summary of a context r C is no longer global, and there exists a local summary rt|C per each target. This is made possible by the attention mechanism, which takes key, query, and value as its input. In this case, the attention mechanism computes similarities between the target input embeddings ΦT (queries) and the context input embeddings ΦC (keys), and aggregates context embeddings {g(xc, yc)}c C (values) with these similarities to predict the embedding corresponding to the target. By denoting N key-value pairs arranged as matrices as K RN dh, V RN dv, and M queries as Q RM dh, the (scaled) dot-product attention (Graves 2012), one of the most widely used attention mechanisms, can be written as: Dot Product(Q, K, V ) = softmax(QK / The Multi-head attention (Vaswani et al. 2017) is an extension that linearly transforms keys, values, and queries, and then applies dot-product attention in each head. It tends to yield smoother predictions than the dot-product attention by performing ensemble over multiple dot-product attention: Multi Head(Q, K, V ) := concat(head1, ..., head H)W O where headi := Dot Product(QW Q i , KW K i , V W V i ). Furthermore, ANPs adopt the self-attention instead of the simple neural network g( ), i.e. an attention with Q = K = RC = (σ2 NI + Φ CΦC) 1 YC, Φ C rt|C = φ(xt) ΦCRC p(yt|xt, YC, XC) , φ(xt) rt|C RC = SA([XC, YC]) rt|C = 1 dk softmax(φ(xt) ΦC)RC p(yt|xt, YC, XC) = N μ(rt|C, xt), [σ(rt|C, xt)]2 Table 1: Comparison of the conditional predictive distributions of a Bayesian Last Layer (top) and a single-headed Attentive Neural Process (bottom). V , particularly effective for obtaining a richer representation by modeling interactions among context points. Higherorder interactions can be also modeled by simply stacking multiple alternating self-attention layers and dense layers, as in transformer (Vaswani et al. 2017). ANPs use both the deterministic path rt|C and the stochastic path z. It gets one context dependent summary of contexts rt|C and one global summary of contexts s C by rt|C = Multi Head(φ(xt), ΦC, SA([XC, YC])), s C = 1 |C| c C [SA([XC, YC])]c, (3) where SA([XC, YC]) R|C| dh are trainable self-attention networks. Note that the feature extractor φ(xc) Rdh is adopted to perform attention over diverse features of inputs. To summarize, the conditional predictive distribution of an ANP is given by: p(YT |XT , XC, YC) p(YT |XT , z, rt|C)q(z|s C)dz, p(YT |XT , z, rt|C) = t T N(yt|μ(xt, rt|C, z), diag[σ(xt, rt|C, z)]2), q(z|s C) = N(z|μ(s C), diag[σ(s C)]2). (4) By adopting the attention mechanisms, ANPs are shown to be less affected by the finite dimension bottleneck, yielding much more accurate predictions than BLLs or NPs. However, using the self-attention network leads to an increased prediction time complexity, O(|C|2dh + |C|d2 h), which is asymptotically equivalent to that of BLLs. Residual Neural Processes Despite the asymptotically equivalent time complexity, we find in practice that ANPs outperform BLLs with the same bottleneck width dh, and the performance gap gets larger as Figure 1: Model architectures of an Attentive Neural Process (top) and an Residual Neural Process (bottom) the problem gets more complex. Where does the additional modeling power come from? By considering the most basic form of ANP (deterministic path-only, dy = 1 and H = 1), it can be easily observed that ANPs and BLLs exhibit a structural similarity (see Table 1). Both summarize contexts into the matrix RC of |C| rows, and compute the similarities between target/context input embeddings φ(xt) ΦC to build target-dependent context embeddings rt|C. It is then fed into a function along with the target input to yield the predictive mean/variance. One major difference is that the context summary of a BLL is linear in YC whereas that of an ANP is not. Although the conditional mean of a BLL can be the best loss minimizer among functions in the feature space by the representer theorem, the capacity of the feature space becomes a bottleneck due to the cubic complexity O(d3 h) in the inversion operation. Consequently, having a conditional mean that is non-linear in YC is evidently helpful to increase performance without increasing dh. Nevertheless, considering the fact that Deep Set based architectures including the self-attention network turned out to be not omnipotent (Murphy et al. 2019; Wagstaff et al. 2019), there are also drawbacks in learning flexible nonlinear prediction respect to YC since the optimal summary RC can be very difficult to learn. Even in the cases when functions lie in the space spanned by feature function so that BLLs can perfectly predict, the self-attention network is required to learn how to solve a linear system. While the scaling 1 dk and the softmax transformation over the similarities help to fix output scale of RC independent of |C| or dh, it can be easily shown empirically that the self-attention network cannot yield a reasonably approximate solution to an arbitrary linear system: We tried to train a self-attention network such that SA(ΦC) = ˆΦ 1 C (Φ CΦC) 1Φ C. It ended up with 1 |C|2 I ˆΦ 1 C ΦC 2 F 0.6 when [ΦC]ij N(0, 1), where the off-the-shelf pseudo-inverse library achieved an error of 1e 12. Stacking self-attention layers multiple times or adjusting parameters {dh, |C|} did not help improve the error. Motivated from the above, for better summarization of contexts RC, it can be naturally proposed to train ANP to predict a nonlinear residual of a context summary respect to YC. Instead of explicitly learning the residual as in Res Net (He et al. 2016), we compute the exact linear contexts summary BLL statistics and additionally feed it to the conditional predictive distribution to make the network more flexible. It is also expected to learn a good feature function faster since the gradients from the exact linear path avoids a backpropagation through a large number of parameters of the self-attention network. We hence name our model a Residual Neural Process (RNP), defined by RC =SA([XC, YC]), rnl t|C =Multi Head(φ(xt), ΦC, SA([XC, YC])), φ(xt) ΦC(σ2 NI + Φ CΦC) 1YC φ(xt) ΦC(σ2 NI + Φ CΦC) 1Φ Cφ(xt) p(YT |XT , z, rnl t|C, rl t|C) = (5) t T N(yt;μ(xt, z, rnl t|C, rl t|C), diag[σ(xt, z, rnl t|C, rl t|C)]2). where the rl t|C term above is carefully designed to adopt adequate features used for the prediction of a BLL; we can recover (1) by learning the dot product between the first and second (predictive mean), and between the first and third (predictive variance). While the number of parameters does not increase significantly from original ANPs, RNPs take a few more computation steps for prediction; however, since a BLL takes O(|C|2dh + |C|d2 h + min(d3 h, |C|3)), the asymptotic time complexity does not increase when compared to an ANP given either cases of |C| > dh or dh > |C|. Stochastic Feature Extractor A natural extension to a BLL for more accurate uncertainty prediction would be defining a prior and inferring a posterior on not only on the last layer but on the feature function as well. The exact predictive distribution will not be tractable anymore, but it has been shown that with approximate inference it is possible to express a far richer family of distributions over functions, e.g. Bayesian Neural Networks (Graves 2011; Kingma, Salimans, and Welling 2015). It has not been applied to a meta-learning context, as it is challenging to avoid costly training phases with the approximate inference algorithms proposed so far. Inheriting a spirit of NPs, however, it is possible to approximately infer the posterior of feature function in a metalearning context. We introduce the additional stochastic context embedding zf for feature extractor in BLL, such that, p(YT |XT , XC, YC) p(YT |XT , XC, YC, φzf )q(zf|s C)dz, Target NLL MSE NP 1.030 ( 0.028) 0.557 ( 0.020) BLL 0.068 ( 0.011) 0.344 ( 0.012) ANP 0.106 ( 0.044) 0.316 ( 0.005) RNP (var) 0.051 ( 0.009) 0.310 ( 0.003) RNP (full) 0.021 ( 0.010) 0.309 ( 0.004) Figure 2: Learning curves, converged results and example predictions of deterministic models are shown. (top left) Wall-clock time v.s. unseen target negative log likelihood / mean squared error averaged over 5 random seeds. 95% confidence bounds are also shown as shades. (top right) Converged results after 2 106 iterations. 95% confidence bounds are also reported. (bottom) Predictive mean and variance of different models given the same context. where φzf ( ) is implemented with a neural network of a concatenated input φzf ( ) = φ([ , zf]) and q(zf|s C) follows Eq. (3) and Eq. (4). Such an extension can also be applied to ANPs and RNPs. A problem of having only a separate stochastic path is that the stochasticity induced by z does not affect the way how a deterministic target-dependent context summary rt|C is computed. In the case where the correlations of xs vary significantly over tasks (e.g. variable length-scale), the difference between having different context summarizing mechanism per task and having fixed mechanism over tasks will be maximized. With both the stochastic path and the stochastic feature extractor, we optimize over the modified objective, L = Eq(zf |s T C),q(z|s T C)[log p(YT |XT , z, φzf , rnl t|C, rl t|C)] DKL(q(z|s T C) q(z|s C)) DKL(q(zf|s T C) q(zf|s C)). Experimental Results Following the training method of an NP, we train on multiple realizations of the underlying data generating process. We sample a function per batch, and select random points to be the target and the context. Note that we did not forcefully include the context points among the target points as done in the previous NP benchmarks (Garnelo et al. 2018b; Kim et al. 2019) since such a training method introduces an additional bias. The convergence speed is hence a little slower when compared to what reported before. We used the architecture reported in (Kim et al. 2019) for an ANP structure; except for the fact that we used a feature extractor φ(x) of 2 hidden layers with dh units each (dh depends on the task) and with skip connections. Adam optimizer with a learning rate of 5e-5 is used throughout all experiments. As a quantitative result we mainly report an unseen target NLL or an upper bound of it: however, note that it is significantly affected by the lower bound of predictive variance. We used the lower bound 10 2 of predictive variance, the lower bound used in previous NP researches. In the case of BLLs, we used the lower bound of observation variance of 10 4, but augmented the predictive distribution to have 10 2 lower bound for a predictive variance for a fair comparison of NLL. For a stochastic feature extractor, we used zf R5 for all experiments1. 1D Function Regression with Deterministic NPs First, we demonstrate the idea of the RNP with deterministic path only models. The functions to train are generated from a Gaussian Process with a squared exponential kernel and small likelihood noise, with hyper-parameters fixed. The number of contexts and the number of targets is chosen randomly (|C|, |T| U[3, 100]). Both XC and XT are also drawn uniformly in [ 20, 20]. In this experiment, we used dh = 150. This is just an illustrative example, and there is no need to use a known stochastic process (e.g. a GP) for training NPs. In Figure 2 we show the running average of unseen target negative log-likelihood (NLL) and mean squared er- 1Code used for experiments can be found at : https://github.com/dlqudwns/Residual-Neural-Process Target NLL BLL 0.212 ( 0.024) ANP (stochastic path) 0.079 ( 0.026) ANP (stochastic feature) 0.082 ( 0.024) ANP (both) 0.059 ( 0.025) RNP (stochastic path) 0.048 ( 0.025) RNP (stochastic feature) 0.047 ( 0.025) RNP (both) 0.038 ( 0.023) Figure 3: Learning curves, converged results, and example predictions of stochastic models are shown. (top left) Iterations / Wall-clock time v.s. an upper bound of the unseen target negative log-likelihood averaged over 10 random seeds. 95% confidence bounds are also shown as shades. (top right) Converged results after 1 106 iterations. 95% confidence bounds are also reported. (bottom) 20 samples of predictive mean and variance of different models given the same context. ror (MSE). Deterministic NP, deterministic ANP, and BLL methods are included in the plot as baselines, and two different deterministic RNP models are demonstrated. One denoted with Residual NP (var) uses: rvar t|C = φ(xt) φ(xt) ΦC(σ2 NI + Φ CΦC) 1Φ Cφ(xt) instead of full rl t|C as from Eq. (5). It can be observed in the left figure that the RNPs (full) outperform the ANPs and the BLLs, showing much rapid decrease and better convergence point in terms of both NLL and MSE. The two learning curves are plotted against wallclock time and show that the improvement of performance persists even with the increased running time of the RNPs. The RNPs with variance statistics only, the RNPs (var) improve from ANP but do not show dramatic improvement as the RNPs (full). Even though the performance of the BLLs alone is not always better than the ANPs, the results prove that adopting BLLs statistics and making attention based contexts summary to predict the residuals do help for better prediction overall. The performance at convergence is shown on the right, where the results of the RNPs required 15% of more time than the ANPs. At the bottom of Figure 2, predictions of models are shown. With the help of the BLL statistics, the RNP seems to alleviate the discontinuities of mean prediction that occur in the ANP plot due to the attention mechanism. 1D Function Regression with Stochastic NPs We then demonstrate the idea of incorporating stochasticity in a feature extractor. Unlike previous experiment, we made kernel hyper-parameters to be also sampled randomly l U[0.1, 0.6], σf U[0.1, 1]. In the experiments with stochastic models, we resort to evaluating approximate lower bounds: log p(yt|xt, XC, YC) Eq(zi|XC,YC) p(yt|xt, zi)p(zi|XC, YC) q(zi|XC, YC) Eq(zi|XC,YC) i=1 p(yt|xt, zi) as suggested in (Burda, Grosse, and Salakhutdinov 2016; Le et al. 2018). We used K = 50 and 2000 estimates are running averaged. The results are shown in Figure 3. dh = 150 is used in this experiment. We compared the ANPs and the RNPs with different stochasticities: the stochastic path model is what used in the ANPs, where decoder is fed by z. The stochastic feature model is what we suggest, where the feature extractor is fed by latent variable zf. We can also use two methods in the same time, and is denoted as both. The BLL method here is implemented with stochastic feature, and shown as baseline. It can be observed that the performance of the BLL is relatively worse: apparently, the inclusion of stochastic feature slows down the learning of feature extractor. Nevertheless, the RNPs are getting benefit from the BLL part of the model MNIST NLL Celeb A NLL BLL -0.985 ( 0.006) -2.776 ( 0.028) ANP -1.106 ( 0.006) -3.252 ( 0.036) RNP -1.108 ( 0.004) -3.289 ( 0.017) RNP (both) -1.135 ( 0.006) -3.277 ( 0.013) Figure 4: Learning curves, converged results and example predictions on image completion tasks are shown. (top left) Iterations v.s. an upper bound of the unseen target negative log likelihood averaged over 5 random seeds. 95% confidence bounds are also shown as shades. (top right) Converged results after 2 106 iterations. 95% confidence bounds are also reported. (bottom) Given four different contexts (10 points, 30 points, 100 points and upper half), 10 samples of a predictive mean of each process are shown. Predicted variances are not presented in this figure. and shows increased learning speed from that of the ANPs. Either having one of the stochastic path or the stochastic feature shows similar performance, whereas having both shows better NLL. On the right of the figure, we pick 4 models and presented their predictive distributions. Unlike ANPs, the RNPs did not miss any of the contexts. With both the stochastic path and the stochastic feature, the RNPs show more smooth and stable prediction when compared to the stochastic path only, which we suspect the variable attention that can learn varying length-scales with zf. 2D Function Regression on Image Data If we treat images as a function that maps pixel location x R2 to its pixel intensity y (y R3 for RGB, y R for gray-scale), the whole dataset becomes a group of functions sampled from some stochastic process. We trained and compared BLL, ANP, and RNP models on MNIST (Le Cun et al. 1998) and sub-sampled 32 32 Celeb A (Liu et al. 2015). We used random sizes of contexts and targets (|C|, |T| U[3, 200]). The x and y are re-scaled to [ 1, 1] and [ 0.5, 0.5] respectively. dh = 250 is used in this experiment. The quantitative results reporting the upper bound of Target NLL is shown in Figure 4. K = 50 samples are used to evaluate the bound and 2000 bounds are running averaged. Compared to previous experiments, the performance gap between the BLL and the other methods is the largest, meaning that the nonlinear mapping from YC is crucial in the image completion tasks. While the RNPs with both stochasticity show improved results in both domains, the improvements come from different reasons: MNIST benefits from additional stochasticity while leveraging BLL features helps in Celeb A. For the convergent results, the RNPs (both) required 50% of more wall-clock time than the ANPs. The bottom figure shows the predictive means for ten different samples of z and zf given different contexts. The RNPs tend to show more diverse and realistic samples than the ANPs, especially when not enough samples were provided. In this paper, we presented an RNP, which leverages a BLL for efficient residual learning. A stochastic feature extractor is also presented to approximately infer a posterior of feature extractor, complementing the stochasticity that was previously independent to the context summary. As a result, the RNPs with both types of stochasticities improves from the ANPs, both quantitatively (convergence speed and asymptotic performance of target NLL) and qualitatively (sample quality and diversity). One of the directions for future work would be to investigate other various architectures of set function approxima- tors that can better represent inference procedures. While we explicitly included BLL statistics to the decoder input in this work, a self-attention network that inherently capture them would be more desirable for both approximation accuracy and model simplicity. Acknowledgments This work was supported by the National Research Foundation (NRF) of Korea (NRF-2019M3F2A1072238 and NRF-2019R1A2C1087634), the Information Technology Research Center (ITRC) program of Korea (IITP-20192016-0-00464), and the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No. 2019-0-00075-001 and IITP No. 2017-0-01779 XAI). Bloem-Reddy, B., and Teh, Y. W. 2019. Probabilistic symmetry and invariant neural networks. ar Xiv preprint ar Xiv:1901.06082. To appear in Jounral of Machine Learning Research. Burda, Y.; Grosse, R.; and Salakhutdinov, R. 2016. Importance weighted autoencoders. In International Conference on Learning Representations. Calandra, R.; Peters, J.; Rasmussen, C. E.; and Deisenroth, M. P. 2016. Manifold gaussian processes for regression. In International Joint Conference on Neural Networks. Garnelo, M.; Rosenbaum, D.; Maddison, C.; Ramalho, T.; Saxton, D.; Shanahan, M.; Teh, Y. W.; Rezende, D.; and Eslami, S. A. 2018a. Conditional neural processes. In International Conference on Machine Learning. Garnelo, M.; Schwarz, J.; Rosenbaum, D.; Viola, F.; Rezende, D. J.; Eslami, S.; and Teh, Y. W. 2018b. Neural processes. In ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models. Graves, A. 2011. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems. Graves, A. 2012. Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks. Springer. 5 13. Harrison, J.; Sharma, A.; and Pavone, M. 2018. Metalearning priors for efficient online bayesian regression. In Workshop on the Algorithmic Foundations of Robotics. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. Kim, H.; Mnih, A.; Schwarz, J.; Garnelo, M.; Eslami, A.; Rosenbaum, D.; Vinyals, O.; and Teh, Y. W. 2019. Attentive neural processes. In International Conference on Learning Representations. Kingma, D. P., and Welling, M. 2014. Auto-encoding variational Bayes. In International Conference on Learning Representations. Kingma, D. P.; Salimans, T.; and Welling, M. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems. Le, T. A.; Kim, H.; Garnelo, M.; Rosenbaum, D.; Schwarz, J.; and Teh, Y. W. 2018. Empirical evaluation of neural process objectives. In Neurl PS Workshop on Bayesian Deep Learning. Le Cun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE. Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision. Murphy, R. L.; Srinivasan, B.; Rao, V.; and Ribeiro, B. 2019. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. In International Conference on Learning Representations. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. Wagstaff, E.; Fuchs, F. B.; Engelcke, M.; Posner, I.; and Osborne, M. 2019. On the limitations of representing functions on sets. In International Conference on Machine Learning. Weber, N.; Starc, J.; Mittal, A.; Blanco, R.; and M arquez, L. 2018. Optimizing over a Bayesian last layer. In Neurl PS Workshop on Bayesian Deep Learning. Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R. R.; and Smola, A. J. 2017. Deep sets. In Advances in Neural Information Processing Systems.