# robustifying_sequential_neural_processes__a548c848.pdf

Robustifying Sequential Neural Processes

Jaesik Yoon 1 Gautam Singh 2 Sungjin Ahn 2 3

When tasks change over time, meta-transfer learning seeks to improve the efﬁciency of learning a new task via both meta-learning and transferlearning. While the standard attention has been effective in a variety of settings, we question its effectiveness in improving meta-transfer learning since the tasks being learned are dynamic and the amount of context can be substantially smaller. In this paper, using a recently proposed meta-transfer learning model, Sequential Neural Processes (SNP), we ﬁrst empirically show that it suffers from a similar underﬁtting problem observed in the functions inferred by Neural Processes. However, we further demonstrate that unlike the meta-learning setting, the standard attention mechanisms are not effective in meta-transfer setting. To resolve, we propose a new attention mechanism, Recurrent Memory Reconstruction (RMR), and demonstrate that providing an imaginary context that is recurrently updated and reconstructed with interaction is crucial in achieving effective attention for meta-transfer learning. Furthermore, incorporating RMR into SNP, we propose Attentive Sequential Neural Processes RMR (ASNP-RMR) and demonstrate in various tasks that ASNP-RMR signiﬁcantly outperforms the baselines.

1. Introduction

A central challenge in machine learning is to improve learning efﬁciency. Among such approaches are metalearning (Schmidhuber, 1987; Bengio et al., 1990) and transfer learning (Pratt, 1993; Pan & Yang, 2009). Meta-learning aims to learn the learning process itself and thus enables efﬁcient learning (e.g., from a small amount of data) while

1SAP 2Department of Computer Science, Rutgers University 3Rutgers Center for Cognitive Science. Correspondence to: Jaesik Yoon <jaesik.yoon.kr@gmail.com>, Sungjin Ahn <sjn.ahn@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

transfer learning allows efﬁcient warm-start of a new task by transferring knowledge from previously learned tasks.

In many scenarios, these two problems are not separate but appear in a combined way. For example, to build a customer preference model monthly, we would like to build it by transferring knowledge from the models of the previous months instead of starting from scratch, because the general preference of a customer would not change much across months. However, due to the monthly preference shift, e.g., due to a seasonal change, we also need to efﬁciently learn from the new observations of the new month.

Sequential Neural Processes (SNP) (Singh et al., 2019) is a new probabilistic model class to resolve the abovementioned meta-transfer learning problem. In SNP, metatransfer learning is modeled as stochastic processes that change with time (thus, a stochastic process of stochastic processes). At each time-step, the stochastic process or equivalently a task is meta-learned from the context. SNP represents each task by a latent state of the Neural Process (NP) and models the temporal dynamics of such latent states using a recurrent state-space model (Hafner et al., 2018). This enables SNP to transfer the knowledge of the previous tasks to learn a new task.

It is well-known that Neural Processes (NP) suffers from the underﬁtting problem because all context observations are encoded with limited expressiveness (e.g., by sum or mean encoding) to satisfy the order-invariant property (Wagstaff et al., 2019). In Kim et al. (2019), the authors observe that query-dependent attention can substantially mitigate the problem and propose Attentive Neural Processes (ANP). Therefore, a crucial next question is whether SNP which is partly based on NP for task-level meta-learning but also equipped with temporal-transfer also suffers from underﬁtting, and if so, how we can resolve the problem.

In this paper, we argue not only that there is underﬁtting in SNP but also that it affects the robustness more severely in comparison to NP. We observe that this is because of two novel problems, sparse context and obsolete context, that occur in the novel setting of meta-transfer learning. In Singh et al. (2019), it is shown that in comparison to metalearning, meta-transfer learning is expected to learn a task more efﬁciently by using a much smaller amount of context or even empty context due to the availability of temporal transfer. However, this sparsity becomes an issue because

Robustifying Sequential Neural Processes

Figure 1: Task shift. A context observation c1 = (x1, y1) is provided for a task T1 (black line). Then, the task is changed to T2 (blue line). After the task is shifted to T2, the value f(x2) is queried whose true value is y2. A the standard attention will use the obsolete context c1 = (x1, y1) to infer f(x2) and a high attention-weight will be given to it because x1 and x2 are close. As a result, the attention will suggest a high value for f(x2) even though the true value y2 is small. Our proposed model reconstructs c1 so that its value can be properly adapted to the new task (T2).

when the context is small or empty, the attention which is a remedy for underﬁtting becomes highly ineffective or even not applicable. In the case of sparse context, it is possible to include the past contexts also for attention. However, this raises the second problem, the problem of obsolete context, because the past contexts come from tasks that are different from the current task. Thus, we argue that if the past contexts are not properly transformed for the current task, attention on them may hurt the performance. An illustration is shown in Fig. 1.

To this end, we propose a novel attention mechanism for meta-transfer learning, called Recurrent Memory Reconstruction and Attention (RMRA). In RMRA, to resolve the sparsity problem, for each task, we augment its original context with a generated imaginary context. Thus, even if a task provides a small or empty context, we can still apply attention effectively on the generated imaginary context. To resolve the obsolete context problem, we generate this imaginary context using a novel Recurrent Memory Reconstruction (RMR) mechanism. RMR temporally encodes all the past context observations using recurrent updates and then reconstructs a reformed imaginary context for each task. In this way, the past contexts are properly transformed into a useful representation for the current task. In addition, we do not need to limit the attention to a ﬁnite-size window but can use the entire past information without storing it explicitly. By augmenting SNP with RMRA, we propose a novel and robust SNP model, called Attentive SNP-RMR.

Our main contributions are: (i) we identify why SNP should also suffer from underﬁtting and empirically show that it is indeed the case, (ii) we provide reasons why the existing attention mechanisms for NP are sub-optimal in the meta-transfer learning setting due to sparse and obsolete context and provide empirical analysis for it, (iii) we propose a novel Recurrent Memory Reconstruction and Attention (RMRA) mechanism to resolve the problem and we combine RMRA with SNP and thereby propose a novel

model Attentive SNP-RMR, and (iv) in our experiments, we empirically show that the RMRA mechanism resolves underﬁtting efﬁciently and effectively and results in superior performance in comparison to the baselines in various meta-transfer learning settings.

2. Background

Neural Process (NP) (Garnelo et al., 2018) learns to learn a task τ to map an input x Rdx to an output y Rdy given a context dataset C = (XC, YC) = {(x(n), y(n))}n [NC]. Here, NC is the number of data points in C and [NC] {1, . . . , NC}. To learn a task distribution from this context, NP uses a distribution P(z|C) to sample a task representation z. This makes NP a probabilistic meta-learning framework. Next, an observation model p(y|x, z) takes an input x and returns an output y. The generative process for NP conditioned on the context is given by:

P(Y, z|X, C) = P(Y |X, z)P(z|C) (1)

where P(Y |X, z) = Q

n [ND] P(y(n)|x(n), z) and D = (X, Y ) = {(x(n), y(n))}n [ND] is the target dataset. To obtain the training data for this meta-learning setting, we draw multiple tasks from the true task distribution and sample (C, D) for each task. Note that to implement NP, C is encoded via a permutation-invariant function, such as P

n MLP(x(n), y(n)). Kim et al. (2019) argue that such a sum-aggregation produces an encoding that is not expressive enough and consequently hurts the observation model P(Y |X, z). This is a key limitation of NP and is addressed by Attentive Neural Processes (ANP).

Attentive Neural Process (ANP) Kim et al. (2019) identify the problem of underﬁtting in NP. In underﬁtting, tasks learned from the context set fail to accurately predict the target points, including those in the context set. Also, the learned task distribution shows high uncertainty. Authors show that using a larger latent z is also not sufﬁcient. Therefore to address this, ANP combines the observation model p(y|x, z) with attention on the context points. This allows the model to take a query x and perform attention on the most relevant data points to predict the target output y. To achieve this, an attention function Attend(C; xq) is implemented, which takes the context C and a query xq and returns a read value rxq.

rxq = Attend(C; xq) = X

n [NC] wny(n) (2)

Here, for each y(n) YC, there is an associated weight wn which is computed using a similarity function, sim(XC, xq). Using rx, the observation model becomes P(y|x, z, rx) and results in the following generative process.

P(y, z|x, C) = P(y|x, z, rx)P(z|C) (3)

Robustifying Sequential Neural Processes

where (x, y) D is a target point. The ANP framework allows for the use of a variety of attentive mechanisms (Vaswani et al., 2017).

Sequential Neural Process (SNP) (Singh et al., 2019) While the goal of meta-learning is to learn a single task distribution from a given context C, many situations consist of a sequence of tasks {τt}t [T ] that change gradually with time. Here, t is a task-step. Thus, a framework that learns a task τt from a context Ct must also utilize its similarity with the previous tasks to be able to learn with less context. Let zt denote a representation for a task τt. Then SNP uses a distribution p(zt|z<t, Ct) to learn from both Ct and the previous task representations z<t. From this standpoint, SNP is a meta-transfer learning framework. Given a task representation zt, the model takes a query xt and generates an output yt through an observation model p(yt|xt, zt).

For a task τt, let Ct = {(x(n) t , y(n) t )} be the context and let (xt, yt) be a target point. Then, we describe the generative process for the target conditioned on the context as follows.

P(Y, Z|X, C) =

t=1 P(yt|xt, zt)P(zt|z<t, Ct). (4)

Here, C, X, Y and Z respectively denote the set aggregations of Ct, xt, yt and zt over the entire roll-out.

3. Attentive Meta-Transfer

In this section, we describe our proposed model to resolve the problem of underﬁtting using attention in the setting of meta-transfer learning. We ﬁrst propose a novel recurrent attention mechanism called Recurrent Memory Reconstruction (RMR), which resolves the problem of sparse context that makes underﬁtting in SNP more severe and obsolete context which turns the past contexts to noise due to task shift. We then propose to incorporate RMR into the SNP framework to obtain a robust meta-transfer learning model, Attentive Sequential Neural Processes-RMR (ASNP-RMR).

3.1. Recurrent Memory Reconstruction

To resolve the limitations of the standard attention models in dealing with sparse and obsolete context, we propose the Recurrent Memory Reconstruction (RMR) mechanism. The key ideas in RMR are (i) to generate imaginary context to complement the sparse context and (ii) to introduce recurrent reformation to appropriately transform the obsolete context to a new useful representation upon a task-shift. The imaginary context is generated at each task-step using the updated representation of past contexts via recurrent reformation.

The main task of RMR at each task-step t is to generate a new imaginary context Ct from an encoded representation of the past contexts (both real and imaginary). The imaginary context contains K memory cells, each of which is a

Algorithm 1 Recurrent Memory Reconstruction

ensure: f c xy = MLP, f c order-invariant = P f c xy initialize: hx 0, { hk 0}k [K], X0, { vk 0}k [K] for t [T] do

# Context Processing rc t f c order-invariant(Ct) Ct {(x(n) t , f c xy(x(n) t , y(n) t )) : (x(n) t , y(n) t ) Ct}

# Key Imagination hx t RNN( Xt 1, hx t 1, rc t) Xt f X(hx t )

# Value Imagination for k [K] in parallel do

h(k) t RNNk( x(k) t , v(k) t 1, h(k) t 1) end for { v(k) t }k [K] Attend(Ct {( x(k) t , h(k) t )}k [K]; Xt)

Ct {( x(k) t , v(k) t )}k [K] end for

key-value pair ( x(k) t , v(k) t ) for k [K] {1, . . . , K}. We denote the imagined key-set by Xt = { x(k) t }k and the imagined value-set by Vt = { v(k) t }k. When generating imaginary context, we also use the real context Ct gathered from the current task τt to inform the RMR about the characteristics of the current task. This is summarized by:

Ct, Ht = RMR(Ct, Ct 1, Ht 1) (5)

where Ht = Hkey t Hval t is a set of RMR s hidden states, which consists of the hidden-states of key-generation and value-generation, respectively. RMR generates imaginary keys ﬁrst and then, conditioning on the generated keys, the imaginary values.

3.1.1. GENERATING IMAGINARY KEYS

The main goal of imaginary key generation is to update the formation of the imaginary inputs (i.e., keys) upon a taskshift so that it can provide attention on a meaningful area of the new task. This requires the model to see (i) which input areas were tracked so far and to learn the task-shift dynamics via Xt 1 and (ii) what the current task is via Ct. To this end, RMR deploys an RNN called key-tracker. Therefore, the key-generation process becomes:

( Xt, hkey t ) = Generate Key(Ct, Xt 1, hkey t 1). (6)

where hkey t is the hidden state of the key-tracker.

Speciﬁcally, we ﬁrst concatenate the previous imaginarykeys x(1) t 1, . . . , x(K) t 1 to make the input representation and then encode the real context Ct using an order-invariant encoder rt = P

n MLP(x(n) t , y(n) t ). We then provide the concatenated keys and the real context encoding to the keytracker RNN as the inputs. The new imaginary key-set Xt

Robustifying Sequential Neural Processes

Hval t 1 = h(k) t 1 K

k=1 Vt 1 = v(k) t 1 K

k=1 Xt = x(k) t K

h(k) t h(1) t ... h(K) t Hval t = h(k) t K

Self-Attend

Vt = v(k) t K

Gen Key Gen Val

Track Val Flowk

Figure 2: Illustration of Recurrent Memory Reconstruction (RMR) for generating imaginary context.

is generated from the updated hidden state hkey t . To compute the order-invariant encoding, other implementation options are described in Garnelo et al. (2018); Eslami et al. (2018); Gordon et al. (2019).

3.1.2. GENERATING IMAGINARY VALUES

Given the new imaginary keys, we generate the corresponding imaginary values. Similar to the key generation process, to generate a new value set Vt, RMR takes the new key Xt, the new real context Ct, and the previous value-set Vt 1 as inputs. This results in the following summary of the value-generation process:

Vt, Hval t = Generate Value( Xt, Ct, Vt 1, Hval t 1) (7)

where Hval t 1 is a set of hidden states for value generation.

In designing the value generator, two principles play key roles. First, value-generation process should be aware of what has happened in the past to generate a useful value for the current task upon a task-shift. Second, the values in a value-set should be aware of each other in order to obtain an optimal formation of the values. To realize this, we implement the value generation by the following two components: value-ﬂow tracking and value-ﬂow interaction.

i) Value-Flow Tracking. The purpose of this stage is to implement recurrence that captures value-transitions across the tasks. To this end, for each memory cell k [K], we assign an RNN, Track Value Flow updating:

h(k) t = Track Value Flow( x(k) t , v(k) t 1, h(k) t 1) (8)

where h(k) t is the RNN hidden-state which we call the valuetracker state. Here, each value-tracker state acts as a proposal for the ﬁnal imaginary value, and each RNN is seen as tracking the ﬂow of the value in a particular memory cell, hence the name.

ii) Value-Flow Interaction. In the previous step, Track Value Flow, each value was updated using the new key and its previous value. However, this step does not know what other values were generated. Hence, we require an interactive update. We revise each proposal value via self-attention on other values in the proposal value-set and the real context set.

Speciﬁcally, we ﬁrst use the proposal values to construct a proposal key-value set Cprop t = {( x(k) t , h(k) t }k [K]. We then combine it with the real context Ct to get Ct Cprop t . Finally, we perform self-attention on this union using the imaginary keys as the attention queries and obtain the ﬁnal imaginary value-set t Vt = { v(k) t }k [K]:

Vt = Self-Attend(Ct Cprop t ; Xt). (9)

This completes the generation of imaginary context, Ct = ( Xt, Vt). Algorithm 1 and Fig. 2 show the described process of generating imaginary context.

3.1.3. READING IN RMR

To perform a read operation on the RMR at a given task-step t, we propose to perform attention on the extended context Ct = Ct Ct. Given a query input xq t, we thus obtain the read value as follows rxq t = Attend( Ct; xq t).

3.2. Attentive Sequential Neural Processes

Using the imagined context obtained from RMR, we can now address the problems of sparse and obsolete context in SNP. In this section, we describe how we augment Sequential Neural Processes with RMR and, in doing so, propose Attentive Sequential Neural Processes-RMR.

3.2.1. GENERATIVE PROCESS

We equip the observation model in SNP with the augmented memory Ct provided by RMR. With this, the observation model becomes P(yt|xt, zt, Ct) (see Fig. 3). Then, given a target input xt, the observation model reads Ct to obtain an attention encoding rxt = Attend( Ct; xt).

In this way, combining SNP with RMR, the generative process is described as follows: P(Y, Z, C|X, C) =

t=1 P(yt|xt, zt, Ct)P(zt|z<t, Ct) | {z } SNP with access to RMR

P( Ct| C<t, C t) | {z } RMR Update

We call this model Attentive Sequential Neural Process with RMR (ASNP-RMR).

Robustifying Sequential Neural Processes

τ1 τ2 τt ...

Real Context

Ht H2 H1 H0

C0 C1 C2 Ct

Figure 3: Illustration of the generative process of ASNP-RMR.

3.2.2. LEARNING AND INFERENCE

Because the true posterior is not tractable, ASNP is trained using a variational approximation with the following autoregressive formulation:

P(Z| C, C, D)

t=1 Q(zt|z<t, C t, C, D) (11)

where D = (X, Y ) and C is the set aggregations of Ct over the entire roll-out. For training, the following ELBO is maximized w.r.t. θ and φ: LASNP(θ, φ) =

t=1 EQφ(zt|C,D) h log Pθ(yt|xt, zt, Ct, Ct) i

EQφ(z<t) h KL(Qφ(zt|z<t, C t, C, D)

Pθ(zt|z<t, C t, C t)) i .

For backpropagation, we use the reparameterization trick (Kingma & Welling, 2013). See Appendix C for derivation.

4. Related Works

Meta-learning approaches have become attractive for learning to learn new tasks at test-time. In this line of work, GQN (Eslami et al., 2018) renders 3D scenes from a few viewpoints, and NP (Garnelo et al., 2018) generalizes GQN. Subsequently, ANP (Kim et al., 2019) identiﬁes and resolves the problem of underﬁtting in NP by using query-dependent representation. Similarly, in Rosenbaum et al. (2018), the authors introduce attention to GQN to render complex 3D scenes in large procedurally-generated maps as in Minecraft. Functional Neural Processes (FNP) (Louizos et al., 2019) learns a graph of dependencies between a pre-selected set of points and the training points for modeling distributions over functions. ANP-RNN (Qin et al., 2019) encodes the target inputs via an RNN and use the hidden states as queries

to attend on the context points in a meta-learning setting. To extend NP to meta-transfer learning, SNP (Singh et al., 2019) models a sequence of tasks with sequential latent representations. It outperforms NP in meta-transfer learning a sequence of tasks that come from different but related distributions. Recurrent Neural Process (RNP) (Kumar, 2019) also deals with meta-transfer learning, but it transfers via deterministic representations and shows high uncertainty with sparse context. Willi et al. (2019) also propose a model named RNP for meta-transfer learning.

The term meta-transfer learning has also been used in connection to a different set of problems that deal with discovery of causal mechanisms (Bengio et al., 2019), fast-adaption from a large-scale trained model (Sun et al., 2019), and knowledge transfer to different architectures and tasks (Jang et al., 2019). In Kang & Feng (2018), the authors propose transferable meta-learning to apply a meta-trained model to a task from a task distribution different from the trained.

Santoro et al. (2018) and Goyal et al. (2019) propose approaches combining self-attention and recurrent update. Recurrent Memory Core (RMC) (Santoro et al., 2018) selfattends the entire memory with a single input vector and updates the memory through an RNN module. Recurrent Independent Mechanisms (RIM) (Goyal et al., 2019) recurrently updates the memory but self-attends only the memories that are estimated to be the most relevant using the visual input. Although these methods combine recurrent and attention modules, however unlike RMR, their memory elements and inputs contain only values. They do not deal with key-value pairs in a meta-transfer learning setting.

5. Experiments

In this section, we describe our experiments to answer two key questions: i) By resolving the problems of sparse context and obsolete context, can we improve meta-transfer learning? ii) If yes, what are the needed memory sizes and computational overhead during training? We also perform an ablation on RMR to demonstrate the need for ﬂowtracking and ﬂow-interaction. In the rest, we ﬁrst describe the baselines and our experiment settings. We then describe our results on dynamic 1D regression, dynamic 2D image completion, and dynamic 2D image rendering.

ANP SNP ASNP-W ASNP-RMR

Sequential Latent Attention Recurrent Memory

Table 1: Taxonomy of the models considered for evaluation.

Baselines. We consider three baselines ANP, SNP and ASNP-W. These are characterized by whether or not they contain a sequential latent, attention mechanism or a recurrent memory (see Table 1). Among these, ASNP-W is an

Robustifying Sequential Neural Processes

1 6 11 16 21 26 31 36 41 46 1

Sparse-Context in Dynamic 1D

1 4 7 10 13 16 19

Transfer-Prediction in Dynamic 1D

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Training Time (seconds in millions)

0 0.1 0.2 0.3 0.4 0.5

Training Time (seconds in millions)

ANP SNP ASNP-W (K=25) ASNP-RMR (K=25)

Figure 4: Top: Target-NLL computed at each time-step in the task sequence. Bottom: Target-NLL convergence against the wall-clock time computed during training on a held-out set. Hidden unit size in all models is 128.

extension of SNP such that the observation model attends on a window of K-most recent contexts. Hence, ASNP-W contains sequential latent and attention but not a recurrent memory. These baselines are to test whether sequential latent and the standard attention alone can solve underﬁtting without addressing the problems of sparse and obsolete context. We also test ANP and SNP using different latent sizes to investigate its effect on performance. Similarly, we test ASNP-RMR and ASNP-W using different memory sizes and analyze its effect. See Appendix A for more details on their implementation.

Performance Metric and Context Regimes. We evaluate our models on hundreds of held-out sequences of tasks and analyze their performance in modeling the target outputs. Our performance metric is Target-NLL deﬁned as Ez P (Z|C) log P(Y |X, Z, C). For dynamic 1D regression and dynamic 2D image completion, we consider two regimes for evaluation i) Sparse-Context Regime. In this regime, we consider task-sequences of length 50. We provide a sparse context for 45 randomly chosen tasks and empty context for the remaining. This regime tests the model s ability to transfer-learn from sparse contexts gathered from different tasks. At every increment of the taskstep, we expect the model s performance to improve as the model collects more context. ii) Transfer-Prediction Regime. In this regime, we consider task-sequences of length 20. We provide a large-sized context for the ﬁrst 10 task-steps and empty context for the remaining. This regime allows the model to infer the ﬁrst 10 tasks and their dynamics with high certainty. Subsequently, the model must use this information to transfer-learn the remaining 10 tasks. At every increment of the task-step, we expect the performance of any model to deteriorate as the contexts become more obsolete. However, if a model can deal with this problem, we

Figure 5: Samples for dynamic 1D regression in sparse-context regime at t = 33. Dotted line: True function. Blue line: Predicted function. Shaded light-blue region: Uncertainty. Black dot: Context points at t = 33. Blue dots: Past context points. Darker dots are more recent. Dashed-rectangles: Regions where the predictions are inaccurate.

expect it to show less degradation.

For 2D dynamic image rendering also, we experiment on the sparse-context and transfer-prediction regimes but with task-sequences of length 6. See Appendix B for details about sequence lengths and context sizes.

5.1. Dynamic 1D Regression

In this setting, the tasks are 1D functions that change with time. To generate this dataset, we draw a function from a GP at each task-step such that the kernel-parameters of the GP change according to some linear dynamics. Hence, the model must estimate the function at unseen points and also track the shifts in its shape. We used a squared-exponential kernel for GPs. See Appendix B.1 for more details.

Target-NLL. We plot the target-NLLs for each task-step in Fig. 4. Consider the performance of ASNP-W to study the effect of extending SNP with the standard attention. We observe that improvements compared to SNP are small. Comparing these with ASNP-RMR shows that sequential latent and the standard attention alone cannot solve underﬁtting without addressing sparse and obsolete contexts. In transfer-prediction regime, for t 10, we note that target NLLs of ANP are similar to that of ASNP-RMR. This is because, in this interval, ANP exploits the larger context sizes without attending to the obsolete points. However, the main focus of this setting is multi-step transfer for t > 10 on which ANP degrades quickly.

Effect of Memory Size and Latent Size. In Fig. 6, we show the effect of memory and latent sizes on performance. We observe that increasing the latent size in SNP from 128 to 1024 shows small improvements with diminishing returns. This shows that a larger latent size is not adequate. Additionally, we observe that ASNP-W does not surpass ANP signiﬁcantly. Hence, we conclude that the recurrent

Robustifying Sequential Neural Processes

9 25 100 0.8

Memory size

9 25 100 1.1

Memory size

Moving MNIST

Memory size

Moving Celeb A

ANP (h = 128) ANP (h = 512) SNP (h = 128) SNP (h = 512) SNP (h = 1024) ASNP-W ASNP-RMR

Figure 6: Target-NLL computed as a function of memory sizes in sparse-context regime. Here, h denotes the latent size.

1 6 11 16 21 26 31 36 41 46

Sparse-Context in Moving MNIST

1 4 7 10 13 16 19 1.2

Transfer-Prediction in Moving MNIST

1 6 11 16 21 26 31 36 41 46 3.2

Sparse-Context in Moving Celeb A

1 4 7 10 13 16 19

Transfer-Prediction in Moving Celeb A

ANP SNP ASNP-W (K=25) ASNP-RMR (K=25)

Figure 7: Target-NLL computed at each time-step in moving Celeb A image completion task sequence. Hidden unit size in all models is 128. latents z t fail to inform the observation model about the shift in the obsolete points. We further note that ASNPRMR at K = 9 outperforms ASNP-W for all choices of K. This shows that RMR is more size-efﬁcient.

Training Time. In Fig. 4, we plot the target-NLL against the training wall-clock time, and we note that ASNP-RMR imposes no signiﬁcant overhead in convergence.

Qualitative Analysis. In Fig. 5, we show the predictive means of the target function. We observe that SNP underﬁts the target function in multiple locations. Furthermore, in ANP and ASNP-W, attention helps the prediction only at those obsolete points that did not undergo a signiﬁcant shift. In these models, attention also causes misestimation of the function when obsolete points shift signiﬁcantly. It shows that without addressing the obsolete context issue, the standard attention cannot resolve underﬁtting. In comparison, ASNP-RMR makes more accurate predictions.

5.2. Dynamic 2D Image Completion

In this setting, the task is to complete an image by estimating the pixel value at a given pixel location. The image belongs to a sequence of images that contains a moving image patch on a white canvas. In this dynamic setting, the model must not only estimate the unseen pixel values but also track their

Figure 8: Qualitative evaluation on moving Celeb A image completion task sequence. Red dots: Imaginary keys imagined by ASNP-RMR. The memory size K is 25.

motion. The moving images are taken from the MNIST (Le Cun et al., 1998) and the Celeb A (Liu et al., 2015) datasets, and hence we call these settings moving MNIST and moving Celeb A, respectively.

Target NLL. We plot the target-NLLs in Fig. 7. Recall that in the sparse-context regime, a model must transfer-learn at every task-step, and in the transfer-prediction regime, a model must transfer-learn on task-steps t > 10 that have an empty context. We observe that when such transferlearning is required, the performance of ASNP-W degrades compared to SNP. This implies that attention on obsolete points is not only ineffective but also detrimental.

Effect of Memory and Latent Sizes. From Fig. 6, our conclusions about the effect of memory size and latent size are similar to those in dynamic 1D regression. ASNP-RMR is not only size-efﬁcient but also shows monotonic improvement as we increase the memory size. Note that in Fig. 6, we do not test ANP using a larger latent size because we ﬁnd its memory usage too large to train.

Qualitative Analysis. Fig. 8 shows the positions of imaginary keys generated by RMR. We observe that the model learns to discover certain key points on the Celeb A patch, and it tracks them as the patch moves. It also places some keys on the edges of the canvas to account for the bounces. Fig. 9 shows qualitative samples for moving MNIST image completion in sparse-context setting. As ASNP-RMR accumulates more context, RMR reforms the obsolete points for the current task. Consequently, we observe that the predicted images become clearer over time. We also observe that in comparison to SNP, ASNP-W shows deterioration.

Robustifying Sequential Neural Processes

Figure 9: Moving MNIST image completion samples in the sparse-context regime. Memory size of ASNP-RMR is 25.

In Fig. 9 and 8, consider the task-steps with an empty context. On these task-steps, ANP and ASNP-W predict poor quality images compared to ASNP-RMR, which shows that the standard attention cannot deal with the obsolete context.

5.3. Ablation Study on RMR

In this section, we perform an ablation on RMR to create two modiﬁcations, and we report the results in Fig. 10.

i) No Value-Flow Tracking. In this modiﬁcation, the model generates new imaginary values only via self-attention on the previous values. Thus, without the recurrence, we expect the model to forget the transition dynamics. In Fig. 10, we observe in the transfer-prediction setting that the performance deteriorates sharply in task-steps [11, 20]. This is because the context is empty, and value imagination fails to extrapolate without capturing the transition dynamics.

ii) No Value-Flow Interaction. In this modiﬁcation, the model generates new imaginary values only via Value-Flow Tracking. To incorporate the real context, we provide it as an input to the ﬂow tracking RNNs. Because of no value interaction, we observe performance degradation in Fig. 10. Thus, we conclude that to perform effective value imagination, the model requires both ﬂow tracking and interaction.

5.4. Dynamic 2D Image Rendering

In this setting, we consider a sequence of images as in the moving Celeb A dataset. The task for the model is to take a location on the canvas and predict the image-patch centered at the location. Because the task is to generate an image, we replace the SNP baseline with a TGQN (Singh et al., 2019). Similarly, we replace ASNP-RMR and ASNP-W with ATGQN-RMR and ATGQN-W, respectively. Here, ATGQN stands for attentive TGQN a model that we develop which incorporates attention into TGQN. Hence, ATGQN-RMR is TGQN equipped with attention on RMR,

1 6 11 16 21 26 31 36 41 46

Sparse-Context in 1D GP Regression

1 4 7 10 13 16 19 1.3

Transfer-Prediction in 1D GP regression

1 6 11 16 21 26 31 36 41 46

Sparse-Context in 2D Moving MNIST

1 4 7 10 13 16 19

Transfer-Prediction in 2D Moving MNIST

ASNP-RMR (no ﬂow-tracking) ASNP-RMR (no ﬂow-interaction)

Figure 10: Target-NLL for ASNP-RMR compared against two alternate versions by way of ablation i) without ﬂow-tracking and ii) without ﬂow-interaction.

Regime ATGQN -RMR ATGQN -W TGQN

Sparse-Context 2.57 2.77 3.23 Transfer-Prediction 2.04 2.8 3.08

Table 2: Target-NLL comparison between ATGQN-RMR, ATGQN-W and TGQN in moving Celeb A image rendering task.

and ATGQN-W is equipped with the standard attention on a context window. See Appendix B.3 for implementation details. We report the results on these models in Table 2, and we note that ATGQN-RMR outperforms ATGQN-W and TGQN.

6. Conclusion

In this paper, we argued that the two problems, sparse context and obsolete context, observed in meta-transfer learning due to temporal task-shift, make the underﬁtting issue in SNP more severe. Then, to resolve this, we proposed a novel attention model using imaginary context generated by Recurrent Memory Reconstruction (RMR) and a robust probabilistic meta-transfer learning model, Attentive Sequential Neural Processes. Our experiments demonstrate that existing methods show weaknesses in dealing with sparse and obsolete context, and that using RMR-based attention in SNP is an effective way to resolve the issue. The ablation study shows that the recurrent context modeling and the interaction model are the key components to achieve this improved robustness. In the future, it will be interesting to apply this robust model to various other problems including meta-transfer reinforcement learning.

Robustifying Sequential Neural Processes

Acknowledgements

JY thanks Kakao Brain and SAP for their support and Hoyong Jang for helpful feedback. SA thanks Kakao Brain and Center for Super Intelligence (CSI) for their support.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265 283, 2016.

Bengio, Y., Bengio, S., and Cloutier, J. Learning a synaptic learning rule. Universit e de Montr eal, D epartement d informatique et de recherche ..., 1990.

Bengio, Y., Deleu, T., Rahaman, N., Ke, R., Lachapelle, S., Bilaniuk, O., Goyal, A., and Pal, C. A meta-transfer objective for learning to disentangle causal mechanisms. ar Xiv preprint ar Xiv:1901.10912, 2019.

Eslami, S. A., Rezende, D. J., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, K., et al. Neural scene representation and rendering. Science, 360(6394):1204 1210, 2018.

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018.

Gordon, J., Bruinsma, W. P., Foong, A. Y., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes. ar Xiv preprint ar Xiv:1910.13556, 2019.

Goyal, A., Lamb, A., Hoffmann, J., Sodhani, S., Levine, S., Bengio, Y., and Sch olkopf, B. Recurrent independent mechanisms. ar Xiv preprint ar Xiv:1909.10893, 2019.

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. ar Xiv preprint ar Xiv:1811.04551, 2018.

Jang, Y., Lee, H., Hwang, S. J., and Shin, J. Learning what and where to transfer. ar Xiv preprint ar Xiv:1905.05901, 2019.

Kang, B. and Feng, J. Transferable meta learning across domains. In UAI, pp. 177 187, 2018.

Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. ar Xiv preprint ar Xiv:1901.05761, 2019.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kumar, A., Eslami, S., Rezende, D. J., Garnelo, M., Viola, F., Lockhart, E., and Shanahan, M. Consistent generative query networks. ar Xiv preprint ar Xiv:1807.02033, 2018.

Kumar, S. Spatiotemporal Modeling using Recurrent Neural Processes. Ph D thesis, Carnegie Mellon University Pittsburgh, PA, 2019.

Le Cun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Louizos, C., Shi, X., Schutte, K., and Welling, M. The functional neural process. ar Xiv preprint ar Xiv:1906.08324, 2019.

Nair, V. and Hinton, G. E. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807 814, 2010.

Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359, 2009.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.

Pratt, L. Y. Discriminability-based transfer between neural networks. In Advances in neural information processing systems, pp. 204 211, 1993.

Qin, S., Zhu, J., Qin, J., Wang, W., and Zhao, D. Recurrent attentive neural process for sequential data. ar Xiv preprint ar Xiv:1910.09323, 2019.

Rosenbaum, D., Besse, F., Viola, F., Rezende, D. J., and Eslami, S. Learning models for visual 3d localization with implicit mapping. ar Xiv preprint ar Xiv:1807.03149, 2018.

Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski, M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., and Lillicrap, T. Relational recurrent neural networks. In Advances in neural information processing systems, pp. 7299 7310, 2018.

Schmidhuber, J. Evolutionary Principles in Self-referential Learning: On Learning how to Learn: the Meta-metameta...-hook. 1987.

Singh, G., Yoon, J., Son, Y., and Ahn, S. Sequential neural processes. ar Xiv preprint ar Xiv:1906.10264, 2019.

Robustifying Sequential Neural Processes

Sun, Q., Liu, Y., Chua, T.-S., and Schiele, B. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 403 412, 2019.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

Wagstaff, E., Fuchs, F. B., Engelcke, M., Posner, I., and Osborne, M. On the limitations of representing functions on sets. ar Xiv preprint ar Xiv:1901.09006, 2019.

Willi, T., Masci, J., Schmidhuber, J., and Osendorfer, C. Recurrent neural processes. ar Xiv preprint ar Xiv:1906.05915, 2019.

Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-c. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802 810, 2015.