# robustifying_sequential_neural_processes__a548c848.pdf Robustifying Sequential Neural Processes Jaesik Yoon 1 Gautam Singh 2 Sungjin Ahn 2 3 When tasks change over time, meta-transfer learning seeks to improve the efficiency of learning a new task via both meta-learning and transferlearning. While the standard attention has been effective in a variety of settings, we question its effectiveness in improving meta-transfer learning since the tasks being learned are dynamic and the amount of context can be substantially smaller. In this paper, using a recently proposed meta-transfer learning model, Sequential Neural Processes (SNP), we first empirically show that it suffers from a similar underfitting problem observed in the functions inferred by Neural Processes. However, we further demonstrate that unlike the meta-learning setting, the standard attention mechanisms are not effective in meta-transfer setting. To resolve, we propose a new attention mechanism, Recurrent Memory Reconstruction (RMR), and demonstrate that providing an imaginary context that is recurrently updated and reconstructed with interaction is crucial in achieving effective attention for meta-transfer learning. Furthermore, incorporating RMR into SNP, we propose Attentive Sequential Neural Processes RMR (ASNP-RMR) and demonstrate in various tasks that ASNP-RMR significantly outperforms the baselines. 1. Introduction A central challenge in machine learning is to improve learning efficiency. Among such approaches are metalearning (Schmidhuber, 1987; Bengio et al., 1990) and transfer learning (Pratt, 1993; Pan & Yang, 2009). Meta-learning aims to learn the learning process itself and thus enables efficient learning (e.g., from a small amount of data) while 1SAP 2Department of Computer Science, Rutgers University 3Rutgers Center for Cognitive Science. Correspondence to: Jaesik Yoon , Sungjin Ahn . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). transfer learning allows efficient warm-start of a new task by transferring knowledge from previously learned tasks. In many scenarios, these two problems are not separate but appear in a combined way. For example, to build a customer preference model monthly, we would like to build it by transferring knowledge from the models of the previous months instead of starting from scratch, because the general preference of a customer would not change much across months. However, due to the monthly preference shift, e.g., due to a seasonal change, we also need to efficiently learn from the new observations of the new month. Sequential Neural Processes (SNP) (Singh et al., 2019) is a new probabilistic model class to resolve the abovementioned meta-transfer learning problem. In SNP, metatransfer learning is modeled as stochastic processes that change with time (thus, a stochastic process of stochastic processes). At each time-step, the stochastic process or equivalently a task is meta-learned from the context. SNP represents each task by a latent state of the Neural Process (NP) and models the temporal dynamics of such latent states using a recurrent state-space model (Hafner et al., 2018). This enables SNP to transfer the knowledge of the previous tasks to learn a new task. It is well-known that Neural Processes (NP) suffers from the underfitting problem because all context observations are encoded with limited expressiveness (e.g., by sum or mean encoding) to satisfy the order-invariant property (Wagstaff et al., 2019). In Kim et al. (2019), the authors observe that query-dependent attention can substantially mitigate the problem and propose Attentive Neural Processes (ANP). Therefore, a crucial next question is whether SNP which is partly based on NP for task-level meta-learning but also equipped with temporal-transfer also suffers from underfitting, and if so, how we can resolve the problem. In this paper, we argue not only that there is underfitting in SNP but also that it affects the robustness more severely in comparison to NP. We observe that this is because of two novel problems, sparse context and obsolete context, that occur in the novel setting of meta-transfer learning. In Singh et al. (2019), it is shown that in comparison to metalearning, meta-transfer learning is expected to learn a task more efficiently by using a much smaller amount of context or even empty context due to the availability of temporal transfer. However, this sparsity becomes an issue because Robustifying Sequential Neural Processes Figure 1: Task shift. A context observation c1 = (x1, y1) is provided for a task T1 (black line). Then, the task is changed to T2 (blue line). After the task is shifted to T2, the value f(x2) is queried whose true value is y2. A the standard attention will use the obsolete context c1 = (x1, y1) to infer f(x2) and a high attention-weight will be given to it because x1 and x2 are close. As a result, the attention will suggest a high value for f(x2) even though the true value y2 is small. Our proposed model reconstructs c1 so that its value can be properly adapted to the new task (T2). when the context is small or empty, the attention which is a remedy for underfitting becomes highly ineffective or even not applicable. In the case of sparse context, it is possible to include the past contexts also for attention. However, this raises the second problem, the problem of obsolete context, because the past contexts come from tasks that are different from the current task. Thus, we argue that if the past contexts are not properly transformed for the current task, attention on them may hurt the performance. An illustration is shown in Fig. 1. To this end, we propose a novel attention mechanism for meta-transfer learning, called Recurrent Memory Reconstruction and Attention (RMRA). In RMRA, to resolve the sparsity problem, for each task, we augment its original context with a generated imaginary context. Thus, even if a task provides a small or empty context, we can still apply attention effectively on the generated imaginary context. To resolve the obsolete context problem, we generate this imaginary context using a novel Recurrent Memory Reconstruction (RMR) mechanism. RMR temporally encodes all the past context observations using recurrent updates and then reconstructs a reformed imaginary context for each task. In this way, the past contexts are properly transformed into a useful representation for the current task. In addition, we do not need to limit the attention to a finite-size window but can use the entire past information without storing it explicitly. By augmenting SNP with RMRA, we propose a novel and robust SNP model, called Attentive SNP-RMR. Our main contributions are: (i) we identify why SNP should also suffer from underfitting and empirically show that it is indeed the case, (ii) we provide reasons why the existing attention mechanisms for NP are sub-optimal in the meta-transfer learning setting due to sparse and obsolete context and provide empirical analysis for it, (iii) we propose a novel Recurrent Memory Reconstruction and Attention (RMRA) mechanism to resolve the problem and we combine RMRA with SNP and thereby propose a novel model Attentive SNP-RMR, and (iv) in our experiments, we empirically show that the RMRA mechanism resolves underfitting efficiently and effectively and results in superior performance in comparison to the baselines in various meta-transfer learning settings. 2. Background Neural Process (NP) (Garnelo et al., 2018) learns to learn a task τ to map an input x Rdx to an output y Rdy given a context dataset C = (XC, YC) = {(x(n), y(n))}n [NC]. Here, NC is the number of data points in C and [NC] {1, . . . , NC}. To learn a task distribution from this context, NP uses a distribution P(z|C) to sample a task representation z. This makes NP a probabilistic meta-learning framework. Next, an observation model p(y|x, z) takes an input x and returns an output y. The generative process for NP conditioned on the context is given by: P(Y, z|X, C) = P(Y |X, z)P(z|C) (1) where P(Y |X, z) = Q n [ND] P(y(n)|x(n), z) and D = (X, Y ) = {(x(n), y(n))}n [ND] is the target dataset. To obtain the training data for this meta-learning setting, we draw multiple tasks from the true task distribution and sample (C, D) for each task. Note that to implement NP, C is encoded via a permutation-invariant function, such as P n MLP(x(n), y(n)). Kim et al. (2019) argue that such a sum-aggregation produces an encoding that is not expressive enough and consequently hurts the observation model P(Y |X, z). This is a key limitation of NP and is addressed by Attentive Neural Processes (ANP). Attentive Neural Process (ANP) Kim et al. (2019) identify the problem of underfitting in NP. In underfitting, tasks learned from the context set fail to accurately predict the target points, including those in the context set. Also, the learned task distribution shows high uncertainty. Authors show that using a larger latent z is also not sufficient. Therefore to address this, ANP combines the observation model p(y|x, z) with attention on the context points. This allows the model to take a query x and perform attention on the most relevant data points to predict the target output y. To achieve this, an attention function Attend(C; xq) is implemented, which takes the context C and a query xq and returns a read value rxq. rxq = Attend(C; xq) = X n [NC] wny(n) (2) Here, for each y(n) YC, there is an associated weight wn which is computed using a similarity function, sim(XC, xq). Using rx, the observation model becomes P(y|x, z, rx) and results in the following generative process. P(y, z|x, C) = P(y|x, z, rx)P(z|C) (3) Robustifying Sequential Neural Processes where (x, y) D is a target point. The ANP framework allows for the use of a variety of attentive mechanisms (Vaswani et al., 2017). Sequential Neural Process (SNP) (Singh et al., 2019) While the goal of meta-learning is to learn a single task distribution from a given context C, many situations consist of a sequence of tasks {τt}t [T ] that change gradually with time. Here, t is a task-step. Thus, a framework that learns a task τt from a context Ct must also utilize its similarity with the previous tasks to be able to learn with less context. Let zt denote a representation for a task τt. Then SNP uses a distribution p(zt|z