# metafun_metalearning_with_iterative_functional_updates__cdba7f41.pdf

Meta Fun: Meta-Learning with Iterative Functional Updates

Jin Xu 1 Jean-Francois Ton 1 Hyunjik Kim 2 Adam R. Kosiorek 1 3 Yee Whye Teh 1

We develop a functional encoder-decoder approach to supervised meta-learning, where labeled data is encoded into an inﬁnite-dimensional functional representation rather than a ﬁnitedimensional one. Furthermore, rather than directly producing the representation, we learn a neural update rule resembling functional gradient descent which iteratively improves the representation. The ﬁnal representation is used to condition the decoder to make predictions on unlabeled data. Our approach is the ﬁrst to demonstrates the success of encoder-decoder style meta-learning methods like conditional neural processes on largescale few-shot classiﬁcation benchmarks such as mini Image Net and tiered Image Net, where it achieves state-of-the-art performance.

1. Introduction

The goal of meta-learning is to be able to generalise to new tasks from the same task distribution as the training tasks. In supervised meta-learning, a task can be described as making predictions on a set of unlabelled data points (target) by effectively learning from a set of data points with labels (context). Various ideas have been proposed to tackle supervised meta-learning from different perspectives (Andrychowicz et al., 2016; Ravi & Larochelle, 2016; Finn et al., 2017; Koch, 2015; Snell et al., 2017; Vinyals et al., 2016; Santoro et al., 2016; Rusu et al., 2019). In this work, we are particularly interested in a family of meta-learning models that use an encoder-decoder pipeline, such as Neural Processes (Garnelo et al., 2018a;b). The encoder is a permutation-invariant function on the context that summarises the context into a task representation, while the decoder produces a predictive model for the targets, conditioned on the task representation.

1Department of Statistics, University of Oxford, Oxford, United Kingdom 2Google Deep Mind, London, United Kingdom 3Applied AI Lab, Oxford Robotics Institute, University of Oxford, Oxford United Kingdom. Correspondence to: Jin Xu <jin.xu@stats.ox.ac.uk>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

The objective of meta-learning is then to learn the encoder and the decoder such that the produced predictive model generalises well to the targets of new tasks.

Previous works in this category such as the Conditional Neural Process (CNP) and the Neural Process (NP) (Garnelo et al., 2018a;b) use sum-pooling operations to produce ﬁnite-dimensional, vectorial, task representations. In this work, we investigate the idea of summarising tasks with inﬁnite-dimensional functional representations. Although there is a theoretical guarantee that sum-pooling of instancewise representations can express any set function (universality) (Zaheer et al., 2017; Bloem-Reddy & Teh, 2020), in practice CNP and NP tend to underﬁt the context (Kim et al., 2019). This observation is in line with the theoretical ﬁnding that for universality, the dimension of the task representation should be at least as large as the cardinality of the context set, if the encoder is a smooth function Wagstaff et al. (2019). We develop a method that explicitly uses functional representations. Here the effective dimensionality of the task representation grows with the number of context points, which addresses the aforementioned issues of ﬁxed-dimensional representations. Moreover, in practice it is difﬁcult to model interactions between data points with only sum-pooling operations. This issue can be partially addressed by inserting modules such as relation networks (Sung et al., 2018; Rusu et al., 2019) or set transformers (Lee et al., 2019a) before sum-pooling. However, only within-context but not context-target interactions can be modelled by these modules. The construction of our functional representation involves measuring similarities between all data points, which naturally contains information regarding interactions between elements in either the context or the target.

Furthermore, rather than producing the functional representation in a single pass, we develop an approach that learns iterative updates to encode the context into the task representation. In general, learning via iterative updates is often easier than directly learning the ﬁnal representation, because of the error-correcting opportunity at each iteration. For example, an iterative parameterisation of the encoder in Variational Autoencoders (VAEs) has been demonstrated to be effective in reducing the amortisation gap (Marino et al., 2018), while in meta-learning, both learning to learn methods (Andrychowicz et al., 2016; Ravi & Larochelle,

Meta Fun: Meta-Learning with Iterative Functional Updates

Figure 1. To illustrate the iterative procedure in Meta Fun, we consider a simpler case where our functional representation is just a predictor for the task. (A) The ﬁgure depicts a 1D regression task with the current predictor. (B) Local updates are computed by evaluating the functional representation (the current predictor) on the context inputs, and comparing it to the corresponding context outputs. Here we simply measure differences between evaluations (predictions) and outputs. (C) We apply functional pooling to aggregate local updates into a global functional update, which generalises the local updates to the whole input domain. (D) The functional update is applied to the current functional representation with a learning rate α.

2016) and Model Agnostic Meta Learning (MAML) (Finn et al., 2017) use iterative updating procedures to adapt to new tasks, although these update rules operate in parameter space rather than function space. Therefore, it is reasonable to conjecture that iterative structures are favourable inductive biases for the task encoding process.

In summary, the primary contribution of this work is a metalearning approach that learns to summarise a task using a functional representation constructed via iterative updates. We apply our approach to solve meta-learning problems on both regression and classiﬁcation tasks, and achieve state-ofthe-art performance on heavily benchmarked datasets such as mini Image Net (Vinyals et al., 2016) and tiered Image Net (Ren et al., 2018), which has never been demonstrated with encoder-decoder meta-learning methods without MAMLstyle gradient updates. We also conducted an ablation study to understand the effects of the different model components.

2. Meta Fun

Meta-learning, or learning to learn, leverages past experiences to quickly adapt to tasks T p(T ) drawn iid from some task distribution. In supervised meta-learning, a task T takes the form of T = {ℓ, {(xi, yi)}i C, {(x j, y j)}j T}, where xi, x j X are inputs, yi, y j Y outputs, ℓis the loss function to be minimised, {(xi, yi)}i C is the context, and {(x j, y j)}j T is the target. We consider the process of learning as constructing a predictive model using the task context

and refer to the mapping from context {(xi, yi)}i C to a predictive model f = Φ({(xi, yi)}i C; φ) as the learning model parameterised by φ. In our formulation, the objective of meta-learning is to optimise the learning model such that the expected loss on the target under f is minimised, formally written as:

f = Φ({(xi, yi)}i C; φ)

φ = arg min φ ET p(T )

j T ℓ(f(x j), y j)

where both {(xi, yi)}i C, {(x j, y j)}i T come from task T .

2.1. Learning Functional Task Representation

Like previous works such as CNP and NP, we construct the learning model using an encoder-decoder pipeline, where the encoder Φe({(xi, yi)}i C; φe) is a permutationinvariant function of the context producing a task representation. In past works, pooling operations are usually used to enforce permutation-invariance. CNP and NP use sum-pooling: r = P

i C ri, where ri = h(xi, yi; φe) is a representation for context pair xi, yi, and r is a ﬁxed-dimensional task representation. Instead, we introduce functional-pooling operations, which also enforce permutation-invariance but output a function that can loosely be interpreted as an inﬁnite-dimensional representation.

Deﬁnition 2.1 (Functional pooling). Let k( , ) be a realvalued similarity measure, and {(xi, ri)}i C be a set of key-value pairs with xi X, ri R. Functional pooling is a mapping FUNPOOLING : (X R)|C| H deﬁned as

r( ) = FUNPOOLING({(xi, ri)}i C) = X

i C k( , xi)ri , (2)

where the output is a function r : X R and H is a space of such functions.

In practice, we only need to evaluate this function on a ﬁnite query set {(x j, y j)}j Q (consisting of both contexts and targets; see below). That is, we only need to compute R = [r(x 1), . . . , r(x |Q|)] , which can be easily implemented using matrix operations. We consider two types of FUNPOOLING here, though others are possible. The kernel-based FUNPOOLING reads as,

R = KFP (Q, K, V ) := krbf(Q, K)V , (3)

where krbf is the RBF kernel, Q = [a(x 1), . . . , a(x |Q|)] is a matrix whose rows are queries, a( ) is a transformation mapping inputs into features, K = [a(x1), . . . , a(x|C|)]

a matrix whose rows are keys, and V = [r1, . . . , r|C|] a matrix whose rows are values (using terminology from the attention literature). Parameterising input transformation a

Meta Fun: Meta-Learning with Iterative Functional Updates

Figure 2. This ﬁgure illustrates the iterative computation of functional representation in Meta Fun. At each iteration, we ﬁrst evaluate the current functional representation at both context and target points. Then the shared local update function u takes in each context point and the corresponding evaluation as inputs, and produces local update ui. Next, we apply (kernel-based or attention-based) functional pooling to aggregate local updates ui into a functional update r( ), which for each query is a linear combination of local updates ui weighted by similarities between this query and all keys. Finally, the functional updates are evaluated for both the context and the target, and are applied to the corresponding evaluations of functional representation with a learning rate α.

with a deep neural network can be seen as using deep kernels (Wilson et al., 2016) as the similarity measure. The second type of FUNPOOLING is given by dot-product attention,

R = DFP (Q, K, V ) := softmax(QK / p

where dk is the dimension of the query/key vectors.

Our second core idea is that rather than producing the task representation in a single pass (like previous encoderdecoder meta-learning approaches), we start from an initial representation r(0)( ), and iteratively produce improved representations r(1)( ), . . . , r(T )( ). At each step, a parameterised local update rule u compares r(t)( ) to the context input/output pairs, producing local update values ui = u(xi, yi, r(t)(xi)) for each i C. These can then be aggregated into a global update to the task representation using functional pooling,

ui = u(xi, yi, r(t)(xi)) ,

r(t)( ) = FUNPOOLING({(xi, ui)}i C) ,

r(t+1)( ) = r(t)( ) α r(t)( ) , (5)

where α is the step size. Once the local update function u and the functional pooling operations are parameterised by neural networks, Equation (5) deﬁnes a neural update rule operating directly in function space. The functional update r(t)( ) depends on the current representation r(t)( ) and the context {(xi, yi)}i C. Figure 1 illustrates our iterative procedure in a simpliﬁed setting.

The ﬁnal task representation can then be decoded into a predictor f( ) = Φd(r(T )( ); φd). The speciﬁc parametric forms of the decoder take different forms for regression and classiﬁcation, and are described in Section 2.2. The decoder requires the evaluation of functional representation r(T )(x) at x only for predicting f(x). Therefore, it is unnecessary to compute the functional representations r( ) (including their functional updates) on all input points. Instead, we compute them only on the context {xi}i C and target inputs {x j}j T. We use r(t) = [r(t)(x1) . . . r(t)(x|C|), r(t)(x 1) . . . r(t)(x |T|)]

to denote a matrix where each row is r(t)(x) evaluated on either context or target inputs, and let Q = [a(x1) . . . a(x|C|), a(x 1) . . . a(x |T|)] . Equation (5) can be implemented using matrix computations as follows,

u(t) i = u(xi, yi, r(t) i ) , (6)

U (t) = [u(t) 1 , . . . , u(t) |C|] , (7)

r(t) = KFP or DFP Q, K, U (t) , (8)

r(t+1) = r(t) α r(t) (9)

where r(t) i denotes the i-th row of r(t).

To obtain a prediction fj for the target (x j, y j), we decode the ﬁnal representation for this target point: fj = Φd(r(T ) |C|+j; φd), and the overall training loss can be written

Meta Fun: Meta-Learning with Iterative Functional Updates

as: L(u, a, φd) = 1 |T|

j T ℓ(fj, y j) , (10)

where the predictions fj depend on u, a and φd.

Assuming the width and depth of all our neural network components are bounded by W and D respectively, and the output dimension of u is also less than W, the time complexity of our approach is O W|C|(|C|+|T|)+W 2D(|C|T + |T|) , and the space complexity is O (|C| + WT)(|C| + |T|) + W 2D . For few-shot problems, |C| and |T| are typically small, and T 6 in all our experiments.

2.2. Meta Fun for Regression and Classiﬁcation

While the proposed framework can be applied to any supervised learning task, the speciﬁc parameterisation of learnable components can affect the model performance. In this section, we specify the parametric forms of our model that work well on regression and classiﬁcation tasks.

Regression For regression tasks, we parameterise the local update function u( ) using a multi-layer perceptron as u([xi, yi, r(xi)]) = MLP ([xi, yi, r(xi)]), i C, where [ ] is concatenation. We also use an MLP to parametrise the input transformation a( ) in the functional pooling. The decoder in this case is given by w = MLP (r(x)), another MLP1 that outputs w, which then parameterises the predictive model f = MLP (x; w).

Note that our model can easily be modiﬁed to incorporate Gaussian uncertainty by adding an extra output vector for the predictive standard deviation: P(y|x) = N(µw(x), σw(x)), w = MLP (r(x)). For further architecture details, see Appendix.

Classiﬁcation For K-way classiﬁcation, we divide the latent functional representation r(x) into K parts [r1(x), . . . , r K(x)], where rk(x) corresponds to the class k. Consequently, the local update function u( ) also has K parts, that is, u([xi, yi, r(xi)]) = [u1( ), . . . , u K( )]. In this case, yi = [y1 i , . . . , y K i ] is the class label expressed as a one-hot vector; the uk is deﬁned as follows,

uk([xi, yi, r(xi)]) = yk i u+(m(rk(xi)), mi)

+ (1 yk i )u (m(rk(xi)), mi) , (11)

where mi = PK k=1 m(rk(xi)) summarises representations of all classes, and m, u+, u are parameterised by separate MLPs. With this formulation, we update the class representations using either u+ (when the label matches k) or u

1It might be desirable to use other parameterisations of the input transformation a( ), and the decoder f( ), e.g., f(x) = MLP ([x, r(x)])), or feeding r(x) to each layer of the MLP.

(when the label is different to k), so that labels are not concatenated to the inputs, but directly used to activate different model components, which is crucial for model performance. Furthermore, interactions between data points in classiﬁcation problems include both within-class and between-class interactions. Our approach is able to integrate two types of interactions by having separate functional representation for each class and computing local updates for each class differently based on class membership of each data point. In fact, this formulation resembles the structure of the local update rule in functional gradient descent for classiﬁcation tasks, which is a special case of our approach (see Section 3). Same as in regression tasks, the input transformation a( ) in the functional pooling is still an MLP. The parametric form of the decoder is the same as in Latent Embedding Optimisation (LEO) (Rusu et al., 2019). The class representation rk(x) generates weights wk N(µ(rk(x)), σ(rk(x))) where µ and σ are MLPs or just linear functions, and the ﬁnal prediction is given by

P(y = k|x) = softmax(x T w)k , (12)

where w = [w1, . . . , w K], k = 1, . . . , K. Hyperparameters of all components are described in Appendix.

3. Related Work

Functional Gradient Descent Functional gradient descent (Mason et al., 1999; Y. Guo & Williamson, 2001) is an optimisation algorithm used to minimise the objective function by moving in the direction of the negative gradient in function space. To ensure smoothness, we may work with functions in a Reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950; Berlinet & Thomas-Agnan, 2011) deﬁned by a kernel k(x, x ). Given a function f in the RKHS, we are interested in minimising the supervised loss L(f) = P

i C ℓ(f(xi), yi) with respect to f. We can do so by computing the functional derivative and use it to iteratively update f (see Appendix for more details),

f (t+1)(x) = f (t)(x) α X

i C k(x, xi) ℓ(f (t)(xi), yi) (13)

with step size α, and ℓ(f (t)(xi), yi) denotes gradient w.r.t. to predictions in the loss function ℓ.

The update rule in Equation (5) becomes that of functional gradient descent in Equation (13) when

(i) A trivial decoder f(x) = Φd(r(x); φd)(x) = r(x) is used, so the functional representation r(x) is the same as the predictive model f(x).

(ii) Kernel functional pooling KFP is used and the kernel function is ﬁxed.

Meta Fun: Meta-Learning with Iterative Functional Updates

(iii) Using gradient-based local update function u(x, y, f(x)) = ℓ(f(x), y).

Furthermore, for a K-way classiﬁcation problem, we predict K-dimensional logits f(x) = [f 1(x), . . . , f K(x)] , and use cross entropy loss as follows:

ℓ(f(x), y) =

k=1 yk log ef k(x) PK k =1 ef k (x) , (14)

where y = [y1, . . . , y K] is the one-hot label for x.

The gradient-based local update function is now ℓ(f(x), y) = [ 1ℓ(f(x), y), . . . , Kℓ(f(x), y)] where kℓ(f(x), y) is partial derivative w.r.t. each predictive logit:

kℓ(f(x), y) = ef k(x) PK k =1 ef k (x) yk . (15)

Here kℓ(f(x), y) is analogous to uk( ) in Equation (11), which is the local update function for class k.

If m, u+, u in Equation (11) are speciﬁed rather than being learned, more speciﬁcally:

m(f k (x)) = ef k (x)

u+(m(f k(x)), m) = m(f k(x))

u (m(f k(x)), m) = m(f k(x))

k=1 m(f k (x)) , (16)

Equation (15) can be rewritten as:

kℓ(f(x), y) = yku+(m(f k(x)), m)

+ (1 yk)u (m(f k(x)), m) , (17)

which has a similar form as Equation (11).

Therefore, our approach can be seen as an extension of functional gradient descent, with an additional learning capacity due learnable neural modules which afford more ﬂexibility. From this perspective, our approach tackles supervised meta-learning problems by learning an optimiser in function space.

Supervised Meta-Learning Various ideas have been proposed to solve the problem of supervised meta-learning. Andrychowicz et al. (2016); Ravi & Larochelle (2016) learn the neural optimisers from previous tasks which can be used to optimise models for new tasks. However, these learned optimisers operate in parameter space rather than function space as we do. MAML (Finn et al., 2017) learns the initialisation from which models are further adapted for a new task

by a few gradient descent steps. Koch (2015); Snell et al. (2017); Vinyals et al. (2016) explore the idea of learning a metric space from previous tasks in which data points are compared to each other to make predictions at test time. Santoro et al. (2016) demonstrate that Memory-Augmented Neural Networks (MANN) can rapidly integrate the data for a new task into memory, and utilise this stored information to make predictions.

Our approach, in line with previous works such as CNP and

NP, adopt an encoder-decoder pipeline to tackle supervised meta-learning. The encoder in CNP corresponds to a summation of instance-level representations produced by a shared instance encoder. NPs, on the other hand, use a probabilistic encoder with the same parametric form as CNP, but producing a distribution of stochastic representation. The Attentive Neural Process (ANP) (Kim et al., 2019) adds a deterministic path in addition to the stochastic path in NP. The deterministic path produces a target-speciﬁc representation, which can be interpreted as applying functional pooling (implemented with multihead attention (Vaswani et al., 2017)) to instance-wise representation. However, the representation is directly produced in a single pass rather than iteratively improved as we do, and only regression applications are explored as opposed to few-shot image classiﬁcation. In fact, to achieve high performance for classiﬁcation tasks, it is crucial for CNP to only apply sum-pooling within each class (Garnelo et al., 2018a), and it is unclear how to follow similar practices in ANP with both within-class and betweenclass interactions still being modelled. Recently, Gordon et al. (2019) have also extended CNP to use functional representations, but for the purpose of incorporating translation equivariance in the inputs as an inductive bias rather than increasing representational capacity as we do. Their approach uses convnets to impose translation equivariance and does not learn a ﬂexible iterative encoder.

Pooling operations are usually used in encoder-decoder meta-learning to enforce permutation invariance in the encoder. As an example, encoders in both CNP and NP use simple sum-pooling operations. More expressive pooling operations have been proposed to model interactions between data points. Murphy et al. (2019) introduces Janossy pooling which applies permutation-sensitive functions to all reorderings and averages the outputs, while Lee et al. (2019a) use pooling by multihead attention (PMA), which uses a ﬁnite query set to attend to the processed key-value pairs. Loosely speaking, attention-based functional pooling can be seen as having the whole input domain X as the query set in PMA.

Gradient-Based Meta-Learning Interestingly, many gradient-based meta-learning methods such as MAML can also be cast into an encoder-decoder formulation, because a gradient descent step is a valid permutation-invariant func-

Meta Fun: Meta-Learning with Iterative Functional Updates

tion. For a model f( , θ) parameterised by θ, one gradient descent step on the context loss has the following form,

θt+1 = θt α X

i C θℓ(f(xi; θt), yi) , (18)

where ℓis the loss function, α is the learning rate, and θt are the model parameters after t gradient steps. This corresponds to a special case of permutation-invariant functions where we take the instance-wise encoder to be ht(xi, yi; θt) = θt/|C| α θℓ(f(xi; θt), yi) and apply sum-pooling θt+1 = P

i ht(xi, yi; θt). Multiple gradientdescent steps also result in a permutation-invariant function, which can be proved by induction. We refer to this as a gradient-based encoder. What follows is that popular metalearning methods such as MAML can be seen as part of the encoder-decoder formulation. More speciﬁcally, in MAML, we learn an initialisation of the model parameters θ0 from training tasks, and adapt to new tasks by running T gradient steps from the learned initialisation. Therefore, θT can be seen as the task representation (albeit very highdimensional) produced by a gradient-based encoder. The success of MAML on a variety of tasks can be partially explained by the high-dimensional representation and the iterative adaptation by gradient descent, supporting our usage of a functional ( inﬁnite-dimensional ) representation and iterative updating procedure. Note, however, that the update rule in MAML operates in parameter space rather than function space as in our case.

Under the same encoder-decoder formulation, a comparison regarding MAML and Meta Fun can be made, which partially explains why Meta Fun can be desirable: Firstly, the updates in MAML must lie in its parametric space, while there is no parametric constraint in Meta Fun, which is better illustrated in Figure 3. Secondly, MAML uses gradient-based updates, while Meta Fun uses learned local updates, which potentially contains more information than gradient. Finally, MAML does not explicitly consider interactions between data points, while both within-context and context-target interactions are modelled in Meta Fun.

4. Experiments

We evaluate our proposed model on both few-shot regression and classiﬁcation tasks. In all experiments that follow, we partition the data into training, validation and test meta-sets, each containing data from disjoint tasks. For quantitative results, we train each model with 5 different random seeds and report the mean and the standard deviation of the test accuracy. For further details on hyperparameter tuning, see the Appendix. All experiments are performed using Tensor Flow (Abadi et al., 2015), and the code is available

4.1. 1-D Function Regression

We ﬁrst explore a 1D sinusoid regression task where we visualise the updating procedure in function space, providing intuition for the learned functional updates. Then we incorporate Gaussian uncertainty into the model, and compare our predictive uncertainty against that of a GP which generates the data.

Table 1. Few-shot regression on sinusoid. MAML can beneift from more parameters, but Meta Fun still outperforms all MAMLs despite less parameters being used compared to large MAML. We report mean and standard deviation of 5 independent runs.

Model 5-shot MSE 10-shot MSE

Original MAML 0.390 0.156 0.114 0.010 Large MAML 0.208 0.009 0.061 0.004 Very Wide MAML 0.205 0.013 0.059 0.010 Meta Fun 0.040 0.008 0.017 0.005

Visualisation of functional updates We train a T-step Meta Fun with dot-product functional pooling, on a simple sinusoid regression task from Finn et al. (2017), where each task uses data points of a sine wave. The amplitude A and phase b of the sinusoid varies across tasks and are randomly sampled during training and test time, with A U(0.1, 5.0) and b U(0, π). The x-coordinates are uniformly sampled from U( 5.0, 5.0). Figure 3 shows that our proposed algorithm learns a smooth transition from the initial state to the ﬁnal prediction at t = T = 5. Note that although only 5 context points on a single phase of the sinusoid are given at test time, the ﬁnal iteration makes predictions close to the ground truth across the whole period. As a comparison, we use MAML as an example of updating in parameter space. The original MAML (40 units 2 hidden layers) can ﬁt the sinusoid quite well after several iterations from the learned initialisation. However the prediction is not as good, particularly on the left side where there are no context points (see Figure 3 B). As we increase the model size to large MAML (256 units 3 hidden layers), updates become much smoother (Figure 3 C) and the predictions are closer to the ground truth. We further conduct experiments with a very wide MAML (1024 units 3 hidden layers), but the performance cannot be further improved (Figure 3 D). In Table 1, we compare the mean squared error averaged across tasks. Meta Fun performs much better than all MAMLs, even though less parameters (116611 parameters) are used compared to large MAML (132353 parameters).

2A tensorﬂow implementation of our model is available at github.com/jinxu06/metafun-tensorflow

Meta Fun: Meta-Learning with Iterative Functional Updates

(C) Large MAML (256x3) (D) Very Wide MAML (1024x3)

(B) MAML (40x2) (A) Meta Fun

Figure 3. Meta Fun is able to learn smooth updates, and recover the ground truth function almost perfectly. While the updates given by

MAMLs are relatively not smooth, especially for MAML with less parameters.

(A) Meta Fun Prediction (B) GP Ground Truth

(D) GP Ground Truth (C) Meta Fun Prediction

Figure 4. Predictive uncertainties for Meta Fun matches those for the oracle GP very closely in both 5-shot and 15-shot cases. The model is trained on random context size ranging from 1 to 20.

Predictive uncertainties As another simple regression example, we demonstrate that Meta Fun, like CNP, can produce good predictive uncertainties. We use synthetic data generated using a GP with an RBF kernel and Gaussian observation noise (µ = 0, σ = 0.1), and our decoder produces both predictive means and variances. As in Kim et al. (2019), we found that Meta Fun-DFP can produce somewhat piece-wise constant mean predictions which is less appealing in this situation. On the other hand, Meta Fun-KFP (with deep kernels) performed much better, as can be seen in Figure 4. We consider the cases of 5 or 15 context points, and compare our predictions to those for the oracle GP. In both cases, our model gave very good predictions.

4.2. Classiﬁcation: mini Image Net and tiered Image Net

The mini Image Net dataset (Vinyals et al., 2016) consists of 100 classes selected randomly from the ILSVRC-12 dataset (Russakovsky et al., 2015), and each class contains 600 randomly sampled images. We follow the split in Ravi & Larochelle (2016), where the dataset is divided into training (64 classes), validation (16 classes), and test (20 classes) meta-sets. The tiered Image Net dataset (Ren et al., 2018) contains a larger subset of the ILSVRC-12 dataset. These classes are further grouped into 34 higher-level nodes. These nodes are then divided into training (20 nodes), validation (6 nodes), and test (8 nodes) meta-sets. This dataset is considered more challenging because the split is near the root of the Image Net hierarchy (Ren et al., 2018). For both datasets, we use the pre-trained features provided by Rusu et al. (2019).

Following the commonly used experimental setting, each few-shot classiﬁcation task consists of 5 randomly sampled

classes from a meta-set. Within each class, we have either 1 example (1-shot) or 5 examples (5-shot) as context, and 15 examples as target. For all experiments, hyperparameters are chosen by training on the training meta-set, and comparing target accuracy on the validation meta-set. We conduct randomised hyperparameters search (Bergstra & Bengio, 2012), and the search space is given in Appendix. Then with the model conﬁgured by the chosen hyperparameters, we train on the union of the training and validation meta-sets, and report ﬁnal target accuracy on the test meta-set.

In Table 2 we compare our approach to other meta-learning methods. The numbers presented are the mean and standard deviation of 5 independent runs. The table demonstrates that our model outperforms previous state-of-the-art on 1shot and 5-shot classiﬁcation tasks for the more challenging tiered Image Net. As for mini Image Net, we note that previous work, such as Meta Opt Net-SVM (Lee et al., 2019b), used signiﬁcant data augmentation to regularise their model and hence achieved superior results. For a fair comparison, we also equipped each model with data augmentation and reported accuracy with/without data augmentation. However, Meta Opt Net-SVM (Lee et al., 2019b) uses a different data augmentation scheme involving horizontal ﬂip, random crop, and color (brightness, contrast, and saturation) jitter. On the other hand, Meta Fun, Qiao et al. (2018) and LEO (Rusu et al., 2019), only use image features averaging representation of different crops and their horizontal mirrored versions. In 1-shot cases, Meta Fun matches previous state-of-the-art performance, while in 5-shot cases, we get signiﬁcantly better results. In Table 2, results for both Meta Fun-DFP (using dot-product attention) and Meta Fun KFP (using deep kernels) are reported. Although both of

Meta Fun: Meta-Learning with Iterative Functional Updates

Table 2. Few-shot Classiﬁcation Test Accuracy

mini Image Net 5-way mini Image Net 5-way Models 1-shot 5-shot

(Without deep residual networks feature extraction): Matching networks (Vinyals et al., 2016) 43.56 0.84% 55.31 0.73% Meta-learner LSTM (Ravi & Larochelle, 2016) 43.44 0.77% 60.60 0.71% MAML (Finn et al., 2017) 48.70 1.84% 63.11 0.92% LLAMA (Grant et al., 2018) 49.40 1.83% - REPTILE (Nichol et al., 2018) 49.97 0.32% 65.99 0.58% PLATIPUS (Finn et al., 2018) 50.13 1.86% -

(Without data augmentation): Meta-SGD (Li et al., 2017) 54.24 0.03% 70.86 0.04% SNAIL (Mishra et al., 2018) 55.71 0.99% 68.88 0.92% Bauer et al. (2017) 56.30 0.40% 73.90 0.30% Munkhdalai et al. (2018) 57.10 0.70% 70.04 0.63% TADAM (Oreshkin et al., 2018) 58.50 0.30% 76.70 0.30% Qiao et al. (2018) 59.60 0.41% 73.74 0.19% LEO 61.76 0.08% 77.59 0.12% Meta Fun-DFP 62.12 0.30% 77.78 0.12% Meta Fun-KFP 61.16 0.15% 78.20 0.16%

(With data augmentation): Qiao et al. (2018) 63.62 0.58% 78.83 0.36% LEO 63.97 0.20% 79.49 0.70% Meta Opt Net-SVM (Lee et al., 2019b)1 64.09 0.62% 80.00 0.45% Meta Fun-DFP 64.13 0.13% 80.82 0.17% Meta Fun-KFP 63.39 0.15% 80.81 0.10%

tiered Image Net 5-way tiered Image Net 5-way Models 1-shot 5-shot

(Without deep residual networks feature extraction): MAML (Finn et al., 2017) 51.67 1.81% 70.30 0.08% Prototypical Nets (Snell et al., 2017) 53.31 0.89% 72.69 0.74% Relation Net [in Liu et al. (2019)] 54.48 0.93% 71.32 0.78% Transductive Prop. Nets (Liu et al., 2019) 57.41 0.94% 71.55 0.74%

(With deep residual networks feature extraction): Meta-SGD 62.95 0.03% 79.34 0.06% LEO 66.33 0.05% 81.44 0.09% Meta Opt Net-SVM 65.81 0.74% 81.75 0.58% Meta Fun-DFP 67.72 0.14% 82.81 0.15% Meta Fun-KFP 67.27 0.20% 83.28 0.12%

them demonstrate state-of-the-art performance, Meta Fun KFP generally outperforms Meta Fun-DFP for 5-shot problems, but performs slightly worse for 1-shot problems.

4.3. Ablation Study

As stated in Section 2.2, our model has three learnable components: the local update function, the functional pooling, and the decoder. In this section we explore the effects of using different versions of these components. We also investigate how the model performance would change with different numbers of iterations.

Table 3 demonstrates that neural network parameterised local update functions, described in Section 2.1, consistently outperforms gradient-based local update function, despite the latter having build-in inductive biases. Interestingly, the

choice between dot-product attention and deep kernel in functional pooling is problem dependent. We found that Meta Fun with deep kernels usually perform better than Meta Fun with dot product attention on 5-shot classiﬁcation tasks, but worse on 1-shot tasks. We conjecture that the deep kernel is better able to fuse the information across the 5 images per class compared to attention. In the comparative experiments in Section 4.2 we reported results on both.

In addition, we investigate how a simple Squared Exponential (SE) kernel would perform on these few-shot classiﬁcation tasks. This corresponds to using an identity input transformation function a in deep kernels. Table 3 shows that using SE kernel is consistently worse than using deep kernels, showing that the heavily parameterised deep kernel is necessary for these problems.

Meta Fun: Meta-Learning with Iterative Functional Updates

Table 3. Ablation Study. We conduct independent randomised hyperparameter search for each number presented, and reported means and standard deviations over 5 independent runs for each.

Functional Local update Decoder Mini Image Net tiered Image Net pooling function 1-shot 5-shot

Attention NN 62.12 0.30% 77.78 0.12% 67.72 0.14% 82.81 0.15% Deep Kernel NN 61.16 0.15% 78.20 0.16% 67.27 0.20% 83.28 0.12% Attention Gradient 59.63 0.19% 75.84 0.04% 62.55 0.10% 78.18 0.09% Deep Kernel Gradient 59.73 0.21% 76.41 0.14% 65.24 0.11% 80.31 0.16%

SE Kernel NN 60.04 0.19% 75.25 0.12% 60.81 0.30% 79.70 0.20% Deep Kernel Gradient 57.67 0.16% 73.55 0.04% 62.53 0.17% 76.86 0.07%

mini Image Net 1 shot

59% 60% 61% 62% 63%

Meta Fun-DFP

Meta Fun-KFP

mini Image Net 5 shot

74% 75.25% 76.5% 77.75% 79%

tiered Image Net 1 shot

64% 65.25% 66.5% 67.75% 69%

tiered Image Net 5 shot

76% 78% 80% 82% 84%

Figure 5. This ﬁgure illustrates the accuracy of our approach for varying number of iterations T = 1, . . . , 6, over different few-shot learning problems. For each problem, we use the same conﬁguration of hyperparameters except for the number of iterations and the choice between attention and deep kernels. Error bars (standard deviations) are given by training the same model 5 times with different random seeds.

Next, we looked into directly applying functional gradient descent with parameterised deep kernel to these tasks. This corresponds to removing the decoder and using deep kernels and gradient-based local update function (see Section 3). Unsurprisingly, this did not fare as well, given as it only has one trainable component (the deep kernel) and the updates are directly applied to the predictions rather than a latent functional representation.

Finally, Figure 5 illustrates the effects of using different numbers of iterations T. On all few-shot classiﬁcation tasks, we can see that using multiple iterations (two is often good enough) always signiﬁcantly outperform one iteration. We also note that this performance gain diminishes as we add more iterations. In Section 4.2 we treated the number of iterations as one of the hyperparameters.

5. Conclusions and Future Work

In this paper, we propose a novel functional approach for meta-learning called Meta Fun. The proposed approach learns to generate a functional task representation and an associated functional update rule, which allows to iteratively update the task representation directly in the function space. We evaluate Meta Fun on both few-shot regression and classiﬁcation tasks, and demonstrate that it matches or exceeds previous state-of-the-art results on mini Image Net

and tiered Image Net few-shot classiﬁcation tasks.

Interesting future research directions include a) exploring a stochastic encoder and hence working with stochastic functional representations, akin to the Neural Process (NP), and b) using local update functions and the functional pooling components whose parameters change with iterations instead of sharing them across iterations, where the added ﬂexibility could lead to further performance gains.

Acknowledgements

We would like to thank Jonathan Schwarz for valuable discussion, and the anonymous reviewers for their feedback. Jin Xu and Yee Whye Teh acknowledge funding from Tencent AI Lab through the Oxford-Tencent Collaboration on Large Scale Machine Learning project. Jean-Francois Ton is supported by the EPSRC and MRC through the Ox Wa SP CDT programme (EP/L016710/1).

Meta Fun: Meta-Learning with Iterative Functional Updates

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorﬂow.org.

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981 3989, 2016.

Aronszajn, N. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337 404, 1950.

Bauer, M., Rojas-Carulla, M., Swi atkowski, J. B., Schölkopf, B., and Turner, R. E. Discriminative kshot learning using probabilistic models. ar Xiv preprint ar Xiv:1706.00326, 2017.

Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281 305, 2012.

Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.

Bloem-Reddy, B. and Teh, Y. W. Probabilistic symmetries and invariant neural networks. Journal of Machine Learning Research, 21(90):1 61, 2020. URL http: //jmlr.org/papers/v21/19-322.html.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017.

Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516 9527, 2018.

Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. A. Conditional neural processes. In International Conference on Machine Learning, pp. 1690 1699, 2018a.

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S., and Teh, Y. W. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018b.

Gordon, J., Bruinsma, W. P., Foong, A. Y., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes. ar Xiv preprint ar Xiv:1910.13556, 2019.

Grant, E., Finn, C., Levine, S., Darrell, T., and Grifﬁths, T. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations, 2018.

Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019.

Koch, G. Siamese neural networks for one-shot image recognition. Master s thesis, University of Toronto, 2015.

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744 3753, 2019a.

Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Metalearning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657 10665, 2019b.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Liu, Y., Lee, J., Park, M., Kim, S., Yang, E., Hwang, S. J., and Yang, Y. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019.

Marino, J., Yue, Y., and Mandt, S. Iterative amortized inference. In International Conference on Machine Learning, pp. 3403 3412, 2018.

Mason, L., Baxter, J., Bartlett, P. L., Frean, M., et al. Functional gradient techniques for combining hypotheses. Advances in Large Margin Classiﬁers. MIT Press, 1999.

Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.

Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. Rapid adaptation with conditionally shifted neurons. In International Conference on Machine Learning, pp. 3661 3670, 2018.

Meta Fun: Meta-Learning with Iterative Functional Updates

Murphy, R. L., Srinivasan, B., Rao, V., and Ribeiro, B. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. In International Conference on Learning Representations, 2019.

Nichol, A., Achiam, J., and Schulman, J. On ﬁrst-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Oreshkin, B., López, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721 731, 2018.

Qiao, S., Liu, C., Shen, W., and Yuille, A. L. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229 7238, 2018.

Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2016.

Ren, M., Triantaﬁllou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., and Zemel, R. S. Metalearning for semi-supervised few-shot classiﬁcation. In International Conference on Learning Representations, 2018.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211 252, 2015.

Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842 1850, 2016.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077 4087, 2017.

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199 1208, 2018.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630 3638, 2016.

Wagstaff, E., Fuchs, F. B., Engelcke, M., Posner, I., and Osborne, M. On the limitations of representing functions on sets. ar Xiv preprint ar Xiv:1901.09006, 2019.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. In Artiﬁcial Intelligence and Statistics, pp. 370 378, 2016.

Y. Guo, P. Bartlett, A. S. and Williamson, R. C. Normbased regularization of boosting. Submitted to Journal of Machine Learning Research, 2001.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in neural information processing systems, pp. 3391 3401, 2017.