# metalearning_with_neural_tangent_kernels__11dc76dc.pdf

Published as a conference paper at ICLR 2021

META-LEARNING WITH NEURAL TANGENT KERNELS

Yufan Zhou , Zhenyi Wang , Jiayi Xian, Changyou Chen, Jinhui Xu Department of Computer Science and Engineering, State University of New York at Buffalo {yufanzho,zhenyiwa,jxian,changyou,jinhui}@buffalo.edu

Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefﬁciency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be deﬁned in function spaces, and propose the ﬁrst meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model s Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efﬁciency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-ofdistribution adaptation than popular baselines, as demonstrated in our experiments.

1 INTRODUCTION

Meta-learning (Schmidhuber, 1987) has made tremendous progresses in the last few years. It aims to learn abstract knowledge from many related tasks so that fast adaption to new and unseen tasks becomes possible. For example, in few-shot learning, meta-learning corresponds to learning a meta-model or meta-parameters so that they can fast adapt to new tasks with a limited number of data samples. Among all existing meta-learning methods, Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) is perhaps one of the most popular and ﬂexible ones, with a number of follow-up works such as (Nichol et al., 2018; Finn et al., 2018; Yao et al., 2019; Khodak et al., 2019a;b; Denevi et al., 2019; Fallah et al., 2020; Lee et al., 2020; Tripuraneni et al., 2020). MAML adopts a double-looped optimization framework, where adaptation is achieved by one or several gradientdescent steps in the inner-loop optimization. Such a framework could lead to some undesirable issues related to computational inefﬁciency and sub-optimal solutions. The main reasons are that 1) it is computationally expensive to back-propagate through a stochastic-gradient-descent chain, and 2) it is hard to tune the number of adaptation steps in the inner-loop as it can be different for both training and testing. Several previous works tried to address these issues, but they can only alleviate them to certain extents. For example, ﬁrst order MAML (FOMAML) (Finn et al., 2017) ignores the high-order terms of the standard MAML, which can speed up the training but may lead to deteriorated performance; MAML with Implicit Gradient (i MAML) (Rajeswaran et al., 2019) directly minimizes the objective of the outer-loop without performing the inner-loop optimization. But it still needs an iterative solver to estimate the meta-gradient.

To better address these issues, we propose two algorithms that generalize meta-learning to the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model s Neural Tangent Kernel (NTK) (Jacot et al., 2018). In this RKHS, instead of using parameter adaptation, we propose to perform an implicit function adaptation. To this end, we introduce two algorithms to avoid explicit

The ﬁrst two authors contribute equally. Correspondence to Changyou Chen (changyou@buffalo.edu). The research of the ﬁrst and ﬁfth authors was supported in part by NSF through grants CCF-1716400 and IIS-1910492.

Published as a conference paper at ICLR 2021

function adaptation: one replaces the function adaptation step in the inner-loop with a new metaobjective with a fast-adaptive regularizer inspired by MAML; the other solves the adaptation problem analytically based on tools from NTK so that the meta-objective can be directly evaluated on samples in a closed-form. When restricting the function space to be RKHS, the solutions to the proposed two algorithms become conveniently solvable. In addition, we provide theoretical analysis on our proposed algorithms in the cases of using fully-connected neural networks and convolutional neural networks as the meta-model. Our analysis shows close connections between our methods and the existing ones. Particularly, we prove that one of our algorithms is closely related to MAML with some high-order terms ignored in the meta-objective function, thus endowing effective optimization. In summary, our main contributions are:

We re-analyze the meta-learning problem and introduce two new algorithms for metalearning in RKHS. Different from all existing meta-learning algorithms, our proposed methods can be solved efﬁciently without cumbersome chain-based adaptations. We conduct theoretically analysis on the proposed algorithms, which suggests that our proposed algorithms are closely related to the existing MAML methods when fully-connected neural networks and convolutional neural networks are used as the meta-model. We conduct extensive experiments to validate our algorithms. Experimental results indicate the effectiveness of our proposed methods, through standard few-shot learning, robustness to adversarial attacks and out-of-distribution adaptation.

2 PRELIMINARIES

2.1 META-LEARNING Meta-learning can be roughly categorized as black-box adaptation methods (Andrychowicz et al., 2016; Graves et al., 2014; Mishra et al., 2018), optimization-based methods (Finn et al., 2017), non-parametric methods (Vinyals et al., 2016; Snell et al., 2017; Triantaﬁllou et al., 2020) and Bayesian meta-learning methods (Finn et al., 2018; Yoon et al., 2018; Ravi & Beatson, 2019). In this paper, we focus on the framework of Model Agnostic Meta-Learning (MAML) (Finn et al., 2017), which has two key components, meta initialization and fast adaptation. Speciﬁcally, MAML solves the meta-learning problem through a double-looped optimization procedure. In the inner-loop, MAML runs a task-speciﬁc adaptation procedure to transform a meta-parameter, θ, to a task-speciﬁc parameter, {φm}B m=1, for a total of B different tasks. In the outer-loop, MAML minimizes a total loss of PB m=1 L(fφm) with respect to meta-parameter θ, where fφm is the model adapted on task m that is typically represented by a deep neural network. It is worth noting that in MAML, one potential problem is to compute the meta-gradient θ PB m=1 L(fφm). It requires one to differentiate through the whole inner-loop optimization path, which could be very inefﬁcient.

2.2 GRADIENT FLOW Our proposed method relies on the concept of gradient ﬂow. Generally speaking, gradient ﬂow is a continuous-time version of gradient descent. In the ﬁnite-dimensional parameter space, a gradient ﬂow is deﬁned by an ordinary differential equation (ODE), dθt/dt = θt F(θt), with a starting point θ0 and function F : Rd R. Gradient ﬂow is also known as steepest descent curve.

One can generalize gradient ﬂows to inﬁnite-dimensional function spaces. Speciﬁcally, given a function space H, a functional F : H R, and a starting point f 0 H, a gradient ﬂow is similarly deﬁned as the solution of df t/dt = f t F(f t). This is a curve in the function space H. In this paper, we use notation f t F(f t), instead of HF(f t), to denote the general function derivative of the energy functional F with respect to function f t (Villani, 2008).

2.3 THE NEURAL TANGENT KERNEL Neural Tangent Kernel (NTK) is a recently proposed technique for characterizing the dynamics of a neural network under gradient descent (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019). NTK allows one to analyze deep neural networks (DNNs) in RKHS induced by NTK. One immediate beneﬁt of this is that the loss functional in the function space is often convex, even when it is highly non-convex in the parameter space (Jacot et al., 2018) . This property allows one to better understand the property of DNNs. Speciﬁcally, let fθ be a DNN parameterized by θ. The corresponding NTK Θ

Let H be the function space, F be the realization function for neural network deﬁned in Section 3.2. Note even if a functional loss (e.g., L2 loss) E : H R is convex on H, the composition E F is in general not.

Published as a conference paper at ICLR 2021

is deﬁned as: Θ(x1, x2) = fθ(x1)

, where x1, x2 are two data points. In our paper, we will deﬁne meta-learning on an RKHS induced by such a kernel.

3 META-LEARNING IN RKHS

We ﬁrst deﬁne the meta-learning problem in a general function space, and then restrict the function space to be an RKHS, where two frameworks will be proposed to make meta-learning feasible in RKHS, along with some theoretical analysis. For simplicity, in the following we will hide the superscript time t unless necessary, e.g., when the analysis involves time-changing.

3.1 META-LEARNING IN FUNCTION SPACE

Given a function space H, a distribution of tasks P(T ), and a loss function L, the goal of metalearning is to ﬁnd a meta function f H, so that it performs well after simple adaptation on a speciﬁc task. Let Dtr m and Dtest m be the training and testing sets, respectively, sampled from a data distribution of task Tm. The meta-learning problem on function space H is deﬁned as:

f = arg min f H E(f), with E(f) = ETm h L Adapt(f, Dtr m), Dtest m i (1)

where Adapt denotes some adaptation algorithms, e.g., several steps of gradient descent; E : H R is called energy functional, which is used to evaluate the model represented by the function f.

In theory, solving equation 1 is equivalent to solving the gradient ﬂow equation df t/dt = f t E(f t). However, solving the gradient ﬂow equation is generally infeasible, since i) it is hard to directly apply optimization methods in function space and ii) the energy functional E contains an adaptation algorithm Adapt, making the functional gradient infeasible. Thus, a better way is to design a special energy functional so that it can be directly optimized without running the speciﬁc adaptation algorithm. In the following, we ﬁrst specify the functional meta-learning problem in RKHS, and then propose two methods to derive efﬁcient solutions for the problem.

3.2 META-LEARNING IN RKHS

We consider a function f that is parameterized by θ RP , denoted as fθ, with P being the number of parameters. Deﬁne a realization function F : RP H that maps parameters to a function. With these, we can then deﬁne an energy function in the parameter space as E E F : RP R with being the composition operator. Consequently, with an initialized θ0, we can deﬁne the gradient ﬂow of E(θt) in parameter space as: dθt/dt = θt E(θt). In the following, we ﬁrst establish an equivalence between the gradient ﬂow in RKHS and the gradient ﬂow in the parameter space. We then propose two algorithms for meta-learning in the RKHS induced by NTK.

Theorem 1 Let H be an RKHS induced by the NTK Θ of fθ. With f 0 = fθ0, the gradient ﬂow of E(f t) coincides with the function evolution of fθt driven by the gradient ﬂow of E(θt).

The proof of Theorem 1 relies on the property of NTK (Jacot et al., 2018), and is provided in the Appendix. Theorem 1 serves as a foundation of our proposed methods, which indicates that solving the meta-learning problem in RKHS can be done by some appropriate manipulations. In the following, we describe two different approaches termed Meta-RKHS-I and Meta-RKHS-II, respectively.

3.3 META-RKHS-I: META-LEARNING IN RKHS WITHOUT ADAPTATION

Our goal is to design an energy functional that has no adaptation component, but is capable of achieving fast adaptation. For this purpose, we ﬁrst introduce two deﬁnitions: empirical loss function L(fθ, Dm) and expected loss function L(fθ). Let Dm = {xm,i, ym,i}n i=1 be a set containing the data of a regression task Tm. The empirical loss function L(fθ, Dm) and the expected loss function Lm(fθ) can be deﬁned as:

L(fθ, Dm) = 1

f(xm,i) ym,i 2, Lm(fθ) = Exm,ym

f(xm) ym 2 .

Published as a conference paper at ICLR 2021

Our idea is to deﬁne a regularized functional such that it endows the ability of fast adaptation in RKHS. Our solution is based on some property of the standard MAML. We start from analyzing the meta-objective of MAML with a k-step gradient-descent adaptation, i.e., applying k gradient-descent steps in the inner-loop. The objective can be formulated as

θ = arg min θ ETm L(fφ, Dtest m ) , with φ = θ α

i=0 θi L(fθi, Dtr m) ,

where α is the learning rate of the inner-loop, θ0 = θ, and θi+1 = θi α θi L(fθi, Dtr m) . By Taylor expansion, we have

ETm L(fφ, Dtest m ) ETm

L(fθ, Dtest m ) α

i=0 θi L(fθi, Dtr m) θL(fθ, Dtest m ) #

Since Dtr m and Dtest m come from the same distribution, equation 2 is an unbiased estimator of

Mk = ETm[Lm(fθ)

i=0 βi], where βi = α θi Lm(fθi) θLm(fθ) . (3)

We focus on the case of k = 1, which is M1 = ETm [Lm(fθ)] αETm θLm(fθ) 2 . The ﬁrst term on the RHS is the traditional multi-task loss evaluated at θ for all tasks. The second term corresponds to the negative gradient norm; minimizing it means choosing a θ with the maximum gradient norm. Intuitively, when θ is not a stationary point of a task, one should choose the steepest descent direction to reduce the loss maximally for a speciﬁc task, thus leading to fast adaptation.

The above understanding suggests us to propose the following regularized energy functional, e Eα, for meta-learning in the RKHS induced with the NTK for fast function adaptation:

e E(α, fθ) = ETm Lm(fθ) α fθLm(fθ) 2 H , (4)

where H denotes the functional norm in H, and α is a hyper-parameter. The above objective is inspired by the Taylor expansion of the MAML objective, but is deﬁned in the RKHS induced by the NTK. Its connection with MAML and some functional-space properties will be discussed later.

Solving the Function Optimization Problem To minimize equation 4, we ﬁrst derive Theorem 2 to reduce the function optimization problem to a parameter optimization problem.

Theorem 2 Let fθ be a neural network with parameter θ and H be the RKHS induced by the NTK Θ of fθ. Then, the following are equivalent

e E(α, fθ) = M1, and fθLm(fθ) 2 H = θLm(fθ) 2 .

Theorem 2 is crucial to our approach as it indicates that solving problem equation 4 is no more difﬁcult than the original parameter-based MAML, although it only considers one-step adaptation case. Next, we will show that multi-step adaptation in the parameter space can also be well-approximated by our objective equation 4 but with a scaled regularized parameter α. In the following, we consider the squared loss L. The case with the cross-entropy loss is discussed in the Appendix. We assume that fθ is parameterized by either fully-connected or convolutional neural networks, and only consider the impact of number of hidden layers L in our theoretical results.

Theorem 3 Let fθ be a fully-connected neural network with L hidden layers and Re LU activation function, s1, ..., s L+1 be the spectral norm of the weight matrices, s = maxh sh, and α be the learning rate of gradient descent. If α O(qr) with q = min(1/(Ls L), L 1/(L+1)) and r = min(s L, s), then the following holds |Mk e E(kα, fθ)| O 1

Theorem 4 Let fθ be a convolutional neural network with L l convolutional layers and l fullyconnected layers and with Re LU activation function, and dx be the input dimension. Denote by W h the parameter vector of the convolutional layer for h L l, and the weight matrices of the fully connected layers for L l + 1 h L + 1. 2 means both the spectral norm of a matrix

For ease of our later notation, we write the gradient θi L (thus the parameter as well) as a row vector.

Published as a conference paper at ICLR 2021

and the Euclidean norm of a vector. Deﬁne sh = dx W h 2 if h = 1, ..., L l, and W h 2 if L l + 1 h L + 1. Let s = maxh sh and α be the learning rate of gradient descent. If α O(qr) with q = min(1/(Ls L), L 1/(L+1)) and r = min(s L, s), the following holds

|Mk e E(kα, fθ)| O 1

The above Theorems indicate that, for a meta-model with fully-connected and convolutional layers, the proposed Meta-RKHS-I can be an efﬁcient approximation of MAML with a bounded error. Comparisons with Reptile and MAML Similar to Reptile and MAML, the testing stage of Meta RKHS-I also requires gradient-based adaptation on meta-test tasks. By Theorem 1, we known that gradient ﬂow of an energy functional can be approximated by gradient descent in a parameter space. Reptile with 1-step adaptation (Nichol et al., 2018) is equivalent to the approximation of the gradient ﬂow of e E(α, fθ) with α = 0, which does not include the fast-adaptation regularization as in our method. For a fairer comparison on the efﬁciency, we will discuss the computational complexity later.

From the equivalent parameter-optimization form indicated in Theorem 2, we know that our energy functional e E(α, fθ) is closely related to MAML. However, with this form, our method does not need the explicit adaptation steps in training (i.e., the inner-loop of MAML), leading to a simpler optimization problem. We will show that our proposed method leads to better results.

3.4 META-RKHS-II: META-LEARNING IN RKHS WITH A CLOSED-FORM ADAPTATION

In this section, we present our second solution for meta-learning in RKHS by deriving a closed-form adaptation function, i.e., we focus on a case where Adapt(f, Dtr m) is analytically solvable using the theory of NTK. Speciﬁcally, we are given a loss function L, tasks Tm with randomly split training set Dtr m = {xtr m,i, ytr m,i}n i=1, and testing set Dtest m . Let θt m and f t m,θ denote the parameters and the corresponding function at time t adapted by task Tm from the meta parameter θ and meta function fθ, respectively. From the NTK theory (Jacot et al., 2018; Arora et al., 2019; Lee et al., 2019), we can write the function/parameter evolution as:

dθt m dt = θtm L(f t m,θ, Dtr m), and df t m,θ dt = dθt m dt f t m,θ θtm

L(f t m,θ, Dtr m) f t m,θ(xtr m,i) Θ(xtr m,i, ).

The above differential equation corresponds to the adaptation step, i.e., how to adapt the meta parameter/function for task m. By the NTK theory, we can show that this admits closed-form solutions. In our meta-learning settings, this indicates that no explicit adaptation steps are necessary.

To see why this is the case, we ﬁrst investigate the regression case, where the loss function L is the squared loss. Let x Dtest m be a test data point. As shown in Arora et al. (2019); Lee et al. (2019), with a large enough neural network we can safely assume that NTK will not change too much during the training. In this case, we can have a closed-form solution for f t m,θ as

f t m,θ(x) = fθ(x) + H(x, Xtr m)H 1(Xtr m, Xtr m) e t H(Xtr m,Xtr m) I fθ(Xtr m) Y tr , (5)

where e is the matrix exponential map, which can be approximated by Pad e approximation (M.Arioli et al., 1996). H(Xtr m, Xtr m) is an n n kernel matrix with its (i, j) element being Θ(xm,i, xm,j), H(x, Xtr m) is a 1 n vector with its i-th element being Θ(x, xm,i), fθ(Xtr m) Rn is the predictions of all training data at the initialization, and Y tr Rn is the target value of the training data. Speciﬁcally, at time t = , we have

f m,θ(x) = fθ(x) + H(x, Xtr m)H 1(Xtr m, Xtr m) Y tr fθ(Xtr m) . (6)

The above results allow us to directly deﬁne an energy functional by substituting Adapt(f, Dtr m) in equation 1 with its closed-form solution f t m,θ. In other words, our new energy functional is

E(t, fθ) = ETm Lm(f t m,θ) , (7)

where f t m,θ is deﬁned in equation 5, and Lm(f t m,θ) is the expectation of L f t m,θ, Dtest m . For classiﬁcation problems, we follow the same strategy as in Arora et al. (2019) to extend regression to classiﬁcation. Mores details can be found in the Appendix, including the algorithm in Appendix A.

Published as a conference paper at ICLR 2021

Table 1: Running time comparison per iteration with C1 = dxp + Lp2 and C2 = dxp + Ldxp2.

FOMAML Reptile Meta-RKHS-I Meta-RKHS-II

Fully-connected O(n(k + 1)C1) O(nk C1) O(n C1) O(n C1 + n3) Convolutional O(n(k + 1)C2) O(nk C2) O(n C2) O(n C2 + n3)

On Potential Robustness of Meta-RKHS-II Our extensive empirical studies show that Meta RKHS-II is a more robust model than related baselines. We provide an intuitive explanation on the potential robustness of Meta-RKHS-II, as we ﬁnd current theories of both robustness machine learning and NTK are insufﬁcient for a formal explanation. Our explanation is based on some properties of both the meta-learning framework and NTK: 1) Strong initialization (meta model): For NTK to generalize well, we argue that it is necessary to start the model with a good initialization. This is automatically achieved in our meta-learning setting, where the meta model serves as the initialization for NTK predictions. Actually, this has been supported by recent research (Fort et al., 2020), which shows that there is a chaotic stage in the NTK prediction with ﬁnite neural networks, and the NTK regime can be reachable with a good initialization. 2) Low complex classiﬁcation boundary: It is known that NTK is a linear model in the NTK regime. Intuitively, generating adversarial samples with a lower complex model should be relatively harder because there is less data in the vicinity of the decision boundary compared to a more complex model, making the probability of the model being attacked smaller. Thus we argue that our model can be more robust than standard meta learning models. 3) Our NTK-based model is robust enough to adapt with different time steps. And these ﬁnite time steps can be more robust to adversarial attacks than that of the inﬁnite-time limit partly due to the complexity of back-propagating gradients. We note each of the individual factors might not be enough to ensure robustness. Instead, we argue it is the combination effect of these factors that lead to robustness of our model. Formal analysis is out of the scope of this paper and left for future work.

Connection with Meta-RKHS-I The proposed two methods choose different strategies to avoid explicit adaptation in meta-learning, which seem to be two very different algorithms. We prove below theorem, which indicates that the difference of the underlying gradient ﬂows of the two algorithms indeed increases w.r.t.both T and the depth L of a DNN (we only consider impacts of T and L).

Theorem 5 Let fθ be a neural network with L hidden layers, with each layer being either fullyconnected or convolutional. Assume that L < . Then, error(T) = |e E(T, fθ) E(T, fθ)| is a non-decreasing function of T. Furthermore, for arbitrary T > 0 we have error(T) O T 2L+3 .

Actually, Meta-RKHS-II implicitly contains a term of functional gradient norm because E(T, fθ) =

Lm(fθ) R T 0

θt Lm(f t m,θ) 2 dt . The difference compared to Meta-RKHS-I mainly

comes from the fact that Meta-RKHS-I can be regarded as an approximation of time-discrete adaptation, while Meta-RKHS-II is based on time-continuous adaptation. In our experiments, we observe that Meta-RKHS-I is as fast as FOMAML, which means that it is more computationally efﬁcient than the standard MAML. Meanwhile Meta-RKHS-II is the more robust model in tasks of adversarial attack and out-of-distribution adaptation.

Connection with i MAML Our proposed method is similar to the i MAML algorithm (Finn & Levine, 2019) in the sense that both methods try to solve meta-learning without executing the optimization path. Different from i MAML, which still relies on an iterative solver, our method only needs to solve a simpler optimization problem due to the closed-form adaptation.

3.5 TIME COMPLEXITY ANALYSIS

We compare the time complexity of our proposed methods with other ﬁrst-order meta-learning methods. Without loss of generality, we analyze the complexity in the case of a L-layer MLP or L-layer convolutional neural networks. Recall that dx is the input dimension. Assume each layer has width (ﬁlter number) O(p). Let n be the data batch size, k the adaptation steps of inner-loop optimization. We summarize the time complexity in Table 1, where we simply assume the complexity of multiplying matrices with sizes a b and b c to be O(abc). Note in the meta-learning setting, n is typically small, indicating the efﬁciency of our proposed methods.

Published as a conference paper at ICLR 2021

4 2 0 2 4 2.0

1.5 Sampled Before After True

(a) Random Initialized

Sampled Before After True

(b) Meta-RKHS-I

4 Sampled Before After True

(c) Meta-RKHS-II

Figure 1: Performance of random initialized network and our methods. The models before/after adaptation are shown in dotted/dashed lines, samples used for adaptation are also shown in the ﬁgure. 4 EXPERIMENTS

We conduct a set of experiments to evaluate the effectiveness of our proposed methods, including a sine wave regression toy experiment, few-shot classiﬁcation, robustness to adversarial attacks, out-of-distribution generalization and ablation study. Due to space limit, more results are provided in the Appendix. We compare our models with related baselines including MAML (Finn et al., 2017), the ﬁrst order MAML (FOMAML) (Finn et al., 2017), Reptile (Nichol et al., 2018) and i MAML (Rajeswaran et al., 2019). Results are reported as mean and variance over three independent runs.

4.1 REGRESSION Following Finn et al. (2017); Nichol et al. (2018), we ﬁrst test our proposed methods on the 1dimensional sine wave regression problem. This problem is instructive, where a model is trained on many different sine waves with different amplitudes and phases, and tested by adapting the trained model to new sine waves with only a few data points using a ﬁxed number of gradient-descent steps. Following Finn et al. (2017); Nichol et al. (2018), we use a fully-connected neural network with 2 hidden layers and the Re LU activation function. The results are shown in Figure 1.

4.2 FEW-SHOT IMAGE CLASSIFICATION For this experiment, we choose two popular datasets adopted for meta-learning: Mini-Image Net and FC-100 (Oreshkin et al., 2018). The cross-entropy loss is adopted for Meta-RKHS-I; while the squared loss is used for Meta-RKHS-II following Arora et al. (2019); Novak et al. (2019). Similar to Finn et al. (2017), the model architecture is set to be a four-layer convolutional neural network with Re LU activation. The ﬁlter number is set to be 32. The Adam optimizer (Kingma & Ba, 2015) is used to minimize the energy functional. Meta batch size is set to be 16 and learning rates are set to be 0.01 for Meta-RKHS-II.

Table 2: Few-shot classiﬁcation results on Mini-Image Net and FC-100.

MINI-IMAGENET FC-100 ALGORITHM 5 WAY 1 SHOT 5 WAY 5 SHOTS 5 WAY 1 SHOT 5 WAY 5 SHOTS

MAML 48.70 1.84% 63.11 0.93% 38.00 1.95% 49.34 0.97% FOMAML 48.07 1.75% 63.15 0.91% 37.73 1.93% 49.05 0.99% IMAML 49.30 1.88% 64.89 0.95% 38.38 1.70% 49.41 0.80% REPTILE 49.70 1.83% 65.91 0.84% 38.40 1.94% 50.50 0.87%

META-RKHS-I 51.10 1.82% 66.19 0.80% 38.90 1.90% 51.47 0.86% META-RKHS-II 50.53 2.09% 65.40 0.91% 41.20 2.17% 51.36 0.96

The results are shown in Table 2. Note the results of Reptile is different from those in Nichol et al. (2018), because we re-evaluate it under the same setting as Finn et al. (2017), i.e., 10 steps of adaptation is applied during testing. Our results of i MAML is based on the implementation of Spigler (2019). It is observed that our proposed methods achieve better accuracy than different baselines. Interestingly, our Meta-RKHS-I performs better than FOMAML (this is also the case in other experiments), although they share a similar objective. We conjecture the reason is because our Meta-RKHS-I restricts the function to be in an RKHS, making the functional space smaller thus easier to optimize compared to the unrestricted version of FOMAML. In terms of our two algorithms, there is not always a winner on all the tasks. We note that Meta-RKHS-I is more efﬁcient in training. However, we show below that Meta-RKHS-II is better in terms of robustness to adversarial attacks and out-of-distribution generalization.

4.3 ROBUSTNESS TO ADVERSARIAL ATTACKS We now compare the adversarial robustness of our methods with other popular baselines. We adopt both white-box and black-box attacks in this experiment. For the white-box attacks, we adopt strong attacks including the PGD Attack (Madry et al., 2017), BPDA attack (Athalye et al., 2018) and

Published as a conference paper at ICLR 2021

1 10 20 30 40 50 60 Queries

Black-box attack

1 3 5 7 9 11 13 15 Queries

Black-box attack

1 50 100 150 200 250 300 Queries

Black-box attack

Reptile Meta-RKHS-I IMAML Meta-RKHS-II Meta-RKHS-II_t100_PQ1 Meta-RKHS-II_t100_PQ2 MAML FOMAML

Figure 2: Black-box attack on Mini-Image Net and FC-100. Mini-Image Net 5-way 1-shot (left), FC-100 5-way 1-shot (middle) and Mini-Image Net 5-way 5-shot (right).

Figure 3: BPDA attack on Mini-Image Net 5-way 5-shot (left) and FC-100 5-way 5-shot (right).

Figure 4: SPSA attack on Mini-Image Net 5-way 5-shot (left) and FC-100 5-way 5-shot (right).

SPSA attack (Uesato et al., 2018). For PGD attack, we use ℓ norm and compare the results on Mini-imagenet and FC-100. We compare the robust accuracy with different magnitude with 20-step attack with a step size of 2/255. For BPDA attack, we apply median smoothing, JPEGFilter and Bit Squeezing as input transformation adapted from (Guo et al., 2018) as defense strategies. For SPSA attack, we follow (Uesato et al., 2018) and set the Adam learning rate 0.01, perturbation size δ = 0.01. For Black-box attack, we adopt the strong query efﬁcient attack method (Guo et al., 2019). Follow the setting of Guo et al. (2019), we use a ﬁxed step size of 0.2.

We consider both ﬁnite-time and inﬁnite-time adaptation in this experiment. For ﬁnite-time adaptation, the Padé approximation with P = Q = 1 and P = Q = 2 to approximate the matrix exponential are considered (Butcher & Chipman, 1992). We use Meta-RKHS-II_t100_PQ1 and Meta-RKHSII_t100_PQ2 to denote methods using ﬁnite time t = 100, P = Q = 1 or P = Q = 2, respectively. We observe other ﬁnite time t makes similar predictions, thus we only consider t = 100. The results from the black-box attack in Figure 2 indicate the robustness of our Meta-RKHS-II. In fact, the gaps are signiﬁcantly large, making it the only useful robust model in the adversarial-attack setting. Our Meta-RKHS-I is not as robust as Meta-RKHS-II, but still slightly outperforms other baselines. Regarding the white-box attack, results in Figure 3, 4 and 5 again show that our proposed Meta-RKHS-II is signiﬁcantly more robust than baselines under the three strong attacks. It is also interesting to see that our Meta-RKHS-I performs slightly better than Meta-RKHS-II in some rare cases, e.g., in the Mini-Image Net 5-way 1-shot case when the attack magnitude is not too small. More results are presented in the Appendix.

Published as a conference paper at ICLR 2021

attack magnitude

L infinity norm PGD Attack

attack magnitude

L infinity norm PGD Attack

attack magnitude

L infinity norm PGD Attack

MAML FOMAML IMAML Reptile Meta-RKHS-I Meta-RKHS-II Meta-RKHS-II_t100_PQ1 Meta-RKHS-II_t100_PQ2

Figure 5: ℓ norm PGD attack on Mini-Image Net and FC-100. Mini-Image Net 5-way 5-shot (left), Mini-Image Net 5-way 1-shot (middle) and FC-100 5-way 5-shot (right).

Table 4: Meta-RKHS-II with different time t.

TIME t t = 0.1 t = 1 t = 10 t = 100 t =

MINI-IMAGENET 5 WAY 1 SHOT 49.67 2.23% 48.27 2.23% 50.53 2.09% 49.13 2.19% 48.70 2.28% 5 WAY 5 SHOTS 64.51 0.93% 64.28 0.98% 65.40 0.91% 64.24 1.06% 64.95 0.96%

FC-100 5 WAY 1 SHOT 36.50 2.10% 38.80 2.32% 41.20 2.17% 38.80 2.21% 37.60 2.13% 5 WAY 5 SHOTS 48.35 1.02% 49.79 1.04% 51.36 0.96% 48.59 1.09% 49.48 0.98%

4.4 OUT-OF-DISTRIBUTION GENERALIZATION

We adopt similar strategy in (Lee et al., 2020) to test a model s ability of generalizing to out-ofdistribution datasets. In this setting, the state of arts are achieved by Bayesian TAML (Lee et al., 2020). Different from their setting that considers any-shot learning with maximum number of examples for each class being as large as 50, we only focus on the standard 1 or 5 shot learning. We thus modify their code to accommodate our standard setting. The CUB (Wah et al., 2011) and VGG Flower Nilsback & Zisserman (2008) are ﬁne-grained datasets used in this experiment, where all images are resized to 84 84. We follow Lee et al. (2020) to split these datasets into meta training/validation/testing sets. We ﬁrst train all the methods on Mini-Image Net or FC-100 datasets, then conduct meta-testing on CUB and VGG Flower datasets. The results are shown in Table 3. Again, our methods achieve the best results, outperforming the state-of-art method with our Meta-RKHS-II, indicating the robustness of our proposed methods. More results are presented in the Appendix.

4.5 ABLATION STUDY Table 3: Meta testing on different out-of-distribution datasets with model trained on Mini-Image Net.

5 WAY 1 SHOT 5 WAY 5 SHOT ALGORITHM CUB VGG FLOWER CUB VGG FLOWER

MAML 34.23 1.52% 52.98 1.76% 52.36 0.94% 67.52 1.30% FOMAML 35.32 1.69% 53.86 1.64% 52.02 0.71% 68.83 1.16% REPTILE 35.61 1.38% 53.57 1.58% 51.93 0.89% 71.62 1.25% IMAML 40.55 0.61% 54.97 0.80% 46.31 2.03% 60.67 1.91% BAYESIAN TAML(SOTA) 41.57 0.60% 58.56 0.66% 61.78 0.56% 77.95 0.46%

META-RKHS-I 36.73 1.26% 54.79 1.61% 54.19 0.73% 72.76 1.08% META-RKHS-II 45.36 0.87% 60.80 1.02% 65.21 0.64% 78.25 0.49%

We conduct several ablation studies, including: comparing Reptile with Meta-RKHS-I under different adaptation steps (results shown in the Appendix), testing the impact of choosing different time t in Meta-RKHS-II (results shown in Table 4) and the impact of network architecture with different number of CNN feature channels (results shown in the Appendix). It is interesting to see that a ﬁnite-time (around t = 10) achieves the best accuracy, although the inﬁnite-time case guarantees a stationary point. This indicates that a stationary point achieved by limited training data in the adaptation step is not always the best choice, because the limited training data might easily overﬁt the model, thus achieving worse test results.

5 CONCLUSION

We develop meta-learning in RKHS, and propose two practical algorithms allowing efﬁcient adaptation in the function space by avoiding some complicated adaptations as in traditional methods. We show connections between our proposed methods and existing ones. Extensive experiments suggest that our methods are more effective, achieve better generalization and are more robust against adversarial attacks and out-of-distribution generalization, compared to popular strong baselines.

Published as a conference paper at ICLR 2021

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. volume 97 of Proceedings of Machine Learning Research, pp. 242 252, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr. press/v97/allen-zhu19a.html.

Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, pp. 3981 3989. 2016.

Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an inﬁnitely wide neural net. In Advances in Neural Information Processing Systems, 2019.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, 2018.

J. C. Butcher and F. H. Chipman. Generalized padé approximations to the exponential function. BIT Numerical Mathematics, 32:118 130, 1992.

Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In https://arxiv.org/abs/1903.10399, 2019.

Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In International Conference on Artiﬁcial Intelligence and Statistics, 2020.

Chelsea Finn and Sergey Levine. Meta-learning: from few-shot learning to rapid reinforcement learning. In ICML 2019 Meta-Learning Tutorial, 2019.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.

Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems. 2018.

Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Advances in Neural Information Processing Systems, 2020.

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. In https://arxiv.org/abs/1410.5401, 2014.

Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input transformations. In International Conference on Learning Representations, 2018.

Chuan Guo, Jacob R. Gardner, Yurong You, Andrew Gordon Wilson, and Kilian Q. Weinberger. Simple black-box adversarial attacks. In International Conference on Machine Learning, 2019.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018.

Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, 2019a.

Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, 2019b.

Published as a conference paper at ICLR 2021

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Hae Beom Lee, Hayeon Lee, Donghyun Na, Saehoon Kim, Minseop Park, Eunho Yang, and Sung Ju Hwang. Learning to balance: Bayesian meta-learning for imbalanced and out-of-distribution tasks. In International Conference on Learning Representations, 2020.

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32, pp. 8572 8583. Curran Associates, Inc., 2019.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. Ar Xiv, abs/1706.06083, 2017.

M.Arioli, B.Codenotti, and C.Fassino. The Pad e method for computing the matrix exponential. Linear Algebra and its Applications, June 1996.

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. In International Conference on Learning Representations, 2018.

Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms. In https://arxiv.org/abs/1803.02999, 2018.

Maria-Elena Nilsback and Andrew Zisserman. Automated ﬂower classiﬁcation over a large number of classes. In Sixth Indian Conference on Computer Vision, Graphics and Image Processing, 2008.

Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A. Abolaﬁa, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2019.

Boris N. Oreshkin, Pau Rodriguez, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, 2018.

Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems. 2019.

Sachin Ravi and Alex Beatson. Amortized bayesian meta-learning. In International Conference on Learning Representations, 2019.

Filippo Santambrogio. Euclidean, Metric, and Wasserstein gradient ﬂows: an overview, 2016.

Jurgen Schmidhuber. Evolutionary principles in self-referential learning. Diploma thesis, Technische Universitat Munchen, Germany, 14 May 1987.

Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.

Giacomo Spigler. Meta-learnt priors slow down catastrophic forgetting in neural networks. ar Xiv e-prints, art. ar Xiv:1909.04170, Sep 2019.

Eleni Triantaﬁllou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Metadataset: A dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations, 2020.

Nilesh Tripuraneni, Chi Jin, and Michael I. Jordan. Provable meta-learning of linear representations. In https://arxiv.org/abs/2002.11684, 2020.

Jonathan Uesato, Brendan O Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks, 2018.

C Villani. Optimal transport Old and new, volume 338, pp. xxii+973. 01 2008. doi: 10.1007/ 978-3-540-71050-9.

Published as a conference paper at ICLR 2021

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In https://arxiv.org/pdf/1606.04080.pdf, 2016.

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. In Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.

Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. Hierarchically structured meta-learning. In International Conference on Machine Learning, 2019.

Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems. 2018.

Published as a conference paper at ICLR 2021

A ALGORITHMS

Our proposed algorithms for meta-learning in the RKHS are summarized in Algorithm 1.

Algorithm 1 Meta-Learning in RKHS Require: p(T ): distribution over tasks, randomly initialized neural network parameters θ.

while not done do

Sample a batch of tasks {Tm}B m=1 p(T ) for all Tm do

Sample a batch of data points Dm or Sample two batches of data points Dtr m, Dtest m . end for Evaluate the energy functional by equation 4 with {Dm}B m=1 or Evaluate the energy functional by equation 7 with {Dtr m, Dtest m }B m=1. Minimize the energy functional w.r.t θ. end while

B PROOF OF THEOREM 1

Theorem 1 If fθ is a neural network with parameter θ RP and H is the Reproducing Kernel Hilbert Space (RKHS) induced by Θ, where Θ is the Neural Tangent Kernel (NTK) of fθ, then with initialization f 0 = fθ0, the gradient ﬂow of E(f t) coincides with the function evolution of fθt induced by the gradient ﬂow of E(θt).

Proof Without loss of generality, we can rewrite E(f) = ETm{E(xm,ym) [C(f(xm), ym)]} with some function C( , ).

For a neural network fθ with parameter θ RP , the gradient ﬂow of E in RP is

dt = θt E(θt).

dt = θt(E F)(θt)

= ETm{E(xm,ym) [ θt C(fθt(xm), ym)]}

C(f t θ(xm), ym) f t θ(xm) f t θ(xm) θt

We know that the dynamics of fθt is

C(fθt(xm), ym)

fθt(xm) fθt(xm)

C(fθt(xm), ym)

fθt(xm) fθt(x)

C(fθt(xm), ym)

fθt(xm) Θt(xm, ) , (8)

where Θt is the Neural Tangent Kernel of neural network fθt (Jacot et al., 2018).

If Ht is the Reproducing Kernel Hilbert Space induced by a kernel Θt and Vxm : H R is the evaluation functional at xm, which is deﬁned as

Vxm(f) = f(xm),

Published as a conference paper at ICLR 2021

then for an arbitrary function g and a small perturbation ϵ, we have

f Vxm(f), g = lim ϵ 0 Vxm(f + ϵg) Vxm(f)

f Vxm(f), g = lim ϵ 0 f(xm) + ϵg(xm) f(xm)

ϵ f Vxm(f), g = g(xm)

f Vxm(f), g = Θt(xm, ), g

f Vxm(f) = Θt(xm, )

ff(xm) = Θt(xm, ).

With an initial function f 0 = fθ0 H, the gradient ﬂow of E in H is

dt = f t E(f t).

dt = ETm{E(xm,ym) f t C(f t(xm), ym) }

C(f t(xm), ym)

f t(xm) f tf t(xm)

C(f t(xm), ym)

f t(xm) Θt(xm, ) . (9)

We can complete the proof by comparing equation 8 and equation 9.

C PROOF OF THEOREM 2

Theorem 2 If fθ is a neural network with parameter θ and H is the Reproducing Kernel Hilbert Space (RKHS) induced by Θ, where Θ is the Neural Tangent Kernel (NTK) of fθ, then

M1 = e E(α, fθ), and β0 = α θLm(fθ) 2 = α fθLm(fθ) 2 H.

Proof Without loss of generality, we rewrite Lm(fθ) = Exm,ym [C(fθ(xm), ym)].

In regression task, we have

C(fθ(xm), ym) = 1

fθ(xm) ym 2

. In classiﬁcation task, we have

C(fθ(xm), ym) = ym log(fθ(xm)) ,

where log is element-wise logarithm operation.

= θLm(fθ) θLm(fθ)

= θExm,ym [C(fθ(xm), ym)] θExm,ym [C(fθ(xm), ym)]

C(fθ(xm), ym)

fθ(xm) fθ(xm)

C(fθ(xm), ym)

C(fθ(xm), ym)

fθ(xm) fθ(xm)

θ fθ(x m) θ

C(fθ(x m), y m) fθ(x m)

C(fθ(xm), ym)

fθ(xm) Θ(xm, x m) C(fθ(x m), y m) fθ(x m)

Published as a conference paper at ICLR 2021

C(fθ(xm), ym)

fθ(xm) Θ(xm, ) , Ex m,y m

C(fθ(x m), y m) fθ(x m) Θ(x m, )

C(fθ(xm), ym)

fθ(xm) fθfθ(xm) , Ex m,y m

C(fθ(x m), y m) fθ(x m) fθfθ(x m)

H = fθLm(fθ), fθLm(fθ)

H = fθLm(fθ) 2 H,

where , H is the inner product in Reproducing Kernel Hilbert Space (RKHS) H. In the above equations, we use the deﬁnition of Neural Tangent Kernel (NTK), the property of inner product in RKHS, the deﬁnition of evaluation functional and its gradient in RKHS.

Recall that e E(α, fθ) = ETm Lm(fθ) α fθLm(fθ) 2 H

where βi = α θi Lm(fθi) θLm(fθ) and θ0 = θ, θi+1 = θi α θi L(fθi, Dtr m). The result is straightforward now.

D PROOF OF THEOREM 3

The proof techniques we use are similar to some previous works such as (Arora et al., 2019; Allen-Zhu et al., 2019). We summaries some of the differences. Different from previous works that typically assume a neural network is Gaussian initialized, we do not have such an assumption as we are trying to learn a good meta-initialization in the meta-learning setting. Previous works try to investigate the behavior of models during training, while we focus on revealing the connection between different meta-learning algorithms. Previous work focuses on single-task regression/classiﬁcation problems, while we focus on meta-learning problem.

Theorem 3 Let fθ be a fully-connected neural network with L hidden layers and Re LU activation function, s1, ..., s L+1 be the spectral norm of the weight matrices, s = maxh sh, and α be the learning rate of gradient descent. If α O(qr) with q = min(1/(Ls L), L 1/(L+1)) and r = min(s L, s), then the following holds

|e E(kα, fθ) Mk| O 1

Proof We ﬁrst prove the case of k = 2, i.e. applying a two-step gradient descent adaptation in MAML.

We need to prove the following theorem ﬁrst.

Theorem 6 Let fθ be a fully-connected neural network with L hidden layers, and x be a data sample. Represent the neural network by fθ(x) = σ(σ(...σ(x W 1)...W L 1)W L)W L+1, where W 1, ..., W L+1 denote the weight matrices, and σ is the Re LU activation function. Let s1, ..., s L+1 be the spectral norm of weight matrices, and s = maxh sh. Let α be the learning rate of gradient descent, and f θ(x) be the resulting value after one step of gradient descent, and F be the Frobenius norm.

If α O(qs L), where q = min(1/(Ls L), L 1/(L+1)), then f θ(x)

Remark 1 Theorem 6 states that for a neural network with L hidden layers, if the learning rate of gradient descent is bounded, then the norm of derivative w.r.t all the parameters will not change

Published as a conference paper at ICLR 2021

too much, although there are O(Lm2) parameters, where m denotes the maximum width of hidden layers. We use row vector instead of column vector for consistency, while it does not affect our results.

For simplicity, we will write gh(x) as gh. The bias terms in the neural network are introduced by adding an additional coordinate thus omitted in Theorem 6. Without loss of generality, we can assume x 1, which can be done by data normalization in pre-processing.

Let gh(x) = σ(σ(...σ(x W 1)...W h 1)W h) be the activation at hth hidden layer and g0(x) = x, g L+1 = fθ(x). Deﬁne diagonal matrices Dh, where Dh (i,i) = 1{gh 1W h 0} and

bh = Idy, if h = L + 1 bh+1(W h+1) Dh, otherwise

where Idy is a dy dy identity matrix. We ﬁrst prove the following Lemma.

Lemma 7 Given a neural network as stated in Theorem 6, let 2 denote the spectral norm, W h = W h W h denote some perturbation on weight matrices, gh(x) denote the resulting value after perturbation, and gh(x) = gh(x) gh(x). If s 1 and W h 2 O(s L/L) for all h, then gh O( 1 Ls L h+1 );

If s < 1 and W h 2 O(q) for all h, where q = min(1/(Ls L), L 1/(L+1)) and r = max(q, s), then

gh O(rh 1q) =

( O( 1 Ls L h+1 ), if 1/(Ls L) L 1/(L+1)

O(L h/(L+1)), if 1/(Ls L) > L 1/(L+1).

Proof Proof of Lemma 7 is based on induction.

We ﬁrst prove the case of s 1. Note that g0 = x, thus g0 = 0 O( 1 Ls L 0+1 ) always holds.

For g1, we have

g1 = σ(x W 1) σ(x W 1)

x W 1 x W 1 , due to the property of Re LU activation

Thus, the hypothesis holds for g1.

Now, assume that the hypothesis holds for gh, then we have

gh+1 = σ( gh W h+1) σ(gh W h+1)

gh W h+1 gh W h+1 , due to the property of Re LU activation

gh W h+1 + gh W h+1 gh W h+1

gh W h+1 2 + gh W h+1 2 O(s) gh + gh + gh W h+1 2 O(s) gh + O(sh) W h+1 2 + gh W h+1 2

O(s)O( 1 Ls L h+1 ) + O(sh)O( 1

Ls L ) + O( 1 Ls L h+1 )O( 1

O( 1 Ls L h ).

The last three inequalities come from the fact that gh = σ(σ(...σ(x W 1)...W h 1)W h) O(sh) and s 1. Thus, we have proved the Lemma in the case s 1.

Now, we prove the ﬁrst part of the case of s < 1, i.e. gh O(rh 1q). Because g0 = 0, thus the hypothesis for g0 always holds.

Published as a conference paper at ICLR 2021

For g1, we have

g1 = σ(x W 1) σ(x W 1)

x W 1 x W 1

x W 1 2 O(q).

Thus, the hypothesis holds for g1.

Now, we assume that the hypothesis holds for gh. Then, we have

gh+1 = σ( gh W h+1) σ(gh W h+1)

gh W h+1 gh W h+1

gh W h+1 + gh W h+1 gh W h+1

gh W h+1 2 + gh W h+1 2 O(s) gh + gh + gh W h+1 2 O(s)O(rh 1q) + O(sh)q + q O(rh 1q)

The last inequality comes from the fact that r = max(q, s) and sh < s < 1.

Next we consider the second part of the case of s < 1.

If 1/(Ls L) L 1/(L+1), we know that q = 1/(Ls L) and

1/(Ls L) L 1/(L+1)

L1/(L+1) Ls L

L L/(L+1) s L

which means q s, thus r = s. Then, we have

gh = O(rh 1q) = O(sh 1q) = O(sh 1L 1s L) = O( 1 Ls L h+1 ).

If 1/(Ls L) > L 1/(L+1), we know that q = L 1/(L+1) and q > s; then, r = q and

gh = O(rh 1q) = O(qh 1q) = O(qh) = O(L h/(L+1)).

Thus, we can conclude that Lemma 7 also holds for the case of s < 1, which completes the proof.

We now prove a similar Lemma for bh.

Lemma 8 Given a neural network as stated in Theorem 6, let 2 denote the spectral norm, W h = W h W h denote some perturbation on weight matrices, bh denote the resulting value after perturbation, and bh = bh bh.

If s 1 and W h 2 O(s L/L) for all h, then

If s < 1 and W h 2 O(q) for all h, where q = min(1/(Ls L), L 1/(L+1)), then

bh O(L 1s h), if 1/(Ls L) L 1/(L+1)

O(L(h L 1)/(L+1)), if 1/(Ls L) > L 1/(L+1).

Published as a conference paper at ICLR 2021

Proof Recall that

bh = Idy, if h = L + 1 bh+1(W h+1) Dh, otherwise

where Idy is a dy dy identity matrix and Dh (i,i) = 1{gh 1W h 0}. It is easy to see that bh O(s L h+1), because Dh 2 1 and W h 2 s.

We ﬁrst prove the case of s 1. We know that b L+1 = 0 O(s L 1/L) always holds.

For h L, we can re-write bh as

bh = Idy(W L+1) DL(W L) DL 1...(W h+1) Dh.

Then, we have

bh(gh) = Idy(W L+1) DL(W L) DL 1...(W h+1) Dh(gh) . (10)

Because of the fact that

fθ = g L+1 = x W 1D1W 2D2...DLW L+1 = gh W h+1Dh+1...DLW L+1

and gh = gh Dh, Dh = (Dh) . We can re-write equation 10 as

bh(gh) = f θ .

bh( gh) bh(gh) = f θ fθ = g L+1 O( 1

by Lemma 7. Consequently, we have

bh( gh) bh(gh) = bh(gh) + bh (gh) + bh (gh) O( 1

Since gh O(sh), we know that

Lsh ), bh O(s L h+1)

always hold. Since L 1, s 1, we simply have bh O( 1

Now, we prove the case of s < 1. Similarly, we have

bh( gh) bh(gh) = f θ fθ = g L+1 O( 1

Similarly, we must have

Lsh ), bh O( 1 Lrh 1q ),

where q = min(1/(Ls L), L 1/(L+1)) and r = max(q, s) by Lemma 7.

If 1/(Ls L) L 1/(L+1), then s L+1 1/L. We thus have

O( 1 Lrh 1q ) = O(Ls L h+1

L ) = O(s L+1

Hence, we get bh O( 1

If 1/(Ls L) > L 1/(L+1), then s L+1 < 1/L. We have

O( 1 Lrh 1q ) = O(L 1 Lh/(L+1)) O(L 1 s h) = O( 1

Thus, we get bh O(L(h L 1)/(L+1)).

Published as a conference paper at ICLR 2021

Lemma 9 Given a neural network as stated in Theorem 6, let F be the Frobenius norm, W 1, ..., W L+1 be the weight matrices in the neural network, W h = W h W h be the perturbation on weight matrices, θh be the parameter vector containing all the elements in W h, θh = θh θh be the perturbation on parameter vectors, and f θ(x) be the resulting value after perturbation.

If s 1 and W h 2 O(s L/L) for all h, for any weight matrices the following holds f θ(x)

If s < 1 and W h 2 O(q) for all h, where q = min(1/(Ls L), L 1/(L+1)), for any weight matrices the following holds f θ(x)

Proof We ﬁrst prove the case of dy = 1, i.e. the output of neural network is 1-dimensional.

In this case, we know that f θ(x)

and the derivative to W h is fθ(x)

W h = (bh) gh 1.

Then, we have fθ(x)

F = ( bh) gh 1 (bh) gh 1 F

= ( bh) gh 1 (bh) gh 1 + ( bh) gh 1 F ( bh) gh 1 F + (bh + bh) gh 1 F.

Recall the fact that gh O(sh) and bh O(s L+1 h).

When s 1, from Lemma 7 and Lemma 8 we know that

gh O( 1 Ls L h+1 ), bh O( 1

Then, we have fθ(x)

F O(sh 1)O( 1

Lsh ) + O(s L+1 h)O( 1 Ls L h+2 ) + O( 1 Ls L h+2 )O( 1

When s < 1, from Lemma 7 and Lemma 8 we know that

( O( 1 Ls L h+1 ), if 1/(Ls L) L 1/(L+1)

O(L h/(L+1)), if 1/(Ls L) > L 1/(L+1)

bh O(L 1s h), if 1/(Ls L) L 1/(L+1)

O(L(h L 1)/(L+1)), if 1/(Ls L) > L 1/(L+1).

If 1/(Ls L) L 1/(L+1), we have fθ(x)

F O(sh 1)O( 1

Lsh ) + O(s L h+1)O( 1 Ls L h+2 ) + O( 1 Ls L h+2 )O( 1

Published as a conference paper at ICLR 2021

Since 1/(Ls L) L 1/(L+1) implies L 1 s L+1 (from proof of Lemma 7), we have

1 Lsh s L h+1.

Then we can conclude that fθ(x)

If 1/(Ls L) > L 1/(L+1), we have fθ(x)

F O(sh 1)O(L(h L 1)/(L+1)) + O(s L+1 h)O(L (h 1)/(L+1))

+ O(L (h 1)/(L+1))O(L(h L 1)/(L+1)).

Since 1/(Ls L) > L 1/(L+1) implies L 1 > s L+1 (from proof of Lemma 7), we have

L(h L 1)/(L+1) > s L h+1, 1 L(h 1)/(L+1) > sh 1.

Then we have fθ(x)

s L), because s < 1.

We have proved the Lemma for the case of dy = 1.

For the case of dy > 1, we know that

where fθ,i(x) is the ith dimension of fθ(x). The last inequality directly comes from the 1dimensional case.

Since dy is a constant, we ignore it. Then, we have f θ(x)

which completes the proof.

Now we can prove Theorem 6, if W h is obtained by one step gradient descent starting from W h, θ is obtained by one step gradient descent starting from θ, and learning rate is α. Then, for any weight matrix we have

W h 2 = α W h L(θ) 2 α W h L(θ) F = α θh L(θ) F

= α Pn i=1 n [fθ(xi) yi] fθ(xi)

α Pn i=1 n ci

α Pn i=1 n ci p

dy O(s L h+1)O(sh 1)

where ci = fθ(xi) yi are some constants.

Published as a conference paper at ICLR 2021

If α O(s 2L/L) when s 1, then for any weight matrix we have

W h 2 αO(s L) O(s L/L).

If α O(qs L) where q = min(1/(Ls L), L 1/(L+1)) when s < 1, then for any weight matrix we have W h 2 αO(s L) O(q).

By Lemma 9, we can conclude that f θ(x)

Then, we have

When s 1, we know that s L 1 LL/(L+1). Then, we have 1 Ls L 1

L L 1/(L+1).

Thus, we know 1/(Ls L) = min(1/(Ls L), L 1/(L+1)) when s 1.

For the case of s 1, we can rewrite α O(s 2L/L) = O(qs L), where q = min(1/(Ls L), L 1/(L+1)), which completes the proof of Theorem 6.

Now, we prove Theorem 3 with k = 2, i.e. two-step gradient descent adaptation. We know that

β1 = α θLm(f θ) θLm(fθ) , fθLm(fθ) 2 H = θLm(fθ) 2.

Thus, we have β1 α fθLm(fθ) 2 H

= α θLm(f θ) θLm(fθ) α θLm(fθ) θLm(fθ)

=α θLm(f θ) θLm(fθ) θLm(fθ)

=α E(xm,ym) n f θ(xm) ym f θ(xm)

[fθ(xm) ym] fθ(xm)

=α E(xm,ym) n [fθ(xm) ym + fθ(xm)] f θ(xm)

[fθ(xm) ym] fθ(xm)

=α E(xm,ym) n fθ(xm) f θ(xm)

+ [fθ(xm) ym] fθ(xm)

L ) + O( 1 s

L ) θLm(fθ)

L ) + O( 1 s

L ) θLm(fθ) , because L 1

L ) + O( qr

L ) θLm(fθ) , where q = min(1/(Ls L), L 1/(L+1)), r = min(s L, s)

L ) θLm(fθ) .

Published as a conference paper at ICLR 2021

In the case of dy = 1, we have fθ(x)

F = (bh) gh 1 O(s L),

which has already been shown in the proof of Lemma 9. Then, we have

θLm(fθ) = O v u u t

2 = O v u u t

In the case of dy 1, the bound is simply scaled by a constant of p

Thus we have β1 α fθLm(fθ) 2 H O( q

L ) θLm(fθ) O(qs L) O( 1

because q = min(1/(Ls L), L 1/(L+1)), which completes the proof for the case of k = 2.

For the case of k > 2, we only need to make sure that the bound on learning rate always holds. Fortunately, since k is a ﬁnite constant, according to what we have already showed in the proof of previous lemmas, every step of gradient descent will not change the spectral norm of the weight matrix too much: W h 2 O(s L/L) for all h if s 1, and W h 2 O(q) for all h if s < 1, where q = min(1/(Ls L), L 1/(L+1)). Thus, we may assume that the bound on learning rate always holds during the adaptation. Using triangle inequality to generalize the results from k = 2 to k > 2, i.e. for all 1 i k 1, we have βi α fθLm(fθ) 2 H O 1

Recall that e E(α, fθ) = ETm Lm(fθ) α fθLm(fθ) 2 H

where βi = α θi Lm(fθi) θLm(fθ) and θ0 = θ, θi+1 = θi α θi L(fθi, Dtr m). The result is straightforward now.

E PROOF OF THEOREM 4

Theorem 4 Let fθ be a convolutional neural network with L l convolutional layers and l fullyconnected layers and with Re LU activation function, and dx be the input dimension. Denote by W h the parameter vector of the convolutional layer for h L l, and the weight matrices of the fully connected layers for L l + 1 < h L + 1. 2 means both the spectral norm of a matrix and the Euclidean norm of a vector. Deﬁne

sh = dx W h 2, if h = 1, ..., L l W h 2, if L l + 1 < h L + 1

and let s = maxh sh and α be the learning rate of gradient descent. If α O(qr) with q = min(1/(Ls L), L 1/(L+1)) and r = min(s L, s), the following holds

|Mk e E(kα, fθ)| O 1

Proof We prove Theorem 4 by ﬁrst transforming the convolutional neural network into an equivalent fully connected neural network and then applying Theorem 3.

Published as a conference paper at ICLR 2021

First of all, we assume that there are ch channels in hth convolutional layer s output gh(x), where h = 0, ..., L l. For fully-connected layers, deﬁne c L l = ... = c L+1 = 1. We may represent the dimensionality of input data by x Rdxc0. Instead of using matrices, we represent the output of every convolutional layer by a dxch length vector gh = gh 1 , gh 2 , ..., gh dx , where every gh i = gh i,1, gh i,2, ..., gh i,ch is a ch length vector contains value of different channels at the same position.

We assume that for every element gh i,j of gh i , its value is completely determined by elements of set Qh 1 i , where Qh 1 i contains kch 1 elements with ﬁxed positions in gh 1 for a given i. In other words, every element of the output of a convolutional layer is determined by some elements with ﬁxed positions from output of the previous layer. This is exactly how convolutional layer works in deep learning.

If we use gh 1 Qh 1 i to represent the concatenation of gh 1 a,b Qh 1 i , then gh 1 Qh 1 i is a kch 1 length vector,

where k is the kernel size. Then we have

gh i = σ(gh 1 Qh 1 i U h i )

where U h i,j Rkch 1 ch is a kch 1 ch matrix.

For notation simplicity, one can deﬁne a matrix U h Rdxch 1 dxch, where every column of U h only has kch 1 non-zero elements, and it satisﬁes

gh = σ(gh 1U h)

By the property of convolutional layer, we know the following facts:

One can represent U h by U h = V h 1 , V h 2 , ..., V h dx where V h i Rdxch 1 ch is sub-matrix of U h; Every V h i contains the same set of elements as W h, while these elements are located at different positions; Every V h i can be obtained by any other V h j by swapping rows; Let s deﬁne U L l = W L l, ..., U L+1 = W L+1 for the fully-connected layer and output layer. Then we can represent the neural network just as in Theorem 3 by fθ(x) = σ(σ(...σ(x U 1)...U L 1)U L)U L+1, and x Rdxc0.

Now let th be the spectral norm of U h, and t = maxh th. By Theorem 3, we know that we want α O(qr), where q = min(1/(Ls L), L 1/(L+1)), r = min(s L, s).

Because every V h i contains the same set of elements, we know that every V h i has the same Frobenius norm. Because every V h i can be obtained by any other V h j by swapping rows, we know that every V h i has the same rank.

We know that

1 r V h 1 F V h 1 2 U h 2 U h F = p

dx V h 1 F = p

where F denotes Frobenius norm, r denotes the rank of V h 1 . The last equality holds because matrix V h 1 and vector W h have the same set of elements.

Let s deﬁne

sh = dx W h 2, if h = 1, ..., L l W h 2, if L l + 1 < h L + 1

and s = maxh sh.

From above we know that th = Θ(sh), because sh/ dxr th sh. So we also have t = Θ(s). Then the conclusion is straightforward.

Published as a conference paper at ICLR 2021

F REVISION OF THEOREM 3 AND THEOREM 4 IN CLASSIFICATION CASE

We now show how to obtain similar results of Theorem 3 and Theorem 4 in classiﬁcation problem, where cross-entropy loss is used instead of squared loss. We need two more restrictions in the classiﬁcation case:

1. There exist matrix A and B such that g LA softmax(g LW L+1) g LB for all data points, where softmax is the softmax operation at the last layer.

2. For any data point x whose belongs to cth class, there exists a constant ϵ > 0 such that fθ,c(x) ϵ, i.e. the output of neural network has a lower bound on the true class position.

The proof is actually similar to the proof in regression case. We brieﬂy talk about the differences here.

Firstly, in the classiﬁcation case, softmax function is used at the last layer. By the ﬁrst restriction, we can get rid of softmax function by introducing new matrices, which further leads to bound of the learning rate as in regression case.

Secondly, if the loss function is the cross-entropy loss, we have:

θLm(fθ) = E(xm,ym)

1 fθ,cm(xm) fθ,cm(xm)

where cm denotes the class of xm, e.g. if xm belongs to the third class, then cm = 3. fθ,cm(xm) denotes the cth m dimensional element of fθ(xm). We want a lower bound of fθ,c(x) exists, so that the gradient θLm(fθ) can be further bounded.

Then we can prove similar theorems just follow the steps in regression case.

G PROOF OF THEOREM 5

Theorem 5 Let fθ be a neural network with L hidden layers, with each layer being either fullyconnected or convolutional. Assume that L < . Then, error(T) = |e E(T, fθ) E(T, fθ)| is a non-decreasing function of T. Furthermore, for arbitrary T > 0 we have:

error(T) O T 2L+3 .

Proof Recall that E(t, fθ) is deﬁned based on f t m,θ, which is the resulting function whose parameters

evolve according to the gradient ﬂow dθt m dt = θtm L(f t m,θ, Dtr m).

We actually have the following (Santambrogio, 2016):

θ = θ0 θt O(

For simplicity and clearness, we use to denote the change of any vectors and matrices. Thus, we know that

W h 2 W h F θ O(

Just like the proofs of Lemma 7, Lemma 8 and Lemma 9, we show that

gh O(th/2), bh O(t(L h+1)/2), fθ(x)

F O(t(L+1)/2

by mathematical inductions; we skip the details here. Note that different from some previous theorem, here we focus on time t, and thus hide the effect of the spectral norms by treating them as constants.

Published as a conference paper at ICLR 2021

Then, we have θLm(fθ)

= θt Lm(f t m,θ) θLm(fθ)

= E(xm,ym) n f t m,θ(xm) ym f t m,θ(xm)

[fθ(xm) ym] fθ(xm)

= E(xm,ym) n fθ(xm)

" f t m,θ(xm)(xm)

θt + fθ(xm)

+ [fθ(xm) ym] fθ(xm)

Recall that:

E(T, fθ) = ETm Lm(f T m,θ)

Lm(fθ) + Z T

0 t Lm(f t m,θ)dt

Lm(fθ) + Z T

dt θt Lm(f t m,θ)dt

θt Lm(f t m,θ) 2 dt

and e E(T, fθ) = ETm Lm(fθ) T fθLm(fθ) 2 H = ETm Lm(fθ) T θLm(fθ) 2 .

Because e E(T, fθ) E(T, fθ)

θt Lm(f t m,θ) 2 dt T θLm(fθ) 2

θLm(fθ) + θLm(fθ) 2 dt T θLm(fθ) 2

0 2 θLm(fθ) θLm(fθ) + θLm(fθ) 2 dt,

we have error(T) = |e E(T, fθ) E(T, fθ)| O L + 1

2L + 3T 2L+3 = O T 2L+3

by simple calculation.

On the other hand, observe that

E(T, fθ) = ETm

θt Lm(f t m,θ) 2

E(T, fθ) = ETm[Lm(fθ) T θLm(fθ) 2

θt Lm(f t m,θ) 2 dt,

and assume that θt Lm(f t m,θ) is continuous at t = 0. Then, we have G (τ) = θt Lm(f t θ) 2.

E(T, fθ) E(T, fθ) = ETm

θt Lm(f t m,θ) 2 dt T θLm(fθ) 2

= ETm(G(T) T G (0)) ,

Published as a conference paper at ICLR 2021

where TG (0) = G(0) + TG (0) (note that G(0) = 0) is a ﬁrst order approximation to G(T) at τ = 0. When T = 1, G(T) TG (0) can be taken as a local truncation error (i.e., the error that occurs in one step of a numerical approximation). When T increases, the difference is no better than the global truncation error (in T steps):

i=0 (i (i 1))G (i) =

θt Lm(f t m,θ) 2 θi Lm(f t=i m,θ) 2 dt

i 2 i t Lm(fm,θ) Lm(f t m,θ)dt ,

where i t Lm(fm,θ) = θt Lm(f t m,θ) θi Lm(f t m,θ) as shown previously , i is the i-th time

step, and G (i) is the gradient of G at time step i. Now we can see that E(T, fθ) E(T, fθ)

highly relates to the difference between θt Lm(f t m,θ) at different time steps (i.e. i t Lm(fm,θ)), θt Lm(f t m,θ) and T . The ﬁrst two terms relate to how ﬂat or sharp the hyperplane of Lm(fm,θ) is near t = 0. We can wrap it as a constant C0(L, t = 0). Then, the error is at least C0(L, t = 0) O(T). For the hyperplane smooth enough, we can further get a ﬁrst order approximation of i t Lm(fm,θ) and yield C(L, t = 0)O(T 2), where C(L, t = 0) can be analogized as the second order derivative of L.

H SOME EXPERIMENTAL DETAILS

H.1 IMPLEMENTATION OF CLASSIFICATION FOR META-RKHS-II

As we mentioned earlier, our proposed energy functional with closed form adaptation can not be directly applied to classiﬁcation problem. We handle this challenge following Arora et al. (2019). For a dy class classiﬁcation problem, every data x is associated with a Rdy one-hot vector y as its label. For C classes classiﬁcation problem, its encoding is C dimensional vector and we use 1/C and (C 1)/C as its correct and incorrect entries encoding. In the prediction, Y tr is replaced by the encoding of training data. fθ(x) is replaced by fθ(x) [1, ..., 1] Rn dy for dimension consistency. During the testing time, we compute the encoding of the test data point, and choose the position with largest value as its predicted class.

I EXTRA EXPERIMENTAL RESULTS

I.1 COMPARISON WITH RBF KERNEL

One interesting question is, without introducing extra model components or networks, what will the results of other kernel be? We provide the results of using RBF (Gaussian) kernel here: 42.1 1.9 (5-way 1-shot) and 54.9 1.1 (5-way 5-shot) on Mini-Image Net, 32.4 2.0 (5-way 1-shot) and 38.2 0.9 (5-way 5-shot) on FC-100, which are worse than the NTK based Meta-RKHS-II, showing the superiority of using NTK.

I.2 MORE RESULTS ON OUT-OF-DISTRIBUTION GENERALIZATION

We provide some more results on out-of-distribution generalization experiments here. From the results we can ﬁnd that the proposed methods is more robust and can generalize to different datasets better.

Published as a conference paper at ICLR 2021

Table 5: Meta testing on different out-of-distribution datasets with model trained on FC-100.

5 WAY 1 SHOT 5 WAY 5 SHOT ALGORITHM CUB VGG FLOWER CUB VGG FLOWER

MAML 31.58 1.89% 50.82 1.94% 41.72 1.29% 65.19 1.36% FOMAML 32.34 1.57% 49.90 1.78% 41.96 1.53% 66.87 1.45% REPTILE 33.56 1.40% 46.77 1.81% 42.79 1.38% 67.97 0.71% IMAML 32.49 1.52% 49.96 1.98% 38.92 1.62% 59.80 1.82% BAYESIAN TAML(SOTA) 31.82 0.49% 49.58 0.55% 43.97 0.57% 67.36 0.53% META-RKHS-I 34.12 1.34% 48.81 1.89% 43.31 1.43% 69.02 0.62% META-RKHS-II 36.35 1.07% 59.75 1.23% 49.92 0.68% 76.32 0.58%

Published as a conference paper at ICLR 2021

I.3 MORE RESULTS ON ADVERSARIAL ATTACK

We now show some more extra results on adversarial attack in the following ﬁgures. Consistent to the results in main text, we can ﬁnd that our proposed methods are more robust to adversarial attacks.

1 10 20 30 40 50 60 70 Queries

Black-box attack

Reptile Meta-RKHS-I IMAML Meta-RKHS-II Meta-RKHS-II_t100_PQ1 Meta-RKHS-II_t100_PQ2 MAML FOMAML

Figure 6: FC-100 5-way 5-shot Black-box attacks (left) and 5-way 1-shot PGD ℓ norm attack (right).

I.4 IMPACT OF GRADIENT NORM IN META-RKHS-I

In this experiment, we compare between our proposed Meta-RKHS-I and Reptile. We evaluate the trained models with different adaptation steps in testing-time. The comparison is shown in Figure 7. As we can see, our Meta-RKHS-I always gets better results than Reptile, which supports our idea that the learned function should be close to task-speciﬁc optimal and have large functional gradient norm. These two conditions together lead to the ability of fast adaptation.

2 4 6 8 10 0.496

(a) Mini-Image Net, 5 Way 1 Shots

(b) Mini-Image Net, 5 Way 5 Shots

(c) FC-100, 5 Way 1 Shots

Reptile Meta-RKHS-I

(d) FC-100, 5 Way 5 Shots

Figure 7: Reptile (dashed) vs. Meta-RKHS-I (solid) with different testing adaptation steps (x-axis).

I.5 IMPACT OF NETWORK ARCHITECTURE FOR DIFFERENT META-LEARNING MODELS

In this section, we compare different meta-learning models with feature channels of 100 and 200 of the CNN network structure with 4 or 5 CNN layers respectively.

Published as a conference paper at ICLR 2021

Table 6: Few-shot classiﬁcation results on Mini-Image Net with different number of feature channels of 4 convolution layers.

100 200 ALGORITHM 5 WAY 1 SHOT 5 WAY 5 SHOTS 5 WAY 1 SHOT 5 WAY 5 SHOTS

MAML 49.50 1.58% 64.31 1.07% 48.91 1.69% 63.96 0.82% FOMAML 48.69 1.62% 63.73 0.76% 48.55 1.86% 63.18 0.96% IMAML 49.30 1.94% 62.89 0.95% 48.23 1.58% 62.25 0.83% REPTILE 50.20 1.69% 64.12 0.92% 48.72 1.97% 63.67 0.79%

META-RKHS-I 51.23 1.79% 66.69 0.73% 51.54 1.64% 65.92 0.92% META-RKHS-II 51.37 2.31% 66.97 0.98% 50.96 2.15% 65.21 0.87%

Table 7: Few-shot classiﬁcation results on Mini-Image Net with different number of feature channels of 5 convolution layers.

100 200 ALGORITHM 5 WAY 1 SHOT 5 WAY 5 SHOTS 5 WAY 1 SHOT 5 WAY 5 SHOTS

MAML 49.87 1.65% 65.78 1.18% 48.62 1.82% 63.25 0.75% FOMAML 48.93 1.71% 64.37 0.80% 48.27 1.74% 62.95 0.83% IMAML 48.03 1.76% 62.15 0.83% 47.52 1.73% 61.77 0.89% REPTILE 50.62 1.83% 64.53 0.97% 49.33 1.89% 63.26 0.70%

META-RKHS-I 52.45 1.88% 66.07 0.69% 51.37 1.92% 65.39 0.98% META-RKHS-II 50.92 2.16% 66.45 0.91% 50.43 2.42% 64.17 1.06%