# gradientem_bayesian_metalearning__991d263d.pdf

Gradient-EM Bayesian Meta-learning

Yayi Zou Didi AI Labs @Silicon Valley yz725@cornell.edu

Xiaoqi Lu Columbia University lx2170@columbia.edu

Bayesian meta-learning enables robust and fast adaptation to new tasks with uncertainty assessment. The key idea behind Bayesian meta-learning is empirical Bayes inference of hierarchical model. In this work, we extend this framework to include a variety of existing methods, before proposing our variant based on gradient-EM algorithm. Our method improves computational efﬁciency by avoiding back-propagation computation in the meta-update step, which is exhausting for deep neural networks. Furthermore, it provides ﬂexibility to the inner-update optimization procedure by decoupling it from meta-update. Experiments on sinusoidal regression, few-shot image classiﬁcation, and policy-based reinforcement learning show that our method not only achieves better accuracy with less computation cost, but is also more robust to uncertainty.

1 Introduction

Meta-learning, also known as learning to learn, has gained tremendous attention in both academia and industry, especially with applications to few-shot learning[3]. These methods utilize the similar nature of multi-task setting, such that learning from previous tasks helps mastering new tasks faster.

The early fast meta-learning algorithm was gradient-based and deterministic, which may cause overﬁtting on both inner-level and meta-level [15]. With growing interests in prediction uncertainty evaluation and overﬁtting control, later studies explored probabilistic meta-learning methods [6, 27, 4]. It has been agreed that Bayesian inference is one of the most convenient choices because of its Occam s Razor property [14] that automatically prevents overﬁtting, which happens in deep neural network (DNN) very often. It also provides reliable predictive uncertainty because of its probabilistic nature. This makes Bayesian methods important to DNN, which as [8] shows, unlike shallow neural networks, are usually poorly calibrated on predictive uncertainty.

The theoretical foundation of Bayesian meta-learning is hierarchical Bayes (HB) [5] or empirical Bayes (EB) [22], where the former can be seen as adding a hyper-prior over the latter [20]. For simplicity, in this paper we focus on EB to restrict the learning of meta-parameters to point estimates. A common solution of EB is a bi-level iterative optimization procedure [20, 13], where the innerupdate refers to adaptation to a given task, and the meta-update is the meta-training objective. Starting with a generalized meta learning setting, we propose our general Bayesian framework and claim its optimality under certain metrics. This framework extends the original optimization framework for train/val split in the inner-update procedure to mitigate in-task overﬁtting which is important for NN based ML. We also hypothesize a mechanism of how EB framework achieves fastadaptation (few inner-update gradient steps) under Gaussian parameterization, along with empirical evidences. What s more, we successfully adapt this EB framework to RL both theoretically and empirically which has not been done before.

We show that many important previous works in (Bayesian) meta-learning [20, 4, 27, 3, 17] can be included to this extended framework. However in these previous works, the meta-update step requires backpropagation through the inner optimization process [18] which imposes large computation

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

and memory burden as the increase of inner-update gradient steps. This puts limits on possible applications, especially those require many inner-update gradient steps or involves large dataset (Appendix C.2). Motivated by the above observations, we propose a gradient-based Bayesian algorithm inspired by Gradient-EM algorithm. By designing a new way to compute gradient of meta loss in Bayesian MAML, we come up with an algorithm that decouples meta-update and inner-update and thus avoids the computation and memory burden of previous methods, making it scalable to a large number of inner-update gradient steps. In addition, it enables large ﬂexibility on the choice of inner-update optimization method because it only requires the value of the result of the inner-update optimization, instead of the optimization process (for example in experiments we use Adam in classiﬁcations and Trust Region Policy Optimization in RL). The separability of meta-update and inner-update also makes it a potentially useful scheme for distributed learning and private learning.

In experiments, we show our method can quickly learn the knowledge and uncertainty of a novel task with less computation burden in sinusoidal regression, image classiﬁcation, and reinforcement learning.

2 Problem Formulation and Framework

2.1 General Meta-learning Setting

We set up the K-shot meta-learning framework upon reinforcement learning(RL) with episode length H as in [3], where supervised learning is a special case with H = 1. With a decision rule (policy) f we can sample rollout data D = {xt, at, rt}H t=1 from the task environment. A decision rule (policy) f can be evaluated on D with loss function L(D, f). We assume each task τ to be i.i.d. sampled from the task space T , following some task distribution P(τ). During meta-training phase we are given a set of training tasks T meta-train. For each task τ of this set, we collect K samples rollout of current policy f denoted as Dtr τ and another K samples rollout after 1 policy gradient training of

f denoted as Dval τ (f is not needed in generating samples in supervised learning). At meta-testing phase, for a randomly sampled task τ, Dtrτ is ﬁrstly provided. We are then required to return fτ based on {Dtr τ S Dval τ : τ T meta-train} Dtrτ and evaluate its expected loss lτ = EDτ τ L(Dτ , fτ ) on more samples generated from that task. The objective is to come up with an algorithm that produces a decision rule fτ that minimize the expected test loss over all tasks Eτ P (τ)lτ.

Figure 1: (a) Minimal A . Each dot (in a vector space) represents the weights making the NN model perform well for certain task. Different colors are used to represent different tasks. This ﬁgure demonstrates that many good solutions(local optimums of NN loss function) exist for each task. The area within dotted circle is the small neighboring zone A where each color has at least one point inside. (b) Graphical Model

2.2 Extended Empirical Bayesian Meta-learning Framework

We consider parameterized decision rule fθ and construct a corresponding generative model log P(D|θ) = L(D, fθ) (We leave the detail of this construction in RL to Appendix B.1). For each task τ, denote the best policy parameter as well as the best ﬁtted underlying generative model parameter to be θτ = arg min EDτ τ L(Dτ , fθ) = arg max EDτ τ log P(Dτ |θ). In general such maximum is not unique, which is discussed in Section 2.3. With θτ uniquely deﬁned, we have a distribution P(θτ ) induced by P(τ) (change of variable). We summarize the graphical model in Figure 1(b). Under perfect approximation, the ground-truth generator matches our generative model: P(Dτ |τ) = P(Dτ |θτ ), resulting in the following proposition (proof in Appendix A.1): Proposition 1. Suppose a data generator is represented by the hierarchical model P(Dτ |θτ ) and P(θτ ), and deﬁne L(Q; D) = log Eθ Q P(D|θ) for distribution Q over θ. Let (Dtrτ , Dval τ ) be independent

samples from task τ, and consider Q determined by Dtrτ via Q = g(Dtrτ ). Then

P(θτ |Dtrτ ; P(θτ )) = arg max g Eτ L(g(Dtrτ ), Dval τ ) (1)

where P(θτ |Dtrτ ; P(θτ )) is the posterior given the prior as P(θτ ) and observations Dtrτ .

Two observations are made here. First, this theorem guarantees the best decision rule we can come up with during meta-testing, i.e., through computing posterior P(θτ |Dtrτ ; P(θτ )). Second, this theorem suggests an estimation method for P(θτ ) during meta-training: arg max P (θτ ) P τ L(P(θτ |Dtrτ ; P(θτ )); Dval τ ). We prove in Appendix A.2 that this estimator is not only asymptotic consistent but also with good asymptotic normality which means it quickly converge to true value with small variance as number of tasks increases. We further parameterize P(θτ ) by P(θτ ; Θ), and introduce short notation L(Θ, Dtrτ ; Dval τ ) = L(P(θτ |Dtrτ , Θ); Dval τ ), then the optimization in meta-training can be written as

arg maxΘ P τ L(Θ, Dtrτ ; Dval τ ). For clarity we denote L[1] τ = L(Θ; Dtrτ Dval τ ) and L[2] τ = L(Θ, Dtrτ ; Dval τ ) , and also L[1] = Eτ L[1] τ , L[2] = Eτ L[2] τ . Then the estimation method of P(θτ ; Θ) becomes arg maxΘ L[2]. This is an extension of the popular MLE approach in empirical Bayes, which maximize (marginal) log-likelihood L[1] as the special case of L[2] when |Dtrτ | = 0. There is a bias/variance trade-off between L[1] and L[2]. Using L[2] as the meta loss function improves in-task overﬁtting problem while L[1] extracts more information from the data.

Combining the above two observations, a stochastic gradient descent (SGD) approach to meta-training is provided in Algorithm 1: at iteration t, gradient ΘL[i] τ for each task in the t-th meta-training batch is computed by subroutine Meta-Gradient, then gradient ascent on Θ is performed. A variational inference (VI) approach to meta-testing is also included, where posterior P(θτ |Dtrτ ; Θ) is estimated with ﬁxed bΘ learned during meta-training. Detailed discussion of these subroutines are presented in Session 3.

2.3 non-uniqueness and fast-adaptation

For neural networks fθ, there exists many local optimums that achieves similarly good performance for each task. We observe from empirical study (Appendix C.3) that the key to fast-adaptation for gradient-based algorithm is to ﬁnd a small neighbouring zone A where most tasks have at least one good parameter inside it (Figure 1(a)). The intuition is that when {θτ } are close enough they can be learned within a few gradient steps starting from any points within that neighbouring zone (our experiment shows that a perturbation of initial points within that area would still have good performance at meta-test). The existence of this small neighbouring zone A depends on the parametric model fθ and the task distribution P(τ). We further demonstrate (Appendix C.3) its existence with neural networks as the parametric model and Gaussian parameterization of P(θτ ; Θ) for uni-modal task distribution like sinusoidal functions. Even if we fail to ﬁnd a single small neighbouring zone A (e.g. multi-modal task distribution like mixture of sinusoidal, linear and quadratic functions), solution may be provided by extension to mixture Gaussian [7, 19]. In this work we focus on the uni-modal situation and leave the extension to future work.

1 Algorithm Meta-train()

2 randomly initialize Θ

4 while not done do

5 Sample batch of tasks Tt P(τ)

6 for each task τ Tt do

7 Sample Dtr τ, Dval τ τ

8 Compute ΘL[i] τ by Subroutine Meta-Gradient(Θ(t),{Dtr τ, Dval τ })

10 Θ(t+1) = Θ(t) β P

τ Tt ΘL[i] τ

11 t = t + 1

1 Algorithm Meta-test()

2 Require: learned Θ, Dtr τ from new task τ

3 Compute posterior λτ =VI(Θ, Dtr τ).

4 Sample θτ P(θ; λτ)

5 return f(; θτ) for evaluation

1 Subroutine VI(Θ,Dτ)

2 Initialize λτ at Θ.

3 while not done do

4 Sample ϵ from ϵ p(ϵ).

5 λτ = λτ + α λτ [log P(Dτ|g(λτ, ϵ))

6 - KL(P(θτ; λτ) P(θτ|Θ))]

8 return λτ

Algorithm 1: Extended Empirical Bayes Meta-learning Framework. VI: reparameterize θτ P(θτ; λτ) using a differentiable transformation g(λτ, ϵ) of an auxiliary noise variable ϵ such that θτ = g(λτ, ϵ) with ϵ p(ϵ) [12]

In this section, we ﬁrst introduce the gradient-based variational inference subroutine VI related to a variety of existing methods, then present our proposed subroutine Meta-Gradient inspired by Gradient-EM algorithm and compare it with the mostly used existing methods for this subroutine.

3.1 Variational Inference

Notice that this framework requires computing posterior on complex models such as neural networks. To achieve this, we approximate the posterior with the same parametric distribution P(θτ ; λτ ) as we approximate the prior P(θτ ; Θ) and use Variational Inference to compute the parameters, as has been done in previous work [20]. Let P(θτ ; λτ (Dτ ; Θ)) be the approximation of the posterior P(Dτ , θτ ; Θ) by minimizing their KL distance. Since

L(Θ; Dτ ) = log P(Dτ ; Θ) = KL[P(θτ ; λτ ) P(θτ |Dτ ; Θ)] + EP (θτ ;λτ )[log P(Dτ , θτ ; Θ) log P(θτ ; λτ )] (2) is constant in terms of λτ, we have λτ (Dτ ; Θ) = arg minλτ KL[P(θτ ; λτ ) P(θτ |Dτ ; Θ)] = arg maxλτ EP (θτ ;λτ )[log P(Dτ , θτ ; Θ) log P(θτ ; λτ )] = arg maxλτ EP (θτ ;λτ )[log p(Dτ |θτ )]

KL[P(θτ ; λτ ) p(θτ ; Θ)]. So the inference process is to ﬁnd λτ to maximize the Evidence Lower Bound ELBO(τ)(λτ ; Θ) = EP (θτ ;λτ )[log p(Dτ |θτ )] KL[P(θτ ; λτ ) p(θτ ; Θ)] via mini-batch gradient descent[20]. The gradient of KL-divergence terms are calculated analytically in Gaussian case whereas the gradient of expectations can be computed by monte-carlo with reparameterization along with some variance reduction tricks [11, 28]. Due to the above analysis in Section 2.3, only a few gradient steps are needed for this process with well learned Θ by our framework. We summarize the subroutine VI in Algorithm 1.

A special case worth mentioning is when we use delta function for the posterior approximation P(θτ ; λτ ) = δµλτ (θτ ), we have λτ (Dτ ; Θ) = arg max[log P(Dτ |µλτ ) µλτ µΘ 2 /(2ΣΘ2)], which is actually the inner-update step of i MAML [18], MAML [3], and reptile [17] (if we replace the l2 regularization term with choosing µΘ as initial point for gradient based optimization: µλτ (µΘ) = µΘ θ log P(Dτ |θ)|θ=µΘ).

3.2 Meta-Gradient

The essential part of this meta-learning framework is to compute the gradient ΘL[i] τ . We show below that this problem can be reduced as computing ΘL(Θ; Dτ ) = Θ log P(Dτ ; Θ) given Dτ and Θ. For L[1], this is direct. For L[2], there are two approaches. The ﬁrst approach is to compute λτ (Dτ ; Θ) as stated above, then

ΘL[2] τ = ΘL(Θ, Dtrτ ; Dval τ ) = ΘL(λτ (Dtrτ ; Θ); Dval τ ) = ΘL(Θ; Dval τ )|Θ=λτ (Dtrτ ;Θ) Θλτ (Dtrτ ; Θ) (3)

, where Θλτ (Dtrτ ; Θ) can be computed by auto-gradient (if λτ (Dtrτ ; Θ) is computed by gradient based algorithms). This approach is widely used in previous work such as [3], [4], [27], [6]. The second approach is proposed by us as shown in subroutine Meta-Gradient:GEM-BML+ below. We utilize a property L(Θ, Dtrτ ; Dval τ ) = L(Θ; Dtrτ S Dval τ ) L(Θ; Dtrτ ) (proof in Appendix A.4) such that

ΘL[2] τ = ΘL(Θ; Dtrτ S Dval τ ) ΘL(Θ; Dtrτ ) can be expressed by the difference of two L[1] terms, thus is reduced to computing ΘL(Θ; Dτ ) terms.

Gradient-EM Estimator We propose an efﬁcient way to compute ΘL(Θ; Dτ ) through gradient of the complete log likelihood. This is guaranteed by the following Gradient-EM Theorem inspired by the observation in [23].

Theorem 1. ΘL(Θ; D) = Eθ P (θ|D;Θ) Θ log[P(D, θ; Θ)]

ΘL(Θ; D) = Θ log P(D|Θ)

= 1 P(D|Θ) Θ

Z P(D, θ|Θ)dθ

= Z P(D, θ|Θ)

P(D|Θ) Θ log P(D, θ|Θ)dθ

= Z P(θ|D, Θ)

Θ log P(D, θ|Θ)dθ

= Eθ P (θ|D;Θ) Θ log[P(D, θ; Θ)]

Under hierarchical modeling structure, we have Θ log P(Dτ , θτ |Θ) = Θ log[P(Dτ |θτ ) P(θτ ; Θ)] = Θ log[P(θτ ; Θ)]. Combining with Theorem 1 we have ΘL(Θ; Dτ ) = Eθτ P (θ|Dτ ;Θ) Θ log[P(θτ ; Θ)]. After using VI to compute the approximate posterior parameter λτ (Dτ , Θ), the above estimator becomes ˆg = Eθτ P (θτ ;λτ (Dτ ,Θ)) Θ log[P(θτ ; Θ)] which can be calculated analytically in Gaussian case as we show in Appendix B.7. This gives us two Meta-Gradient subroutines GEM-BML and GEM-BML+ for ΘL[1] τ and ΘL[2] τ respectively. We name Algorithm 1 with these two subroutines as our algorithms GEM-BML and GEM-BML+.

1 Subroutine Meta-Gradient:GEM-BML(Θ,{Dtr, Dval})

2 Compute posterior λtr =VI(Θ, Dtr).

3 Compute posterior λtr val =VI(λtr, Dval).

4 return ˆg = Eθ P (θ;λtr val) Θ log[P(θ; Θ)]

1 Subroutine Meta-Gradient:GEM-BML+(Θ,{Dtr, Dval})

2 Compute posterior λtr =VI(Θ, Dtr).

3 Compute posterior λtr val =VI(λtr, Dval).

4 return ˆg = Eθ P (θ;λtr val) Θ log[P(θ; Θ)] Eθ P (θ;λtr) Θ log[P(θ; Θ)]

ELBO Gradient Estimator As comparison, one of the most widely used methods to optimize L[i] in Bayesian metalearning is optimizing ELBO [20](see Appendix B.6 for other existing methods and comparing analysis). Here we show it is actually another way to estimate ΘL(Θ; Dτ ). According to equation (2), when VI approximation error KL[P(θτ ; λτ (Dτ ; Θ)) P(θτ |Dτ ; Θ)] is small enough, we have L(Θ; Dτ ) EP (θτ ;λτ (Dτ ;Θ))[log P(Dτ , θτ ; Θ) log P(θτ ; λτ (Dτ ; Θ))] =

[EP (θτ ;λτ (Dτ ;Θ)) log P(Dτ |θτ )] KL[P(θτ ; λτ (Dτ ; Θ)) P(θτ |Θ)] = ELBO(τ)(λτ (Dτ ; Θ); Θ). So the gradient can be computed by

ΘL(Θ; Dτ ) ΘELBO(τ)(λτ (Dτ ; Θ); Θ)

λτ ELBO(τ)(λτ ; Θ)|λτ =λτ (Dτ ;Θ) Θλτ (Dτ ; Θ) + Θ ELBO(τ)(λτ ; Θ)|λτ =λτ (Dτ ;Θ) (4)

. The ﬁrst partial gradient term can be computed by the same method in Section 3.1 and the second one can be calculated analytically in Gaussian case.

In fact, Gradient-EM(GEM) can also be reviewed as an co-ordinate descent algorithm to optimize ELBO as a variant of EM as we show in Appendix B.3. Comparing to ELBO gradient, GEM avoids the back Props computation of Θλτ (Dτ ; Θ) which gives it a series of advantages as we specify in Section 4.2. Both GEM and ELBO gradient has estimation error arise from the discrepancy of estimated posterior by VI and the true posterior. We show empirical results in Appendix C.1 that GEM has stably lower estimation error than ELBO gradient. We also show in Appendix A.3 that our method has a theoretical bound of estimation error in terms of the VI discrepancy ˆg ΘL(Θ; Dτ ) M p

DKL(P(θ|Dτ ; Θ) P(θ; λτ (Dτ , Θ))) where M is a bounded constant.

Matrix of Related Works We compare Gradient-EM (our method) with ELBO-gradient over two loss functions L[1], L[2],

L[1] = Eτ T L(Θ; Dtr τ Dval τ ) L[2] = Eτ T L(λτ (Dtr τ ; Θ); Dval τ ) ELBO gradient Amortized BML [20] related to PMAML [4] Graident-EM GEM-BML (our method) ; reduce to Reptile [17] in delta case KL-Chaser Loss(related to l2-Chaser Loss, BMAML [27]) Table 1: Matrix of related works

summarized in Table 1. It turns out each element of this matrix is related to a previous work or our method. Notice that this matrix can be extended with more columns (e.g. one more column of L[2] = Eτ T L(Θ; Dtrτ Dval τ ) L(Θ; Dtrτ )) and more rows to a larger matrix with blank elements (models) that haven t been explored before. For example, KL-Chaser Loss model in the right bottom of Table 1 hasn t been studied before. We leave the thorough study of all combinations to future work. Here we only show how MAML and Reptile can be ﬁt into this Bayes frame, while further details are left to Appendix B.4. To see this, consider using ﬁxed variance parameters for both prior P(θτ ; µΘ, ΣΘ = C0) and posterior P(θτ ; µλτ , Σλτ = Cτ ) and let Cτ 0 so posterior becomes delta distribution δµλτ (θτ ). We can compute µλτ (µΘ) by gradient descent from µΘ. MAML uses L[2] as meta-loss function. Under delta distribution posterior we have

L[2] τ = log Z P(Dτ |θτ )δµλτ (µΘ)(θτ )dθτ = log P(Dτ |µλτ (µΘ)) = f(Dτ ; µλτ (µΘ))

. Then ΘL[2] = µΘf(Dτ ; µλτ (µΘ)) can be directly computed through back-propagation in neural networks. On the other hand, Reptile uses L[1] as meta-loss function. Using GEM-gradient ˆg we have

ΘL[1] τ = Eδµλτ (µΘ)(θτ ) Θ log[P(θτ ; µΘ, ΣΘ = C0)] = µΘ |µΘ µλτ (µΘ)|2

which is the Reptile gradient. Also notice that, if we let C0 0 and so the prior becomes delta, then we have

ΘL[1] τ = µΘ log Z P(Dτ |θτ )δµΘ(θτ )dθτ = µΘ log P(Dτ |µΘ) = µΘf(Dτ ; µΘ)

. This corresponds to "pre-train" which simply train a model to ﬁt data of all tasks combined.

Advantages of GEM Observe that all methods in the above matrix requires to compute the posterior parameters λτ (Dτ ; Θ) ﬁrst and use it to compute the sampled meta-loss function gradient ΘL[i] τ . Following the convention of [3], we deﬁne the step of computing λτ (Dτ ; Θ) as inner-update and the step of computing ΘL[i] τ as meta-update. Notice that both the L[2] = Eτ T L(λτ (Dtrτ ; Θ); Dval τ ) column and the ELBO-gradient row involve the computation of Θλτ (Dτ ; Θ)(Equation (3,4)). This means the meta-update computation of these three methods(highlighted in colour) has to compute backpropagation through the inner optimization process which leads to a number of burden and limitation, while GEM avoids this computation and thus gives a number of advantages as mentioned in Introduction. Also notice that, if assuming independence between neural network layers, the meta-update of our algorithm (Line 4 of Subroutine GEM-BML(+)) can be computed among different neural network layers in parallel, which may largely reduce the computation time in deep neural networks. We summarize a detailed analysis of our advantages to Appendix B.5.

5 Experiment

5.1 Regression

The purpose of this experiment is to test our methods on fast adaptation ability and robustness to meta-level uncertainty.

We compare our model GEM-BML and GEM-BML+ with MAML [3], Reptile [17] and Bayesian meta-learning benchmarks BMAML [27] and Amortized BML[20] on the same sinusoidal function regression problem. We ﬁrst apply the default setting in [3] then apply a more challenging setting which contains more uncertainty as proposed in [27] to demonstrate the robustness to meta-level uncertainty. Data of each task is generated from y = A sin(wx + b) + ϵ with amplitude A, frequency w, and phase b as task parameter and observation noise ϵ. Task parameters are sampled from uniform distributions A [0.1, 5.0], b [0.0, 2π], w [0.5, 2.0] and observation noise follows ϵ N(0, (0.01A)2). x ranges from [ 5.0, 5.0]. For each task, K = 10 observations({xi, yi} pairs) are given. The underlying

1 2 3 4 5 number of gradient steps

GPU Memory (Normalized)

GEM-BML+ (ours) BMAML(M=5) Reptile MAML fo MAML amortized BML implicit MAML(CG=2)

1 2 3 4 5 number of gradient steps

Compute Time (Normalized)

Figure 2: (a) Sinusoidal regression results: Meta-test error of default and challenging setting after 40000 meta-train iterations. (b) Computation and memory trade-offs with 4 layer CNN on 1-shot,5-class mini Image Net task. (BMAML is beyond the range of the plot.)

Omniglot 1-shot, 5-class 5-shot, 5-class 1-shot, 20-class 5-shot, 20-class MAML 98.7 0.4 % 99.9 0.1 % 95.8 0.3 % 98.9 0.2 % ﬁrst-order MAML 98.3 0.5 % 99.2 0.2 % 89.4 0.5 % 97.9 0.1 % Reptile 97.68 0.04 % 99.48 0.06 % 89.43 0.14 % 97.12 0.32 % i MAML 99.50 0.26 % 99.74 0.11 % 96.18 0.36 % 99.14 0.1 % GEM-BML+(Ours) 99.23 0.42 % 99.64 0.08 % 96.24 0.35 % 98.94 0.25 %

Table 2: Few-shot classiﬁcation on Omniglot dataset. The shows 95% conﬁdence intervals over different testing tasks. All results to compare are from original literature.

network architecture(2 hidden layers of size 40 with RELU activation) is the same as [3] to make a fair comparison.

In Figure 2 (a), we plot the mean squared error (MSE) performance on test tasks during meta-test process under both settings. Under default setting, our methods show similar fast-adaptation ability as previous methods. The challenging setting result shows that Bayesian methods GEM-BML(+), BMAML and Amortized BML can still extract information in high uncertainty environment while non-Bayesian models MAML and Reptile fail to learn. We also observe that our model provides a stable meta-train learning curve and continues to improve as performing more gradient steps without overﬁtting. This demonstrates the robustness of Bayesian methods resulted from its probabilistic nature and the ability to control overﬁtting.

5.2 Classiﬁcation

The purpose of this experiment aims to answer the following questions: (1) Does our model save computation time and memory requirement by avoiding meta-update back Prop as we claimed? (2) Can our methods be scaled to few-shot image classiﬁcation benchmarks and achieve good accuracy and predictive uncertainty?

To study (1), we turn to Mini-Image Net [21] dataset on 1-shot,5-class. We compare GEMBML+(GEM-BML is even less expensive) with MAML and its ﬁrst order variants fo MAML, Reptile, i MAML, Amortized BML and BMAML in Fig 2(b). Just like other ﬁrst-order meta-learning al-

Figure 3: Reinforcement Learning

mini Image Net 1-shot, 5-class MAML [3] 48.70 1.84 % ﬁrst-order MAML [3] 48.07 1.75 % Reptile [17] 49.97 0.32 % i MAML [18] 49.30 1.88 % Amortized BML [20] 45.0 0.60 % GEM-BML+(Ours) 50.03 1.63 %

Predictive uncertainty ECE MCE MAML 0.0471 0.1104 Amortized BML 0.0124 0.0257 GEM-BML+(Ours) 0.0102 0.0197

Table 3: Accuracy and Predictive Uncertainty Measurement of Few-shot classiﬁcation on the Mini Imagenet dataset. Small ECE and MCE indicate a model is better calibrated.

gorithms and i MAML which decouples the inner-update and meta-update, the memory usage of GEM-BML(+) is independent of the number of inner-update gradient steps since the inner-update computation other than ﬁnal step results need not to be stored. On the other hand, MAML-like algorithms (MAML, Amoritized BML) need memory growing linearly with inner-update gradient steps. It is also similar for compute time, MAML-like algorithms requires expensive back Prop over the inner-update optimization in meta-update, where the compute cost grows at a faster rate than GEM-BML(+), fo MAML, Reptile and i MAML (i MAML has a relatively high base compute cost because of Hessian computation).

To study (2) we applied our method to N-class image classiﬁcation on the Omniglot dataset and Mini Imagenet dataset which are popular few-shot learning benchmarks([25, 24, 21]). Notice that the purpose of this experiment is not to compete with state-of-the-art on this benchmark but to provide an apples-to-apples comparison with prior works within our extended Empirical Bayes framework. So for a fair comparison, we use the identical backbone convolutional architecture[3] as these prior works. Note however that this backbone architecture can be replaced with other ones and lead to better results for all algorithms [2, 10]. We leave to the future work to improve our method with better backbone architectures to challenge the state-of-the-art of this benchmark. The inner-update is computed using Adam to demonstrate the ﬂexibility of our methods in choosing inner-update optimizer. The results in Table 2 and 3 shows that our methods performs as good as the the best prior methods within our extended Empirical Bayes framework.

Predictive uncertainty is the probability associated with the predicted class label which indicates how likely the predictions to be correct. To measure the predictive uncertainty of the models, we use two quantitative metrics ECE and MCE ([16, 8]) to Mini Imagenet dataset. Smaller ECE and MCE indicate a better-calibrated model. A perfectly calibrated model achieves 0 for both metrics. The results of ECE and MCE for our models and previous works are shown in Table 3. We can see that our model is slightly better calibrated compared to the state-of-art bayesian meta-learning model Amortized BML and well outperform non-Bayesian models. This shows our model can learn a good prior and make good probability predictions as an advantage of Bayesian model.

5.3 Reinforcement Learning

We test and compare the models on the same 2D Navigation and Mu Jo Co continuous control tasks as are used in [3]. See Appendix C.4.3 for detailed descriptions on experiment settings and hyper-parameters.

For a fair comparison, we use the same policy network architecture as [3] with two hidden layers, each with 100 Re LU units. At meta-train, we collect K samples of rollout of current policy and another K samples rollout after 1 policy gradient update as [3] where K is the inner-batch size. At meta-test, we compare adaptation to a new task with up to 3 gradient updates, each with 40 samples. We compare to two baseline models: MAML and reptile.

MAML uses TRPO in meta-update to boost performance while our meta-update is data-free as speciﬁed in the above sections. For inner-updates, due to our model s ﬂexibility of choosing innerupdate optimzier, we can either use vanilla policy gradient (REINFORCE) (Williams, 1992) or a specially designed TRPO proposed by [3]. We ﬁnd that TRPO inner-update performs better in 2d navigation while vanilla policy gradient tend to be better in Mu Jo Co continuous control tasks. We hypothesis that the reasons could be in complex task setting the task distribution variance tend to be higher(A is larger in Figure 1 (a)). While TRPO limits the step size of each inner-update which makes the task parameters hard to be attained within a few gradient steps.

As shown in Fig 3, GEM-BML+ outperforms MAML while reptile and GEM-BML has less superior performance. This shows L[2] variant is necessary in RL which has high in-task variance and easily overﬁtted. Previous work [17] show it is hard to adapt algorithms to RL with the advantage of data-free meta-update(reptile like algorithm). But with our L[2] variant we can adapt to RL while preserving this advantage. Our results show that the key to adaptation is L[2] variant for RL, which highlighted our contribution of GEM-BML+ algorithm.

6 Related Works

Hierarchical Bayes(HB) and Empirical Bayes(EB) have been decently studied [9] in the past to utilize statistical connections between related tasks. Since then, deep neural network(DNN) caught enormous attention and efforts of measuring the uncertainty of DNN also started ongoing in which Bayesian and sampling method are widely applied.[1] The research trend of multi-task learning and transfer learning also changed to the ﬁne-tuning framework for DNN after then. Model Agnostic Meta-learning(MAML) [3] emerged in such a motivation to ﬁnd good initial parameters that can be fast adapted to new tasks in a few gradient steps. Recently, Bayesian models have a big comeback because of their probabilistic nature in uncertainty measure and automatic overﬁtting preventing. [26] applied HBM to multi-task reinforcement learning. [6] related MAML to Hierarchical Bayesian model and proposed a Laplace approximation method to capture isotopic Gaussian uncertainty of the model parameters. BMAML [27] used Stein Variational Gradient Descent(SVGD) to obtain posterior samples and proposed a Chaser Loss in order to prevent meta-level overﬁtting. PMAML [4] also proposed a gradient-based method to obtain a ﬁxed measure of prior and posterior uncertainty. Amortized-BML [20] proposed a MAML-like variational inference method for amortized Bayesian meta-learning. All of the methods above can not make inner-update and meta-update separable thus largely limit the ﬂexibility of the optimization process of inner-update. i MAML [18] propose an implicit gradients method for MAML which can make inner-update and meta-update separable with the cost of computation on second order derivatives and solving QP in each meta-update step. Notice that this method is not within our framework though it shares the style because its update is by the effort of approximating MAML rather than a Bayesian approach.

7 Conclusion

Inspired by Gradient-EM algorithm we have proposed GEM-BML(+) Algorithm for Bayesian Metalearning. Our method is based on a theoretical insight of the Gradient-EM Theorem and the Bayesian formulation of multi-task meta-learning. This method avoids back Prop in meta-update and decouples the meta-update and inner-update. We have tested our method on sinusoidal regression, few-shot image classiﬁcations and reinforcement learning to demonstrate the advantage of our method. For future work, we consider to apply our method to start-of-art image classiﬁcation backbone and extending our work to nonparametric Gaussian approximation to handle multimodal and dynamic task-distribution situations.

Broader Impact

Meta-learning algorithms can be applied in AI products that requires fast adaptation with few data points. Examples are: 1) facial recognition system for enterprise where only few photo shots from each employer are taken as training samples; 2) manufacturing robot that masters new tasks quickly from few times of human demonstration; 3) AI assistant that customizes to a new user after few interactions. Our research, in particular, makes an impact by introducing a novel approach that improves computational efﬁciency, robustness (and other advantages) of meta-learning. This generally beneﬁts future AI researches in meta-learning, rather than direct impact on speciﬁc product.

In particular, our method improves the computational efﬁciency, uncertainty prediction and has potential use in building distributed and privacy protected meta-learning system. Potential advantage for deep NN because it can be parallelized among network layers. Speciﬁcally, under a distributed setting where meta-update and inner-update take place in separate devices, unlike previous methods, our method avoids transmission of gradients which may cause leakage of user data due to a recent research [29]. It may help protect user privacy and enhance decentralization of AI, preventing the monopoly of AI.

acknowledgements

The authors would like to thank Zhiwei Qin for his support and detailed feedback on an early draft of the paper, Prof. Yuhong Guo for technical advice, Prof. Jieping Ye, Prof. Hongtu Zhu for their support, and Tristan Deleu s support on implementation and the anonymous reviewers for their comments.

The authors would also like to thank Qian Qiao for helpful support, without which this work would not be accomplished.

[1] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. ar Xiv preprint ar Xiv:1505.05424, 2015.

[2] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang. A closer look at few-shot classiﬁcation. ar Xiv preprint ar Xiv:1904.04232, 2019.

[3] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ar Xiv preprint ar Xiv:1703.03400, 2017.

[4] C. Finn, K. Xu, and S. Levine. Probabilistic model-agnostic meta-learning. ar Xiv preprint ar Xiv:1806.02817, 2018.

[5] I. J. Good. Some history of the hierarchical bayesian methodology. Trabajos de estadística y de investigación operativa, 31(1):489, 1980.

[6] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Grifﬁths. Recasting gradient-based meta-learning as hierarchical bayes. ar Xiv preprint ar Xiv:1801.08930, 2018.

[7] E. Grant, G. Jerfel, K. Heller, and T. L. Grifﬁths. Modulating transfer between tasks in gradient-based meta-learning. 2018.

[8] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017.

[9] T. Heskes. Solving a huge number of simular tasks: a combination of multi-task learning and a hierarchical bayesian approach. 1998.

[10] J. Kim, S. Lee, S. Kim, M. Cha, J. K. Lee, Y. Choi, Y. Choi, D.-Y. Cho, and J. Kim. Auto-meta: Automated gradient based meta learner search. ar Xiv preprint ar Xiv:1806.06927, 2018.

[11] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pages 2575 2583, 2015.

[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[13] M. J. Lindstrom and D. M. Bates. Newton-raphson and em algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83(404):1014 1022, 1988.

[14] D. J. Mac Kay and D. J. Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003.

[15] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. Meta-learning with temporal convolutions. ar Xiv preprint ar Xiv:1707.03141, 2017.

[16] M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

[17] A. Nichol, J. Achiam, and J. Schulman. On ﬁrst-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

[18] A. Rajeswaran, C. Finn, S. Kakade, and S. Levine. Meta-learning with implicit gradients. ar Xiv preprint ar Xiv:1909.04630, 2019.

[19] C. E. Rasmussen. The inﬁnite gaussian mixture model. In Advances in neural information processing systems, pages 554 560, 2000.

[20] S. Ravi and A. Beatson. Amortized bayesian meta-learning. 2018.

[21] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.

[22] H. Robbins and S. Monro. A stochastic approximation method in: Herbert robbins selected papers. New York, USA: Springer, 102:109, 1985.

[23] R. Salakhutdinov, S. T. Roweis, and Z. Ghahramani. Optimization with em and expectation-conjugategradient. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 672 679, 2003.

[24] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memoryaugmented neural networks. In International conference on machine learning, pages 1842 1850, 2016.

[25] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630 3638, 2016.

[26] A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning, pages 1015 1022. ACM, 2007.

[27] J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 7343 7353, 2018.

[28] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot reﬁnement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4203 4212, 2018.

[29] L. Zhu, Z. Liu, and S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems, pages 14747 14756, 2019.