# gradientem_bayesian_metalearning__991d263d.pdf Gradient-EM Bayesian Meta-learning Yayi Zou Didi AI Labs @Silicon Valley yz725@cornell.edu Xiaoqi Lu Columbia University lx2170@columbia.edu Bayesian meta-learning enables robust and fast adaptation to new tasks with uncertainty assessment. The key idea behind Bayesian meta-learning is empirical Bayes inference of hierarchical model. In this work, we extend this framework to include a variety of existing methods, before proposing our variant based on gradient-EM algorithm. Our method improves computational efficiency by avoiding back-propagation computation in the meta-update step, which is exhausting for deep neural networks. Furthermore, it provides flexibility to the inner-update optimization procedure by decoupling it from meta-update. Experiments on sinusoidal regression, few-shot image classification, and policy-based reinforcement learning show that our method not only achieves better accuracy with less computation cost, but is also more robust to uncertainty. 1 Introduction Meta-learning, also known as learning to learn, has gained tremendous attention in both academia and industry, especially with applications to few-shot learning[3]. These methods utilize the similar nature of multi-task setting, such that learning from previous tasks helps mastering new tasks faster. The early fast meta-learning algorithm was gradient-based and deterministic, which may cause overfitting on both inner-level and meta-level [15]. With growing interests in prediction uncertainty evaluation and overfitting control, later studies explored probabilistic meta-learning methods [6, 27, 4]. It has been agreed that Bayesian inference is one of the most convenient choices because of its Occam s Razor property [14] that automatically prevents overfitting, which happens in deep neural network (DNN) very often. It also provides reliable predictive uncertainty because of its probabilistic nature. This makes Bayesian methods important to DNN, which as [8] shows, unlike shallow neural networks, are usually poorly calibrated on predictive uncertainty. The theoretical foundation of Bayesian meta-learning is hierarchical Bayes (HB) [5] or empirical Bayes (EB) [22], where the former can be seen as adding a hyper-prior over the latter [20]. For simplicity, in this paper we focus on EB to restrict the learning of meta-parameters to point estimates. A common solution of EB is a bi-level iterative optimization procedure [20, 13], where the innerupdate refers to adaptation to a given task, and the meta-update is the meta-training objective. Starting with a generalized meta learning setting, we propose our general Bayesian framework and claim its optimality under certain metrics. This framework extends the original optimization framework for train/val split in the inner-update procedure to mitigate in-task overfitting which is important for NN based ML. We also hypothesize a mechanism of how EB framework achieves fastadaptation (few inner-update gradient steps) under Gaussian parameterization, along with empirical evidences. What s more, we successfully adapt this EB framework to RL both theoretically and empirically which has not been done before. We show that many important previous works in (Bayesian) meta-learning [20, 4, 27, 3, 17] can be included to this extended framework. However in these previous works, the meta-update step requires backpropagation through the inner optimization process [18] which imposes large computation 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. and memory burden as the increase of inner-update gradient steps. This puts limits on possible applications, especially those require many inner-update gradient steps or involves large dataset (Appendix C.2). Motivated by the above observations, we propose a gradient-based Bayesian algorithm inspired by Gradient-EM algorithm. By designing a new way to compute gradient of meta loss in Bayesian MAML, we come up with an algorithm that decouples meta-update and inner-update and thus avoids the computation and memory burden of previous methods, making it scalable to a large number of inner-update gradient steps. In addition, it enables large flexibility on the choice of inner-update optimization method because it only requires the value of the result of the inner-update optimization, instead of the optimization process (for example in experiments we use Adam in classifications and Trust Region Policy Optimization in RL). The separability of meta-update and inner-update also makes it a potentially useful scheme for distributed learning and private learning. In experiments, we show our method can quickly learn the knowledge and uncertainty of a novel task with less computation burden in sinusoidal regression, image classification, and reinforcement learning. 2 Problem Formulation and Framework 2.1 General Meta-learning Setting We set up the K-shot meta-learning framework upon reinforcement learning(RL) with episode length H as in [3], where supervised learning is a special case with H = 1. With a decision rule (policy) f we can sample rollout data D = {xt, at, rt}H t=1 from the task environment. A decision rule (policy) f can be evaluated on D with loss function L(D, f). We assume each task τ to be i.i.d. sampled from the task space T , following some task distribution P(τ). During meta-training phase we are given a set of training tasks T meta-train. For each task τ of this set, we collect K samples rollout of current policy f denoted as Dtr τ and another K samples rollout after 1 policy gradient training of f denoted as Dval τ (f is not needed in generating samples in supervised learning). At meta-testing phase, for a randomly sampled task τ, Dtrτ is firstly provided. We are then required to return fτ based on {Dtr τ S Dval τ : τ T meta-train} Dtrτ and evaluate its expected loss lτ = EDτ τ L(Dτ , fτ ) on more samples generated from that task. The objective is to come up with an algorithm that produces a decision rule fτ that minimize the expected test loss over all tasks Eτ P (τ)lτ. Figure 1: (a) Minimal A . Each dot (in a vector space) represents the weights making the NN model perform well for certain task. Different colors are used to represent different tasks. This figure demonstrates that many good solutions(local optimums of NN loss function) exist for each task. The area within dotted circle is the small neighboring zone A where each color has at least one point inside. (b) Graphical Model 2.2 Extended Empirical Bayesian Meta-learning Framework We consider parameterized decision rule fθ and construct a corresponding generative model log P(D|θ) = L(D, fθ) (We leave the detail of this construction in RL to Appendix B.1). For each task τ, denote the best policy parameter as well as the best fitted underlying generative model parameter to be θτ = arg min EDτ τ L(Dτ , fθ) = arg max EDτ τ log P(Dτ |θ). In general such maximum is not unique, which is discussed in Section 2.3. With θτ uniquely defined, we have a distribution P(θτ ) induced by P(τ) (change of variable). We summarize the graphical model in Figure 1(b). Under perfect approximation, the ground-truth generator matches our generative model: P(Dτ |τ) = P(Dτ |θτ ), resulting in the following proposition (proof in Appendix A.1): Proposition 1. Suppose a data generator is represented by the hierarchical model P(Dτ |θτ ) and P(θτ ), and define L(Q; D) = log Eθ Q P(D|θ) for distribution Q over θ. Let (Dtrτ , Dval τ ) be independent samples from task τ, and consider Q determined by Dtrτ via Q = g(Dtrτ ). Then P(θτ |Dtrτ ; P(θτ )) = arg max g Eτ L(g(Dtrτ ), Dval τ ) (1) where P(θτ |Dtrτ ; P(θτ )) is the posterior given the prior as P(θτ ) and observations Dtrτ . Two observations are made here. First, this theorem guarantees the best decision rule we can come up with during meta-testing, i.e., through computing posterior P(θτ |Dtrτ ; P(θτ )). Second, this theorem suggests an estimation method for P(θτ ) during meta-training: arg max P (θτ ) P τ L(P(θτ |Dtrτ ; P(θτ )); Dval τ ). We prove in Appendix A.2 that this estimator is not only asymptotic consistent but also with good asymptotic normality which means it quickly converge to true value with small variance as number of tasks increases. We further parameterize P(θτ ) by P(θτ ; Θ), and introduce short notation L(Θ, Dtrτ ; Dval τ ) = L(P(θτ |Dtrτ , Θ); Dval τ ), then the optimization in meta-training can be written as arg maxΘ P τ L(Θ, Dtrτ ; Dval τ ). For clarity we denote L[1] τ = L(Θ; Dtrτ Dval τ ) and L[2] τ = L(Θ, Dtrτ ; Dval τ ) , and also L[1] = Eτ L[1] τ , L[2] = Eτ L[2] τ . Then the estimation method of P(θτ ; Θ) becomes arg maxΘ L[2]. This is an extension of the popular MLE approach in empirical Bayes, which maximize (marginal) log-likelihood L[1] as the special case of L[2] when |Dtrτ | = 0. There is a bias/variance trade-off between L[1] and L[2]. Using L[2] as the meta loss function improves in-task overfitting problem while L[1] extracts more information from the data. Combining the above two observations, a stochastic gradient descent (SGD) approach to meta-training is provided in Algorithm 1: at iteration t, gradient ΘL[i] τ for each task in the t-th meta-training batch is computed by subroutine Meta-Gradient, then gradient ascent on Θ is performed. A variational inference (VI) approach to meta-testing is also included, where posterior P(θτ |Dtrτ ; Θ) is estimated with fixed bΘ learned during meta-training. Detailed discussion of these subroutines are presented in Session 3. 2.3 non-uniqueness and fast-adaptation For neural networks fθ, there exists many local optimums that achieves similarly good performance for each task. We observe from empirical study (Appendix C.3) that the key to fast-adaptation for gradient-based algorithm is to find a small neighbouring zone A where most tasks have at least one good parameter inside it (Figure 1(a)). The intuition is that when {θτ } are close enough they can be learned within a few gradient steps starting from any points within that neighbouring zone (our experiment shows that a perturbation of initial points within that area would still have good performance at meta-test). The existence of this small neighbouring zone A depends on the parametric model fθ and the task distribution P(τ). We further demonstrate (Appendix C.3) its existence with neural networks as the parametric model and Gaussian parameterization of P(θτ ; Θ) for uni-modal task distribution like sinusoidal functions. Even if we fail to find a single small neighbouring zone A (e.g. multi-modal task distribution like mixture of sinusoidal, linear and quadratic functions), solution may be provided by extension to mixture Gaussian [7, 19]. In this work we focus on the uni-modal situation and leave the extension to future work. 1 Algorithm Meta-train() 2 randomly initialize Θ 4 while not done do 5 Sample batch of tasks Tt P(τ) 6 for each task τ Tt do 7 Sample Dtr τ, Dval τ τ 8 Compute ΘL[i] τ by Subroutine Meta-Gradient(Θ(t),{Dtr τ, Dval τ }) 10 Θ(t+1) = Θ(t) β P τ Tt ΘL[i] τ 11 t = t + 1 1 Algorithm Meta-test() 2 Require: learned Θ, Dtr τ from new task τ 3 Compute posterior λτ =VI(Θ, Dtr τ). 4 Sample θτ P(θ; λτ) 5 return f(; θτ) for evaluation 1 Subroutine VI(Θ,Dτ) 2 Initialize λτ at Θ. 3 while not done do 4 Sample ϵ from ϵ p(ϵ). 5 λτ = λτ + α λτ [log P(Dτ|g(λτ, ϵ)) 6 - KL(P(θτ; λτ) P(θτ|Θ))] 8 return λτ Algorithm 1: Extended Empirical Bayes Meta-learning Framework. VI: reparameterize θτ P(θτ; λτ) using a differentiable transformation g(λτ, ϵ) of an auxiliary noise variable ϵ such that θτ = g(λτ, ϵ) with ϵ p(ϵ) [12] In this section, we first introduce the gradient-based variational inference subroutine VI related to a variety of existing methods, then present our proposed subroutine Meta-Gradient inspired by Gradient-EM algorithm and compare it with the mostly used existing methods for this subroutine. 3.1 Variational Inference Notice that this framework requires computing posterior on complex models such as neural networks. To achieve this, we approximate the posterior with the same parametric distribution P(θτ ; λτ ) as we approximate the prior P(θτ ; Θ) and use Variational Inference to compute the parameters, as has been done in previous work [20]. Let P(θτ ; λτ (Dτ ; Θ)) be the approximation of the posterior P(Dτ , θτ ; Θ) by minimizing their KL distance. Since L(Θ; Dτ ) = log P(Dτ ; Θ) = KL[P(θτ ; λτ ) P(θτ |Dτ ; Θ)] + EP (θτ ;λτ )[log P(Dτ , θτ ; Θ) log P(θτ ; λτ )] (2) is constant in terms of λτ, we have λτ (Dτ ; Θ) = arg minλτ KL[P(θτ ; λτ ) P(θτ |Dτ ; Θ)] = arg maxλτ EP (θτ ;λτ )[log P(Dτ , θτ ; Θ) log P(θτ ; λτ )] = arg maxλτ EP (θτ ;λτ )[log p(Dτ |θτ )] KL[P(θτ ; λτ ) p(θτ ; Θ)]. So the inference process is to find λτ to maximize the Evidence Lower Bound ELBO(τ)(λτ ; Θ) = EP (θτ ;λτ )[log p(Dτ |θτ )] KL[P(θτ ; λτ ) p(θτ ; Θ)] via mini-batch gradient descent[20]. The gradient of KL-divergence terms are calculated analytically in Gaussian case whereas the gradient of expectations can be computed by monte-carlo with reparameterization along with some variance reduction tricks [11, 28]. Due to the above analysis in Section 2.3, only a few gradient steps are needed for this process with well learned Θ by our framework. We summarize the subroutine VI in Algorithm 1. A special case worth mentioning is when we use delta function for the posterior approximation P(θτ ; λτ ) = δµλτ (θτ ), we have λτ (Dτ ; Θ) = arg max[log P(Dτ |µλτ ) µλτ µΘ 2 /(2ΣΘ2)], which is actually the inner-update step of i MAML [18], MAML [3], and reptile [17] (if we replace the l2 regularization term with choosing µΘ as initial point for gradient based optimization: µλτ (µΘ) = µΘ θ log P(Dτ |θ)|θ=µΘ). 3.2 Meta-Gradient The essential part of this meta-learning framework is to compute the gradient ΘL[i] τ . We show below that this problem can be reduced as computing ΘL(Θ; Dτ ) = Θ log P(Dτ ; Θ) given Dτ and Θ. For L[1], this is direct. For L[2], there are two approaches. The first approach is to compute λτ (Dτ ; Θ) as stated above, then ΘL[2] τ = ΘL(Θ, Dtrτ ; Dval τ ) = ΘL(λτ (Dtrτ ; Θ); Dval τ ) = ΘL(Θ; Dval τ )|Θ=λτ (Dtrτ ;Θ) Θλτ (Dtrτ ; Θ) (3) , where Θλτ (Dtrτ ; Θ) can be computed by auto-gradient (if λτ (Dtrτ ; Θ) is computed by gradient based algorithms). This approach is widely used in previous work such as [3], [4], [27], [6]. The second approach is proposed by us as shown in subroutine Meta-Gradient:GEM-BML+ below. We utilize a property L(Θ, Dtrτ ; Dval τ ) = L(Θ; Dtrτ S Dval τ ) L(Θ; Dtrτ ) (proof in Appendix A.4) such that ΘL[2] τ = ΘL(Θ; Dtrτ S Dval τ ) ΘL(Θ; Dtrτ ) can be expressed by the difference of two L[1] terms, thus is reduced to computing ΘL(Θ; Dτ ) terms. Gradient-EM Estimator We propose an efficient way to compute ΘL(Θ; Dτ ) through gradient of the complete log likelihood. This is guaranteed by the following Gradient-EM Theorem inspired by the observation in [23]. Theorem 1. ΘL(Θ; D) = Eθ P (θ|D;Θ) Θ log[P(D, θ; Θ)] ΘL(Θ; D) = Θ log P(D|Θ) = 1 P(D|Θ) Θ Z P(D, θ|Θ)dθ = Z P(D, θ|Θ) P(D|Θ) Θ log P(D, θ|Θ)dθ = Z P(θ|D, Θ) Θ log P(D, θ|Θ)dθ = Eθ P (θ|D;Θ) Θ log[P(D, θ; Θ)] Under hierarchical modeling structure, we have Θ log P(Dτ , θτ |Θ) = Θ log[P(Dτ |θτ ) P(θτ ; Θ)] = Θ log[P(θτ ; Θ)]. Combining with Theorem 1 we have ΘL(Θ; Dτ ) = Eθτ P (θ|Dτ ;Θ) Θ log[P(θτ ; Θ)]. After using VI to compute the approximate posterior parameter λτ (Dτ , Θ), the above estimator becomes ˆg = Eθτ P (θτ ;λτ (Dτ ,Θ)) Θ log[P(θτ ; Θ)] which can be calculated analytically in Gaussian case as we show in Appendix B.7. This gives us two Meta-Gradient subroutines GEM-BML and GEM-BML+ for ΘL[1] τ and ΘL[2] τ respectively. We name Algorithm 1 with these two subroutines as our algorithms GEM-BML and GEM-BML+. 1 Subroutine Meta-Gradient:GEM-BML(Θ,{Dtr, Dval}) 2 Compute posterior λtr =VI(Θ, Dtr). 3 Compute posterior λtr val =VI(λtr, Dval). 4 return ˆg = Eθ P (θ;λtr val) Θ log[P(θ; Θ)] 1 Subroutine Meta-Gradient:GEM-BML+(Θ,{Dtr, Dval}) 2 Compute posterior λtr =VI(Θ, Dtr). 3 Compute posterior λtr val =VI(λtr, Dval). 4 return ˆg = Eθ P (θ;λtr val) Θ log[P(θ; Θ)] Eθ P (θ;λtr) Θ log[P(θ; Θ)] ELBO Gradient Estimator As comparison, one of the most widely used methods to optimize L[i] in Bayesian metalearning is optimizing ELBO [20](see Appendix B.6 for other existing methods and comparing analysis). Here we show it is actually another way to estimate ΘL(Θ; Dτ ). According to equation (2), when VI approximation error KL[P(θτ ; λτ (Dτ ; Θ)) P(θτ |Dτ ; Θ)] is small enough, we have L(Θ; Dτ ) EP (θτ ;λτ (Dτ ;Θ))[log P(Dτ , θτ ; Θ) log P(θτ ; λτ (Dτ ; Θ))] = [EP (θτ ;λτ (Dτ ;Θ)) log P(Dτ |θτ )] KL[P(θτ ; λτ (Dτ ; Θ)) P(θτ |Θ)] = ELBO(τ)(λτ (Dτ ; Θ); Θ). So the gradient can be computed by ΘL(Θ; Dτ ) ΘELBO(τ)(λτ (Dτ ; Θ); Θ) λτ ELBO(τ)(λτ ; Θ)|λτ =λτ (Dτ ;Θ) Θλτ (Dτ ; Θ) + Θ ELBO(τ)(λτ ; Θ)|λτ =λτ (Dτ ;Θ) (4) . The first partial gradient term can be computed by the same method in Section 3.1 and the second one can be calculated analytically in Gaussian case. In fact, Gradient-EM(GEM) can also be reviewed as an co-ordinate descent algorithm to optimize ELBO as a variant of EM as we show in Appendix B.3. Comparing to ELBO gradient, GEM avoids the back Props computation of Θλτ (Dτ ; Θ) which gives it a series of advantages as we specify in Section 4.2. Both GEM and ELBO gradient has estimation error arise from the discrepancy of estimated posterior by VI and the true posterior. We show empirical results in Appendix C.1 that GEM has stably lower estimation error than ELBO gradient. We also show in Appendix A.3 that our method has a theoretical bound of estimation error in terms of the VI discrepancy ˆg ΘL(Θ; Dτ ) M p DKL(P(θ|Dτ ; Θ) P(θ; λτ (Dτ , Θ))) where M is a bounded constant. Matrix of Related Works We compare Gradient-EM (our method) with ELBO-gradient over two loss functions L[1], L[2], L[1] = Eτ T L(Θ; Dtr τ Dval τ ) L[2] = Eτ T L(λτ (Dtr τ ; Θ); Dval τ ) ELBO gradient Amortized BML [20] related to PMAML [4] Graident-EM GEM-BML (our method) ; reduce to Reptile [17] in delta case KL-Chaser Loss(related to l2-Chaser Loss, BMAML [27]) Table 1: Matrix of related works summarized in Table 1. It turns out each element of this matrix is related to a previous work or our method. Notice that this matrix can be extended with more columns (e.g. one more column of L[2] = Eτ T L(Θ; Dtrτ Dval τ ) L(Θ; Dtrτ )) and more rows to a larger matrix with blank elements (models) that haven t been explored before. For example, KL-Chaser Loss model in the right bottom of Table 1 hasn t been studied before. We leave the thorough study of all combinations to future work. Here we only show how MAML and Reptile can be fit into this Bayes frame, while further details are left to Appendix B.4. To see this, consider using fixed variance parameters for both prior P(θτ ; µΘ, ΣΘ = C0) and posterior P(θτ ; µλτ , Σλτ = Cτ ) and let Cτ 0 so posterior becomes delta distribution δµλτ (θτ ). We can compute µλτ (µΘ) by gradient descent from µΘ. MAML uses L[2] as meta-loss function. Under delta distribution posterior we have L[2] τ = log Z P(Dτ |θτ )δµλτ (µΘ)(θτ )dθτ = log P(Dτ |µλτ (µΘ)) = f(Dτ ; µλτ (µΘ)) . Then ΘL[2] = µΘf(Dτ ; µλτ (µΘ)) can be directly computed through back-propagation in neural networks. On the other hand, Reptile uses L[1] as meta-loss function. Using GEM-gradient ˆg we have ΘL[1] τ = Eδµλτ (µΘ)(θτ ) Θ log[P(θτ ; µΘ, ΣΘ = C0)] = µΘ |µΘ µλτ (µΘ)|2 which is the Reptile gradient. Also notice that, if we let C0 0 and so the prior becomes delta, then we have ΘL[1] τ = µΘ log Z P(Dτ |θτ )δµΘ(θτ )dθτ = µΘ log P(Dτ |µΘ) = µΘf(Dτ ; µΘ) . This corresponds to "pre-train" which simply train a model to fit data of all tasks combined. Advantages of GEM Observe that all methods in the above matrix requires to compute the posterior parameters λτ (Dτ ; Θ) first and use it to compute the sampled meta-loss function gradient ΘL[i] τ . Following the convention of [3], we define the step of computing λτ (Dτ ; Θ) as inner-update and the step of computing ΘL[i] τ as meta-update. Notice that both the L[2] = Eτ T L(λτ (Dtrτ ; Θ); Dval τ ) column and the ELBO-gradient row involve the computation of Θλτ (Dτ ; Θ)(Equation (3,4)). This means the meta-update computation of these three methods(highlighted in colour) has to compute backpropagation through the inner optimization process which leads to a number of burden and limitation, while GEM avoids this computation and thus gives a number of advantages as mentioned in Introduction. Also notice that, if assuming independence between neural network layers, the meta-update of our algorithm (Line 4 of Subroutine GEM-BML(+)) can be computed among different neural network layers in parallel, which may largely reduce the computation time in deep neural networks. We summarize a detailed analysis of our advantages to Appendix B.5. 5 Experiment 5.1 Regression The purpose of this experiment is to test our methods on fast adaptation ability and robustness to meta-level uncertainty. We compare our model GEM-BML and GEM-BML+ with MAML [3], Reptile [17] and Bayesian meta-learning benchmarks BMAML [27] and Amortized BML[20] on the same sinusoidal function regression problem. We first apply the default setting in [3] then apply a more challenging setting which contains more uncertainty as proposed in [27] to demonstrate the robustness to meta-level uncertainty. Data of each task is generated from y = A sin(wx + b) + ϵ with amplitude A, frequency w, and phase b as task parameter and observation noise ϵ. Task parameters are sampled from uniform distributions A [0.1, 5.0], b [0.0, 2π], w [0.5, 2.0] and observation noise follows ϵ N(0, (0.01A)2). x ranges from [ 5.0, 5.0]. For each task, K = 10 observations({xi, yi} pairs) are given. The underlying 1 2 3 4 5 number of gradient steps GPU Memory (Normalized) GEM-BML+ (ours) BMAML(M=5) Reptile MAML fo MAML amortized BML implicit MAML(CG=2) 1 2 3 4 5 number of gradient steps Compute Time (Normalized) Figure 2: (a) Sinusoidal regression results: Meta-test error of default and challenging setting after 40000 meta-train iterations. (b) Computation and memory trade-offs with 4 layer CNN on 1-shot,5-class mini Image Net task. (BMAML is beyond the range of the plot.) Omniglot 1-shot, 5-class 5-shot, 5-class 1-shot, 20-class 5-shot, 20-class MAML 98.7 0.4 % 99.9 0.1 % 95.8 0.3 % 98.9 0.2 % first-order MAML 98.3 0.5 % 99.2 0.2 % 89.4 0.5 % 97.9 0.1 % Reptile 97.68 0.04 % 99.48 0.06 % 89.43 0.14 % 97.12 0.32 % i MAML 99.50 0.26 % 99.74 0.11 % 96.18 0.36 % 99.14 0.1 % GEM-BML+(Ours) 99.23 0.42 % 99.64 0.08 % 96.24 0.35 % 98.94 0.25 % Table 2: Few-shot classification on Omniglot dataset. The shows 95% confidence intervals over different testing tasks. All results to compare are from original literature. network architecture(2 hidden layers of size 40 with RELU activation) is the same as [3] to make a fair comparison. In Figure 2 (a), we plot the mean squared error (MSE) performance on test tasks during meta-test process under both settings. Under default setting, our methods show similar fast-adaptation ability as previous methods. The challenging setting result shows that Bayesian methods GEM-BML(+), BMAML and Amortized BML can still extract information in high uncertainty environment while non-Bayesian models MAML and Reptile fail to learn. We also observe that our model provides a stable meta-train learning curve and continues to improve as performing more gradient steps without overfitting. This demonstrates the robustness of Bayesian methods resulted from its probabilistic nature and the ability to control overfitting. 5.2 Classification The purpose of this experiment aims to answer the following questions: (1) Does our model save computation time and memory requirement by avoiding meta-update back Prop as we claimed? (2) Can our methods be scaled to few-shot image classification benchmarks and achieve good accuracy and predictive uncertainty? To study (1), we turn to Mini-Image Net [21] dataset on 1-shot,5-class. We compare GEMBML+(GEM-BML is even less expensive) with MAML and its first order variants fo MAML, Reptile, i MAML, Amortized BML and BMAML in Fig 2(b). Just like other first-order meta-learning al- Figure 3: Reinforcement Learning mini Image Net 1-shot, 5-class MAML [3] 48.70 1.84 % first-order MAML [3] 48.07 1.75 % Reptile [17] 49.97 0.32 % i MAML [18] 49.30 1.88 % Amortized BML [20] 45.0 0.60 % GEM-BML+(Ours) 50.03 1.63 % Predictive uncertainty ECE MCE MAML 0.0471 0.1104 Amortized BML 0.0124 0.0257 GEM-BML+(Ours) 0.0102 0.0197 Table 3: Accuracy and Predictive Uncertainty Measurement of Few-shot classification on the Mini Imagenet dataset. Small ECE and MCE indicate a model is better calibrated. gorithms and i MAML which decouples the inner-update and meta-update, the memory usage of GEM-BML(+) is independent of the number of inner-update gradient steps since the inner-update computation other than final step results need not to be stored. On the other hand, MAML-like algorithms (MAML, Amoritized BML) need memory growing linearly with inner-update gradient steps. It is also similar for compute time, MAML-like algorithms requires expensive back Prop over the inner-update optimization in meta-update, where the compute cost grows at a faster rate than GEM-BML(+), fo MAML, Reptile and i MAML (i MAML has a relatively high base compute cost because of Hessian computation). To study (2) we applied our method to N-class image classification on the Omniglot dataset and Mini Imagenet dataset which are popular few-shot learning benchmarks([25, 24, 21]). Notice that the purpose of this experiment is not to compete with state-of-the-art on this benchmark but to provide an apples-to-apples comparison with prior works within our extended Empirical Bayes framework. So for a fair comparison, we use the identical backbone convolutional architecture[3] as these prior works. Note however that this backbone architecture can be replaced with other ones and lead to better results for all algorithms [2, 10]. We leave to the future work to improve our method with better backbone architectures to challenge the state-of-the-art of this benchmark. The inner-update is computed using Adam to demonstrate the flexibility of our methods in choosing inner-update optimizer. The results in Table 2 and 3 shows that our methods performs as good as the the best prior methods within our extended Empirical Bayes framework. Predictive uncertainty is the probability associated with the predicted class label which indicates how likely the predictions to be correct. To measure the predictive uncertainty of the models, we use two quantitative metrics ECE and MCE ([16, 8]) to Mini Imagenet dataset. Smaller ECE and MCE indicate a better-calibrated model. A perfectly calibrated model achieves 0 for both metrics. The results of ECE and MCE for our models and previous works are shown in Table 3. We can see that our model is slightly better calibrated compared to the state-of-art bayesian meta-learning model Amortized BML and well outperform non-Bayesian models. This shows our model can learn a good prior and make good probability predictions as an advantage of Bayesian model. 5.3 Reinforcement Learning We test and compare the models on the same 2D Navigation and Mu Jo Co continuous control tasks as are used in [3]. See Appendix C.4.3 for detailed descriptions on experiment settings and hyper-parameters. For a fair comparison, we use the same policy network architecture as [3] with two hidden layers, each with 100 Re LU units. At meta-train, we collect K samples of rollout of current policy and another K samples rollout after 1 policy gradient update as [3] where K is the inner-batch size. At meta-test, we compare adaptation to a new task with up to 3 gradient updates, each with 40 samples. We compare to two baseline models: MAML and reptile. MAML uses TRPO in meta-update to boost performance while our meta-update is data-free as specified in the above sections. For inner-updates, due to our model s flexibility of choosing innerupdate optimzier, we can either use vanilla policy gradient (REINFORCE) (Williams, 1992) or a specially designed TRPO proposed by [3]. We find that TRPO inner-update performs better in 2d navigation while vanilla policy gradient tend to be better in Mu Jo Co continuous control tasks. We hypothesis that the reasons could be in complex task setting the task distribution variance tend to be higher(A is larger in Figure 1 (a)). While TRPO limits the step size of each inner-update which makes the task parameters hard to be attained within a few gradient steps. As shown in Fig 3, GEM-BML+ outperforms MAML while reptile and GEM-BML has less superior performance. This shows L[2] variant is necessary in RL which has high in-task variance and easily overfitted. Previous work [17] show it is hard to adapt algorithms to RL with the advantage of data-free meta-update(reptile like algorithm). But with our L[2] variant we can adapt to RL while preserving this advantage. Our results show that the key to adaptation is L[2] variant for RL, which highlighted our contribution of GEM-BML+ algorithm. 6 Related Works Hierarchical Bayes(HB) and Empirical Bayes(EB) have been decently studied [9] in the past to utilize statistical connections between related tasks. Since then, deep neural network(DNN) caught enormous attention and efforts of measuring the uncertainty of DNN also started ongoing in which Bayesian and sampling method are widely applied.[1] The research trend of multi-task learning and transfer learning also changed to the fine-tuning framework for DNN after then. Model Agnostic Meta-learning(MAML) [3] emerged in such a motivation to find good initial parameters that can be fast adapted to new tasks in a few gradient steps. Recently, Bayesian models have a big comeback because of their probabilistic nature in uncertainty measure and automatic overfitting preventing. [26] applied HBM to multi-task reinforcement learning. [6] related MAML to Hierarchical Bayesian model and proposed a Laplace approximation method to capture isotopic Gaussian uncertainty of the model parameters. BMAML [27] used Stein Variational Gradient Descent(SVGD) to obtain posterior samples and proposed a Chaser Loss in order to prevent meta-level overfitting. PMAML [4] also proposed a gradient-based method to obtain a fixed measure of prior and posterior uncertainty. Amortized-BML [20] proposed a MAML-like variational inference method for amortized Bayesian meta-learning. All of the methods above can not make inner-update and meta-update separable thus largely limit the flexibility of the optimization process of inner-update. i MAML [18] propose an implicit gradients method for MAML which can make inner-update and meta-update separable with the cost of computation on second order derivatives and solving QP in each meta-update step. Notice that this method is not within our framework though it shares the style because its update is by the effort of approximating MAML rather than a Bayesian approach. 7 Conclusion Inspired by Gradient-EM algorithm we have proposed GEM-BML(+) Algorithm for Bayesian Metalearning. Our method is based on a theoretical insight of the Gradient-EM Theorem and the Bayesian formulation of multi-task meta-learning. This method avoids back Prop in meta-update and decouples the meta-update and inner-update. We have tested our method on sinusoidal regression, few-shot image classifications and reinforcement learning to demonstrate the advantage of our method. For future work, we consider to apply our method to start-of-art image classification backbone and extending our work to nonparametric Gaussian approximation to handle multimodal and dynamic task-distribution situations. Broader Impact Meta-learning algorithms can be applied in AI products that requires fast adaptation with few data points. Examples are: 1) facial recognition system for enterprise where only few photo shots from each employer are taken as training samples; 2) manufacturing robot that masters new tasks quickly from few times of human demonstration; 3) AI assistant that customizes to a new user after few interactions. Our research, in particular, makes an impact by introducing a novel approach that improves computational efficiency, robustness (and other advantages) of meta-learning. This generally benefits future AI researches in meta-learning, rather than direct impact on specific product. In particular, our method improves the computational efficiency, uncertainty prediction and has potential use in building distributed and privacy protected meta-learning system. Potential advantage for deep NN because it can be parallelized among network layers. Specifically, under a distributed setting where meta-update and inner-update take place in separate devices, unlike previous methods, our method avoids transmission of gradients which may cause leakage of user data due to a recent research [29]. It may help protect user privacy and enhance decentralization of AI, preventing the monopoly of AI. acknowledgements The authors would like to thank Zhiwei Qin for his support and detailed feedback on an early draft of the paper, Prof. Yuhong Guo for technical advice, Prof. Jieping Ye, Prof. Hongtu Zhu for their support, and Tristan Deleu s support on implementation and the anonymous reviewers for their comments. The authors would also like to thank Qian Qiao for helpful support, without which this work would not be accomplished. [1] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. ar Xiv preprint ar Xiv:1505.05424, 2015. [2] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang. A closer look at few-shot classification. ar Xiv preprint ar Xiv:1904.04232, 2019. [3] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ar Xiv preprint ar Xiv:1703.03400, 2017. [4] C. Finn, K. Xu, and S. Levine. Probabilistic model-agnostic meta-learning. ar Xiv preprint ar Xiv:1806.02817, 2018. [5] I. J. Good. Some history of the hierarchical bayesian methodology. Trabajos de estadística y de investigación operativa, 31(1):489, 1980. [6] E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. ar Xiv preprint ar Xiv:1801.08930, 2018. [7] E. Grant, G. Jerfel, K. Heller, and T. L. Griffiths. Modulating transfer between tasks in gradient-based meta-learning. 2018. [8] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017. [9] T. Heskes. Solving a huge number of simular tasks: a combination of multi-task learning and a hierarchical bayesian approach. 1998. [10] J. Kim, S. Lee, S. Kim, M. Cha, J. K. Lee, Y. Choi, Y. Choi, D.-Y. Cho, and J. Kim. Auto-meta: Automated gradient based meta learner search. ar Xiv preprint ar Xiv:1806.06927, 2018. [11] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pages 2575 2583, 2015. [12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [13] M. J. Lindstrom and D. M. Bates. Newton-raphson and em algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83(404):1014 1022, 1988. [14] D. J. Mac Kay and D. J. Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003. [15] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. Meta-learning with temporal convolutions. ar Xiv preprint ar Xiv:1707.03141, 2017. [16] M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [17] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018. [18] A. Rajeswaran, C. Finn, S. Kakade, and S. Levine. Meta-learning with implicit gradients. ar Xiv preprint ar Xiv:1909.04630, 2019. [19] C. E. Rasmussen. The infinite gaussian mixture model. In Advances in neural information processing systems, pages 554 560, 2000. [20] S. Ravi and A. Beatson. Amortized bayesian meta-learning. 2018. [21] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016. [22] H. Robbins and S. Monro. A stochastic approximation method in: Herbert robbins selected papers. New York, USA: Springer, 102:109, 1985. [23] R. Salakhutdinov, S. T. Roweis, and Z. Ghahramani. Optimization with em and expectation-conjugategradient. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 672 679, 2003. [24] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memoryaugmented neural networks. In International conference on machine learning, pages 1842 1850, 2016. [25] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630 3638, 2016. [26] A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning, pages 1015 1022. ACM, 2007. [27] J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 7343 7353, 2018. [28] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4203 4212, 2018. [29] L. Zhu, Z. Liu, and S. Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems, pages 14747 14756, 2019.