# sharpmaml_sharpnessaware_modelagnostic_meta_learning__86d695ba.pdf

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Momin Abbas * 1 Quan Xiao * 1 Lisha Chen * 1 Pin-Yu Chen 2 Tianyi Chen 1

Abstract Model-agnostic meta learning (MAML) is currently one of the dominating approaches for fewshot meta-learning. Albeit its effectiveness, the optimization of MAML can be challenging due to the innate bilevel problem structure. Specifically, the loss landscape of MAML is much more complex with possibly more saddle points and local minimizers than its empirical risk minimization counterpart. To address this challenge, we leverage the recently invented sharpness-aware minimization and develop a sharpness-aware MAML approach that we term Sharp-MAML. We empirically demonstrate that Sharp-MAML and its computation-efficient variant can outperform the plain-vanilla MAML baseline (e.g., +3% accuracy on Mini-Imagenet). We complement the empirical study with the convergence rate analysis and the generalization bound of Sharp-MAML. To the best of our knowledge, this is the first empirical and theoretical study on sharpness-aware minimization in the context of bilevel learning. The code is available at https://github.com/ mominabbass/Sharp-MAML.

1. Introduction

Humans tend to easily learn new concepts using only a handful of samples. In contrast, modern deep neural networks require thousands of samples to train a model that generalizes well to unseen data (Krizhevsky et al., 2012). Meta learning is a remedy to such a problem whereby new concepts can be learned using a limited number of samples (Schmidhuber, 1987; Vilalta & Drissi, 2002). Meta learning offers fast adaptation to unseen tasks (Thrun & Pratt, 2012; Novak & Gowin, 1984) and has been widely studied to produce state of the art results in a variety of fewshot learning settings including language and vision tasks

*Equal contribution 1Rensselaer Polytechnic Institute, Troy, NY 2IBM Thomas J. Watson Research Center, NY, USA. Correspondence to: Tianyi Chen <chentianyi19@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

(Munkhdalai & Yu, 2017; Nichol & Schulman, 2018; Snell et al., 2017; Wang et al., 2016; Li & Malik, 2017; Vinyals et al., 2016; Andrychowicz et al., 2016; Brock et al., 2018; Zintgraf et al., 2019a; Wang et al., 2019; Achille et al., 2019; Li et al., 2018a; Hsu et al., 2018; Obamuyide & Vlachos, 2019). In particular, model-agnostic meta learning (MAML) is one of the most popular optimization-based meta learning frameworks for few-shot learning (Finn et al., 2017; Vuorio et al., 2019; Yin et al., 2020; Obamuyide & Vlachos, 2019). MAML aims to learn an initialization such that after applying only a few number of gradient descent updates on the initialization, the adapted task-specific model can achieve desired performance on the validation dataset. MAML has been successfully implemented in various data-limited applications including medical image analysis (Maicas et al., 2018), language modelling (Huang et al., 2018), and object detection (Wang et al., 2020).

Despite its recent success on some applications, MAML faces a variety of optimization challenges. For example, MAML incurs high computation cost due to second-order derivatives, requires searching for multiple hyperparameters, and is sensitive to neural network architectures (Antoniou et al., 2019). Even if various optimization techniques can potentially overcome these training challenges (e.g., making training error small), whether the meta-model learned with limited training samples can lead to small generalization error or testing error in unseen tasks with unseen data, is not guaranteed (Rothfuss et al., 2021).

These training and generalization challenges of MAML are partially due to the nested (e.g., bilevel) structure of the problem, where the upper-level optimization problem learns shared model initialization and the lower-level problem optimizes task-specific models (Finn et al., 2017; Rajeswaran et al., 2019a). This is in sharp contrast to the more widely known single-level learning framework - empirical risk minimization (ERM). As a result, the training and generalization challenges in ERM will not only remain in MAML but may be also exacerbated by the bilevel structure of MAML. For example, as we will show later, the nonconvex loss landscape of MAML contains possibly more saddle points and local minimizers than its ERM counterpart, many of which do not have good generalization performance. Recent works have proposed various useful techniques to improve the generalization performance (Grant et al., 2018a; Park & Oliva,

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

2019; Antoniou et al., 2019; Kao et al., 2021), but none of them are from the perspective of the optimization landscape.

Given the nested nature of bilevel learning models such as MAML, this paper aims to answer the following question:

How can we find nonconvex bilevel learning models such as MAML that generalize well?

In an attempt to provide a satisfactory answer to this question, we use MAML as a concrete case of bilevel learning and incorporate a recently proposed sharpness-aware minimization (SAM) algorithm (Foret et al., 2021) into the MAML baseline. Originally designed for single-level problems such as ERM, SAM improves the generalization ability of non-convex models by leveraging the connection between generalization and sharpness of the loss landscape (Foret et al., 2021). We demonstrate the power of integrating SAM into MAML by: i) empirically showing that it outperforms the popular MAML baseline; and, ii) theoretically showing it leads to the potentially improved generalization bound. To the best of our knowledge, this is the first study on sharpness-aware minimization in the context of bilevel optimization.

1.1. Our contributions

We summarize our contributions below.

(C1) Sharpness-aware optimization for MAML with improved empirical performance. We theoretically and empirically discover that the loss landscape of bilevel models such as MAML is more involved than its ERM counterpart with possibly more saddle points and local minimizers. To overcome this challenge, we develop a sharpnessaware MAML approach that we term Sharp-MAML and its computation-efficient variant. Intuitively, Sharp-MAML avoids the sharp local minima of MAML loss functions and achieves better generalization performance. We empirically demonstrate that Sharp-MAML can outperform the plain-vanilla MAML baseline.

(C2) Optimization analysis of Sharp-MAML including MAML as a special case. We establish the O(1/

T) convergence rate of Sharp-MAML through the lens of recent bilevel optimization analysis (Chen et al., 2021a), where T is the number of iterations. This corresponds to O(ϵ 2) sample complexity as a fixed number of samples are used per iteration. The convergence rate and sample complexity match those of training single-level ERM models, and improves the known O(ϵ 3) sample complexity of MAML.

(C3) Generalization analysis of Sharp-MAML demonstrating its improved generalization performance. We quantify the generalization performance of models learned by Sharp-MAML through the lens of a recently developed

probably approximately correct (PAC)-Bayes framework (Farid & Majumdar, 2021). The generalization bound justifies the desired empirical performance of models learned from Sharp-MAML, and provides some insights on why models learned through Sharp-MAML can have better generalization performance than that from MAML.

1.2. Technical challenges

Due to the bilevel structure of both SAM and MAML, formally quantifying the optimization and generalization performance of Sharp-MAML is highly nontrivial.

Specifically, the state-of-the-art convergence analysis of bilevel optimization (e.g., (Chen et al., 2021a)) only applies to the case where the upperand lower-level are both minimization problems. Unfortunately, this prerequisite is not satisfied in Sharp-MAML. In addition, the existing analysis of MAML in (Fallah et al., 2020) requires the growing batch size and thus results in a suboptimal O(ϵ 3) sample complexity. From the theoretical perspective, this work not only broadens the applicability of the recent analysis of bilevel optimization (Chen et al., 2021a) to tackle Sharp MAML problems, but also tightens the analysis of the original MAML (Fallah et al., 2020). For the generalization analysis of Sharp-MAML, different from the classical PACBayes analysis for single-level problems as in SAM (Foret et al., 2021), both the lower and upper level problems of MAML contribute to the generalization error. Going beyond the PAC-Bayes analysis in (Foret et al., 2021), we further discuss how the choice of the perturbation radius in SAM affects the bound, providing insights on why Sharp-MAML improves over MAML in terms of generalization ability.

1.3. Related work

We review related work from the following three aspects.

Loss landscape of non-convex optimization. The connection between the flatness of minima and the generalization performance of the minimizers has been studied both theoretically and empirically; see e.g., (Dziugaite & Roy, 2016; Dinh et al., 2017; Keskar et al., 2017; Neyshabur et al., 2019). In a recent study, (Jiang et al., 2019) has showed empirically that sharpness-based measure has the highest correlation with generalization. Furthermore, (Izmailov et al., 2018) has showed that averaging model weights during training yields flatter minima that can generalize better.

Sharpness-aware minimization. Motivated by the connection between sharpness of a minimum and generalization performance, (Foret et al., 2021) developed the SAM algorithm that encourages the learning algorithm to converge to a flat minimum, thereby improving its generalization performance. Recent follow-up works on SAM showed the efficacy of SAM in various settings. Notably, (Bahri

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

et al., 2021) used SAM to improve the generalization performance of language models like text-to-text Transformer (Raffel et al., 2020) and its multilingual counterpart (Xue et al., 2020). More importantly, they empirically showed that the gains achieved by SAM are even more when the training data are limited. Furthermore, (Chen et al., 2021b) showed that vision models such as transformers (Dosovitskiy et al., 2020) and MLP-mixers (Tolstikhin et al., 2021) suffer from sharp loss landscapes that can be better trained via SAM. They showed that the generalization performance of resultant models improves across various tasks including supervised, adversarial, contrastive, and transfer learning (e.g., 11.0% increase in top-1 accuracy). However, existing efforts have been focusing on improving generalization performance in single-level problems such as ERM (Bahri et al., 2021; Chen et al., 2021b). Different from these works based on single-level ERM, we study SAM in the context of MAML through the lens of bilevel optimization. Recent works aim to reduce the computation overhead of SAM. In (Du et al., 2021), two new variants of SAM have been proposed namely, Stochastic Weight Perturbation and Sharpness-sensitive Data Selection, both of which improve the efficiency of SAM without sacrificing generalization performance. While this work showed remarkable improvement on a standard ERM model, whether it can improve the computation overhead (without sacrificing generalization) of a MAML-model is unknown.

Model-agnostic meta learning. Since it was first developed in (Finn et al., 2017), MAML has been one of the most popular optimization-based meta learning tools for fast few-shot learning. Recent studies revealed that the choice of the lower-level optimizer affects the generalization performance of MAML (Grant et al., 2018a; Antoniou et al., 2019; Park & Oliva, 2019). (Antoniou et al., 2019) pointed out a variety of issues of training MAML, such as sensitivity to neural network architectures that leads to instability during training and high computational overhead at both training and inference times. They proposed multiple ways to improve the generalization error, and stabilize training MAML, calling the resulting framework MAML++. Many recent works focus on analyzing the generalization ability of MAML (Farid & Majumdar, 2021; Denevi et al., 2018; Rothfuss et al., 2021; Chen & Chen, 2022) and improving the generalization performance of MAML (Finn & Levine, 2018; Gonzalez & Miikkulainen, 2020; Park & Oliva, 2019). However, these works do not take into account the geometry of the loss landscape of MAML. In addition to generalization-ability, recent works (Wang et al., 2021; Goldblum et al., 2020; Xu et al., 2020) investigated MAML from another important perspective of adversarial robustness the capabilities of a model to defend against adversarial perturbed inputs (also known as adversarial attacks in some literature). However, we focus on improving the general-

ization performance of the models trained by MAML with theoretical guarantees.

2. Preliminaries and Motivations

In this section, we first review the basics of MAML and describe the optimization difficulty of learning MAML models, followed by introducing the SAM method.

2.1. Problem formulation of MAML

The goal of few-shot learning is to train a model that can quickly adapt to a new task using only a few datapoints (usually 1-5 samples per task). Consider M few-shot learning tasks {Tm}M m=1 drawn from a distribution p(T ). Each task m has a fine-tuning training set Dm = n i=1{(xi, yi)} and a separate validation set D m = n i=1{(xi, yi)}, where data are independently and identically distributed (i.i.d.) and drawn from the per-task data distribution Pm. MAML seeks to learn a good initialization of the model parameter θ (called the meta-model) such that fine-tuning θ via a small number of gradient updates will lead to fast learning on a new task. Consider a per datum loss l : Θ X Y R+; define the generic empirical loss over a finite-sample dataset D as L(θ; D) = 1

n Pn i=1 l(θ, xi, yi) and the generic population loss over a data distribution P as L(θ; P) = E(x,y) P[l(θ, x, y)]. For a particular task m, they become L(θ; Dm) or L(θ; D m) and L(θ; Pm).

MAML can be formulated as a bilevel optimization problem, where the fine-tuning stage forms a task-specific lowerlevel problem while the meta-model θ optimization forms a shared upper-level problem. Namely, the optimization problem of MAML is (Rajeswaran et al., 2019a):

m=1 L(θ m(θ); D m) (1a)

s.t. θ m(θ)=arg min θm L(θm; Dm)+ θm θ 2

2βlow , m (1b)

where βlow denotes the lower-level step size.

The bilevel optimization problem in (1) is difficult to solve because each upper-level update (1a) requires calling lower optimization oracle multiple times (1b). There exist many MAML algorithms to solve (1) efficiently, such as Reptile (Nichol & Schulman, 2018) and first-order MAML (Finn et al., 2017) which is an approximation to MAML obtained by ignoring second-order derivatives. We instead use the one-step gradient update (Finn et al., 2017) to approximate the lower-level problem:

min θ F(θ) (1a) s.t. θ m(θ) = θ βlow L(θ; Dm). (2)

Generalization performance. We are particularly interested in the generalization performance of a meta-model θ

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Figure 1. Loss landscapes of MAML and ERM for a single task on CIFAR-100 dataset. Details of the architecture are given in Section 5.1. We use the cross-entropy loss following the process in (Li et al., 2018b). Left: Loss landscape of a MAML model (5-way 1-shot). Right: Loss landscape of a standard ERM model.

obtained by solving the empirical MAML problem (2). The generalization performance of a meta-model θ is measured by the expected population loss

L(θm(θ); P) ETm E(x,y) Pm[L(θm(θ; Dm); D m)] (3)

where the expectation is taken over the randomness of the sampled tasks as well as data in the training and validation datasets per sampled task. For notation simplicity, we define the marginal data distribution for variable (x, y) as

P(x, y) ETm[Pm(x, y)] = Z P(x, y | Tm)P(Tm)d Tm

and we use P to represent P(x, y) thereafter.

2.2. Local minimizer of ERM implies that of MAML

Nevertheless, even with the simple lower-level gradient descent step (2), training the upper-level meta-model θ still requires differentiating the lower update. In other words, the meta-model requires the second-order information (i.e., the Hessian) of the objective function with respect to θ, making the problem (1) more involved than an ERM formulation of the multi-task learning (called ERM thereafter), given by

m=1 L(θ; Dm). (5)

To understand the difficulty of the MAML objective in (2), we visualize its loss landscape of a particular task m and compare it with that of ERM for the same task m. Figure 1 shows the per-task loss landscapes of a meta-model θ in (1a) and a standard ERM model in (5) on CIFAR-100 dataset. We find that the loss landscape of a meta-model is indeed much more involved with more local minima, making the optimization problem difficult to solve. The following lemma also characterizes the complex landscape of meta-model on a particular task m and its proof is deferred in Appendix B.

Lemma 1 (Local minimizers of MAML). Consider the onestep gradient fine-tuning step (2). For any m M, assume L(θ; Dm) has continuous third-order derivatives. Then for a particular task m, the following two statements hold a) the stationary points for L(θ; Dm) are also the stationary points for L(θ m(θ); Dm); and, b) the local minimizers for L(θ; Dm) are also the local minimizers for L(θ m(θ); Dm).

Lemma 1 shows that for a given task m, the number of stationary points and local minimizers for ERM s loss L(θ; Dm) are fewer than that of MAML s loss L(θ m(θ); Dm), which is aligned with the empirical observations in Figure 1. While some of the local minimizers in MAML s loss landscape are indeed effective few-shot learners, there are a number of sharp local minimizers in MAML that may have undesired generalization performance. It also suggests that the optimization of MAML can be more challenging than its ERM counterpart.

2.3. Sharpness aware minimization

SAM is a recently developed technique that leverages the geometry of the loss landscape to improve the generalization performance by simultaneously minimizing the loss value and the loss sharpness (Foret et al., 2021). Given the empirical loss L(θ; D), the goal of training is to choose θ having low population loss L(θ; P). SAM achieves this through the following optimization problem

min θ Lsam(θ; D) with Lsam(θ; D) max ||ϵ||2 α L(θ + ϵ; D).

(6) Given θ, the maximization in (6) seeks to find the weight perturbation ϵ in the Euclidean ball with radius α that maximizes the empirical loss. If we define the sharpness as

max ||ϵ||2 α [L(θ + ϵ; D) L(θ; D)] (7)

then (6) essentially minimizes the sum of the sharpness and the empirical loss L(θ; D). While the maximization in (6) is generally costly, a closed-form approximate maximizer has been proposed in (Foret et al., 2021) by invoking the Taylor expansion of the empirical loss. In such case, SAM seeks to find a flat minimum by iteratively applying the following two-step procedure at each iteration t, that is

ϵ(θt) = α L(θt; D)/|| L(θt; D)||2 (8a)

θt+1 = θt βt( L(θt + ϵ(θt); D)) (8b)

where βt is an appropriately scheduled learning rate. In (8b) and thereafter, the notation L(θ + ϵm(θ)) means L(θ + ϵm(θ)) x L(x)|x=θ+ϵm(θ). SAM works particularly well for complex and non-convex problems having a myriad of local minima, and where different minima yield models with different generalization abilities.

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

3. Sharp-MAML: Sharpness-aware MAML

As discussed in Section 2.1, MAML has a complex loss landscape with multiple local and global minima, that may yield similar values of empirical loss L(θ; D) while having significantly different generalization performance. Therefore, we propose integrating SAM with MAML, which is a new bilevel optimization problem.

3.1. Problem formulation of Sharp-MAML

We propose a unified optimization framework for Sharpnessaware MAML that we term Sharp-MAML by using two hyperparameters αup 0 and αlow 0, that is:

min θ max ||ϵ||2 αup

m=1 L(θ m(θ + ϵ); D m) (upper) (P)

s.t. θ m(θ) = arg min θm max ||ϵm||2 αlow L(θm + ϵm; Dm)

2βlow , m = 1, ..., M. (lower)

Compared with the bilevel formulation of MAML in (1), the above Sharp-MAML formulation is a four-level problem. However, in our algorithm design, we will efficiently approximate the two maximizations in (P) so that the cost of Sharp-MAML is almost the same as that of MAML.

In what follows, we list three main technical questions that we aim to address.

Q1. The choice of αup, αlow determines the specific scenario of integrating SAM with MAML. Applying SAM to both fine-tuning and meta-update stages would be computationally very expensive. Spurred by that, we ask: Is it possible to achieve better generalization by incorporating SAM into only either upperor lower-level problem?

Q2. Both MAML in (1) and SAM in (6) are bilevel optimization problems requiring several lower optimization steps. Thus, we also study whether or not the computationallyefficient alternatives (e.g. ESAM (Du et al., 2021), ANIL (Raghu et al., 2020)) can promise good generalization.

Q3. The theoretical motivation for SAM has been illustrated in (Foret et al., 2021) by bounding generalization ability in terms of neighborhood-wise training loss. Spurred by that, we further ask: Can we explain and theoretically justify why integrating SAM with MAML is effective in promoting generalization performance of MAML models?

3.2. Algorithm development

Based on (P), we focus on three variants of Sharp-MAML that differ in their respective computational complexity:

(a) Sharp-MAMLlow: SAM is applied to only the finetuning step, i.e., αlow > 0 and αup = 0. (b) Sharp-MAMLup: SAM is applied to only the metaupdate step, i.e., αlow = 0 and αup > 0. (c) Sharp-MAMLboth: SAM is applied to both fine-tuning and meta-update steps, i.e., αup, αlow > 0.

Below we only introduce Sharp-MAMLboth in detail and leave the pseudo-code of the other two variants in Appendix A since the other two variants can be deduced from Sharp MAMLboth. For the sake of convenience, we define the biased mini-batch gradient descent (BGD) at point θt + ϵ using gradient at θt + ϵ + ϵm as

BGDm(θt, ϵ, ϵm) θt + ϵ βlow e L(θt + ϵ + ϵm; Dm) (9) where ϵ and ϵm are perturbation vectors that can be computed accordingly to different Sharp-MAML, and e L( ; Dm) is an unbiased estimator of L( ; Dm) which can be assessed by mini-batch evaluation. Moreover, letting θm(θt) = BGDm(θt, ϵ, ϵm), we define

θt L( θm(θt); D m)

(I βlow 2L(θt + ϵ+ϵm; Dm)) L( θm(θt); D m) (10)

and 2L(θt + ϵ + ϵm; Dm) is the Hessian matrix of L( ; Dm) at θt + ϵ + ϵm.

Sharp-MAMLboth. For each task m, we compute its corresponding perturbation ϵm(θt) as follows:

ϵm(θt) = αlow e L(θt; Dm)/||e L(θt; Dm)||2. (11)

Thereafter, the fine-tuning step is carried out by performing gradient descent at θt using the gradient at the maximum point θt + ϵm(θt) using (9):

θ1 m(θt) = BGDm(θt, 0, ϵm(θt)). (12)

After we obtain θ1 m(θt) for all tasks, we compute the minibatch gradient estimator of the upper loss i.e., h = e θt PM m=1 L( θ1 m(θt); D m) which is an unbiased estimator of the upper-level gradient θt PM m=1 L( θ1 m(θt); D m), and use it to compute the meta perturbation ϵ(θt) by:

ϵ(θt) = αup h/|| h||2. (13)

Afterwards, we compute the new perturbed fine-tuned parameter, denoted by θ2 m(θt), using the gradient at the maximum point θt + ϵ(θt) + ϵm(θt) in (9), that is:

θ2 m(θt) = BGDm(θt, ϵ(θt), ϵm(θt)). (14)

Finally, for the meta-update stage, we evaluate the upper loss using the fine-tuned parameter θ2 m(θt) obtained from

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Algorithm 1 Pseudo-code for Sharp-MAMLboth; red lines need to be modified for Sharp-MAMLup; blue lines need to be modified for Sharp-MAMLlow

1: Require: p(T ): distribution over tasks 2: Require: βlow, βup: step sizes 3: Require: αlow > 0, αup > 0: perturbation radii 4: for t = 1, , T do 5: Sample batch of tasks Tm p(T ) 6: for all Tm do 7: Sample K examples from Dm 8: Evaluate e L(θt; Dm)

9: Compute perturbation ϵm(θt) via (11)

10: Compute fine-tuned parameter θ1 m(θt) via (12)

11: Sample data from D m for meta-update 12: end for

13: Compute PM m=1 e L( θ1 m(θt); D m)

14: Compute perturbation ϵ(θt) via (13)

15: Update θ via (15) using ˆθ2 m(θt) from (14) 16: end for

(14) and update the meta-parameter θ via:

θt+1 = θt βup

m=1 e θt L( θ2 m(θt); D m). (15)

See the pseudocode of Sharp-MAMLboth in Algorithm 1. The algorithms for Sharp-MAMLup and Sharp-MAMLlow can be deduced by setting ϵm(θt) = 0 and ϵ(θt) = 0 in Algorithm 1, respectively, which are formally stated in Algorithm 3 and Algorithm 2 in Appendix A.

4. Theoretical Analysis of Sharp-MAML

In this part, we rigorously analyze the performance of the proposed Sharp-MAML method in terms of the convergence rate and the generalization error.

4.1. Optimization analysis

To quantify the optimization performance of solving the onestep version of (1), we introduce the following assumptions. Assumption 1 (Lipschitz continuity). Assume that L(θ; D m), L(θ; Dm), L(θ; D m), 2L(θ; Dm), m are Lipschitz continuous with constant ℓ0, ℓ1, ℓ1, ℓ2. Assumption 2 (Stochastic derivatives). Assume that e L(θ; Dm), e 2L(θ; Dm), e L(θ; D m) are unbiased estimator of L(θ; Dm), 2L(θ; Dm), L(θ; D m) respectively and their variances are bounded by σ2.

Assumptions 1 2 also appear similarly in the convergence analysis of meta learning and bilevel optimization (Finn

et al., 2019; Rajeswaran et al., 2019a; Fallah et al., 2020; Chen et al., 2021a; Ji et al., 2022; Chen et al., 2022).

With the above assumptions, we introduce a novel biased MAML framework which includes MAML and sharp MAML as special cases and get the following theorem. The proof is deferred in Appendix C.

Theorem 1. Under Assumption 1 2, and choosing stepsizes and perturbation radii βlow, βup, αup = O( 1

T ), αlow = O(1), with some proper constants, we can get that the iterates {θt} generated by Sharp-MAMLup, Sharp-MAMLlow and Sharp-MAMLboth satisfy

t=1 E F(θt) 2 = O 1

where F(θ) is the objective function of MAML in (2).

Theorem 1 implies that by choosing a proper perturbation threshold, all three versions of Sharp-MAML can still find ϵ stationary points for MAML objective (2) with O(ϵ 2) iterations and O(ϵ 2) samples, which matches or even improves the state-of-the-art sample complexity of MAML (Rajeswaran et al., 2019a; Fallah et al., 2020; Ji et al., 2022).

4.2. Generalization analysis

To analyze the generalization error of Sharp-MAML, we make similar assumptions to Theorem 2 in (Foret et al., 2021). Recall the population loss L(θ; P) = E(x,y) P[l(θ, x, y)]. Denote the stationary point obtained by Sharp-MAMLup algorithm as ˆθ. Note that the Sharp MAML adopts gradient descent (GD) as the lower level algorithm, which is uniformly stable based on Definition 1.

Definition 1 ((Hardt et al., 2016)). An algorithm A is γuniformly stable if for all data sets S, S Zn such that S and S differ in at most one example, we have

sup S |ES [l (A(S); x, y) l (A (S ) ; x, y)]| γ (17)

where A(S) and A(S ) are the outputs of the algorithm A given datasets S and S .

With the above definition of uniform stability, we are ready to establish the generalization performance. We defer the proof of Theorem 2 to Appendix D.

Theorem 2. Assume loss function L( ) is bounded: 0 L(θ m; D) 1, for θ m defined in (2), and any D. Define F(θ; P) = ED P [F(θ; D)]. Assume F(ˆθ; P)

Eϵ N(0,α2I) h F(ˆθ + ϵ; P) i at the stationary point of the

Sharp-MAMLup denoted by ˆθ. For parameter θ m(ˆθ; D) learned with γA uniformly stable algorithm A from ˆθ Rk, with probability 1 δ over the choice of the training set

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

D P, with |D| = n M, it holds that

F(ˆθ; P) max ϵ 2 α F(ˆθ + ϵ; D) + γA+ (18) v u u tk ln 1 + ˆθ 2 2 α2 1 + q

k 2 + 2 ln 1

δ + 5 ln(n M)

Improved upper bound of generalization error. Theorem 2 shows that the difference between the population loss and the empirical loss of Sharp-MAMLup is bounded by the stability of the lower-level update γA and another O(k/n M) term that vanishes as the number of meta-training data goes to infinity. The lower-level update GD has uniform stability of order O(1/n) (Hardt et al., 2016). Also, it is worth noting that the upper bound of the population loss on the right-hand side (RHS) of (18), is a function of α. And for any sufficiently small α0 > 0, we can find some α1 > α0, where this function takes smaller value than at α0 (see the proof in Appendix D). This suggests that a choice of α arbitrarily close to zero, in which case Sharp-MAML reduces to the original MAML method, is not optimal in terms of the generalization error upper bound. Therefore, it shows Sharp-MAML has smaller generalization error upper bound than conventional MAML. The analysis can be extended to Sharp-MAMLboth in a similar way.

5. Numerical Results

In this section, we demonstrate the effectiveness of Sharp MAML by comparing it with several popular MAML baselines in terms of generalization and computation cost. We evaluate Sharp-MAML on 5-way 1-shot and 5-way 5-shot settings on the Mini-Imagenet dataset and present the results on Omniglot dataset in Appendix E.

5.1. Experiment setups

Our model follows the same architecture used by (Vinyals et al., 2016), comprising of 4 modules with a 3 3 convolutions with 64 filters followed by batch normalization (Ioffe & Szegedy, 2015), a Re LU non-linearity, and a 2 2 max-pooling. We follow the experimental protocol in (Finn et al., 2017). The models were trained using the SAM1

algorithm with Adam as the base optimizer and learning rate α = 0.001. Following (Ravi & Larochelle, 2017), 15 examples per class were used to evaluate the post-update meta-gradient. The values of αlow, αup are taken from a set of {0.05, 0.005, 0.0005, 0.00005} and each experiment is run on each value for three random seeds. We choose the inner gradient steps from a set of {3, 5, 7, 10}. The step size is chosen from a set of {0.1, 0.01, 0.001}. For Sharp-

1We used the open-source SAM Py Torch implementation available at https://github.com/davda54/sam

Table 1. Results on Mini-Imagenet (5-way 1-shot). Our reproduced result of MAML is close to that of the original .

ALGORITHMS ACCURACY

MATCHING NETS 43.56% IMAML (RAJESWARAN ET AL., 2019B) 49.30 % CAVIA (ZINTGRAF ET AL., 2019B) 47.24 % REPTILE (NICHOL & SCHULMAN, 2018) 49.97 % FOMAML (NICHOL & SCHULMAN, 2018) 48.07 % LLAMA (GRANT ET AL., 2018B) 49.40 % BMAML (YOON ET AL., 2018) 49.17 % MAML (REPRODUCED) 47.13% SHARP-MAMLlow 49.72% SHARP-MAMLup 49.56% SHARP-MAMLboth 50.28%

reproduced using the Torchmeta (Deleu et al., 2019) library

Table 2. Results on Mini-Imagenet (5-way 5-shot). Our reproduced result of MAML is close to that of the original .

ALGORITHMS ACCURACY

MATCHING NETS 55.31% CAVIA (ZINTGRAF ET AL., 2019B) 59.05% REPTILE (NICHOL & SCHULMAN, 2018) 65.99% FOMAML (NICHOL & SCHULMAN, 2018) 63.15 % BMAML (YOON ET AL., 2018) 64.23 % MAML (REPRODUCED) 62.20% SHARP-MAMLlow 63.18% SHARP-MAMLup 63.06% SHARP-MAMLboth 65.04%

reproduced using the Torchmeta (Deleu et al., 2019) library

MAMLboth we use the same value of αlow, αup in each experiment. We report the best results in Tables 1 and 2.

One Sharp-MAML update executes two backpropagation operations (i.e., one to compute ϵ(θ) and another to compute the final gradient). Therefore, for a fair comparison, we execute each MAML training run twice as many epochs as each Sharp-MAML training run and report the best score achieved by each MAML training run across either the standard epoch count or the doubled epoch count.

5.2. Sharp-MAML versus MAML baselines

Regarding baselines, we use the MAML (Finn et al., 2017), Matching Nets (Vinyals et al., 2016), CAVIA (Zintgraf et al., 2019b), Reptile (Nichol & Schulman, 2018), FOMAML (Nichol & Schulman, 2018), and BMAML (Yoon et al., 2018).

In Tables 1 and 2, we report the accuracy of three variants of Sharp-MAML and other baselines on the Mini-Imagenet dataset in the 5-way 1-shot and 5-way 5-shot settings respectively. We observe that Sharp-MAMLlow outperforms MAML in all cases, exhibiting the advantage of our methods. The results on the Omniglot dataset are reported in Table 4 and Table 5 of Appendix. Our results verify the efficacy

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Table 3. Results on Mini-Imagenet (5-way 1-shot).

ALGORITHMS ACCURACY TIME

MAML (REPRODUCED) 47.13% X1 SHARP-MAMLlow 49.72% X2.60 SHARP-MAMLup 49.56% X3.60 SHARP-MAMLboth 50.28% X4.60 SHARP-MAMLlow -ANIL 49.19% X1.40 ESHARP-MAMLlow 48.90% X2.20 ESHARP-MAMLlow -ANIL 49.03% X1.20

execution time is normalized to MAML training time

of all the three variants of Sharp-MAML, suggesting that SAM indeed improves the generalization performance of bi-level models like MAML by seeking out flatter minima.

Since Sharp-MAML requires one more gradient computation per iteration than the original MAML, for a fair comparison, we report the execution times in Table 3. The results show that Sharp-MAMLlow requires the least amount of additional computation while still achieving significant performance gains. Sharp-MAMLup and Sharp-MAMLboth also improves the performance significantly but both approaches have a higher computation than Sharp-MAMLlow since the computation of additional Hessians is needed for the meta-update gradient.

5.3. Ablation study and loss landscape visualization

We conduct an ablation study on the effect of perturbation radii αlow and αup on the three Sharp-MAML variants. Figure 2 and Figure 3 summarize the results on the Mini-Imagenet dataset. We observe that all the three Sharp MAML variants outperform the original MAML for almost all the values of αlow, αup we used in our experiments. Therefore, integrating SAM at any/both stage(s) gives better performance than the original SAM for a wide range of values of the perturbation sizes, reducing the need to finetune these hyperparameters. This also suggests that SAM is effectively avoiding bad local minimum in MAML loss landscape (cf. Figure 1) for a wide range of perturbation radii. In Figure 4, we plot the loss landscapes of MAML and Sharp MAML, and observe that Sharp-MAML indeed seeks out landscapes that are smoother as compared to the landscape of original MAML and hence, meets our theoretical characterization of improved generalization performance. Furthermore, the generalization error of Sharp-MAMLboth is found to be 34.58%/8.56% as compared to 37.46%/11.58% of MAML for the 5-way 1-shot/5-shot Mini-Imagenet, which explains the advantage of our approach.

5.4. Computationally-efficient version of Sharp-MAML

Next we investigate if the computational overhead of Sharp-MAMLlow can be further reduced by leveraging

Figure 2. Performance under different values of αlow, αup on Mini Imagenet (5-way 1-shot). For Sharp-MAMLboth, we used the same value of αlow and αup (i.e., αlow = αup).

Figure 3. Performance under different values of αlow, αup on Mini Imagenet (5-way 5-shot). For Sharp-MAMLboth, we used the same value of αlow and αup (i.e., αlow = αup)

the computationally-efficient MAML an almost-noinner-loop (ANIL) method (Raghu et al., 2020) and the computationally-efficient SAM ESAM (Du et al., 2021). Sharp-MAML-ANIL is the case when we use ANIL with Sharp-MAMLlow; ESharp-MAML is the case when we use ESAM with Sharp-MAMLlow; ESharp-MAML-ANIL is Sharp-MAMLlow with both ANIL and ESAM.

In ANIL, fine-tuning is only applied to the task-specific head with a frozen representation network from the meta-model. Motivated by (Raghu et al., 2020), we ask if incorporating Sharp-MAMLlow in ANIL can ameliorate the computational overhead while preserving the performance gains obtained using the model trained on Sharp-MAMLlow. ANIL decomposes the meta-model θ into two parts: the representation encoding network denoted by θr and the classification head denoted by θc i.e., θ [θr, θc]. Different from (1b), ANIL then only fine-tunes θc over a specific task m, given by:

θ m(θ) = arg min θm; θr,m=θr L(θc,m, θr,m; Dm). (19)

In other words, the initialized representation θr, which comprises most of the network, is unchanged during fine-tuning.

ESAM leverages two training strategies, Stochastic Weight Perturbation (SWP) and Sharpness-Sensitive Data Selection (SDS). SWP saves computation by stochastically selecting set of weights in each iteration, and SDS judiciously selects a subset of data that is sensitive to sharpness. To be specific, SWP uses a gradient mask v = (v1, ..., v M) where vi i.i.d.

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Figure 4. One-task cross entropy loss landscapes of MAML and different variants of Sharp-MAML trained on CIFAR-100 dataset (5-way 1-shot setting) using class one. The plots are generated following (Li et al., 2018b). Details of the architecture are given in Section 5.1.

Bern(ξ), i. In SDS, instead of computing LN (θ + ϵ(θ)) over all samples, N, a subset of samples, N +, is selected, whose loss values increase the most with ϵ(θ); that is,

N + {(xi, yi) : l(θ + ϵ; xi, yi) l(θ; xi, yi) > τ}

N {(xi, yi) : l(θ + ϵ; xi, yi) l(θ; xi, yi) < τ}

where the threshold τ controls the size of N +. Furthermore, µ = |N +|/|N| is ratio of number of selected samples with respect to the batch size and determines the exact value of τ. In practice, µ is selected to maximize efficiency while preserving generalization performance.

In Table 3, we report our results on three computationally efficient versions of Sharp-MAML. We find that Sharp MAML-ANIL is comparable in performance to Sharp MAML while requiring almost 86% less computation. ESharp-MAML also reduces the computation, but has slight performance loss. We suspect that this is due to the nested structure of the meta-learning problem that adversely affects the two training strategies used in ESAM. We further investigate the effect of both ANIL and ESAM on Sharp-MAML and observe significant reduction in computation (116% faster) with slight degradation in performance as compared to Sharp-MAML. When compared to MAML, ESharp-MAML-ANIL performs considerably better (+1.90% gain in accuracy) while requiring only 20% more computation.

6. Conclusions

In this paper, we study sharpness-aware minimization (SAM) in the context of model-agnostic meta-learning (MAML) through the lens of bilevel optimization. We name our new MAML method Sharp-MAML. Through a systematic empirical and theoretical study, we find that adding SAM into any/both fine-tuning or/and meta-update stages improves the generalization performance. We further find that incorporating SAM in the fine-tuning stage alone is the best trade-off between performance and computation. To further save computation overhead, we leverage the techniques such as efficient SAM and almost no inner loop to

speed up Sharp-MAML, without sacrificing generalization.

Acknowledgements

The work was partially supported by NSF Mo DL-SCALE Grant 2134168 and the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).

Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C. C., Soatto, S., and Perona, P. Task2vec: Task embedding for meta-learning. In Proc. International Conference on Computer Vision, pp. 6430 6439, Seoul, South Korea, 2019.

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and d. Freitas, N. Learning to learn by gradient descent by gradient descent. In Proc. Advances in Neural Information Processing Systems, pp. 3981 3989, Barcelona, Spain, December 2016.

Antoniou, A., Edwards, H., and Storkey, A. How to train your maml. In Proc. International Conference on Learning Representations, New Orleans, LA, 2019.

Bahri, D., Mobahi, H., and Tay, Y. Sharpness-aware minimization improves language model generalization. ar Xiv preprint:2110.08529, 2021.

Brock, A., Lim, T., Ritchie, J. M., and Weston, N. Sma SH: One-shot model architecture search through hypernetworks. In Proc. International Conference on Learning Representations, Vancouver, Canada, April 2018.

Chen, L. and Chen, T. Is Bayesian model-agnostic meta learning better than model-agnostic meta learning, provably? In Proc. International Conference on Artificial Intelligence and Statistics, March 2022.

Chen, T., Sun, Y., and Yin, W. Closing the gap: Tighter analysis of alternating stochastic gradient methods for

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

bilevel problems. In Proc. Advances in Neural Information Processing Systems, volume 34, Virtual, 2021a.

Chen, T., Sun, Y., Xiao, Q., and Yin, W. A single-timescale method for stochastic bilevel optimization. In Proc. International Conference on Artificial Intelligence and Statistics, pp. 2466 2488, March 2022.

Chen, X., Hsieh, C., and Gong, B. When vision transformers outperform resnets without pretraining or strong data augmentations. ar Xiv preprint:2106.0154, 2021b.

Deleu, T., Würfl, T., Samiei, M., Cohen, J. P., and Bengio, Y. Torchmeta: A Meta-Learning library for Py Torch, 2019. URL https://arxiv.org/abs/1909.06576.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learn around a common mean. In Proc. Advances in Neural Information Processing Systems, volume 31, Montreal, Canada, 2018.

Dinh, L., Pascanu, R., S.Bengio, and Bengio, Y. Sharp minima can generalize for deep nets. In Proc. International Conference on Machine Learning, pp. 1019 1028, Sydney, Australia, August 2017.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., S. Gelly, J. U., and Houlsby, N. An image is worth 16 16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations, Virtual, April 2020.

Du, J., Yan, H., Feng, J., Zhou, J. T., Zhen, L., Goh, R. S. M., and Tan, V. Y. F. Efficient sharpness-aware minimization for improved training of neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Virtual, June 2021.

Dziugaite, G. and Roy, D. M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proc. Conference on Uncertainty in Artificial Intelligence, Sydney, Australia, August 2016.

Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based model-agnostic metalearning algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1082 1092, 2020.

Farid, A. and Majumdar, A. Generalization bounds for metalearning via pac-bayes and uniform stability. Advances in Neural Information Processing Systems, 34, 2021.

Finn, C. and Levine, S. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In Proc. International Conference on Learning Representations, Vancouver, Canada, 2018.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning, Sydney, Australia, 2017.

Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. In International Conference on Machine Learning, pp. 1920 1930, 2019.

Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In Proc. International Conference on Learning Representations, Virtual, April 2021.

Goldblum, M., Fowl, L., and Goldstein, T. Adversarially robust few-shot learning: A meta-learning approach. In Proc. Advances in Neural Information Processing Systems, Virtual, December 2020.

Gonzalez, S. and Miikkulainen, R. Improved training speed, accuracy, and data utilization through loss function optimization. In Proc. IEEE Congress on Evolutionary Computation, pp. 1 8, Glasgow, United Kingdom, 2020.

Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-based meta-learning as hierarchical bayes. In Proc. International Conference on Learning Representations, April 2018a.

Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-based meta-learning as hierarchical bayes. In Proc. International Conference on Learning Representations, Vancouver, Canada, 2018b.

Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225 1234, 2016.

Hsu, K., Levine, S., and Fin, C. Unsupervised learning via meta-learning. In Proc. International Conference on Learning Representations, Vancouver, Canada, April 2018.

Huang, P., Wang, C., Singh, R., Yih, W., and He, X. Natural language to structured query generation via meta-learning. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 732 738, New Orleans, LA, 2018.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. International Conference on Machine Learning, Lille, France, June 2015.

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider optima

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

and better generalization. In Proc. Conference on Uncertainty in Artificial Intelligence, pp. 876 885, Monterey, CA, 2018.

Ji, K., Yang, J., and Liang, Y. Theoretical convergence of multi-step model-agnostic meta-learning. Journal of Machine Learning Research, 23(29):1 41, 2022.

Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. Fantastic generalization measures and where to find them. In Proc. International Conference on Learning Representations, New Orleans, LA, 2019.

Kao, C.-H., Chiu, W.-C., and Chen, P.-Y. Maml is a noisy contrastive learner. ar Xiv preprint ar Xiv:2106.15367, 2021.

Keskar, N., D.Mudigere, Noceda, J., Smelyanskiy, M., and Tang, P. On large-batch training for deep learning: Generalization gap and sharp minima. In Proc. International Conference on Learning Representations, Toulon, France, May 2017.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 1097 1105, Lake Tahoe, NV, 2012.

Langley, P. Crafting papers on machine learning. In Proc. International Conference on Machine Learning, pp. 1207 1216, Stanford, CA, 2000.

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Li, D., Yang, Y., Song, Y., and Hospedales, T. M. Learning to generalize: Meta-learning for domain generalization. In Proc. AAAI Conference on Artificial Intelligence, New Orleans, LA, February 2018a.

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Proc. Advances in Neural Information Processing Systems, Montreal, Canada, December 2018b.

Li, K. and Malik, J. Learning to optimize. In Proc. International Conference on Learning Representations, Toulon, France, May 2017.

Maicas, G., Bradley, A. P., Nascimento, J. C., Reid, I., and Carneiro, G. Training medical image analysis systems like radiologists. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 546 554, Granada, Spain, 2018.

Munkhdalai, T. and Yu, H. Meta networks. In Proc. International Conference on Machine Learning, pp. 2554 2563, Sydney, Australia, August 2017.

Neyshabur, B., Bhojanapalli, S., Mc Allester, D., and Srebro, N. Exploring general-ization in deep learning. In Proc. Advances in Neural Information Processing Systems, pp. 5947 5956, Vancouver, Canada, 2019.

Nichol, A. and Schulman, J. Reptile: a scalable meta learning algorithm. ar Xiv preprint ar Xiv:1803.02999, 2018.

Novak, J. D. and Gowin, D. B. Learning how to learn. Cambridge University Press, 1984.

Obamuyide, A. and Vlachos, A. Model-agnostic metalearning for relation classification with limited supervision. In Proc. Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019.

Park, E. and Oliva, J. B. Meta-curvature. In Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, December 2019.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1 67, 2020.

Raghu, A., Raghu, M., and Bengio, S. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In Proc. International Conference on Learning Representations, Virtual, April 2020.

Rajeswaran, A., Finn, C., Kakade, S., and Levine, S. Metalearning with implicit gradients. In Proc. Advances in Neural Information Processing Systems, pp. 113 124, Vancouver, Canada, December 2019a.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Proc. Advances in Neural Information Processing Systems, pp. 113 124, Vancouver, Canada, 2019b.

Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In Proc. International Conference on Learning Representations, Toulon, France, 2017.

Rothfuss, J., Fortuin, V., Josifoski, M., and Krause, A. PACOH: Bayes-optimal meta-learning with pac-guarantees. In Proc. International Conference on Machine Learning, pp. 9116 9126, virtual, 2021.

Schmidhuber, J. Evolutionary principles in self-referential learning. on learning now to learn: The meta-metameta...-hook. http://www.idsia.ch/ juergen/diploma.html, 1987.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Proc. Advances in Neural Information Processing Systems, pp. 4077 4087, Long Beach, CA, December 2017.

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.

Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Keysers, D., Uszkoreit, J., Lucic, M., and Dosovitskiy, A. Mlp-mixer: An all-mlp architecture for vision. ar Xiv preprint:2105.01601, 2021.

Vilalta, R. and Drissi, Y. A perspective view and survey of meta-learning. Artificial Intelligence Review, 2002.

Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. Matching networks for one shot learning. In Proc. Advances in Neural information processing systems, Barcelona, Spain, December 2016.

Vuorio, R., Sun, S., Hu, H., and Lim, J. J. Multimodal model-agnostic metalearning via task-aware modulation. In Proc. Advances in Neural Information Processing Systems, pp. 1 12, Vancouver, Canada, December 2019.

Wang, G., Luo, C., Sun, X., Xiong, Z., and Zeng, W. Tracking by instance detection: A meta-learning approach. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 6288 6297, Virtual, 2020.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. ar Xiv preprint:1611.05763, 2016.

Wang, R., Xu, K., Liu, S., Chen, P., Weng, T., Gan, C., and Wang, M. On fast adversarial robustness adaptation in model-agnostic meta-learning. In Proc. International Conference on Learning Representations, Virtual, May 2021.

Wang, Y., Ramanan, D., and Hebert, M. Meta-learning to detect rare objects. In Proc. International Conference on Computer Vision, Seoul, Korea, 2019.

Xu, H., Li, Y., Liu, X., Liu, H., and Tang, J. Yet meta learning can adapt fast, it can also break easily. ar Xiv preprint: 2009.01672, 2020.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., and Raffel, C. m T5: A massively multilingual pre-trained text-to-text transformer. ar Xiv preprint:2010.11934, 2020.

Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn, C. Meta-learning without memorization. In Proc. International Conference on Learning Representations, Virtual, April 2020.

Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Proc. Advances in Neural Information Processing Systems, volume 31, Montreal, Canada, 2018.

Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning. In Proc. International Conference on Machine Learning, pp. 7693 7702, Long Beach, CA, June 2019a.

Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning. In Proc. International Conference on Machine Learning, pp. 7693 7702, Long Beach, CA, 2019b.

Supplementary Material for Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning"

A. Omitted Pseudo-code in The Main Manuscript

In this section, we present the omitted pseudo-code of MAML and two Sharp-MAML algorithms.

A.1. MAML algorithm

The pseudo-code of plain-vanilla MAML is summarized in Algorithm 2.

Algorithm 2 MAML for few-shot supervised learning

1: Require: p(T ): distribution over tasks 2: Require: βlow, βup: step sizes 3: for t = 1, , T do 4: Sample batch of tasks Tm p(T ) 5: for all Tm do 6: Sample K examples from Dm = {xi, yi} 7: Evaluate e L(θt; Dm) 8: Compute fine-tuned parameter θ m(θt) via (2) 9: Sample datapoints from D m = {xi, yi} for meta-update 10: end for 11: Update the meta-model θ by θt+1 = θt βup e θt PM m=1 L(θ m(θt); D m) 12: end for

Algorithm 3 Sharp-MAMLup

1: Require: p(T ): distribution over tasks 2: Require: βlow, βup: step sizes 3: Require: αlow > 0, αup > 0: perturbation radii 4: for t = 1, , T do 5: Sample batch of tasks Tm p(T ) 6: for all Tm do 7: Sample K examples from Dm 8: Evaluate e L(θt; Dm) 9: Compute fine-tuned parameter θ1 m(θt) = θt βlow e L(θt; Dm) 10: Sample data from D m for meta-update 11: end for 12: Compute e θt PM m=1 L( θ1 m(θt); D m) 13: Compute perturbation ϵ(θt) via (20) 14: Update θt+1 via (21) 15: end for

A.2. Sharp-MAMLup algorithm

In this case, ϵm(θt) = 0, m, t, so we have that θ1 m(θt) = θt βlow e L(θt; Dm) and θ2 m(θt) = θt + ϵ(θt) βlow e L(θt + ϵ(θt); Dm). Defining h = e θt PM m=1 L( θ1 m(θt); D m), the upper perturbation ϵ(θt) can be computed by:

ϵ(θt) = αup h/|| h||2. (20)

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Algorithm 4 Sharp-MAMLlow

1: Require: p(T ): distribution over tasks 2: Require: βlow, βup: step sizes 3: Require: αlow > 0, αup > 0: perturbation radii 4: for t = 1, , T do 5: Sample batch of tasks Tm p(T ) 6: for all Tm do 7: Sample K examples Dm from Tm 8: Evaluate e L(θt; Dm) 9: Compute perturbation ϵm(θt) via (11) 10: Compute fine-tuned parameter θ1 m(θt) via (22) 11: Sample data D m for meta-update 12: end for 13: Update the meta-model θt+1 via (23) 14: end for

Let ϵt = ϵ(θt), then the final meta update can be written as

θt+1 = θt βup e θt

m=1 L(θt + ϵt βlow e L(θt + ϵt; Dm); D m). (21)

The pseudo-code is summarized in Algorithm 3.

A.3. Sharp-MAMLlow algorithm

In this case, ϵ(θt) = 0, t, so we have that θ1 m(θt) = θ2 m(θt) = θt βlow e L(θt +ϵm(θt); Dm). Then the final meta update can be written as

θ1 m(θt) = θt βlow e L(θt + ϵm(θt); Dm) (22)

θt+1 = θt βup e θt

m=1 L( θ1 m(θt); D m) (23)

The pseudo-code is summarized in Algorithm 4.

B. Proof of Lemma 1

Proof. Since L(θ; Dm) C3, the stationary point of L(θ; Dm) satisfies

L(θ; Dm) = 0 (24)

and the local minimizer of L(θ; Dm) satisfies

L(θ; Dm) = 0 and 2L(θ; Dm) 0. (25)

Next we compute the gradient of L(θ m(θ); Dm) according to the chain rule, that is

L(θ m(θ); Dm) = (I βlow 2L(θ; Dm)) L(θ βlow L(θ; Dm); Dm) (26)

and the Hessian of L(θ m(θ); Dm), that is

2L(θ m(θ); Dm) = (I βlow 2L(θ; Dm)) L(θ βlow L(θ; Dm); Dm)

= βlow 3L(θ; Dm) L(θ βlow L(θ; Dm); Dm)

+ (I βlow 2L(θ; Dm))2 2L(θ βlow L(θ; Dm); Dm). (27)

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Plugging (24) to (26), we get that θL(θ m(θ); Dm) = 0 which implies θ is also a stationary point for L(θ m(θ); Dm).

Moreover, plugging (25) to (27), we get that θL(θ m(θ); Dm) = 0 and

2 θL(θ m(θ); Dm) = (I βlow 2L(θ; Dm))2 2L(θ; Dm) 0

which implies θ is also a local minimizer of L(θ m(θ); Dm).

If θ is the stationary point for L(θ; Dm), m M, we know that θ is also the stationary point for L(θ m(θ); Dm), m M. Thus, θ is also the stationary point for PM m=1 L(θ m(θ); Dm). Likewise, the statement is also true for local minimizer.

C. Convergence Analysis

C.1. Convergence analysis of MAML (Finn et al., 2017)

We provide theoretical analysis for MAML (Finn et al., 2017). First, we state the exact form of MAML as follows.

m=1 L(θ m(θ); D m) (28a)

s.t. θ m(θ) = θ βlow L(θ; Dm), m M. (28b)

The problem (28) can be reformulated as

min θ F(θ) = 1

m=1 L(θ m(θ); D m) (29a)

s.t. θ m(θ) = arg min θm

L(θ; Dm) (θm θ) + 1 2βlow θm θ 2 . (29b)

Next, to show the connection between MAML formulation and ALSET (Chen et al., 2021a), we can concatenate θm as a new vector ϕ = [θ 1 , , θ M] and define

F(θ) = f(ϕ (θ)), f(ϕ) = 1

m=1 L(θm; D m), g(θ, ϕ) = 1

L(θ; Dm) (θm θ) + θm θ 2

where ϕ (θ) = arg minϕ g(θ, ϕ). Then the Jacobian and Hessian of f and g can be computed by

L(θ1; D 1) ... L(θM; D M)

, ϕθg (θ, ϕ) =

2L(θ1; D1) β 1 low I ... 2L(θM; DM) β 1 low I

, ϕϕg(θ, ϕ) = β 1 low I

where I denotes the identity matrix. According to the expression of F(θ) in ALSET (Chen et al., 2021a), we can verify that MAML s gradient has the following form

F(θ) = θϕg(θ, ϕ) 1 ϕϕg(θ, ϕ) ϕf(ϕ)

m=1 (I βlow 2L(θ; Dm)) L(θm; D m). (30)

Moreover, since g(θ, ϕ) is a quadratic function with respect to ϕ, the strongly convexity and Lipschitz continuity assumptions hold.2 Assumptions about upper-level function also holds under Assumption 1.

Then, for notational simplicity, we consider the single-sample case with K = 1 and define three independent samples for stochastic gradient and Hessian computation as ξm := (x, y) Dm, ψm := (x, y) Dm, ξ m := (x, y) D m, so the

2 2g is Lipschitz continuous in Assumption 1 in (Chen et al., 2021a) can be reduced to ϕϕg and ϕθg is Lipschitz continuous, which can be satisfied under Assumption 1.

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

corresponding K-batch gradient and Hessian estimators used in MAML algorithms can be written as

L(θ; Dm, ξm) = 1

ξm Dm l(θ, x, y),

2L(θ; Dm, ψm) = 1

ψm Dm 2l(θ, x, y),

L(θ; D m, ξ m) = 1

ξ m D m l(θ, x, y).

Based on these notations, we can write the stochastic update of MAML algorithm (Finn et al., 2017) as

θt+1 = θt βup e θt L(θt; Dm) = θt βup

m=1 (I βlow 2L(θt; Dm, ψm)) L(θt+1 m ; D m, ξ m)

θt+1 m = θt βlow L(θt; Dm, ξm).

Then MAML algorithm can be seen as a special case of ALSET, so we have the following Lemma. Lemma 2. Under Assumption 1 2, and choosing stepsizes βlow, βup = O( 1

T ) with some proper constants, we can get that the iterates {θt} generated by MAML (Finn et al., 2017) satisfy

t=1 E F(θt) 2 = O 1

C.2. Convergence analysis of a generic biased MAML

Since Sharp-MAML can be treated as biased update version of MAML (Finn et al., 2017), we first analyze the general biased MAML algorithm. Suppose that biased MAML update with

θt+1 = θt βup

m=1 (I βlow ˆ 2L(θt; Dm, ψm)) L(ˆθt+1 m ; D m, ξ m)

ˆθt+1 m = θt βlow ˆ L(θt; Dm, ξm)

where ˆ 2L(θt; Dm, ψm), ˆ L(θt; Dm, ξm) are biased estimator of 2L(θt; Dm), L(θt; Dm), respectively. With the notation that ˆ L(θ; Dm) Eξm h ˆ L(θ; Dm, ξm) i , ˆ 2L(θ; Dm) Eψm h ˆ 2L(θ; Dm, ψm) i ,

we make the following assumptions. Assumption 3 (Stochastic derivatives). Assume that ˆ L(θ; Dm, ξm), ˆ 2L(θ; Dm, ψm) are unbiased estimator of ˆ L(θ; Dm), ˆ 2L(θ; Dm) respectively and their variances are bounded by σ2 b.

Assumption 4. Assume that ˆ 2L(θt; Dm) 2L(θt; Dm) γh, ˆ L(θt; Dm) L(θt; Dm) γg, m M.

Throughout the proof, we use

Ft = σ n ˆθ0 1, , ˆθ0 M, θ0, . . . , θt, ˆθt+1 1 , . . . , ˆθt+1 M o , F t = σ n ˆθ0 1, , ˆθ0 M, θ0, . . . , θto

where σ{ } denotes the σ-algebra generated by random variables. We also denote

m=1 (I βlow ˆ 2L(θt; Dm, ψm)) L(ˆθt+1 m ; D m, ξ m)

m=1 (I βlow ˆ 2L(θt; Dm, ψm)) L(ˆθt+1 m ; D m, ξ m) Ft

m=1 (I βlow ˆ 2L(θt; Dm)) L(ˆθt+1 m ; D m).

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Lemma 3. Under Assumption 1 4, we have that

E F(θt) ht 2 4ℓ2 1β2 low(γ2 g + σ2 b) + 4β2 low ℓ2 0γ2 h + 4(γ2 h + ℓ2 1)ℓ2 1β2 low(γ2 g + σ2 b) . (31)

Proof. Since E h ˆθt+1 m |F t i = θt βlow ˆ L(θt; Dm), then from Assumption 4, we have θ m(θt) E h ˆθt+1 m |F t i βlowγg.

Taking expectation with respect to F t, we get

E h θ m(θt) ˆθt+1 m 2i 2E h θ m(θt) E h ˆθt+1 m |F t i 2i + 2E h E h ˆθt+1 m |F t i ˆθt+1 m 2i 2β2 lowγ2 g + 2β2 lowσ2 b.

Then using Lipschitz continuity of L(θ; D m), we obtain

E L(θ m(θt); D m) L(ˆθt+1 m ; D m) 2 2ℓ2 1β2 low(γ2 g + σ2 b). (32)

On the other hand, by observing that

ˆ 2L(θt; Dm) 2 2 2L(θt; Dm) 2 + 2 2L(θt; Dm) ˆ 2L(θt; Dm) 2 2(γ2 h + ℓ2 1),

E 2L(θt; Dm) L(θ m(θt); D m) ˆ 2L(θt; Dm) L(ˆθt+1 m ; D m) 2

2E 2L(θt; Dm) ˆ 2L(θt; Dm) 2 L(θ m(θt); D m) 2+2E ˆ 2L(θt; Dm) 2 L(θ m(θt); D m) L(ˆθt+1 m ; D m) 2

2ℓ2 0γ2 h + 8(γ2 h + ℓ2 1)ℓ2 1β2 low(γ2 g + σ2 b). (33)

Thus, using (32) and (33), we get

E F(θt) ht 2

h (I βlow 2L(θt; Dm)) L(θ m(θt); D m) (I βlow ˆ 2L(θt; Dm)) L(ˆθt+1 m ; D m) i

m=1 E L(θ m(θt); D m) L(ˆθt+1 m ; D m) 2

+ 2β2 low M

m=1 E 2L(θt; Dm) L(θ m(θt); D m) ˆ 2L(θt; Dm) L(ˆθt+1 m ; D m) 2

4ℓ2 1β2 low(γ2 g + σ2 b) + 4β2 low ℓ2 0γ2 h + 4(γ2 h + ℓ2 1)ℓ2 1β2 low(γ2 g + σ2 b)

from which the proof is complete.

Lemma 4. Under Assumption 1 4, and choosing stepsizes βlow, βup = O( 1

T ), and γg, γh = O(1) with some proper constants, we can get that the iterates {θt} generated by biased MAML satisfy

t=1 E F(θt) 2 = O 1

Proof. First we bound the variance of stochastic biased meta gradient estimator ht as

E h ht ht 2 Ft i 2

m=1 E h (I βlow ˆ 2L(θt; Dm, ψm)) L(ˆθt+1 m ; D m, ξ m)

(I βlow ˆ 2L(θt; Dm)) L(ˆθt+1 m ; D m) 2 Ft i

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

m=1 E h L(ˆθt+1 m ; D m, ξ m) L(ˆθt+1 m ; D m) 2 Ft i

+ 4β2 low M

m=1 E h ˆ 2L(θt; Dm, ψm) L(ˆθt+1 m ; D m, ξ m) ˆ 2L(θt; Dm) L(ˆθt+1 m ; D m) 2 Ft i

4ℓ2 1σ2 + 8β2 low M

m=1 E h L(ˆθt+1 m ; D m, ξ m) 2 ˆ 2L(θt; Dm, ψm) ˆ 2L(θt; Dm) 2 Ft i

+ 8β2 low M

m=1 E h ˆ 2L(θt; Dm) 2 L(ˆθt+1 m ; D m, ξ m) L(ˆθt+1 m ; D m) 2 Ft i

4ℓ2 1σ2 + 8β2 lowσ2 b E h L(ˆθt+1 m ; D m, ξ m) 2 Ft i + 16(γ2 h + ℓ2 1)β2 lowσ2

4ℓ2 1σ2 + 16(γ2 h + ℓ2 1)β2 lowσ2 + 16β2 lowσ2 b(σ2 + ℓ2 0) σ2 (34)

where (34) comes from

E h L(ˆθt+1 m ; D m, ξ m) 2 Ft i 2 L(ˆθt+1 m ; D m) 2 + 2E h L(ˆθt+1 m ; D m, ξ m) L(ˆθt+1 m ; D m) 2 Ft i 2(ℓ2 0 + σ2).

Then according to Lemma 4 in (Chen et al., 2021a), F is smooth with constant LF = O (1). Using the smoothness property with Lemma 2 in (Chen et al., 2021a), it follows that

E F(θt+1)|Ft F(θt) + E F(θt), θt+1 θt |Ft + LF

2 E h θt+1 θt 2 |Ft i

F(θt) βup F(θt), ht + LF β2 up 2 E ht 2|Ft

(a) F(θt) βup F(θt), ht + LF β2 up 2

ht 2 + LF β2 up σ2

(b) = F(θt) βup

2 LF β2 up 2

! ht 2 + βup

F(θt) ht 2 + LF β2 up σ2

where (a) uses E A 2|B = (E [A|B])2 +E A E [A|B] 2|B and (34), and (b) uses 2a b = a 2 + b 2 a b 2. Then using the result in Lemma 3 and choosing βup 1 LF , telescoping and rearranging it, we obtain that

t=1 E h F(θt) 2i 2F(θ1)

βup T + LF βup σ2 + 4β2 low ℓ2 1(γ2 g + σ2 b) + ℓ2 0γ2 h + 4(γ2 h + ℓ2 1)ℓ2 1β2 low(γ2 g + σ2 b) . (35)

Choosing βup, βlow = O( 1

T ) and γg, γh = O(1), we can get the O( 1

T ) convergence results of biased MAML.

C.3. Convergence analysis of Sharp-MAML

Thanks to the discussion in Section C.2, the convergence of Sharp-MAML is straightforward. Here we only prove for Sharp-MAMLboth since same results for the other two variants can be derived by setting αup = 0 or αlow = 0 accordingly.

Lemma 5. Under Assumption 1 2, and choosing stepsizes βlow, βup = O( 1

T ) and perturbation radii αup = O( 1

T ), αlow = O(1), with some proper constants, we can get that the iterates {θt} generated by Sharp-MAMLboth satisfy

t=1 E F(θt) 2 = O 1

Proof. Recalling the update of Sharp-MAMLboth, we rewrite it as follows.

θt+1 = θt βup

m=1 (I βlow 2L(θt + ϵ(θt) + ϵm(θt); Dm, ψm)) L(ˆθt+1 m ; D m, ξ m);

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

ˆθt+1 m = θt + ϵ(θt) βlow L(θt + ϵ(θt) + ϵm(θt); Dm, ξm)

L(θt + ϵ(θt) + ϵm(θt); Dm, ξm) ϵ(θt)

Since L(θ; Dm), 2L(θ; Dm) are Lipschitz continuous with ℓ1, ℓ2 according to Assumption 1, then we have that

L(θt + ϵ(θt) + ϵm(θt); Dm) ϵ(θt)

βlow L(θt; Dm) ℓ1(αup + αlow) + αup

βlow 2L(θt + ϵ(θt) + ϵm(θt); Dm) 2L(θt; Dm) ℓ2(αup + αlow),

which satisfies the condition in Lemma 4 if αup = O( 1

T ), αlow = O(1). Thus, we arrive at the conclusion.

D. Generalization Analysis

We build on the recently developed PAC-Bayes bound for meta learning (Farid & Majumdar, 2021), as restated below.

Lemma 6. Assume the loss function L( ) is bounded: 0 L(h, D) 1 for any h in the hypothesis space, and any D in the sample space. For hypotheses h A(θ,D) learned with γA uniformly stable algorithm A, data-independent prior Pθ,0 over initializations θ, and δ (0, 1), with probability at least 1 δ over a sampling of the meta-training dataset D P, and |D| = n, which include meta-training data from all tasks, the following holds simultaneously for all distributions Pθ over θ:

E D P E θ PθL(h A(θ,D), D) 1

i=1 E θ PθL(h A(θ,D), (xi, yi)) +

DKL(Pθ Pθ,0) + ln 2 n

We can obtain a generalization bound below.

Theorem 3. Assume loss function L( ) is bounded: 0 L(θm; D) 1 for any θm, and any D, and L(θm(ˆθ); P) Eϵ N(0,α2I)[L(θm(ˆθ + ϵ); P)] at the stationary point of the Sharp-MAMLup denoted by ˆθ. For parameter θm(ˆθ; D) learned with γA uniformly stable algorithm A from ˆθ Rk, with probability 1 δ over the choice of the training set D P, with |D| = n, it holds that

L(θm(ˆθ); P) max ϵ 2 α L(θm(ˆθ + ϵ); D) + γA +

v u u tk ln 1 + ˆθ 2 2 α2 1 + q

k 2 + 2 ln 1

δ + 5 ln n + O(1)

Proof. Since Lemma 6 holds for any prior Pθ,0 and posterior Pθ, let Pθ,0 = P = N(0, σ2 P I), Pθ = Q = N(θ, α2I), then

DKL(Q P) = 1

tr Σ 1 P ΣQ + µP µQ T Σ 1 P (µP µQ) k + ln |ΣP |

" kα2 + θ 2 2 σ2 P k + k ln σ2 P α2

Let T = {c exp((1 j)/k) | j N} be the set of values for σ2 P . If for any j N, the PAC-Bayesian bound in Lemma 6 holds for σ2 P = c exp((1 j)/k) with probability 1 δj with δj = 6δ π2j2 , then by the union bound, all bounds w.r.t. all σ2 P T hold simultaneously with probability at least 1 P j=1 6δ π2j2 = 1 δ.

First consider θ 2 α2(exp(4n/k) 1), then kα2 + θ 2 2 kα2(exp(4n/k) + 1). Now set j = 1 k ln α2 + θ 2 2/k /c . By setting c = α2(1 + exp(4n/k)), then ln α2 + θ 2 2/k /c < 0, thus we can ensure that j N. Furthermore, for σ2 P = c exp((1 j)/k), we have:

α2 + θ 2 2/k σ2 P exp(1/k)(α2 + θ 2 2/k)

where the first inequality is derived from 1 j = k ln((α2 + θ 2 2/k)/c) k ln((α2 + θ 2 2/k)/c), the second inequality is derived from 1 j = k ln((α2 + θ 2 2/k)/c) k ln((α2 + θ 2 2/k)/c) + 1.

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

The KL-divergence term can be further bounded as

DKL(Q P) = 1

" kα2 + θ 2 2 σ2 P k + k ln σ2 P α2

" kα2 + θ 2 2 α2 + θ 2 2/k k + k ln

exp(1/k) α2 + θ 2 2/k

exp(1/k) α2 + θ 2 2/k

1 + k ln 1 + θ 2 2 kα2

Given the bound that corresponds to j holds with probability 1 δj for δj = 6δ π2j2 , the ln term above can be written as:

δ + ln π2j2

δ + ln π2k2 ln2 c/ α2 + θ 2 2/k

δ + ln π2k2 ln2 c/α2

δ + ln π2k2 ln2(1 + exp(4n/k))

δ + ln π2k2(4n/k)2

δ + 2 ln(6n).

Therefore for ϵ N(0, σ2I), the generalization bound is

Eϵ N(0,σ2I) [L(θm(θ + ϵ); P)] Eϵ N(0,σ2I) [L(θm(θ + ϵ); D)] +

1 2k ln 1 + θ 2 2 kσ2 + 1

δ + 2 ln(6n)

=Eϵ N(0,σ2I) [L(θm(θ + ϵ); D)] +

1 2k ln 1 + θ 2 2 kσ2 + 1

2 + ln 72 + ln 1

By Lemma 1 in (Laurent & Massart, 2000), we have that for ϵ N(0, σ2I) and any positive t :

P ϵ 2 2 kσ2 2σ2

kt + 2tσ2 exp( t).

Therefore, with probability 1 1/ n we have that:

ϵ 2 2 σ2(2 ln( n) + k + 2 q

k ln( n)) σ2k

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

At the stationary point ˆθ obtained by Sharp-MAML, we have

L(θm(ˆθ); P) Eϵ N(0,α2I) h L(θm(ˆθ + ϵ); P) i (1 1/ n) max ϵ 2 α L(θm(ˆθ + ϵ); D) + 1/ n

1 2k ln 1 + ˆθ 2 2 kσ2 + 1

2 + ln 72 + ln 1

max ϵ 2 α L(θm(ˆθ + ϵ); D) +

v u u tk ln 1 + ˆθ 2 2 α2 1 + q

k 2 + 14 + 2 ln 1

where the last inequality holds due to 1 1/ n 1 and Jensen s inequality.

And then consider ˆθ 2 > α2(exp(4n/k) 1), apparently in (18), the RHS

max ϵ 2 α L(θm(ˆθ + ϵ); D) +

v u u tk ln 1 + ˆθ 2 2 α2 1 + q

k 2 + 14 + 2 ln 1

> max ϵ 2 α L(θm(ˆθ + ϵ); D) +

4n + 14 + 2 ln 1

δ + 5 ln n 4n + γA

> max ϵ 2 α L(θm(ˆθ + ϵ); D) + 1 + γA

L(θm(ˆθ); P)

which completes the proof.

D.1. Discussion: choice of the perturbation radius α

The upper bound of the population loss on the RHS of (18), is a function of α. A choice of α > 0 close to zero, approximates the original MAML method without SAM. We explain why SAM improves the generalization ability of MAML by showing that for any sufficiently small α0 > 0, we can find α1 > α0 where the upper bound of the population loss takes smaller value than at α0.

Proof. Let c = θ 2 2 1 + q

k 2. Denote

g(α) = max ϵ 2 α L(θ + ϵ; D) +

k ln(1 + c α2 ) + 2 ln 1

δ + 5 ln n + O(1) 4n + γA.

Since 0 L( ) 1, it follows that for any 0 < α0 < ( c exp(4n/k) 1)1/2,

k ln 1 + c α2 0 + 2 ln 1

δ + 5 ln n + O(1)

α1 > c 1 + c α2 0 exp( 4n/k) 1

1/2 > c 1 + c α2 0 1

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

then it follows that

k ln 1 + c α2 1 + 2 ln 1

δ + 5 ln n + O(1)

k ln 1 + c α2 0 exp( 4n/k) + 2 ln 1

δ + 5 ln n + O(1)

4n + k ln 1 + c α2 0 + 2 ln 1

δ + 5 ln n + O(1)

k ln 1 + c α2 0 + 2 ln 1

δ + 5 ln n + O(1)

which completes the proof.

D.2. Discussion: justification on the assumption

To obtain the generalization bound, we assume the population loss

L(θm(ˆθ); P) Eϵ N(0,α2I)[L(θm(ˆθ + ϵ); P)] (36)

at the stationary point of the Sharp-MAMLup denoted by ˆθ. We give some discussion next to justify this assumption.

If ˆθ is the local minimizer of L(θm(ˆθ); D), then with high probability, ϵ 2 α2, L(θm(ˆθ); D) Eϵ N(0,α2I)[L(θm(ˆθ); D)]. Assume the empirically observed D is representative of P and preserves the local property of the loss landscape L(θm(ˆθ); P) around ˆθ, i.e. for D P, |D| , L(θm(ˆθ); D) Eϵ N(0,α2I)[L(θm(ˆθ); D)] with high probability, then we have L(θm(ˆθ); P) Eϵ N(0,α2I)[L(θm(ˆθ); P)].

Table 4. Results on Omniglot (20-way 1-shot).

ALGORITHMS ACCURACY

MATCHING NETS 93.8% REPTILE (NICHOL & SCHULMAN, 2018) 89.43% FOMAML (NICHOL & SCHULMAN, 2018) 89.40% MAML (REPRODUCED) 91.77 % SHARP-MAMLlow 92.89 % SHARP-MAMLup 92.96 % SHARP-MAMLboth 93.47 %

Table 5. Results on Omniglot (20-way 5-shot).

ALGORITHMS ACCURACY

MATCHING NETS 98.50% FOMAML (NICHOL & SCHULMAN, 2018) 97.12% REPTILE (NICHOL & SCHULMAN, 2018) 97.90% MAML (REPRODUCED) 96.16% SHARP-MAMLlow 96.59% SHARP-MAMLup 96.62% SHARP-MAMLboth 96.64 %

E. Additional Experiments

In this section, we provide additional details of the experimental set-up and present our results on the Omniglot dataset.

Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning

Few-shot classification on Omniglot dataset. We used the same experimental setups in (Finn et al., 2017). We use only one inner gradient step with 0.1 learning rate for all our experiments for training and testing. The batch size was set to 16 for the 20-way learning setting. Following (Ravi & Larochelle, 2017), 15 examples per class were used to evaluate the post-update meta-gradient. The values of αlow and αup are chosen from the grid search on the set {0.05, 0.005, 0.0005, 0.00005} and each experiment is run on each value for three random seeds. We choose the inner gradient steps from a set of {3, 5, 7, 10}. The step size is chosen via the grid search from a set of {0.1, 0.01, 0.001}. For Sharp-MAMLboth we use the same value of αlow and αup in each experiment. The reproduced result of MAML for the 20-way 1-shot setting is close to that of MAML++ (Antoniou et al., 2019). For the 20-way 1-shot setting, we observe a similar trend where Sharp-MAMLboth achieves the best accuracy of 93.47% as compared to 91.77% of MAML. The performance gain of Sharp-MAML on the Omniglot dataset is not as significant as the Mini-Imagenet dataset because the former task is much simpler.