# efficient_variance_reduction_for_metalearning__ab6d427b.pdf

Efficient Variance Reduction for Meta-Learning

Hansi Yang 1 James T. Kwok 1

Meta-learning tries to learn meta-knowledge from a large number of tasks. However, the stochastic meta-gradient can have large variance due to data sampling (from each task) and task sampling (from the whole task distribution), leading to slow convergence. In this paper, we propose a novel approach that integrates variance reduction with first-order meta-learning algorithms such as Reptile. It retains the bilevel formulation which better captures the structure of meta-learning, but does not require storing the vast number of taskspecific parameters in general bilevel variance reduction methods. Theoretical results show that it has fast convergence rate due to variance reduction. Experiments on benchmark few-shot classification data sets demonstrate its effectiveness over state-of-the-art meta-learning algorithms with and without variance reduction.

1. Introduction

Meta-learning (Hospedales et al., 2021), or learning to learn (Thrun & Pratt, 1998), aims to quickly learn new tasks by utilizing meta-knowledge from tasks that are already learned. This is especially useful for deep networks, as they typically have to train on a huge amount of labeled samples for each task. Meta-learning has been successfully used in various applications such as few-shot learning (Wang et al., 2020; Finn et al., 2017; Nichol et al., 2018), reinforcement learning (Clavera et al., 2019; Gupta et al., 2018), neural architecture search (Elsken et al., 2020), and semi-supervised learning (Shu et al., 2019; Ren et al., 2020). In this paper, we focus on a particularly well-known family of meta-learning algorithms which is based on the MAML (Finn et al., 2017) and its variants (such as Reptile (Nichol et al., 2018) and ANIL (Raghu et al., 2020)).

1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Correspondence to: James Kwok <jamesk@cse.ust.hk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Mathematically, meta-learning is often formulated as a bilevel optimization problem (Franceschi et al., 2018). The outer problem learns some meta-parameter useful to all the tasks; while the lower-level problem (one for each task) learns a task-specific model by adapting the metaparameter. As meta-learning involves learning from a lot of tasks, variance of the stochastic meta-gradient comes from two sources: (i) variance of data samples for each task, and (ii) variance of task samples from the task distribution. This is different from simpler machine learning problems involving only one task, in which the variance comes only from the data samples. When the tasks are diverse, meta-learning algorithms can suffer from large variance of their updates, leading to slow convergence (Ghadimi & Lan, 2013; Ji et al., 2020).

To have faster convergence, a natural approach is to use variance reduction (Gower et al., 2020), which accelerates convergence by reducing the variance in the stochastic gradients. As is demonstrated by the classic variance reduction methods such as SVRG (Johnson & Zhang, 2013) and SARAH (Nguyen et al., 2017), this can achieve convergence rates faster than SGD both theoretically and empirically. However, the batch gradient has to be computed occasionally, which can still be very expensive when the training data set is large. To alleviate this problem, a recent variance reduction algorithm, STORM (Cutkosky & Orabona, 2019), proposes to perform variance reduction without the need for batch gradient, while still achieving the same convergence rate as previous variance reduction methods.

Very recently, Wang et al. (2021) made an initial attempt to use variance reduction in meta-learning. They proposed VFML, which integrates STORM into the first-order metalearning algorithm Reptile. However, VFML ignores the bilevel structure in meta-learning. Moreover, a theoretical study on its variance reduction properties is lacking.

While classic variance reduction algorithms mainly focus on single-level stochastic optimization problems, there are recent extensions to bilevel stochastic optimization (Khanduri et al., 2021; Yang et al., 2021). In principle, these can be straightforwardly incorporated into meta-learning algorithms, by simply replacing the original gradients by their variance-reduced counterparts. However, during optimization, this requires storing the task-specific parameters for all

Efficient Variance Reduction for Meta-Learning

the tasks. When the number of tasks is large (as is typically the case) or the task model is huge (as in many deep learning models), the subsequent storage cost can be prohibitive.

In this paper, we propose an efficient variance reduction method that can be used with various meta-learning algorithms. The proposed family of variance-reduced variants is aware of the bilevel optimization structure in meta-learning, while removing the need for storing task-specific parameters in existing bilevel variance reduction methods. We show theoretically that it achieves a faster convergence rate due to variance reduction. Experiments on benchmark few-shot image classification data sets also demonstrate the effectiveness of the proposed method.

A summary of this paper s contributions is as follows: (i) we propose a novel variance reduction method which can be integrated into various meta-learning algorithms; (ii) we provide theoretical analysis demonstrating that the proposed method has a faster convergence rate; and (iii) extensive experiments on benchmark data sets show that it has faster convergence and better performance than existing metalearning algorithms with and without variance reduction.

2. Related works

2.1. Meta-learning

Given a set of tasks I, meta-learning (Hospedales et al., 2021) is commonly formulated as the following bilevel optimization problem:

i I L((w, θi(w)); Di vald) (1)

s.t. θi θi(w) = arg min θ L((w, θ); Di tr), (2)

where w is the meta-parameter shared among all tasks, θi(w) is the parameter specific to task i, Di vald and Di tr s are the meta-validation and meta-training data, respectively, for task i, L((w, θ); D) = Eξ D[ℓ((w, θ); ξ)] is the loss of task i s model on data D, and ℓ((w, θ); ξ) is the loss on a stochastic sample ξ drawn from D. Since our focus is on learning the meta-parameter, we will simplify notations in the sequel and use Li(w; D) for L((w, θi); D) and ℓ(w; ξi) for ℓ((w, θi); ξi).

In this paper, we focus on the family of MAML algorithms (Finn et al., 2017), in which the meta-parameter is used as a meta-initialization for θi s. The outer loop (1) finds a suitable meta-initialization, while the inner loop (2) adapts w to each task. In many cases, the inner problem is complex and cannot be explicitly solved. Instead of finding the exact minimizer for Li, MAML performs only a single-step SGD: θi = w α ℓ(w; ξi), where ξi is sampled from the training data Di tr of task i. This can also be naturally extended to K-step SGD with K > 1, which allows better adaptation.

In general, bilevel optimization is expensive. While the meta-gradient on w can be obtained from the implicit function theorem, it requires computing the Hessian matrix or its inverse. To alleviate this problem, methods such as FOMAML (Finn et al., 2017) and Reptile (Nichol et al., 2018) propose to approximate the meta-gradient by using only the first-order information. Specifically, FOMAML simply uses the model gradient after task-specific adaptation as the meta-gradient, while Reptile uses the average gradient during adaptation. In practice, Reptile usually has a better empirical performance than MAML and FOMAML.

Another problem with the bilevel formulation of metalearning is that it requires storing the θi s of all the tasks. This can lead to a huge storage cost when the number of tasks is large and/or each θi has a large number of parameters. To address this problem, methods like MAML, FOMAML and Reptile do not explicitly keep the θi s for all tasks. Instead, they directly use the θi s (as a function of w) in the outer objective, leading to the single-level optimization problem: minw 1 |I| P

i I L(w α L(w; Di tr); Di vald).

Domain randomized search (DRS) (Gao & Sener, 2020), also called joint training (Finn et al., 2019), is another way to avoid the bilevel formulation by simply optimizing the losses of all tasks altogether. While it shows good results in some applications (Gao & Sener, 2020), DRS does not consider multi-step task adaptation and can have inferior performance than the other meta-learning algorithms, as will also be demonstrated empirically in Section 4.

2.2. Variance Reduction in Stochastic Optimization

Variance reduction (Gower et al., 2020) has been commonly used to reduce the variance in stochastic gradients and thus accelerate optimization. Pioneering works such as SAG (Roux et al., 2012), SDCA (Shalev-Shwartz & Zhang, 2013) and SVRG (Johnson & Zhang, 2013) are only applicable to strongly-convex problems. More recently, variance reduction methods for general non-convex problems are also developed (Allen-Zhu & Hazan, 2016; Nguyen et al., 2017; Fang et al., 2018). However, they still require the occasional computation of the batch gradient, which can be expensive on large training sets.

Recently, Cutkosky & Orabona (2019) propose a momentum-based variance reduction algorithm called STORM (Algorithm 1). It computes a variance-reduced gradient (step 6) by using only the stochastic gradients at two successive iterates (wt and wt 1) on the same stochastic sample ξt, without requiring the time-consuming batch gradient computation. Note that it reduces to standard SGD when all γt s are set to one. Moreover, STORM (and variant STORM+ (Levy et al., 2021)) has the same asymptotic convergence rate as other variance reduction methods.

Efficient Variance Reduction for Meta-Learning

Algorithm 1 STORM (Cutkosky & Orabona, 2019).

1: Input: w0, step-size {ηt}, decay parameter {γt}. 2: c0 = ℓ(w0; ξ0) 3: w1 = w0 η0c0 4: for t = 1 to T 1 do 5: sample ξt 6: ct = ℓ(wt; ξt) + (1 γt)(ct 1 ℓ(wt 1; ξt)) 7: wt+1 = wt ηtct 8: end for

Algorithm 2 Reptile (Nichol et al., 2018)

1: Input: w0, step-size {ηt} and α, number of local steps K. 2: for t = 0 to T 1 do 3: sample tasks It I 4: for i It do 5: ui 0 = wt 6: for k = 0 to K 1 do 7: obtain samples ξi k,t from support data of task i 8: ui k+1 = ui k α ℓ(ui k, ξi k,t) 9: end for 10: ci t = 1 Kα(wt ui K) 11: end for 12: ct = 1 |It| P

i It ci t 13: wt+1 = wt ηtct 14: end for

Very recently, momentum-based variance reduction has also been extended to stochastic bilevel optimization. Examples include SUSTAIN (Khanduri et al., 2021), MRBO/VRBO (Yang et al., 2021), RSVRB (Guo et al., 2021), and VR-Bi Adam (Huang & Huang, 2021). With the help of variance reduction, these methods achieve the same asymptotic convergence rate and are faster than bilevel algorithms without variance reduction (Ji et al., 2021; Chen et al., 2021).

3. Variance Reduction for Meta-Learning

Recall from Section 2.1 that meta-learning can be formulated as either a bilevel or single-level optimization problem. In both cases, a straightforward approach to reduce variance in the stochastic gradients is to use the corresponding variance reduction methods. For example, when using the bilevel optimization formulation, one can use the recent methods in (Khanduri et al., 2021; Yang et al., 2021). However, recall that this demands a lot of storage and can be infeasible when the number of tasks is large. Moreover, to avoid overfitting the often limited data available in each task, the inner loop typically performs only a small number of gradient descent steps. For effective variance reduction, a sufficiently large number of steps is usually required (Allen-

Algorithm 3 VR-Reptile (Variance-Reduced Reptile).

1: Input: initial weight w0, stepsizes {ηt} and α, number of local steps K, decay parameter {γt}. 2: sample tasks I0 I 3: for i I0 do 4: ui 0 = w0 5: for k = 0 to K 1 do 6: obtain samples ξi k,0 from support data of task i 7: ui k+1 = ui k α ℓ(ui k; ξi k,0) 8: end for 9: ci 0 = 1 Kα(w0 ui K) 10: end for 11: c0 = 1 |I0| P

i I0 ci 0 12: w1 = w0 η0 c0 13: for t = 1 to T 1 do 14: sample tasks It I 15: for i It do 16: ui 0 = wt 17: vi 0 = wt 1 18: for k = 0 to K 1 do 19: obtain samples ξi k,t from support data of task i 20: ui k+1 = ui k α ℓ(ui k; ξi k,t) 21: vi k+1 = vi k α ℓ(vi k; ξi k,t) 22: end for 23: di t 1 = 1 Kα(wt 1 vi K) 24: ci t = 1 Kα(wt ui K) 25: end for 26: dt 1 = 1 |It| P

i It di t 1 27: ct = 1 |It| P

i It ci t + (1 γt)( ct 1 dt 1) 28: wt+1 = wt ηt ct 29: end for

Zhu & Hazan, 2016; Cutkosky & Orabona, 2019).

Alternatively, by formulating meta-learning as a single-level optimization problem, one can use recent variance reduction methods such as STORM. Very recently, an initial attempt in this direction is VFML (Wang et al., 2021), which integrates STORM into Reptile. However, it lacks a formal study on its theoretical properties. Experiments in Section 4 also show that VFML has inferior performance.

In this section, we propose a novel variance reduction algorithm for meta-learning that avoids the pitfalls of directly applying existing variance reduction methods in bilevel or single-level optimization. The idea is to keep utilizing the double-loop structure in the bilevel formulation, which is known to more accurately capture the structure in metalearning (Gao & Sener, 2020), while maintaining the efficiency of single-level optimization formulation that does not require storing a vast number of task-specific parameters.

Efficient Variance Reduction for Meta-Learning

Algorithm 4 VFML (Wang et al., 2021).

1: Input: w0, stepsizes {ηt} and α, number of local steps K, decay parameters {βt} and {γt}. 2: sample tasks I 1 I 3: for i I 1 do 4: mi 0 = ℓ(w0, ξi 0, 1) 5: end for 6: m0 = 1 |I 1| P

i I 1 mi 0 7: for t = 0 to T 1 do 8: sample tasks It I 9: for i It do 10: ui 0 = wt 11: for k = 0 to K 1 do 12: ui k+1 = ui k α(γt ℓ( ui k, ξi k,t) + (1 γt)mt) 13: end for 14: mi t = ℓ(wt, ξi K,t)

15: ci t = 1 Kα(wt ui K) 16: end for 17: ct = 1 |It| P

i It ci t 18: mt = 1 |It| P

i It mi t 19: wt+1 = wt ηt ct 20: for i It do 21: mi t+1 = ℓ(wt+1, ξi K,t) 22: end for 23: mt+1 = 1 |It| P

i It mi t+1 + (1 βt)(mt mt) 24: end for

3.1. Proposed Algorithm

In the following, for easy comparison with VFML, we focus on Reptile (Algorithm 2). The proposed integration, which will be called VR-Reptile (variance-reduced Reptile), is shown in Algorithm 3. The integration with other firstorder meta-learning algorithms (such as MAML, FOMAML and BMG (Flennerhag et al., 2022)) are analogous and are presented in Appendix B.1.

The main part of Algorithm 3 is in steps 13-29. Recall that STORM computes a variance-reduced gradient estimate by using the stochastic gradients at two successive iterates (step 6 of Algorithm 1). Applying this on the stochastic gradient update ct in the outer loop of Reptile (step 13 in Algorithm 2), we obtain the update in step 27 of Algorithm 3. Here, dt 1 is the stochastic gradient estimate (analogous to ct in Algorithm 3 or ct in Reptile) but evaluated at the previous iterate (wt 1) on the current batch of samples ξi k,t. The components ci t s (resp. di t s) in ct (resp. dt), which correspond to the local update on each task i in the inner loop, are computed at steps 16-24. Again, ci t s are based on wt, while di t s are based on wt 1. Note that the bilevel optimization structure in meta-learning is still preserved. Moreover, it can be easily seen that when all γt s are set to 1, VR-Reptile reduces to Reptile, in the same manner as

STORM reduces to standard SGD in this case.

In the implementation, we do not need to store the taskspecific model parameters (ui K s and vi K s), as they are used only once to update ci t s and di t 1 s. Indeed, neither ci t s nor di t 1 s have to be stored separately, as they only need to be summed in the updates of ct and dt 1. Thus, this is much more space-efficient than a direct application of the variance reduction methods for stochastic bilevel optimization, which requires storing all the task-specific parameters.

Remark 3.1. Recall that VFML (shown in Algorithm 4) also aims at integrating STORM into Reptile. However, it is very different from the proposed Algorithm 3. While Algorithm 3 computes a variance-reduced gradient estimate ct for the update of target meta-parameter w (step 27), VFML computes a STORM-like variance-reduced estimate of the intermediate gradient ℓ(ui k; ξi k,t) that is used only by the local variable ui k+1 (step 12 in Algorithm 4). However, the number of gradient descent steps performed in the inner loop is typically small and not enough for effective variance reduction. Moreover, performing variance reduction on an intermediate gradient makes theoretical analysis of VFML difficult.

3.2. Theoretical Analysis

In Section 3.2.1, we first study the (gradient of the) loss function that Reptile and the proposed VR-Reptile are implicitly minimizing. Section 3.2.2 then shows that VR-Reptile has a faster convergence rates than vanilla Reptile due to variance reduction. The analysis can be easily extended to show that VR-MAML/VR-FOMAML/VR-BMG also have faster convergence rate than vanilla MAML/FOMAML/BMG.

3.2.1. LOSS IMPLICITLY USED IN REPTILE

For Reptile (Algorithms 2), let ξi 0:K 1,t {ξi 0,t, . . . , ξi K 1,t} be the K i.i.d. samples from the training data Di tr of task i. At step 13, the meta-parameter wt is updated with ηtct, in which ct is an average over ci t s from the sampled tasks in It (step 12). Note that ci t = 1 K PK 1 k=0 ℓ(ui k; ξi k,t). As each gradient ℓ(ui k; ξi k,t) is a gradient field and thus path-independent by the gradient theorem (Rudin, 1976), summing them together means ci t is also path-independent and thus a gradient field from the converse of gradient theorem. Thus, each ci t can be considered as the stochastic gradient of some loss ℓon samples ξi 0:K 1,t (i.e., ci t ℓ(wt; ξi 0:K 1,t)). Let Li(wt) = Eξi 0:K 1,t Di tr[ ℓ(wt; ξi 0:K 1,t)]. The Reptile update can then be viewed as the stochastic gradient on an implicit loss L with L(wt) = 1 |I| P

i I Li(wt).

When K = 1, as can be seen from Reptile (Al-

Efficient Variance Reduction for Meta-Learning

gorithms 2), ℓ(wt; ξi 0,t) ci t = ℓ(wt; ξi 0,t). Hence, Li(wt) = Eξi 0,t Di tr ℓ(wt; ξi 0,t) = Eξi 0,t Di tr ℓ(wt; ξi 0,t) = L(wt; Di tr), which matches the gradient of task i s training loss. This agrees with the fact that Reptile reduces to SGD on loss L in this case.

Similarly, for VR-Reptile, when K = 1, we have:

ci t = ℓ(wt; ξi 0,t) = ℓ(ui 0; ξi 0,t) = ℓ(wt; ξi 0,t), di t 1 = ℓ(wt 1; ξi 0,t) = ℓ(vi 0; ξi 0,t) = ℓ(wt 1; ξi 0,t),

which are the stochastic gradients at wt and wt 1 with the same stochastic sample ξi 0,t. Since ξi 0,t s are sampled from different tasks and dt 1 is the average over all di t 1 s, dt 1 becomes the stochastic gradient of L at wt 1. ct in step 27 (which is the same as step 6 in STORM (Algorithm 1)) then becomes the variance-reduced gradient of L at wt. Thus, while Reptile reduces to SGD when K = 1, VR-Reptile reduces to (the faster) STORM in this case.

3.2.2. CONVERGENCE PROPERTIES

First, we introduce the following smoothness assumption on the loss ℓ, which is commonly used in stochastic optimization algorithms (Cutkosky & Orabona, 2019). Assumption 3.2. ℓis M-Lipschitz smooth w.r.t. w (i.e., for any w, w and ξ, ℓ(w; ξ) ℓ(w ; ξ) M w w ).

The following Proposition shows that the implicit loss L is also Lipschitz-smooth. All the proofs are in Appendix A. Proposition 3.3. ℓis M-Lipschitz-smooth w.r.t. wt, where M = (1 + αM)K/(αK). Corollary 3.4. Li(w) and L(w) are M-Lipschitz smooth.

Next, we decompose the variance of ci t (which is equal to 1 K PK 1 k=0 ℓ(ui k; ξi k,t)) into two parts.

Ei Eξi 0:K 1,t Di tr ci t L(wt) 2

= Ei h Eξi 0:K 1,t Di tr ci t Li(wt) 2i

+Ei Li(wt) L(wt) 2. (3)

The first part is due to data sampling, while the second part is due to task sampling. To bound the first part, we assume that the stochastic variance of ℓ(w; ξ) is bounded. This is also commonly assumed in stochastic gradient methods (Ghadimi & Lan, 2013; Cutkosky & Orabona, 2019). Assumption 3.5. For any w and task i, there exists a constant σ2 such that Eξ Di tr ℓ(w; ξ) Li(w) 2 σ2.

For simplicity, we assume the same σ2 for all tasks. This can be easily extended to the case where different tasks have difference variance bounds.

The following Proposition bounds the variance of the first part in (3). When K = 1, it reduces to the condition in Assumption 3.5.

Proposition 3.6. Define

ς2 = 2K(1 + M 2α2)K 1

K(1 + 2M 2α2) 2M 2α2σ2

1 + 2M 2α2 + σ2

1 + 2M 2α2 .

For any wt and task i, we have Eξi 0:K 1,t Di tr ci t

Li(wt) 2 ς2. When K = 1, we have ς = σ.

For effective meta-learning, we assume that the tasks should not differ too much, as in (Fallah et al., 2020; Ji et al., 2020).

Assumption 3.7. There exists a positive constant δ2 such that for any w and two different tasks i, j, Li(w) Lj(w) 2 δ2.

Proposition 3.8. Define

δ2 =2δ2 + (1 + 4KM 2α2)K

4K2M 2α2 (1 + 1 KMα2 )(δ2 + 2σ2)

+ 8KM 2α2(δ2 + 2σ2).

For any w and tasks i, j, we have Li(wt) Lj(wt) 2 δ2.

This can then be used to bound the second term in (3).

Corollary 3.9. If tasks i s are uniformly sampled from I, then Ei Li(wt) L(wt) 2 δ2, where the expectation is taken over the set of training tasks.

Combining Proposition 3.6 and Corollary 3.8, we have:

Corollary 3.10. Ei Eξi 0,...,ξi K 1 Di tr ci t L(wt) 2 σ2,

where σ2 ς2 + δ2.

With a suitable α, M can be upper-bounded by a constant not depending on K. Consequently, δ2 also grows linearly (but not exponentially) with K.

Corollary 3.11. With α = 1 KM , M = M(1 + 1

and δ2 = 2δ2 + (1+ 4

4 (1 + KM)(δ2 + 2σ2) + 8

K (δ2 + 2σ2) 2δ2 + ( e4

4 + 8 + e KM

4 )(δ2 + 2σ2).

The following Theorem shows convergence rate for Reptile on the implicit loss.

Theorem 3.12. For any η0 > 0, set ηt = η0 (1+t)1/2 , then Reptile (Algorithm 2) satisfies:

t=0 L(wt) 2 #

where G1 = E[ L(w0) L ]+ M σ2η2 0 ln(T +1) and L = minw L(w).

Efficient Variance Reduction for Meta-Learning

Table 1. Statistics for the data sets used.

number of classes training validation testing #samples per class

Meta-Dataset

bird 64 16 20 60 texture 30 7 10 120 aircraft 64 16 20 100 fungi 64 16 20 150

Mini-Imagenet 64 16 20 600

Theorem 3.12 matches the asymptotic convergence rate for SGD on problems with non-convex objectives (Ghadimi & Lan, 2013). The difference is that we prove the convergence w.r.t. L(wt) 2 (i.e., the gradient norm on the implicit loss) instead of L(wt) 2.

The following Theorem shows that with the use of variance reduction, VR-Reptile has a faster convergence rate than Reptile. The bound also matches the convergence rate in STORM (Theorems 1 and 2 in (Cutkosky & Orabona, 2019)).

Theorem 3.13. Let ηt = 1/(4 M(t + ( 65

28)3)1/3) and γt+1 = 65

28/(t + ( 65

28)3)1/3. Algorithm 3 satisfies:

t=0 L(wt) 2 #

4 MG2 T 2/3 + 65

where G2 = 8E[ L(w0) L ] + σ2

Table 2. Number of outer-loop training iterations on the data sets.

1-shot 5-way 5-shot 5-way

Bird 60,000 20,000 Meta Texture 40,000 30,000 Dataset Aircraft 20,000 20,000 Fungi 60,000 20,000

Mini-Imagenet 60,000 60,000

Figure 1. Variance of updates during training.

4. Experiments

As in (Finn et al., 2017; Nichol et al., 2018), we perform meta-learning experiments in the few-shot image classification setting. Experiments are performed on Mini Imagenet (Vinyals et al., 2016; Ravi & Larochelle, 2017) and Meta-Dataset (Triantafillou et al., 2020). Meta-Dataset has been popularly used for meta-learning. It consists of image classification data sets from different domains. In this experiment, we use (i) bird, (ii) texture, (iii) aircraft, and (iv) fungi as in (Yao et al., 2019). While the Meta-Dataset focuses on fine-grained classification, Mini-Imagenet contains more diverse images. A summary of these data sets is in Table 1. Experiments are performed in the 1-shot 5-way and 5-shot 5-way settings.

Following (Finn et al., 2017; Nichol et al., 2018), we use the CONV4 model as base learner. It is a 4-layer CNN. Each layer contains 64 3 3 convolutional filters, followed by batch normalization, Re LU activation, and 2 2 maxpooling. For all data sets, the hyper-parameter settings follow Reptile (Nichol et al., 2018): we use vanilla SGD as the optimizer for the outer loop, and Adam for the inner loop. The learning rate for SGD is 1, and no momentum is used. The learning rate for Adam is 0.001, the first-order momentum weight is 0, and the second-order momentum weight is 0.99. The number of gradient descent steps K in the inner loop is 5. The number of iterations in the outer loop (T in Algorithm 3) is shown in Table 2.

The following groups of meta-learning baselines are compared: (i) Standard single-level optimization methods, including (a) MAML, (b) FOMAML, (c) Reptile, (d) BMG, and (e) DRS (Gao & Sener, 2020), which optimizes the average loss of all tasks; (ii) Variants of the first group1 (denoted MAML+STORM, FOMAML+STORM, Reptile+STORM, BMG+STORM and DRS+STORM), in which the stochastic gradients are replaced by the variance-reduced counterparts obtained with STORM; (iii) Variants of the first group integrated with the proposed method (denoted VR-MAML, VRFOMAML, VR-Reptile and VR-BMG);2 (iv) VFML (Wang

1These algorithms are shown in Appendix B.2. 2VR-MAML, VR-FOMAML and VR-BMG are shown in Appendix B.1. DRS does not use a bilevel structure, and so cannot be

Efficient Variance Reduction for Meta-Learning

Table 3. Classification accuracy (%) on mini-Image Net.

var reduction single/bilevel 1-shot 5-way 5-shot 5-way

MAML single 48.7 1.8 63.1 0.9 FOMAML single 48.1 1.8 63.2 0.9 Reptile single 50.0 0.3 66.0 0.6 BMG single 50.7 0.5 65.6 0.6 DRS single 24.5 0.8 30.4 0.6 Proto Net single 49.4 0.8 68.2 0.7

MAML+STORM single 47.9 1.4 61.6 1.2 FOMAML+STORM single 48.0 1.6 63.4 1.1 Reptile+STORM single 49.9 0.3 66.2 0.3 BMG+STORM single 46.7 0.6 60.9 0.8 DRS+STORM single 24.7 1.1 30.3 0.7

VR-MAML single 49.2 1.4 63.6 0.8 VR-FOMAML single 48.3 1.2 63.4 0.6 VR-Reptile single 50.4 0.4 67.6 0.8 VR-BMG single 51.4 0.3 68.4 0.6

VFML single 49.6 0.5 66.2 0.8

ANIL bilevel 46.9 0.4 61.4 0.2 ANIL+SUSTAIN bilevel 47.0 0.4 61.8 0.3 ANIL+MRBO bilevel 47.2 0.5 62.0 0.2 ANIL+VRBO bilevel 47.2 0.4 61.9 0.2

(b) Texture.

(c) Aircraft.

(d) Fungi. Figure 2. Training accuracy with number of outer-loop iterations on Meta-Dataset in 1-shot 5-way setting.

et al., 2021); (v) ANIL (Raghu et al., 2020), which uses the bilevel formulation for meta-learning. As the bilevel formulation needs to keep θi s for all tasks i, it is infeasible to use the whole CONV4 model as θi.3 To alleviate this problem, we only adapt parameters in the last layer of CONV4; (vi) variants of ANIL using a straightforward combination with variance reduction methods for bilevel optimization, including (a) SUSTAIN (Khanduri et al., 2021), (b) MRBO (Yang et al., 2021), and (c) VRBO (Yang et al., 2021).

For performance evaluation, we follow (Finn et al., 2017; Nichol et al., 2018) and report the average accuracy over

integrated with the proposed method. 3For example, Mini-Imagenet has 64 classes, and about 7.6 106 5-way classification tasks in the meta-training set. Assume that the deep network has only 0.1M parameters (which is small), the task models take a total of 7.6 106 0.1M=760G memory.

1,000 5-way classification tasks randomly sampled from its meta-testing set. Each method is repeated 3 times.

4.1. Results on Mini-Imagenet

Table 3 shows the testing accuracies of the various methods in the 1-shot and 5-shot settings. As can be seen, integrating meta-learning algorithms with any of the variance reduction methods generally leads to better performance. In particular, the proposed VR-BMG achieves the best overall performance. The superiority of the VR variants over the STORM variants demonstrates that explicitly considering the double-loop meta-learning structure is useful. On the other hand, ANIL and its variance-reduced variants, which are based on the bilevel formulation, perform less well than those using the single-level formulation. As ANIL can only adapt part of its model in order to be computationally

Efficient Variance Reduction for Meta-Learning

Table 4. Classification accuracy (%) on meta-testing set from Meta-Dataset in 1-shot 5-way classification.

var reduction single/bi-level Bird Texture Aircraft Fungi

MAML single 57.44 33.74 57.98 41.80 FOMAML single 56.14 31.48 56.94 39.70 Reptile single 60.82 34.66 54.08 42.84 BMG single 60.96 34.78 54.16 42.76 DRS single 33.64 23.82 28.24 24.96 Proto Net single 60.78 34.66 56.62 40.24

MAML+STORM single 57.16 33.78 57.52 41.88 FOMAML+STORM single 55.94 31.52 57.16 39.48 Reptile+STORM single 61.02 34.84 56.48 43.56 BMG+STORM single 56.22 31.18 53.88 42.04 DRS+STORM single 33.78 23.96 28.36 25.12

VR-MAML single 57.62 34.04 57.88 41.66 VR-FOMAML single 57.08 31.48 56.46 39.88 VR-Reptile single 62.04 35.10 57.54 45.34 VR-BMG single 62.14 35.24 57.68 45.26

VFML single 61.32 34.32 53.94 42.88

ANIL bilevel 56.82 32.68 56.84 41.84 ANIL+SUSTAIN bilevel 56.78 32.74 56.88 41.92 ANIL+MRBO bilevel 56.74 32.76 56.92 41.94 ANIL+VRBO bilevel 56.96 32.90 56.94 42.06

(b) Texture.

(c) Aircraft.

(d) Fungi. Figure 3. Training accuracy with number of outer-loop iterations on Meta-Dataset in 5-shot 5-way setting.

0 5 10 15 20 Number of inner iterations K

1-shot testing accuracy (%)

Figure 4. Testing accuracy (%) of VR-Reptile on 1-shot 5-way mini-Image Net with different K s.

feasible, this greatly limits model flexibility, leading to inferior performance. DRS and DRS+STORM are also much inferior than the other baselines, suggesting that accurate

modeling the structure of meta-learning is more important than the ease of optimization. Also, VFML generally has worse performance than VR-Reptile, as has been discussed in Remark 3.1.

Next, we demonstrate that the proposed method can reduce the variance. We focus on three gradient-based metalearning algorithms, MAML/FOMAML/Reptile, and their variance-reduced variants VR-MAML/VR-FOMAML/VRReptile. Figure 1 shows E ct E[ ct] 2

E[ ct] 2 , the variance of weight update ct (relative to its squared norm), for different methods on Mini-Image Net in the 1-shot 5-way setting. As expected, the update variance becomes smaller with variance reduction. MAML and FOMAML have smaller gradient variance than Reptile, which explains the smaller improvements for variance reduction on MAML/FOMAML than on Reptile.

Efficient Variance Reduction for Meta-Learning

Table 5. Classification accuracy (%) on meta-testing set from Meta-Dataset in 5-shot 5-way classification.

var reduction single/bilevel Bird Texture Aircraft Fungi

MAML single 74.56 45.68 69.06 53.68 FOMAML single 73.64 42.82 66.38 52.18 Reptile single 74.60 43.26 66.46 52.88 BMG single 74.52 43.74 66.64 53.02 DRS single 53.34 33.28 41.06 37.64 Proto Net single 74.22 49.86 71.38 53.94

MAML+STORM single 74.86 45.26 68.48 53.72 FOMAML+STORM single 73.72 42.78 66.22 52.28 Reptile+STORM single 75.24 44.60 67.48 52.54 BMG+STORM single 73.16 42.32 66.46 51.74 DRS+STORM single 53.48 33.42 41.12 37.72

VR-MAML single 75.06 46.18 68.36 53.86 VR-FOMAML single 74.28 43.28 66.98 52.16 VR-Reptile single 76.48 46.94 71.62 54.24 VR-BMG single 76.56 47.28 71.48 54.38

VFML single 74.38 44.48 65.64 52.76

ANIL bilevel 73.68 41.96 68.74 52.84 ANIL+SUSTAIN bilevel 73.74 42.12 68.82 52.78 ANIL+MRBO bilevel 73.78 42.18 68.78 52.86 ANIL+VRBO bilevel 73.88 42.22 68.74 52.82

4.2. Results on Meta-Dataset

Tables 4 and 5 shows the testing accuracies in the 1-shot and 5-shot settings, respectively, for the four few-shot data sets in Meta-Dataset. The observations are generally similar to those on mini-Image Net in Section 4.1. The integration of variance reduction into different meta-learning algorithms leads to best performance overall, as is demonstrated by VRBMG and VR-Reptile. Moreover, meta-learning methods based on the single-level formulation have better performance than methods based on the bilevel formulation in general.

Figures 2 and 3 show the training accuracy with the number of outer-loop iterations (in (1)) for the 1-shot and 5-shot settings. To reduce clutterness, we only show results for Reptile and related methods (VFML, Reptile+STORM and VR-Reptile). As can be seen, VR-Reptile has much faster convergence than the other baselines, showing the benefits of variance reduction and verifies Theorem 3.13. On the other hand, Reptile+STORM does not show faster convergence as compared to Reptile. This is because the inner loop only involves a small number of gradient descent steps, while variance reduction methods like STORM typically require a sufficiently large number of steps to be effective. VFML converges even slower than Reptile in most cases. This can partly be attributed to its lack of theoretical study on convergence properties.

4.3. Ablation Study: Number of Inner Iterations K

Finally, we study the influence of K (number of inner-loop iterations) on VR-Reptile. We use the Mini-Image Net data set under the 1-shot 5-way setting. The testing accuracies with different K s are shown in Figure 4. As can be seen, having a K too small or too large lead to inferior performance, and the performance with 4 K 10 are very similar. To be consistent with (Finn et al., 2017; Nichol et al., 2018), we set K = 5 in all previous experiments.

5. Conclusion

In this paper, we propose a novel variance reduction method VR-Reptile to accelerate convergence of meta-learning. VRReptile utilizes the double-loop structure of meta-learning algorithms, but does not require storing the task-specific parameters. Theoretical results demonstrate that VR-Reptile has a faster convergence rate than Reptile due to variance reduction. Experiments on benchmark few-shot classification data sets demonstrate its superiority over meta-learning algorithms with and without variance reduction.

Acknowledgements

This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant 16200021).

Efficient Variance Reduction for Meta-Learning

Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning, 2016.

Chen, T., Sun, Y., and Yin, W. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. In Advances in Neural Information Processing Systems, 2021.

Clavera, I., Nagabandi, A., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., and Finn, C. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In International Conference on Learning Representations, 2019.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex SGD. In Advances in Neural Information Processing Systems, 2019.

Elsken, T., Staffler, B., Metzen, J. H., and Hutter, F. Metalearning of neural architectures for few-shot learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.

Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based model-agnostic metalearning algorithms. In International Conference on Artificial Intelligence and Statistics, 2020.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: Nearoptimal non-convex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, 2018.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.

Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. In International Conference on Machine Learning, 2019.

Flennerhag, S., Schroecker, Y., Zahavy, T., van Hasselt, H., Silver, D., and Singh, S. Bootstrapped Meta-Learning. In International Conference on Learning Representations, 2022.

Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, 2018.

Gao, K. and Sener, O. Modeling and optimization trade-off in meta-learning. In Advances in Neural Information Processing Systems, 2020.

Ghadimi, S. and Lan, G. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Gower, R. M., Schmidt, M., Bach, F., and Richt arik, P. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968 1983, 2020.

Guo, Z., Hu, Q., Zhang, L., and Yang, T. Randomized Stochastic Variance-Reduced Methods for Multi Task Stochastic Bilevel Optimization. Technical Report ar Xiv:2105.02266, 2021.

Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, 2018.

Hospedales, T. M., Antoniou, A., Micaelli, P., and Storkey, A. J. Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(1):1 20, 2021.

Huang, F. and Huang, H. Bi Adam: Fast Adaptive Bilevel Optimization Methods. Technical Report ar Xiv:2106.11396, 2021.

Ji, K., Yang, J., and Liang, Y. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning. Technical Report ar Xiv:2002.07836, 2020.

Ji, K., Yang, J., and Liang, Y. Bilevel optimization: Convergence analysis and enhanced design. In International Conference on Machine Learning, 2021.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013.

Khanduri, P., Zeng, S., Hong, M., Wai, H. T., Wang, Z., and Yang, Z. A near-optimal algorithm for stochastic bilevel optimization via double-momentum. In Advances in Neural Information Processing Systems, 2021.

Levy, K. Y., Kavis, A., and Cevher, V. STORM+: Fully adaptive SGD with recursive momentum for nonconvex optimization. In Advances in Neural Information Processing Systems, 2021.

Nguyen, L. M., Liu, J., Scheinberg, K., and Tak aˇc, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, 2017.

Nichol, A., Achiam, J., and Schulman, J. On First Order Meta-Learning Algorithms. Technical Report ar Xiv:1803.02999, 2018.

Efficient Variance Reduction for Meta-Learning

Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. In International Conference on Learning Representations, 2020.

Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2017.

Ren, Z., Yeh, R., and Schwing, A. Not all unlabeled data are equal: Learning to weight data in semi-supervised learning. In Advances in Neural Information Processing Systems, 2020.

Roux, N., Schmidt, M., and Bach, F. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, 2012.

Rudin, W. Principles of Mathematical Analysis. Mc Graw Hill, 1976.

Shalev-Shwartz, S. and Zhang, T. Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, 2013.

Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. Meta-weight-net: Learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, 2019.

Thrun, S. and Pratt, L. Learning to Learn: Introduction and Overview. Springer US, 1998.

Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Evci, U., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Manzagol, P.-A., and Larochelle, H. Meta-dataset: A dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations, 2020.

Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, 2016.

Wang, L., Huang, K., Ma, T., Gu, Q., and Huang, J. Variance-reduced first-order meta-learning for natural language processing tasks. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.

Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1 34, 2020.

Yang, J., Ji, K., and Liang, Y. Provably faster algorithms for bilevel optimization. In Advances in Neural Information Processing Systems, 2021.

Yao, H., Wei, Y., Huang, J., and Li, Z. Hierarchically structured meta-learning. In International Conference on Machine Learning, 2019.

Efficient Variance Reduction for Meta-Learning

A.1. Proof of Proposition 3.3

We first construct the following two sequences from wt and w t.

ui 0 = wt, ui k+1,t = ui k α ℓ(ui k, ξi k,t),

ui 0 = w t, ui k+1,t = ui k α ℓ(ui k , ξi k,t).

From the definition of ℓin Section 3.2.1, we have:

ℓ(wt, ξi 0:K 1,t) ℓ(w t, ξi 0:K 1,t) = 1

k=0 ℓ(ui k, ξi k,t) 1

k=0 ℓ(ui k , ξi k,t)

k=0 ( ℓ(ui k, ξi k,t) ℓ(ui k , ξi k,t))

k=0 ℓ(ui k, ξi k,t) ℓ(ui k , ξi k,t)

k=0 ui k ui k ,

where the last inequality comes from Assumption 3.2. Now we have to bound ui k ui k for k = 0, . . . , K 1 in terms of wt w t . For k = 0, obviously we have ui 0 ui 0 = wt w t . For k > 0, we have the following induction:

ui k+1 ui k+1 = ui k α ℓ(ui k, ξi k,t) ui k α ℓ(ui k , ξi k,t)

= (ui k ui k ) α ℓ(ui k, ξi k,t) α ℓ(ui k , ξi k,t)

ui k ui k + α ℓ(ui k, ξi k,t) ℓ(ui k , ξi k,t)

(1 + αM) ui k ui k ,

where the last inequality comes from Assumption 3.2. This then gives:

ui k ui k (1 + αM)k wt w t , k = 0, . . . , K 1.

Summing over k, we have:

ℓ(wt, ξi 0:K 1,t) ℓ(w t, ξi 0:K 1,t) M

k=0 (1 + αM)k wt w t

K (1 + αM)K 1

αK wt w t .

Setting M = (1+αM)K

αK then gives the desired result.

A.2. Proof of Proposition 3.6

From the definition of Eξi 0,t,...,ξi K 1,t Di[ci t] in Section 3.2.1, we have:

Eξi 0,t,...,ξi K 1,t Di[ci t] = 1

k=0 Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)],

ui 0 = wt, ui k+1,t = ui k α ℓ(ui k, ξi k,t) (4)

Efficient Variance Reduction for Meta-Learning

Eξi 0,t,...,ξi K 1,t Di ci t Eξi 0,t,...,ξi K 1,t Di[ci t] 2

=Eξi 0,t,...,ξi K 1,t Di 1

k=0 ℓ(ui k, ξi k,t) 1

k=0 Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

k=0 Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2.

Now we need to bound Eξi 0,t,...,ξi k,t ℓ(ui k, ξi k,t) Eξi 0,t,...,ξi k,t[ ℓ(ui k, ξi k,t)] 2 for each k = 0, . . . , K 1, for which we have:

Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

=Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Li(ui k) + Li(ui k) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

=Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Li(ui k) 2 + Eξi 0,t,...,ξi K 1,t Di Li(ui k) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2,

where the last equality is obtained from the fact that ξi k,t and ui k are independent, as all these stochastic samples ξi 0,t, . . . , ξi k,t are i.i.d..

For the first term in (5), from Assumption 3.5, we have:

Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Li(ui k) 2 = Eξi k,t Di ℓ(ui k, ξi k,t) Li(ui k) 2 σ2. (6)

For the second term in (5), when k = 0, we have

Eξi 0,t,...,ξi K 1,t Di Li(ui 0) Eξi 0,t,...,ξi k,t[ ℓ(ui 0, ξi 0,t)] 2

=Eξi 0,t Li(ui 0) Eξi 0,t[ ℓ(ui 0, ξi 0,t)] 2

=Eξi 0,t Li(ui 0) Li(ui 0) 2 = 0.

When k > 0, ui k depends on ξi 0,t, . . . , ξi k 1,t. Define ui k that satisfies:

ui 0 =ui 0 = wt, ui k = Eξi 0,t,...,ξi k 1,t Di[ui k]. (7)

This gives:

Eξi 0,t,...,ξi K 1,t Di Li(ui k) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

=Eξi 0,t,...,ξi K 1,t Di Li(ui k) Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] 2

=Eξi 0,t,...,ξi K 1,t Di Li(ui k) Li( ui k) Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Li( ui k) 2,

which can be seen as the variance of Li(ui k) Li( ui k) w.r.t. stochastic samples ξi 0,t, . . . , ξi k 1,t, as the other stochastic samples do not affect ui k. With this, we have:

Eξi 0,t,...,ξi K 1,t Di Li(ui k) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

=Eξi 0,t,...,ξi K 1,t Di Li(ui k) Li( ui k) 2 Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Li( ui k) 2

Eξi 0,t,...,ξi K 1,t Di Li(ui k) Li( ui k) 2.

From Assumption 3.2, we have:

Eξi 0,t,...,ξi K 1,t Di Li(ui k) Li( ui k) 2 M 2Eξi 0,t,...,ξi K 1,t Di ui k ui k 2. (8)

Efficient Variance Reduction for Meta-Learning

For Eξi 0,t,...,ξi K 1,t Di ui k ui k 2, from the definitions of ui k and ui k in (4) and (7), we have:

Eξi 0,t,...,ξi K 1,t Di ui k ui k 2

=Eξi 0,t,...,ξi K 1,t Di ui k 1 α ℓ(ui k 1, ξi k 1,t) Eξi 0,t,...,ξi K 1,t Di[ui k 1 α ℓ(ui k 1, ξi k 1,t)] 2

2Eξi 0,t,...,ξi K 1,t Di ui k 1 Eξi 0,t,...,ξi K 1,t Di[ui k 1] 2

+ 2α2Eξi 0,t,...,ξi K 1,t Di ℓ(ui k 1, ξi k 1,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k 1, ξi k 1,t)] 2

=2Eξi 0,t,...,ξi K 1,t Di ui k 1 ui k 1 2

+ 2α2Eξi 0,t,...,ξi K 1,t Di ℓ(ui k 1, ξi k 1,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k 1, ξi k 1,t)] 2, (9)

where the last term is what we want to bound, except that here we have k 1 instead of k. Combining (6), (8) and (9) in (5), we have:

Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

σ2 + Eξi 0,t,...,ξi K 1,t Di Li(ui k) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

σ2 + M 2Eξi 0,t,...,ξi K 1,t Di ui k ui k 2

σ2 + 2M 2Eξi 0,t,...,ξi K 1,t Di ui k 1 ui k 1 2

+ 2M 2α2Eξi 0,t,...,ξi K 1,t Di ℓ(ui k 1, ξi k 1,t) Eξi 0,t,...,ξi K 1,t Di ℓ(ui k 1, ξi k 1,t)] 2. (10)

Finally, we rewrite the two bounds in (9) and (10) as:

Eξi 0,t,...,ξi K 1,t Di ui k ui k 2

2Eξi 0,t,...,ξi K 1,t Di ui k 1 ui k 1 2 + 2α2Eξi 0,t,...,ξi K 1,t Di ℓ(ui k 1, ξi k 1,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k 1, ξi k 1,t)] 2,

Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

σ2 + 2M 2Eξi 0,t,...,ξi K 1,t Di ui k 1 ui k 1 2

+ 2M 2α2Eξi 0,t,...,ξi K 1,t Di ℓ(ui k 1, ξi k 1,t) Eξi 0,t,...,ξi K 1,t Di ℓ(ui k 1, ξi k 1,t)] 2.

Using the fact that

Eξi 0,t,...,ξi K 1,t Di ℓ(ui 0, ξi 0,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui 0, ξi 0,t)] 2

=Eξi 0,t ℓ(ui 0, ξi 0,t) Eξi 0,t[ ℓ(ui 0, ξi 0,t)] 2 = Eξi 0,t ℓ(ui 0, ξi 0,t) Li(ui 0) 2 σ2,

Eξi 0,t ui 0 Eξi 0,t[ui 0] 2 = Eξi 0,t ui 0 ui 0 2 = 0,

we can obtain that:

Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2 2k(1 + M 2α2)k 2M 2α2σ2

1 + 2M 2α2 + σ2

1 + 2M 2α2 .

Summing k from 0 to K 1, we have:

Eξi 0,t,...,ξi K 1,t Di ci t Eξi 0,t,...,ξi K 1,t Di[ci t] 2

k=0 Eξi 0,t,...,ξi K 1,t Di ℓ(ui k, ξi k,t) Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 2

2K(1 + M 2α2)K 1

K(1 + 2M 2α2) 2M 2α2σ2

1 + 2M 2α2 + σ2

1 + 2M 2α2 .

Efficient Variance Reduction for Meta-Learning

Finally, define

ς2 = 2K(1 + M 2α2)K 1

K(1 + 2M 2α2) 2M 2α2σ2

1 + 2M 2α2 + σ2

1 + 2M 2α2 ,

and this concludes the proof.

A.3. Proof of Proposition 3.8

Similar to the proof of Proposition 3.6, we first obtain:

Eξi 0,t,...,ξi K 1,t Di[ci t] = 1

k=0 Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] = 1

k=0 Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)],

which then gives:

Eξi 0,t,...,ξi K 1,t Di[ci t] Eξj 0,t,...,ξj K 1,t Dj[cj t] 2

k=0 Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 1

k=0 Eξj 0,t,...,ξj K 1,t Dj[ ℓ((uj k, ξj k,t)] 2

k=0 (Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)]) 2

k=0 Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2. (11)

Next, we need to bound Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2 for each k = 0, . . . , K 1 and tasks i, j I. Note that when k = 0, we have:

Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2 = Li(wt) Lj(wt) 2 δ2,

that is directly obtained from Assumption 3.7. For k > 0, we have:

Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2

= Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)]

+ Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2

2 Eξi 0,t,...,ξi K 1,t Di[ Li(ui k) Lj(ui k)] 2 + 2 Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2

2δ2 + 2 Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2, (12)

where the first term is bounded by Assumption 3.7. For the second term in (12), we have:

Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2

= Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj h Lj ui k 1 α ℓ(ui k 1, ξi k 1,t) Lj uj k 1 α ℓ(uj k 1, ξj k 1,t) i 2

Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj Lj(ui k 1 α ℓ(ui k 1, ξi k 1,t)) Lj(uj k 1 α ℓ(uj k 1, ξj k 1,t)) 2

M 2Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ui k 1 α ℓ(ui k 1, ξi k 1,t) uj k 1 α ℓ(uj k 1, ξj k 1,t) 2

M 2Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj 2 ui k 1 uj k 1 2 + 2α2 ℓ(ui k 1, ξi k 1,t) ℓ(uj k 1, ξj k 1,t) 2 . (13)

Efficient Variance Reduction for Meta-Learning

For the first term in (13), consider ui k uj k 2, we have:

Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ui k uj k 2 =Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj (ui k wt) (uj k wt) 2

=Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj

l=0 α( ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t)) 2

l=0 ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t) 2. (14)

Substituting the first term in (13) by (14),

Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj(uj k)] 2

M 2Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj[2 ui k 1 uj k 1 2 + 2α2 ℓ(ui k 1, ξi k 1,t) ℓ(uj k 1, ξj k 1,t) 2]

2(k 1)M 2α2Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj

l=0 ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t) 2 #

+ 2k M 2α2Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj[ ℓ(ui k 1, ξi k 1,t) ℓ(uj k 1, ξj k 1,t) 2]. (15)

Now we need to bound Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj[ ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t) 2] for any l = 0, . . . , K 1, that appears in (15). First, we have:

ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t) 2

= ℓ(ui l, ξi l,t) Li(ui l) + Li(ui l) Lj(ui l) + Lj(ui l) Lj(uj l ) + Lj(uj l ) ℓ(uj l , ξj l,t) 2

4 ℓ(ui l, ξi l,t) Li(ui l) 2 + 4 Li(ui l) Lj(ui l) 2

+ 4 Lj(ui l) Lj(uj l ) 2 + 4 Lj(uj l ) ℓ(uj l , ξj l,t) 2. (16)

For the expectation, we have:

Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t) 2

Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj[4 ℓ(ui l, ξi l,t) Li(ui l) 2 + 4 Li(ui l) Lj(ui l) 2

+ 4 Lj(ui l) Lj(uj l ) 2 + 4 Lj(uj l ) ℓ(uj l , ξj l,t) 2]

4δ2 + 8σ2 + 4M 2Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ui l uj l 2

4δ2 + 8σ2 + 4l M 2α2 l 1 X

m=0 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2,

where the first and last terms are bounded by Assumption 3.5, the second term is bounded by Assumption 3.7, and the third term is obtained from Assumption 3.2. This also implies:

m=0 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2

m=0 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2

+ Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t) 2

4δ2 + 8σ2 + (1 + 4KM 2α2) Xl 1

m=0 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2, (17)

Efficient Variance Reduction for Meta-Learning

as we always have l < K by definition. To derive the final bound, consider l = 0 in (17):

m=0 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2

=Eξi 0,t Di,ξj 0,t Dj ℓ(ui 0, ξi 0,t) ℓ(uj 0,t, ξj 0,t) 2

=Eξi 0,t Di,ξj 0,t Dj ℓ(wt, ξi 0,t) Li(wt) + Li(wt) Lj(wt) + Lj(wt) ℓ(wt, ξj 0,t) 2

=Eξi 0,t Di ℓ(wt, ξi 0,t) Li(wt) 2 + Li(wt) Lj(wt) 2 + Eξj 0,t Dj Lj(wt) ℓ(wt, ξj 0,t) 2

where we remove the stochastic samples ξi 1,t, . . . , ξi K 1,t and ξj 1,t, . . . , ξj K 1,t that are irrelevant to the expectations, as ℓ(ui 0, ξi 0,t) ℓ(uj 0,t, ξj 0,t) 2 only depends on ξi 0,t, ξj 0,t. This then gives:

m=0 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2

(1 + 4KM 2α2)l(1 + 1 KMα2 )(δ2 + 2σ2) δ2 + 2σ2

KMα2 . (18)

Now we have:

Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj((uj k)] 2

2(k 1)α2M 2Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj

l=0 ℓ(ui l, ξi l,t) ℓ(uj l , ξj l,t) 2 #

4δ2 + 8σ2 + 4M 2(k 1)α2 k 2 X

m=0 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2 !

=2(k 1)M 2α2 1 + 4(k 1)M 2α2 Eξi 0,t,...,ξi K 1,t Di,ξj 0,t,...,ξj K 1,t Dj

m=0 ℓ(ui m, ξi m,t) ℓ(uj m, ξj m,t) 2 #

+ 8k M 2α2(δ2 + 2σ2)

2(k 1)M 2α2 1 + 4(k 1)M 2α2 (1 + 4KM 2α2)k 2(1 + 1 KMα2 )(δ2 + 2σ2) δ2 + 2σ2

+ 8k M 2α2(δ2 + 2σ2)

2(k 1)M 2α2(1 + 4KM 2α2)k 1(1 + 1 KMα2 )(δ2 + 2σ2) + 8k M 2α2(δ2 + 2σ2), (19)

where the first inequality is due to (16), the second is due to (18), and the last is obtained by removing the δ2+2σ2

KMα2 term which is always negative. Using (19) in (12), we have:

Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj(uj k)] 2

2δ2 + 2 Eξi 0,t,...,ξi K 1,t Di[ Lj(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj(uj k)] 2

2δ2 + 4(k 1)M 2α2(1 + 4KM 2α2)k 1(1 + 1 KMα2 )(δ2 + 2σ2) + 16k M 2α2(δ2 + 2σ2). (20)

Efficient Variance Reduction for Meta-Learning

Using (20) in (11),

Eξi 0,t,...,ξi K 1,t Di[ci t] Eξj 0:K 1,t[cj t] 2

k=0 Eξi 0,t,...,ξi K 1,t Di[ ℓ(ui k, ξi k,t)] 1

k=0 Eξj 0,t,...,ξj K 1,t Dj[ ℓ(uj k, ξj k,t)] 2

k=0 Eξi 0,t,...,ξi K 1,t Di[ Li(ui k)] Eξj 0,t,...,ξj K 1,t Dj[ Lj(uj k)] 2

2δ2 + 4M 2α2 (1 + 4KM 2α2)K

16K2M 4α4 (1 + 1 KMα2 )(δ2 + 2σ2) + 8KM 2α2(δ2 + 2σ2)

=2δ2 + (1 + 4KM 2α2)K

4K2M 2α2 (1 + 1 KMα2 )(δ2 + 2σ2) + 8KM 2α2(δ2 + 2σ2).

Finally, define

δ2 = 2δ2 + (1 + 4KM 2α2)K

4K2M 2α2 (1 + 1 KMα2 )(δ2 + 2σ2) + 8KM 2α2(δ2 + 2σ2),

which concludes the proof.

A.4. Proof of Theorem 3.12

First, we need the following Lemma.

Lemma A.1. If ηt 1 2 M in Algorithm 2, then:

E L(wt+1) E L(wt) ηt

2 E L(wt) 2 + η2 t M 2 σ2.

Proof. Since L(wt) is M-Lipschitz smooth, we have:

E L(wt+1) E[ L(wt) + ( L(wt)) (wt+1 wt) + M

2 wt+1 wt 2]

=E[ L(wt) ηt( L(wt)) ct + η2 t M 2 ct 2].

For Algorithm 2, we have ct = 1 |It| P

i It ci t. Since the tasks i are independently sampled, obviously we have E[ct] =

Ei[ci t] = L(wt), which gives:

E ct 2 =E ct L(wt) + L(wt) 2

=E ct L(wt) 2 + 2E(ct L(wt)) L(wt) + E L(wt) 2

σ2 + E L(wt) 2,

where the last inequality comes from Corollary 3.10. Then we have

E L(wt+1) E[ L(wt) ηt( L(wt)) ct + η2 t M 2 ct 2]

E[ L(wt) ηt L(wt) 2 + η2 t M 2 ( σ2 + E L(wt) 2)]

=E[ L(wt) ηt(1 ηt M 2 ) L(wt) 2 + η2 t M 2 σ2].

Efficient Variance Reduction for Meta-Learning

Since ηt 1 2 M , we have ηt(1 ηt M 2 ) ηt

2 , which gives:

E L(wt+1) E[ L(wt) ηt(1 ηt M 2 ) L(wt) 2 + η2 t M 2 σ2]

2 E L(wt) 2 + η2 t M 2 σ2,

and this concludes the proof.

In Lemma A.1, summing t from 0 to T 1 gives:

E L(w T ) E L(w0)

2 E L(wt) 2 +

η2 t M 2 σ2,

2 E L(wt) 2 E[ L(w0) L(w T )] + M σ2η2 g,0 ln(T + 1),

where we have used the fact that PT 1 t=0 η2 t = PT 1 t=0 η2 g,0 1+t 2η2 g,0 ln(T +1). Also, note that ηt > ηt for any t = 0, . . . , T 1 and E L(w T ) L , which gives:

t=0 E L(wt) 2 E[ L(w0) L ] + M σ2η2 g,0 ln(T + 1) = G,

where G denotes the right hand side for notational simplicity. To obtain the final result, we need to divide both sides by T ηt

2 , for which we have:

2 Tηt = 2 Tηg,0 (1 + T)1/2 2 Tηg,0

2T) = 1 ηg,0 (

which comes from the concavity of the square root function. Then we obtain:

t=0 E L(wt) 2 2 Tηt G

which concludes the proof.

A.5. Proof of Theorem 3.13

We extend Theorem 3.13 to the following which allows a flexible choice of ηt.

Theorem A.2. For any b > 0, set

ηt = b M (7b + 1 28b2 )3 + t 1/3 , γt+1 = cη2 t

for all t = 0, . . . , T 1, where c = M 2(28 + 1 7b3 ), then Reptile+STORM (Algorithm 3) satisfies:

t=0 L(wt) 2 #

G2 M/b T 2/3 + G2 M(7 + 1 28b3 ) T ,

G2 = 8E[ L(w0) L ] + (7 + 1 28b3 ) σ2

4 M + c2b3 σ2

16 M 5 ln T

and L = minw L(w).

Efficient Variance Reduction for Meta-Learning

First, we need the following Lemma. Lemma A.3. If ηt 1 4 M in Algorithm 3, then:

E L(wt+1) E L(wt) ηt

4 E L(wt) 2 + 3ηt

where et = ct L(wt) and the expectation is taken over all tasks and training data.

Proof. Since L(wt) is M-Lipschitz smooth, we have:

E L(wt+1) E[ L(wt) + ( L(wt)) (wt+1 wt) + M

2 wt+1 wt 2]

=E[ L(wt) ηt( L(wt)) ct + η2 t M 2 ct 2].

Since ct = et + L(wt), we have:

( L(wt)) ct = L(wt) 2 + ( L(wt)) et,

ct 2 = et + L(wt) 2 2 et 2 + 2 L(wt) 2,

E L(wt+1) E[ L(wt) ηt L(wt) 2 ηt( L(wt)) et

+ η2 t M et 2 + η2 t M L(wt) 2].

For ( L(wt)) et, we have:

( L(wt)) et 1

2( L(wt) 2 + et 2),

which then gives:

E L(wt+1) E[ L(wt) ηt

2 L(wt) 2 + ηt

2 et 2 + η2 t M et 2 + η2 t M L(wt) 2]

=E[ L(wt) ηt

2 (1 2ηt M) L(wt) 2 + ηt

2 (1 + 2ηt M) et 2].

Since ηt 1 4 M , we have ηt

2 (1 2ηt M) ηt

2 (1 + 2ηt M) 3ηt

4 . Combining these two gives:

E L(wt+1) E L(wt) ηt

4 E L(wt) 2 + 3ηt

which concludes the proof.

The following Lemma. bounds et 2. Lemma A.4. For any t > 0, we have:

E et 2 2γ2 t σ2

|It| + (1 γt)2(1 + 4 M 2η2 t 1)E et 1 2 + 4(1 γt)2 M 2η2 t 1E L(wt 1) 2.

Proof. From Algorithm 3, we have:

et = ct L(wt) = 1

i It ci t L(wt) + (1 γt)(ct 1 dt)

i It ci t L(wt)

i It ci t L(wt) + ct 1 1 |It|

i It ci t L(wt)

i It (ci t di t) ( L(wt) L(wt 1))

+ (1 γt)(ct 1 L(wt 1)),

Efficient Variance Reduction for Meta-Learning

where the last term is exactly (1 γt)et 1 and is independent of the first two terms. For E et 2, we have:

E et 2 =E γt( 1

i It ci t L(wt)) + (1 γt)( 1

i It (ci t di t) ( L(wt) L(wt 1))) 2

+ (1 γt)2E et 1 2.

Note that for any two vectors w, u, we always have w + u 2 2 w 2 + 2 u 2. This implies:

E et 2 =2γ2 t E 1

i It ci t L(wt) 2 + 2(1 γt)2E 1

i It (ci t di t) ( L(wt) L(wt 1)) 2

+ (1 γt)2E et 1 2. (21)

For the first term in (21), since the tasks i are independently sampled, we have:

i It ci t L(wt) 2 = 1 |It|

i It E ci t L(wt) 2 σ2

For the second term in (21), similarly we have:

i It (ci t di t) ( L(wt) L(wt 1)) 2 = 1

i It E (ci t di t) ( L(wt) L(wt 1)) 2

i It E ci t di t 2 L(wt) L(wt 1) 2

i It E ci t di t 2.

For any ci t di t 2, Proposition 3.3 leads to ci t di t 2 M 2 wt wt 1 2. Combining all these together, we have:

E et 2 2γ2 t σ2

|It| + 2(1 γt)2 M 2 wt wt 1 2 + (1 γt)2E et 1 2

|It| + 2(1 γt)2 M 2η2 t 1 ct 1 2 + (1 γt)2E et 1 2

|It| + 2(1 γt)2 M 2η2 t 1 et 1 + L(wt 1) 2 + (1 γt)2E et 1 2

|It| + 4(1 γt)2 M 2η2 t 1 et 1 2 + 4(1 γt)2 M 2η2 t 1 L(wt 1) 2 + (1 γt)2E et 1 2

|It| + (1 γt)2(1 + 4 M 2η2 t 1) et 1 2 + 4(1 γt)2 M 2η2 t 1 L(wt 1) 2,

which concludes the proof.

Now we are ready to prove Theorem 3.13:

Proof for Theorem 3.13. From Lemma A.4, we first consider bounding E[η 1 t et+1 2 η 1 t 1 et 2], i.e., the difference in variance, which is given by:

E[η 1 t et+1 2 η 1 t 1 et 2] 2γ2 t+1 σ2

ηt|It+1| + (1 γt+1)2 1 + 4 M 2η2 t ηt et 2

+ 4(1 γt+1)2 M 2ηt L(wt) 2 1 ηt 1 et 2.

Efficient Variance Reduction for Meta-Learning

Recall that

ηt = b M (7b + 1 28b2 )3 + t 1/3 , γt+1 = cη2 t ,

which implies that

ηt η0 = 1 M(7 + 1 28b3 ) < 1 4 M ,

γt+1 cη2 0 = M 2(28 + 1

7b3 ) 1 M 2(7 + 1 28b3 )2 = 28 + 1 7b3 (7 + 1 28b3 )2 < 1,

i.e., ηt 1 4 M and 0 < γt+1 1 always hold for t 0. Thus, we have 0 1 γt+1 < 1, which gives:

E[η 1 t et+1 2 η 1 t 1 et 2] 2γ2 t+1 σ2

ηt|It+1| + (1 γt+1

ηt 1 ηt 1 + 4 M 2ηt) et 2

+ 4 M 2ηt L(wt) 2.

Summing t from 0 to T 1, we have:

t=0 E[η 1 t et+1 2 η 1 t 1 et 2]

t=0 (1 γt+1

ηt 1 ηt 1 + 4 M 2ηt)E et 2

+ 4 M 2 T 1 X

t=0 ηt E L(wt) 2. (22)

For the first term in (22), since ηt = b M((7b+ 1 28b2 )3+t) 1/3 and γt+1 = cη2 t , we have:

M 3 (7b + 1 28b2 )3 + t |It+1|

1 1 + t 2c2b3 σ2

where the last two inequalities come from (7b + 1 28b2 )3 > 2 and |It+1| 1.

For the second term in (22), we first bound 1

ηt 1 ηt 1 . Obviously, x1/3 is a concave function, which gives (x + y)1/3

x1/3 + y x 2/3/3. Therefore, we have:

1 ηt 1 ηt 1 = M

(7b + 1 28b2 )3 + t 1/3 (7b + 1 28b2 )3 + (t 1) 1/3!

(7b + 1 28b2 )3 + (t 1) 2/3 = M

3b (7b + 1 28b2 )3 + (t 1) 2/3 .

Efficient Variance Reduction for Meta-Learning

Since (7b + 1 28b2 )3 > 2 for b > 0, we have 1

2(7b + 1 28b2 )3 + t

2 < (7b + 1 28b2 )3 1 + t for any t 0, which implies that:

1 ηt 1 ηt 1 M

3b (7b + 1 28b2 )3 1 + t 2/3

2(7b + 1 28b2 )3 + t

3b (7b + 1 28b2 )3 + t 2/3 = 22/3 M 3

Since ηt 1 4 M , we have:

1 ηt 1 ηt 1 22/3 M 3

3b3 η2 t 22/3 M 2

12b3 ηt < M 2

7b3 ηt, (23)

where the last inequality comes from 12/22/3 > 7. For the γt+1

ηt + 4 M 2ηt term, we have:

ηt + 4 M 2ηt = (4 M 2 c)ηt (4 M 2 M 2(28 + 1

7b3 ))ηt = 24 M 2ηt Mηt

Combining (23) and (24), we have:

ηt 1 ηt 1 + 4 M 2ηt) et 2 ( M 2

7b3 ηt 24 M 2ηt Mηt

7b3 ) et 2 = 24 M 2ηt et 2.

Now we have:

t=0 E[η 1 t et+1 2 η 1 t 1 et 2] 2c2b3 σ2

M 3 ln T 24 M 2 T 1 X

t=0 ηt E et 2 + 4 M 2 T 1 X

t=0 ηt E L(wt) 2.

Dividing the sum by 32 M 2 on both sides gives:

t=0 E[η 1 t et+1 2 η 1 t 1 et 2] c2b3 σ2

16 M 5 ln T +

8 L(wt 1) 2 3ηt

Consider the potential function Φt = L(wt) + 1 32 M 2ηt 1 et 2. We have:

E[Φt+1 Φt] =E[ L(wt+1) L(wt)] + E[ 1 32 M 2ηt et+1 2 1 32 M 2ηt 1 et 2]

4 L(wt) 2 + 3ηt

4 et 2 + 1 32 M 2ηt et+1 2 1 32 M 2ηt 1 et 2],

where the first part is bounded from Lemma A.3. Summing t from 0 to T 1 gives:

4 L(wt) 2 + 3ηt

4 et 2 + 1 32 M 2ηt et+1 2 1 32 M 2ηt 1 et 2]

4 L(wt) 2 + 3ηt

4 et 2] + 1 32 M 2

ηt et+1 2 1 ηt 1 et 2]

4 L(wt) 2 + 3ηt

4 et 2] + c2b3 σ2

16 M 5 ln T +

8 L(wt) 2 3ηt

8 L(wt) 2] + c2b3 σ2

16 M 5 ln T,

Efficient Variance Reduction for Meta-Learning

which implies that:

t=0 E[ηt L(wt) 2] 8E[Φ0 ΦT ] + c2b3 σ2

16 M 5 ln T

8E[ L(w0) L ] + 1 4 M 2η 1 E e0 2 + c2b3 σ2

16 M 5 ln T.

Since ηt is decreasing w.r.t. t, we have ηt > ηT for t = 0, . . . , T 1. For η 1, we have:

1 η 1 = M (7b + 1 28b2 )3 1 1/3

b M(7 + 1 28b3 ).

For E e0 2 = E c0 L(w0) 2, we have E e0 2 σ2 from Proposition 3.10. Combining all these,

t=0 E[ L(wt) 2] 8E[ L(w0) L ] + (7 + 1 28b3 ) σ2

4 M + c2b3 σ2

16 M 5 ln T = G2,

where G2 denotes the whole right hand for notational simplicity. To obtain the final result, we need to divide both sides by TηT , for which we have:

1 TηT = (w + σ2T)1/3

k T 2/3 + w1/3

1 TηT = M (7b + 1 28b2 )3 + T 1/3

b T M b T 2/3 + M(7 + 1 28b3 ) T .

Then we have:

t=0 E[ L(wt) 2] G2 M/b T 2/3 + G2 M(7 + 1 28b3 ) T ,

which concludes the proof.

B. Proposed Variance-Reduced Variants for Popular Meta-Learning Algorithms

B.1. Integration with MAML and FOMAML

Besides the integration with Reptile as discussed in Section 3.1, the proposed method can also be integrated with other meta-learning algorithms such as MAML, FOMAML and BMG. The resultant algorithms, which will be called VR-MAML, VR-FOMAML and VR-BMG, respectively, are shown in Algorithm 5. Similar to VR-Reptile, we introduce additional ct and dt 1, and apply similar variance reduction steps.

B.2. Combining STORM with Meta-learning

Algorithm 6 shows how STORM can be integrated into meta-learning algorithms, by replacing all stochastic gradients in these algorithms with the variance-reduced gradients in STORM.

Efficient Variance Reduction for Meta-Learning

Algorithm 5 VR-MAML/FOMAML/BMG (Variance-Reduced MAML/FOMAML/BMG)

1: Input: w0, stepsizes {ηt} and α, number of local steps K, decay parameter {γt} (1 means no variance reduction). 2: sample tasks I0 I 3: for i It do 4: ui 0,0 = w0 5: for k = 0 to K 1 do 6: obtain samples ξi k,t from support data of task i 7: ui k+1,0 = ui k α ℓ(ui k, ξi k,0) 8: end for 9: if VR-BMG then 10: for k = K to K + M 1 do 11: obtain samples ξi k,0 from support data of task i 12: ui k+1,0 = ui k α ℓ(ui k, ξi k,0) 13: end for 14: end if 15: obtain samples ξi K,0 from query data of task i 16: if VR-MAML then 17: ci 0 = w0ℓ(ui k, ξi K,0) 18: else if VR-FOMAML then 19: ci 0 = ℓ(ui k, ξi K,0) 20: else if VR-BMG then 21: ci 0 = w0( 1

2 ui K+M,0 ui K,0 ) 22: end if 23: end for 24: c0 = 1 |I0| P

i I0 ci 0 25: w1 = w0 η0 c0 26: for t = 1 to T 1 do 27: sample tasks It I 28: for i It do 29: ui 0 = wt, vi 0,t 1 = wt 1 30: for k = 0 to K 1 do 31: obtain samples ξi k,t from support data of task i 32: ui k+1,t = ui k α ℓ(ui k, ξi k,t), vi k+1,t 1 = vi k,t 1 α ℓ(vi k,t 1, ξi k,t) 33: end for 34: if VR-BMG then 35: for k = K to K + M 1 do 36: obtain samples ξi k,t from support data of task i 37: ui k+1,t = ui k α ℓ(ui k, ξi k,t), vi k+1,t 1 = vi k,t 1 α ℓ(vi k,t 1, ξi k,t) 38: end for 39: end if 40: obtain samples ξi K,t from query data of task i 41: if VR-MAML then 42: di t 1 = wt 1ℓ(vi K,t 1, ξi K,t), ci t = wtℓ(ui k, ξi K,t) 43: else if VR-FOMAML then 44: di t 1 = ℓ(vi K,t 1, ξi K,t), ci t = ℓ(ui k, ξi K,t) 45: else if VR-BMG then 46: di t 1 = wt 1( 1

2 vi K+M,t 1 vi K,t 1 ), ci t = wt( 1

2 ui K+M,t ui K,t ) 47: end if 48: end for 49: dt 1 = 1 |It| P

i It di t 1 50: ct = 1 |It| P

i It ci t + (1 γt)( ct 1 dt 1) 51: wt+1 = wt ηt ct 52: end for

Efficient Variance Reduction for Meta-Learning

Algorithm 6 MAML/FOMAML/Reptile+STORM: meta-learning algorithms with stochastic gradients replaced by STORM.

1: Input: w0, step-size ηt and α, number of local steps K. 2: for t = 0 to T 1 do 3: Sample tasks It I 4: for i It do 5: ui 0 = wt 6: for k = 0 to K 1 do 7: Obtain data samples ξi k,t from the support data of task i 8: if k > 0 then 9: mi k,t = ℓ(ui k, ξi k,t) + (1 γk)(mi k 1,t ℓ(ui k 1, ξi k,t)) 10: else 11: mi k,t = ℓ(ui k, ξi k,t) 12: end if 13: ui k+1,t = ui k αmi k,t 14: end for 15: Obtain data samples ξi K,t from the query data of task i

16: ci t = wtℓ(ui k, ξi K,t) (MAML), ci t = ℓ(ui k, ξi K,t) (FOMAML), or ci t = 1 Kα(wt ui k) (Reptile) 17: end for 18: ct = 1 |It| P

i It ci t 19: wt+1 = wt ηtct 20: end for