# theoretical_convergence_of_multistep_modelagnostic_metalearning__d35881bb.pdf Journal of Machine Learning Research 23 (2022) 1-41 Submitted 7/20; Revised 5/21; Published 2/22 Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Kaiyi Ji ji.367@osu.edu Department of Electrical and Computer Engineering The Ohio State University Columbus, OH 98195-4322, USA Junjie Yang yang.4972@osu.edu Department of Electrical and Computer Engineering The Ohio State University Columbus, OH 98195-4322, USA Yingbin Liang liang.889@osu.edu Department of Electrical and Computer Engineering The Ohio State University Columbus, OH 98195-4322, USA Editor: Suvrit Sra As a popular meta-learning approach, the model-agnostic meta-learning (MAML) algorithm has been widely used due to its simplicity and effectiveness. However, the convergence of the general multi-step MAML still remains unexplored. In this paper, we develop a new theoretical framework to provide such convergence guarantee for two types of objective functions that are of interest in practice: (a) resampling case (e.g., reinforcement learning), where loss functions take the form in expectation and new data are sampled as the algorithm runs; and (b) finite-sum case (e.g., supervised learning), where loss functions take the finite-sum form with given samples. For both cases, we characterize the convergence rate and the computational complexity to attain an ϵ-accurate solution for multi-step MAML in the general nonconvex setting. In particular, our results suggest that an inner-stage stepsize needs to be chosen inversely proportional to the number N of inner-stage steps in order for N-step MAML to have guaranteed convergence. From the technical perspective, we develop novel techniques to deal with the nested structure of the meta gradient for multi-step MAML, which can be of independent interest. Keywords: Computational complexity, convergence rate, finite-sum, meta-learning, multi-step MAML, nonconvex, resampling. 1. Introduction Meta-learning or learning to learn (Thrun and Pratt, 2012; Naik and Mammone, 1992; Bengio et al., 1991; Schmidhuber, 1987) is a powerful tool for quickly learning new tasks by using the prior experience from related tasks. Recent works have empowered this idea with neural networks, and their proposed meta-learning algorithms have been shown to enable fast learning over unseen tasks using only a few samples by efficiently extracting the knowledge from a range of observed tasks (Santoro et al., 2016; Vinyals et al., 2016; 2022 Kaiyi Ji, Junjie Yang, and Yingbin Liang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-720.html. Ji, Yang, and Liang Finn et al., 2017a). Current meta-learning algorithms can be generally categorized into metric-learning based (Koch et al., 2015; Snell et al., 2017), model-based (Vinyals et al., 2016; Munkhdalai and Yu, 2017), and optimization-based (Finn et al., 2017a; Nichol and Schulman, 2018; Rajeswaran et al., 2019) approaches. Among them, optimization-based meta-learning is a simple and effective approach used in a wide range of domains including classification/regression (Rajeswaran et al., 2019), reinforcement learning (Finn et al., 2017a), robotics (Al-Shedivat et al., 2018), federated learning (Chen et al., 2018), and imitation learning (Finn et al., 2017b). Model-agnostic meta-learning (MAML) (Finn et al., 2017a) is a popular optimizationbased method, which is simple and compatible generally with models trained with gradient descents. MAML consists of two nested stages, where the inner stage runs a few steps of (stochastic) gradient descent for each individual task, and the outer stage updates the meta parameter over all the sampled tasks. The goal of MAML is to find a good meta initialization w based on the observed tasks such that for a new task, starting from this w , a few (stochastic) gradient steps suffice to find a good model parameter. Such an algorithm has been demonstrated to have superior empirical performance (Antoniou et al., 2019; Grant et al., 2018; Zintgraf et al., 2018; Nichol et al., 2018). Recently, the theoretical convergence of MAML has also been studied. Specifically, Finn et al. (2019) extended MAML to the online setting, and analyzed the regret for the strongly convex objective function. Fallah et al. (2020a) provided an analysis for one-step MAML for general nonconvex functions, where each inner stage takes a single stochastic gradient descent (SGD) step. In practice, the MAML training often takes multiple SGD steps at the inner stage, for example in Finn et al. (2017a); Antoniou et al. (2019) for supervised learning and in Finn et al. (2017a); Fallah et al. (2020b) for reinforcement learning, in order to attain a higher test accuracy (i.e., better generalization performance) even at a price of higher computational cost. Compared to the single-step MAML, the multi-step MAML has been shown to achieve better test performance. For example, as shown in Fig. 5 of Finn et al. (2017a) and Table 2 of Antoniou et al. (2019), the test accuracy is improved as the number of inner-loop steps increases. In particular, in the original MAML work (Finn et al., 2017a), 5 innerloop steps are taken in the training of a 20-way convolutional MAML model. In addition, some important variants of MAML also take multiple inner-loop steps, which include but not limited to ANIL (Almost No Inner Loop) (Raghu et al., 2020) and BOIL (Body Only update in Inner Loop) (Oh et al., 2021). For these reasons, it is important and meaningful to analyze the convergence of multi-step MAML, and the resulting analysis can be helpful for studying other MAML-type of variants. However, the theoretical convergence of such multi-step MAML algorithms has not been established yet. In fact, several mathematical challenges will arise in the theoretical analysis if the inner stage of MAML takes multiple steps. First, the meta gradient of multi-step MAML has a nested and recursive structure, which requires the performance analysis of an optimization path over a nested structure. In addition, multi-step update also yields a complicated bias error in the Hessian estimation as well as the statistical correlation between the Hessian and gradient estimators, both of which cause further difficulty in the analysis of the meta gradient. The main contribution of this paper lies in the development of a new theoretical framework for analyzing the general multi-step MAML with techniques for handling the above challenges. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning 1.1 Main Contributions We develop a new theoretical framework, under which we characterize the convergence rate and the computational complexity to attain an ϵ-accurate solution for multi-step MAML in the general nonconvex setting. Specifically, for the resampling case where each iteration needs sampling of fresh data (e.g., in reinforcement learning), our analysis enables to decouple the Hessian approximation error from the gradient approximation error based on a novel bound on the distance between two different inner optimization paths, which facilitates the analysis of the overall convergence of MAML. For the finite-sum case where the objective function is based on pre-assigned samples (e.g., supervised learning), we develop novel techniques to handle the difference between two losses over the training and test sets in the analysis. Our analysis provides a guideline for choosing the inner-stage stepsize at the order of O(1/N) and shows that N-step MAML is guaranteed to converge with the gradient and Hessian computation complexites growing only linearly with N, which is consistent with the empirical observations in Antoniou et al. 2019. In addition, for problems where Hessians are small, e.g., most classification/regression meta-learning problems (Finn et al., 2017a), we show that the inner stepsize α can be set larger while still maintaining the convergence, which explains the empirical findings for MAML training in Finn et al. 2017a; Rajeswaran et al. 2019. 1.2 Related Work Optimization-based meta-learning. Optimization-based meta-learning approaches have been widely used due to its simplicity and efficiency (Li et al., 2017; Ravi and Larochelle, 2016; Finn et al., 2017a). As a pioneer along this line, MAML (Finn et al., 2017a) aims to find an initialization such that gradient descent from it achieves fast adaptation. Many follow-up studies (Grant et al., 2018; Finn et al., 2019; Jerfel et al., 2018; Finn and Levine, 2018; Finn et al., 2018; Mi et al., 2019; Liu et al., 2019; Rothfuss et al., 2019; Foerster et al., 2018; Fallah et al., 2020a; Raghu et al., 2020; Collins et al., 2020) have extended MAML from different perspectives. For example, Finn et al. (2019) provided a follow-themeta-leader extension of MAML for online learning. Alternatively to meta-initialization algorithms such as MAML, meta-regularization approaches aim to learn a good bias for a regularized empirical risk minimization problem for intra-task learning (Alquier et al., 2017; Denevi et al., 2018b,a, 2019; Rajeswaran et al., 2019; Balcan et al., 2019; Zhou et al., 2019). Balcan et al. (2019) formalized a connection between meta-initialization and metaregularization from an online learning perspective. Zhou et al. (2019) proposed an efficient meta-learning approach based on a minibatch proximal update. Raghu et al. (2020) proposed an efficient variant of MAML named ANIL (Almost No Inner Loop) by adapting only a small subset (e.g., head) of neural network parameters in the inner loop. Ji and Liang (2021); Ji et al. (2020b) proposed efficient bilevel optimization algorithms for meta-learning with performance guarantee. Various Hessian-free MAML algorithms have been proposed to avoid the costly computation of second-order derivatives, which include but not limited to FOMAML (Finn et al., 2017a), Reptile (Nichol and Schulman, 2018), ES-MAML (Song et al., 2020), and HF-MAML (Fallah et al., 2020a). In particular, FOMAML (Finn et al., 2017a) omits all Ji, Yang, and Liang second-order derivatives in its meta-gradient computation, HF-MAML (Fallah et al., 2020a) estimates the meta gradient in one-step MAML using Hessian-vector product approximation. This paper focuses on the first MAML algorithms, but the techniques here can be extended to analyze the Hessian-free multi-step MAML. Optimization theory for meta-learning. Theoretical property of MAML was initially established in Finn and Levine (2018), which showed that MAML is a universal learning algorithm approximator under certain conditions. Then MAML-type algorithms have been studied recently from the optimization perspective, where the convergence rate and computation complexity is typically characterized. Finn et al. (2019) analyzed online MAML for a strongly convex objective function under a bounded-gradient assumption. Fallah et al. (2020a) developed a convergence analysis for one-step MAML for a general nonconvex objective in the resampling case. Our study here provides a new convergence analysis for multi-step MAML in the nonconvex setting for both the resampling and finite-sum cases. Since the initial version of this manuscript was posted in ar Xiv, there have been a few studies on multi-step MAML more recently. Wang et al. (2020b,a) studied the global optimality of MAML under the over-parameterized neural networks, while our analysis focus on general nonconvex functions. Kim et al. (2020) proposed an efficient extension of multi-step MAML by gradient reuse in the inner loop, while our analysis focuses on the most basic MAML algorithm. Ji et al. (2020a) analyzed the convergence and complexity performance of multi-step ANIL algorithm, which is an efficient simplification of MAML by adapting only partial parameters in the inner loop. We emphasize that the study here is the first along the line of studies on multi-step MAML. We note that a concurrent work Fallah et al. (2020b) also studies multi-step MAML for reinforcement learning setting, where they design an unbiased multi-step estimator. As a comparison, our estimator is biased due to the data sampling in the inner loop, and hence we need extra developments to control this bias, e.g., by bounding the difference between batch-gradient and the stochastic-gradient parameter updates in the inner loop. Another type of meta-learning algorithms has also been studied as a bi-level optimization problem. Rajeswaran et al. (2019) proposed a meta-regularization variant of MAML named i MAML via bilevel optimization, and analyzed its convergence by assuming that the regularized empirical risk minimization problem in the inner optimization stage is strongly convex. Likhosherstov et al. (2020) studied the convergence properties of a class of firstorder bilevel optimization algorithms. Statistical theory for meta-learning. Zhou et al. (2019) statistically demonstrated the importance of prior hypothesis in reducing the excess risk via a regularization approach. Du et al. (2020) studied few-shot learning from a representation learning perspective, and showed that representation learning can provide a sufficient rate improvement in both linear regression and learning neural networks. Tripuraneni et al. (2020) studied a multitask linear regression problem with shared low-dimensional representation, and proposed a sample-efficient algorithm with performance guarantee. Arora et al. (2020) proposed a representation learning approach for imitation learning via bilevel optimization, and demonstrated the improved sample complexity brought by representation learning. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning 2. Problem Setup In this paper, we study the convergence of the multi-step MAML algorithm. We consider two types of objective functions that are commonly used in practice: (a) resampling case (Finn et al., 2017a; Fallah et al., 2020a), where loss functions take the form in expectation and new data are sampled as the algorithm runs; and (b) finite-sum case (Antoniou et al., 2019), where loss functions take the finite-sum form with given samples. The resampling case occurs often in reinforcement learning where data are continuously sampled as the algorithm iterates, whereas the finite-sum case typically occurs in classification problems where the datasets are already sampled in advance. In Appendix A, we provide examples for these two types of problems. 2.1 Resampling Case: Problem Setup and Multi-Step MAML Suppose a set T = {Ti, i I} of tasks are available for learning and tasks are sampled based on a probability distribution p(T ) over the task set. Assume that each task Ti is associated with a loss li(w) : Rd R parameterized by w. The goal of multi-step MAML is to find a good initial parameter w such that after observing a new task, a few gradient descend steps starting from such a point w can efficiently approach the optimizer (or a stationary point) of the corresponding loss function. Towards this end, multi-step MAML consists of two nested stages, where the inner stage consists of multiple steps of (stochastic) gradient descent for each individual tasks, and the outer stage updates the meta parameter over all the sampled tasks. More specifically, at each inner stage, each Ti initializes at the meta parameter, i.e., ewi 0 := w, and runs N gradient descent steps as ewi j+1 = ewi j α li( ewi j), j = 0, 1, ..., N 1. (1) Thus, the loss of task Ti after the N-step inner stage iteration is given by li( ewi N), where ewi N depends on the meta parameter w through the iteration updates in (1), and can hence be written as ewi N(w). We further define Li(w) := li( ewi N(w)), and hence the overall meta objective is given by min w Rd L(w) := Ei p(T )[Li(w)] := Ei p(T )[li( ewi N(w))]. (2) Then the outer stage of meta update is a gradient decent step to optimize the above objective function. Using the chain rule, we provide a simplified form (see Appendix B for its derivations) of gradient Li(w) by Li(w) = N 1 Y j=0 (I α 2li( ewi j)) li( ewi N), (3) where ewi 0 = w for all tasks. Hence, the full gradient descent step of the outer stage for (2) can be written as wk+1 = wk βk Ei p(T ) j=0 (I α 2li( ewi k,j)) li( ewi k,N), (4) Ji, Yang, and Liang Algorithm 1 Multi-step MAML in the resampling case 1: Input: Initial parameter w0, inner stepsize α > 0 2: for k = 1, ..., K do 3: Sample Bk I of i.i.d. tasks by distribution p(T ) 4: for all tasks Ti in Bk do 5: for j = 0, 1, ..., N 1 do 6: Sample a training set Si k,j Update wi k,j+1 = wi k,j α li(wi k,j; Si k,j) 9: Sample T i k and Di k,j and compute b Gi(wk) through (7). 10: update wk+1 = wk βk P i Bk b Gi(wk) |Bk| . 11: end for where the index k is added to ewi j in (3) to denote that these parameters are at the kth iteration of the meta parameter w. The innerand outer-stage updates of MAML given in (1) and (4) involve the gradient li( ) and the Hessian 2li( ) of the loss function li( ), which takes the form of the expectation over the distribution of data samples as given by li( ) = Eτli( ; τ), (5) where τ represents the data sample. In practice, these two quantities based on the population loss function are estimated by samples. In specific, each task Ti samples a batch Ωof data under the current parameter w, and uses li( ; Ω) := P τ Ω li( ;τ) |Ω| and 2li( ; Ω) := P τ Ω 2li( ;τ) |Ω| as unbiased estimates of the gradient li( ) and the Hessian 2li( ), respectively. For practical multi-step MAML as shown in Algorithm 1, at the kth outer stage, we sample a set Bk of tasks. Then, at the inner stage, each task Ti Bk samples a training set Si k,j for each iteration j in the inner stage, uses li(wi k,j; Si k,j) as an estimate of li( ewi k,j) in (1), and runs a SGD update as wi k,j+1 = wi k,j α li(wi k,j; Si k,j), j = 0, .., N 1, (6) where the initialization parameter wi k,0 = wk for all i Bk. At the kth outer stage, we draw a batch T i k and Di k,j of data samples independent from each other and both independent from Si k,j and use li(wi k,N; T i k) and 2li(wi k,j; Di k,j) to estimate li( ewi k,N) and 2li( ewi k,j) in (4), respectively. Then, the meta parameter wk+1 at the outer stage is updated by a SGD step as shown in line 10 of Algorithm 1, where the estimated gradient b Gi(wk) has a form of I α 2li wi k,j; Di k,j li(wi k,N; T i k). (7) For simplicity, we suppose the sizes of Si k,j, Di k,j and T i k are S, D and T in this paper. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Algorithm 2 Multi-step MAML in the finite-sum case 1: Input: Initial parameter w0, inner stepsize α > 0 2: for k = 1, ..., K do 3: Sample Bk I of i.i.d. tasks by distribution p(T ) 4: for all tasks Ti in Bk do 5: for j = 0, 1, ..., N 1 do 6: Update wi k,j+1 = wi k,j α l Si wi k,j 9: Update wk+1 = wk βk |Bk| P i Bk b Gi(wk) 10: end for 2.2 Finite-Sum Case: Problem Setup and Multi-Step MAML In the finite-sum case, each task Ti is pre-assigned with a support/training sample set Si and a query/test sample set Ti. Differently from the resampling case, these sample sets are fixed and no additional fresh data are sampled as the algorithm runs. The goal here is to learn an initial parameter w such that for each task i, after N gradient descent steps on data from Si starting from this w, we can find a parameter w N that performs well on the test data set Ti. Thus, each task Ti is associated with two fixed loss functions l Si(w) := 1 |Si| P τ Si li(w; τ) and l Ti(w) := 1 |Ti| P τ Ti li(w; τ) with a finite-sum structure, where li(w; τ) is the loss on a single sample point τ and a parameter w. Then, the meta objective function takes the form of min w Rd L(w) := Ei p(T )[Li(w)] = Ei p(T )[l Ti( ewi N)], (8) where ewi N is obtained by ewi j+1 = ewi j α l Si( ewi j), j = 0, 1, ..., N 1 with ewi 0 := w. (9) We want to emphasize that Si and Ti are both training datasets (they together form into meta-training datasets), and (8) is the meta-training loss, i.e., the empirical loss for estimating the test time expected loss. (8) does not involve anything correlated with test error. During the test period, MAML will be evaluated over different meta-test datasets that are separate from meta-training datasets Si and Ti. Similarly to the resampling case, we define the expected losses l S(w) = Eil Si(w) and l T (w) = Eil Ti(w), and the meta gradient step of the outer stage for (8) can be written as wk+1 = wk βk Ei p(T ) j=0 (I α 2l Si( ewi k,j)) l Ti( ewi k,N), (10) where the index k is added to ewi j in (9) to denote that these parameters are at the kth iteration of the meta parameter w. As shown in Algorithm 2, MAML in the finite-sum case has a nested structure similar to that in the resampling case except that it does not sample fresh data at each iteration. Ji, Yang, and Liang In the inner stage, MAML performs a sequence of full gradient descent steps (instead of stochastic gradient steps as in the resampling case) for each task i Bk given by wi k,j+1 = wi k,j α l Si wi k,j , for j = 0, ...., N 1 (11) where wi k,0 = wk for all i Bk. As a result, the parameter wk,j (which denotes the parameter due to the full gradient update) in the update step (11) is equal to ewk,j in (10) for all j = 0, ..., N. At the outer-stage iteration, the meta optimization of MAML performs a SGD step as shown in line 9 of Algorithm 2, where b Gi(wk) is given by j=0 (I α 2l Si(wi k,j)) l Ti(wi k,N). (12) Compared with the resampling case, the biggest difference for analyzing Algorithm 2 in the finite-sum case is that the losses l Si( ) and l Ti( ) used in the inner and outer stages respectively are different from each other, whereas in the resampling case, they both are equal to li( ) which takes the expectation over the corresponding samples. Thus, the convergence analysis for the finite-sum case requires to develop different techniques. For simplicity, we assume that the sizes of all Bk are B. 3. Convergence of Multi-Step MAML in Resampling Case In this section, we first make some basic assumptions for the meta loss functions in Section 3.1, and then describe several challenges in analyzing the multi-step MAML in Section 3.2, and then present several properties of the meta gradient in Section 3.3, and finally provide the convergence and complexity results for multi-step MAML in Section 3.4. 3.1 Basic Assumptions We adopt the following standard assumptions (Fallah et al., 2020a; Rajeswaran et al., 2019). Let denote the ℓ2-norm or spectrum norm for a vector or matrix, respectively. Assumption 1 The loss li( ) of task Ti given by (5) satisfies 1. The loss li( ) is bounded below, i.e., infw Rd li(w) > . 2. li( ) is Li-Lipschitz, i.e., for any w, u Rd, li(w) li(u) Li w u . 3. 2li( ) is ρi-Lipschitz, i.e., for any w, u Rd, 2li(w) 2li(u) ρi w u . By the definition of the objective function L( ) in (2), item 1 of Assumption 1 implies that L( ) is bounded below. In addition, item 2 implies 2li(w) Li for any w Rd. For notational convenience, we take L = maxi Li and ρ = maxi ρi. The following assumptions impose the bounded-variance conditions on li(w), li(w; τ) and 2li(w; τ). Assumption 2 The stochastic gradient li( ) (with i uniformly randomly chosen from set I) has bounded variance, i.e., there exists a constant σ > 0 such that, for any w Rd, Ei li(w) l(w) 2 σ2, where the expected loss function l(w) := Eili(w). Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Assumption 3 For any w Rd and i I, there exist constants σg, σH > 0 such that Eτ li(w; τ) li(w) 2 σ2 g and Eτ 2li(w; τ) 2li(w) 2 σ2 H. Note that the above assumptions are made only on individual loss functions li( ) rather than on the total loss L( ), because some conditions do not hold for L( ), as shown later. 3.2 Challenges of Analyzing Multi-Step MAML Several new challenges arise when we analyze the convergence of multi-step MAML (with N 2) compared to the one-step case (with N = 1). First, each iteration of the meta parameter affects the overall objective function via a nested structure of N-step SGD optimization paths over all tasks. Hence, our analysis of the convergence of such a meta parameter needs to characterize the nested structure and the recursive updates. Second, the meta gradient estimator b Gi(wk) given in (7) involves 2li(wi k,j; Di k,j) for j = 1, ..., N 1, which are all biased estimators of 2li( ewi k,j) in terms of the randomness over Di k,j. This is because wi k,j is a stochastic estimator of ewi k,j obtained via random training sets Si k,t, t = 0, ..., j 1 along an N-step SGD optimization path in the inner stage. In fact, such a bias error occurs only for multi-step MAML with N 2 (which equals zero for N = 1), and requires additional efforts to handle. Third, both the Hessian term 2li(wi k,j; Di k,j) for j = 2, ..., N 1 and the gradient term li(wi k,N; T i k) in the meta gradient estimator b Gi(wk) given in (7) depend on the sample sets Si k,i used for inner stage iteration to obtain wi k,N, and hence they are statistically correlated even conditioned on wk. Such complication also occurs only for multi-step MAML with N 2 and requires new treatment (the two terms are independent for N = 1). Solutions to address the above challenges. The first challenge is mainly caused by the recursive structure of the meta gradient L(w) in (4) and the meta gradient estimator b Gi(wk) given in (7). For example, when analyzing the smoothness of the meta gradient L(w), we need to characterize the gap p between two quantities QN 1 j=0 (I α 2li( ewi j)) and QN 1 j=0 (I α 2li(eui j)), where wi j and ui j are the jth iterates of two different innerloop updating paths. Then, using the error decomposition strategy that f1f2 f 1f 2 f1 f 1 f2 + f 1 f2 f 2 , we can decompose the error p into N parts, where each one corresponds to the distance wi j ui j . The remaining step is to bound the distances wi j ui j , j = 0, ..., N 1 by finding the relationship between wi j+1 ui j+1 and wi j ui j based on the inner-loop gradient descent updates. To address the second and third challenges, we first use the strategy we propose in the first challenge to decompose the error into N components with each one taking the form of wi k,j ewi k,j , where wi k,j and ewi k,j are the jth stochastic gradient step and true gradient step of the inner loop at iteration k. The remaining step is to upper-bound the firstand second-moment distances between wi k,j and ewi k,j for all j = 0, ..., N by finding the relationship between wi k,j+1 ewi k,j+1 and wi k,j ewi k,j based on the inner-loop stochastic gradient updates. Ji, Yang, and Liang 3.3 Properties of Meta Gradient Differently from the conventional gradient whose corresponding loss is evaluated directly at the current parameter w, the meta gradient has a more complicated nested structure with respect to w, because its loss is evaluated at the final output of the inner optimization stage, which is N-step SGD updates. As a result, analyzing the meta gradient is very different and more challenging compared to analyzing the conventional gradient. In this subsection, we establish some important properties of the meta gradient which are useful for characterizing the convergence of multi-step MAML. Recall that L(w) = Ei p(T )[ Li(w)] with Li(w) given by (3). The following proposition characterizes the Lipschitz property of the gradient L( ). Proposition 1 Suppose that Assumptions 1, 2 and 3 hold. For any w, u Rd, we have L(w) L(u) (1 + αL)2NL + CLEi li(w) w u , where CL is a positive constant given by CL = (1 + αL)N 1αρ + ρ L(1 + αL)N((1 + αL)N 1 1) (1 + αL)N. (13) The proof of Proposition 1 handles the first challenge described in Section 3.2. More specifically, we bound the differences between ewi j and eui j along two separate paths ( ewi j, j = 0, ...., N) and (eui j, j = 0, ...., N), and then connect these differences to the distance w u . Proposition 1 shows that the objective L( ) has a gradient-Lipschitz parameter Lw = (1 + αL)2NL + CLEi li(w) , which can be unbounded due to the fact that li(w) may be unbounded. Similarly to Fallah et al. (2020a), we use b Lwk = (1 + αL)2NL + CL P i B k li(wk; Di Lk) to estimate Lwk at the meta parameter wk, where we independently sample the data sets B k and Di Lk. As will be shown in Theorem 5, we set the meta stepsize βk to be inversely proportional to b Lwk to handle the possibly unboundedness. We next characterize several estimation properties of the meta gradient estimator b Gi(wk) in (7). Here, we address the second and third challenges described in Section 3.2. We first quantify how far the stochastic gradient iterate wi k,j is away from the true gradient iterate ewi k,j, and then provide upper bounds on the firstand second-moment distances between wi k,j and ewi k,j for all j = 0, ..., N as below. Proposition 2 Suppose that Assumptions 1, 2 and 3 hold. Then, for any j = 0, ..., N and i Bk, we have First-moment : E( wi k,j ewi k,j |wk) (1 + αL)j 1 σg Second-moment: E( wi k,j ewi k,j 2 |wk) (1 + αL + 2α2L2)j 1 ασ2 g (1+αL)LS . Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Proposition 2 shows that we can effectively upper-bound the point-wise distance between two paths by choosing α and S properly. Based on Proposition 2, we provide an upper bound on the first-moment estimation error of meta gradient estimator b Gi(wk). Proposition 3 Suppose Assumptions 1, 2 and 3 hold, and define constants Cerr1 = (1 + αL)2Nσg, Cerr2 = (1 + αL)4Nρσg 2 (1 + αL)2N L2 . (15) Let ek := E[ b Gi(wk)] L(wk) be the estimation error. If the inner stepsize α < (2 1 2N 1)/L, then conditioning on wk, we have S ( L(wk) + σ). (16) Note that the estimation error for the multi-step case shown in Proposition 3 involves a term O L(wk) S , which cannot be avoided due to the Hessian approximation error caused by the randomness over the inner-loop samples sets Si k,j. Somewhat interestingly, our later analysis shows that this term does not affect the final convergence rate if we choose the size S properly. The following proposition provides an upper-bound on the second moment of the meta gradient estimator b Gi(wk). Proposition 4 Suppose that Assumptions 1, 2 and 3 hold. Define constants Csqu1 = 3 α2σ2 H D + (1 + αL)2 N σ2 g, Csqu3 = 2Csqu1(1 + αL)2N (2 (1 + αL)2N)2σ2g , Csqu2 = Csqu1 (1 + 2αL + 2α2L2)N 1 αL(1 + αL) 1. (17) If the inner stepsize α < (2 1 2N 1)/L, then conditioning on wk, we have E b Gi(wk) 2 Csqu1 S + Csqu3 L(wk) 2 + σ2 . (18) By choosing set sizes D, T, S and the inner stepsize α properly, the factor Csqu3 in the second-moment error bound in (18) can be made at a constant level and the first two error terms Csqu1 T and Csqu2 S can be made sufficiently small so that the variance of the meta gradient estimator can be well controlled in the convergence analysis, as shown later. 3.4 Main Convergence Result By using the properties of the meta gradient established in Section 3.3, we provide the convergence rate for multi-step MAML of Algorithm 1 in the following theorem. Theorem 5 Suppose that Assumptions 1, 2 and 3 hold. Set the meta stepsize βk = 1 Cβ b Lwk , where Cβ > 0 is a positive constant and b Lwk is the approximated smoothness parameter Ji, Yang, and Liang given by (14). For b Lwk in (14), we choose |B k| > 4C2 Lσ2 3(1+αL)4NL2 and |Di Lk| > 64σ2 g C2 L (1+αL)4NL2 for all i B k, where CL is given by (13). Define χ = (2 (1+αL)2N)(1+αL)2NL ξ = 6 CβL 1 C2 err1 + C2 err2σ2 , φ = 2 C2 βL S + Csqu3σ2 θ =2 2 (1 + αL)2N C2 err2 S Csqu3 where Cerr1, Cerr2 are given in (15) and Csqu1, Csqu2, Csqu3 are given in (17). Choose the inner stepsize α < (2 1 2N 1)/L, and choose Cβ, S and B such that θ > 0. Then, Algorithm 1 finds a solution wζ such that θ 1 B , (20) where = L(w0) L with L = infw Rd L(w). Note that for χ in Theorem 5, we replace the notation Cl by (1 + αL)2N 1 based on its definition. The proof of Theorem 5 (see Section 5.1 for details) consists of four main steps: step 1 of bounding an iterative meta update by the meta-gradient smoothness established by Proposition 1; step 2 of characterizing first-moment estimation error of the meta-gradient estimator b Gi(wk) by Proposition 3; step 3 of characterizing second-moment estimation error of the meta-gradient estimator b Gi(wk) by Proposition 4; and step 4 of combining steps 1-3, and telescoping to yield the convergence. In Theorem 5, the convergence rate given by (20) mainly contains three parts: the first term θ 1 K indicates that the meta parameter converges sublinearly with the number K of meta iterations, the second term ξ θ 1 S captures the estimation error of li(wi k,j; Si k,j) for approximating the full gradient li(wi k,j) which can be made sufficiently small by choosing a large sample size S, and the third term φ θ 1 B captures the estimation error and variance of the stochastic meta gradient, which can be made small by choosing large B, T and D (note that φ is proportional to both 1 D). It is worthwhile mentioning that our results here focus on our resampling case, where fresh data are resampled as the algorithm runs. This resampling case often happens in bandit or reinforcement learning settings, where batch sizes S, B, D, T can be chosen to be large and the resulting convergence errors will be small. However, for the cases where S, B, D, T are small, our results in Theorem 5 will contain large convergence errors. It is possible to use some techniques such as variance reduction to reduce or even remove such errors. However, this is not the focus of this paper, and require future efforts to address. Our analysis reveals several insights for the convergence of multi-step MAML as follows. (a) To guarantee convergence, we require αL < 2 1 2N 1 (e.g., α = Θ( 1 NL)). Hence, if the number N of inner gradient steps is large and L is not small (e.g., for some RL problems), we need to choose a small inner stepsize α so that the last output of the inner stage has a strong dependence on the initialization (i.e., meta parameter). This is also explained in Rajeswaran et al. (2019), where they add a regularizer λ w w 2 to make sure the innerloop output w has a close connection to the initialization w. (b) For problems with small Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Hessians such as many classification/regression problems (Finn et al., 2017a), L (which is an upper bound on the spectral norm of Hessian matrices) is small, and hence we can choose a larger α. This explains the empirical findings in Finn et al. (2017a); Antoniou et al. (2019), where their experiments tend to set a larger stepsize for the regression problems with smaller Hessians. We next specify the selection of parameters to simplify the convergence result in Theorem 5 and derive the complexity of Algorithm 1 for finding an ϵ-accurate stationary point. Corollary 6 Under the setting of Theorem 5, choose α = 1 8NL, Cβ = 100 and let batch sizes S 15ρ2σ2 g L4 and D σ2 HL2. Then we have E L(wζ) O 1 K + σ2 g(σ2 + 1) S + σ2 g + σ2 B + σ2 g TB 1 K + σ2g(σ2 + 1) S + σ2g + σ2 To achieve E L(wζ) < ϵ, Algorithm 1 requires at most O 1 ϵ2 iterations, and O( N ϵ2 ) gradient computations and O N ϵ2 Hessian computations per meta iteration. Differently from the conventional SGD that requires a gradient complexity of O( 1 ϵ4 ), MAML requires a higher gradient complexity by a factor of O( 1 ϵ2 ), which is unavoidable because MAML requires O( 1 ϵ2 ) tasks to achieve an ϵ-accurate meta point, whereas SGD runs only over one task. Corollary 6 shows that given a properly chosen inner stepsize, e.g., α = Θ( 1 NL), MAML is guaranteed to converge with both the gradient and the Hessian computation complexities growing only linearly with N. These results explain some empirical findings for MAML training in Rajeswaran et al. (2019). The above results can also be obtained by using a larger stepsize such as α = Θ(c 1 N 1)/L > Θ 1 NL with a certain constant c > 1. 4. Convergence of Multi-Step MAML in Finite-Sum Case In this section, we provide several properties of the meta gradient for the finite-sum case, and then analyze the convergence and complexity of Algorithm 2. Differently from the resampling case, we develop novel techniques to handle the difference between two losses over the training and test sets (i.e., innerand outer-loop losses) in the analysis, whereas these two losses are the same for the resampling case. 4.1 Basic Assumptions We state several standard assumptions for the analysis in the finite-sum case. Assumption 4 For each task Ti, the loss functions l Si( ) and l Ti( ) in (8) satisfy 1. l Si( ), l Ti( ) are bounded below, i.e., infw Rd l Si(w) > and infw Rd l Ti(w) > . 2. Gradients l Si( ) and l Ti( ) are L-Lipschitz continuous, i.e., for any w, u Rd l Si(w) l Si(u) L w u and l Ti(w) l Ti(u) L w u . Ji, Yang, and Liang 3. Hessians 2l Si( ) and 2l Ti( ) are ρ-Lipschitz continuous, i.e., for any w, u Rd 2l Si(w) 2l Si(u) ρ w u and 2l Ti(w) 2l Ti(u) ρ w u . The following assumption provides two conditions l Si( ) and l Ti( ). Assumption 5 For all w Rd, gradients l Si(w) and l Ti(w) satisfy 1. l Ti( ) has a bounded variance, i.e., there exists a constant σ > 0 such that Ei l Ti(w) l T (w) 2 σ2, where l T ( ) = Ei [ l Ti( )]. 2. For each i I, there exists a constant bi > 0 such that l Si(w) l Ti(w) bi. Instead of imposing a bounded variance condition on the stochastic gradient l Si(w), we alternatively assume the difference l Si(w) l Ti(w) to be upper-bounded by a constant, which is more reasonable because sample sets Si and Ti are often sampled from the same distribution and share certain statistical similarity. We note that the second condition also implies l Si(w) l Ti(w) +bi, which is weaker than the bounded gradient assumption made in papers such as Finn et al. (2019). It is worthwhile mentioning that the second condition can be relaxed to l Si(w) ci l Ti(w) + bi for a constant ci > 0. Without the loss of generality, we consider ci = 1 for simplicity. 4.2 Properties of Meta Gradient We develop several important properties of the meta gradient. The following proposition characterizes a Lipschitz property of the gradient of the objective function L(w) = Ei p(T ) j=0 (I α 2l Si( ewi j)) l Ti( ewi N), where the weights ewi j, i I, j = 0, ..., N are given by the gradient descent steps in (9). Proposition 7 Suppose that Assumptions 4 and 5 hold. Then, for any w, u Rd, we have L(w) L(u) Lw w u , Lw = (1 + αL)2NL + Cbb + CLEi l Ti(w) where b = Ei[bi] and Cb, CL > 0 are constants given by Cb = αρ + ρ L(1 + αL)N 1 (1 + αL)2N, CL = αρ + ρ L(1 + αL)N 1 (1 + αL)2N. (21) Proposition 7 shows that L(w) has a Lipschitz parameter Lw. Similarly to (14), we use the following construction ˆLwk = (1 + αL)2NL + Cbb + CL l Ti(wk) , (22) at the kth outer-stage iteration to approximate Lwk, where B k I is chosen independently from Bk. It can be verified that the gradient estimator b Gi(wk) given in (12) is an unbiased estimate of L(wk). Thus, our next step is to upper-bound the second moment of b Gi(wk). Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Proposition 8 Suppose Assumptions 4 and 5 are hold, and define constants Asqu1 = 4(1 + αL)4N (2 (1 + αL)2N)2 , Asqu2 = 4(1 + αL)8N (2 (1 + αL)2N)2 (σ + b)2 + 2(1 + α)4N(σ2 + eb), (23) where eb = Ei p(T )[b2 i ]. Then, if α < (2 1 2N 1)/L, then conditioning on wk, we have E b Gi(wk) 2 Asqu1 L(wk) 2 + Asqu2. Based on the above properties, we next characterize the convergence of multi-step MAML. 4.3 Main Convergence Results In this subsection, we provide the convergence and complexity analysis for Algorithm 2 based on the properties established in the previous subsection. Theorem 9 Let Assumptions 4 and 5 hold, and apply Algorithm 2 to solve the objective function (8). Choose the meta stepsize βk = 1 Cβ b Lwk with b Lwk given by (22), where Cβ > 0 is a constant. For b Lwk in (22), we choose the batch size |B k| such that |B k| 2C2 Lσ2 (Cbb+(1+αL)2NL)2 , where Cb and CL are given by (21). Define constants ξ =2 (1 + αL)2N CL (1 + αL)2NL + 2 (1 + αL)2N Cbb CL + (1 + αL)3Nb, θ =2 (1 + αL)2N B + 1 , φ = Asqu2 where Cb, CL, Asqu1 and Asqu1 are given by (21) and (23). Choose α < (2 1 2N 1)/L, and choose Cβ and B such that θ > 0. Then, Algorithm 2 attains a solution wζ such that E L(wζ) 2θK + φ 2θB + 2θK + φ 2θB The parameters θ, φ and ξ in Theorem 9 take complicate forms. The following corollary specifies the parameters Cβ, α in Theorem 9 and provides a simplified result for Algorithm 2. Corollary 10 Under the same setting of Theorem 9, choose α = 1 8NL, Cβ = 80. We have E L(wζ) O 1 In addition, suppose the batch size B further satisfies B CBσ2ϵ 2, where CB is a sufficiently large constant. Then, to achieve an ϵ-approximate stationary point, Algorithm 2 requires at most K = O(ϵ 2) iterations, and a total number O (T + NS)ϵ 2 of gradient computations and a number O NSϵ 2 of Hessian computations per iteration, where T and S correspond to the sample sizes of the pre-assigned sets Ti, i I and Si, i I. Ji, Yang, and Liang 5. Proofs of Main Results In this section, we provide the proofs the main results for MAML in the resampling case and the finite-sum case, respectively. This section is organized as follows. For the resampling case, Section 5.1 provides the proofs for the convergence properties of multi-step MAML in the resampling case, which include Propositions 1, 2, 3, 4 on the properties of meta gradient, and Theorem 5 and Corollary 6 on the convergence and complexity performance of multi-step MAML. The proofs of these results require several technical lemmas, which we relegate to the Appendix C. Next, for the finite-sum case, Section 5.2 provides the proofs for the convergence properties of multi-step MAML in the finite-sum case, which include Propositions 7, 8 on the properties of meta gradient, and Theorem 9 and Corollary 10 on the convergence and complexity of multi-step MAML. The proofs of these results rely on several technical lemmas, which we relegate to the Appendix D. 5.1 Proofs for Section 3: Convergence of Multi-Step MAML in Resampling Case To simplify notations, we let Si j and Di j denote the randomness over Si k,m, Di k,m, m = 0, ..., j 1 and let Sj and Dj denote all randomness over Si j, Di j, i I, respectively. Proof of Proposition 1 First recall that Li(w) = QN 1 j=0 (I α 2li( ewi j)) li( ewi N). Then, we have Li(w) Li(u) j=0 (I α 2li( ewi j)) j=0 (I α 2li(eui j)) li( ewi N) + (1 + αL)N li( ewi N) li(eui N) j=0 (I α 2li( ewi j)) j=0 (I α 2li(eui j)) (1 + αL)N li(w) + (1 + αL)NL ewi N eui N j=0 (I α 2li( ewi j)) j=0 (I α 2li(eui j)) | {z } V (N) (1 + αL)N li(w) + (1 + αL)2NL w u , (26) where (i) follows from Lemma 12, and (ii) follows from Lemma 11. We next upper-bound the term V (N) in the above inequality. Specifically, define a more general quantity V (m) Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning by replacing N in V (N) with m. Then, we have j=0 (I α 2li( ewi j)) α 2li( ewi m 1) α 2li(eui m 1) j=0 (I α 2li( ewi j)) j=0 (I α 2li(eui j)) I α 2li(eui m 1) (1 + αL)m 1 α 2li( ewi m 1) α 2li(eui m 1) j=0 (I α 2li( ewi j)) j=0 (I α 2li(eui j)) (1 + αL)m 1αρ ewi m 1 eui m 1 + (1 + αL)V (m 1) (1 + αL)m 1αρ(1 + αL)m 1 w u + (1 + αL)V (m 1). (27) Telescoping (27) over m from 1 to N and noting V (1) αρ w u , we have V (N) (1 + αL)N 1V (1) + m=0 αρ(1 + αL)2(N m) 2 w u (1 + αL)m = (1 + αL)N 1αρ w u + αρ(1 + αL)N N 2 X m=0 (1 + αL)m w u (1 + αL)N 1αρ + ρ L(1 + αL)N((1 + αL)N 1 1) w u . (28) Recalling the definition of CL and Combining (26), (28), we have Li(w) Li(u) CL li(w) + (1 + αL)2NL w u . Based on the above inequality, we have L(w) L(u) = Ei p(T )( Li(w) Li(u)) Ei p(T ) ( Li(w) Li(u)) CLEi p(T ) li(w) + (1 + αL)2NL w u , which finishes the proof. Ji, Yang, and Liang Proof of Proposition 2 We first prove the first-moment bound. Conditioning on wk, we have E Sim wi k,m ewi k,m (i) =E Sim wi k,m 1 α li(wi k,m 1; Si k,m 1) ( ewi k,m 1 α li( ewi k,m 1)) E Sim wi k,m 1 ewi k,m 1 + αE Sim li(wi k,m 1; Si k,m 1) li(wi k,m 1) + αE Sim li(wi k,m 1) li( ewi k,m 1) ESi k,m 1 li(wi k,m 1; Si k,m 1) li(wi k,m 1) Si m 2 + (1 + αL)E Si m 1 wi k,m 1 ewi k,m 1 (ii) (1 + αL)E Si m 1 wi k,m 1 ewi k,m 1 + α σg where (i) follows from (1) and (6), and (ii) follows from Assumption 3. Telescoping the above inequality over m from 1 to j and using the fact that wi k,0 = ewi k,0 = wk, we have E Si j wi k,j ewi k,j ((1 + αL)j 1) σg which finishes the proof of the first-moment bound. We next begin to prove the secondmoment bound. Conditioning on wk, we have E Sim wi k,m ewi k,m 2 =E Si m 1 wi k,m 1 ewi k,m 1 2 + α2E Sim li(wi k,m 1; Si k,m 1) li( ewi k,m 1) 2 ESi k,m 1 wi k,m 1 ewi k,m 1, li(wi k,m 1; Si k,m 1) li( ewi k,m 1) Si m 1 (i) E Si m 1 wi k,m 1 ewi k,m 1 2 2αE Si m 1 wi k,m 1 ewi k,m 1, li(wi k,m 1) li( ewi k,m 1) + α2E Sim 2 li(wi k,m 1; Si k,m 1) li(wi k,m 1) 2 + 2 li(wi k,m 1) li( ewi k,m 1) 2 (ii) E Si m 1 wi k,m 1 ewi k,m 1 2 + 2αE Si m 1 wi k,m 1 ewi k,m 1 li(wi k,m 1) li( ewi k,m 1)) + α2E Sim 2 li(wi k,m 1; Si k,m 1) li(wi k,m 1) 2 + 2 li(wi k,m 1) li( ewi k,m 1) 2 E Si m 1 wi k,m 1 ewi k,m 1 2 + 2αLE Si m 1 wi k,m 1 ewi k,m 1 2 + 2α2E Si m 1 σ2 g S + L2 wi k,m 1 ewi k,m 1 2 1 + 2αL + 2α2L2 E Si m 1 wi k,m 1 ewi k,m 1 2 + 2α2σ2 g S , where (i) follows from ESi k,m 1 li(wi k,m 1; Si k,m 1) = li(wi k,m 1) and (ii) follows from the inequality that a, b a b for any vectors a, b. Noting that wi k,0 = ewi k,0 = wk and telescoping the above inequality over m from 1 to j, we obtain E Si j wi k,j ewi k,j 2 (1 + 2αL + 2α2L2)j 1 ασ2 g L(1 + αL)S . Then,taking the expectation over wk in the above inequality finishes the proof. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Proof of Proposition 3 Recall the definition that j=0 (I α 2li(wi k,j; Di k,j)) li(wi k,N; T i k). Then, conditioning on wk, we have E b Gi(wk) =E SN,i p(T )E DN N 1 Y I α 2li(wi k,j; Di k,j) ET i k li(wi k,N; T i k) SN, i =E SN,i p(T ) j=0 EDi k,j I α 2li(wi k,j; Di k,j) SN, i li(wi k,N) =E SN,i p(T ) I α 2li(wi k,j) li(wi k,N), (29) which, combined with L(wk) = Ei p(T ) QN 1 j=0 (I α 2li( ewi k,j)) li( ewi k,N), yields E b Gi(wk) L(wk) (i) E SN,i p(T ) I α 2li(wi k,j) li(wi k,N) j=0 (I α 2li( ewi k,j)) li( ewi k,N) E SN,i p(T ) I α 2li(wi k,j) li(wi k,N) j=0 (I α 2li(wi k,j)) li( ewi k,N) + E SN,i p(T ) I α 2li(wi k,j) li( ewi k,N) j=0 (I α 2li( ewi k,j)) li( ewi k,N) I α 2li(wi k,j) j=0 (I α 2li( ewi k,j)) li( ewi k,N) + (1 + αL)NE SN,i li(wi k,N) li( ewi k,N) (ii) (1 + αL)NE SN,i li(wk) I α 2li(wi k,j) j=0 (I α 2li( ewi k,j)) + (1 + αL)NLE SN,i wi k,N ewi k,N (iii) (1 + αL)NEi li(wk) E SN I α 2li(wi k,j) j=0 (I α 2li( ewi k,j)) i | {z } R(N) + (1 + αL)N((1 + αL)N 1 σg Ji, Yang, and Liang where (i) follows from the Jensen s inequality, (ii) follows from Lemma 12 that li( ewi k,N) (1 + αL)N li(wk) , and (iii) follows from item 1 in Proposition 2. Our next step is to upper-bound the term R(N). To simplify notations, we define a general quantity R(m) by replacing N in R(N) with m, and we use E Sm|i( ) to denote E Sm( |i). Then, we have R(m) E Sm|i I α 2li(wi k,j) j=0 (I α 2li(wi k,j))(I α 2li( ewi k,m 1) j=0 (I α 2li(wi k,j))(I α 2li( ewi k,m 1) j=0 (I α 2li( ewi k,j)) (1 + αL)m 1αρE Sm|i wi k,m 1 ewi k,m 1 + (1 + αL)R(m 1) (i) αρ(1 + αL)m 1((1 + αL)m 1 1) σg S + (1 + αL)R(m 1) αρ(1 + αL)N 1 (1 + αL)N 1 1 σg S + (1 + αL)R(m 1), (31) where (i) follows from Proposition 2. Telescoping the above inequality over m from 2 to N and using R(1) = 0, we have R(N) ((1 + αL)N 1 1)2(1 + αL)N 1 ρσg Thus, conditioning on wk and combining (32) and (30), we have E b Gi(wk) L(wk) ((1 + αL)N 1 1)2 ρ L(1 + αL)2N 1 σg S Ei p(T ) li(wk) + (1 + αL)N((1 + αL)N 1 σg ((1 + αL)N 1 1)2 ρ L(1 + αL)2N 1 σg 1 Cl + σ 1 Cl + (1 + αL)N((1 + αL)N 1 σg where the last inequality follows from Lemma 15. Rearranging the above inequality and using Cerr1 and Cerr2 defined in Proposition 3 finish the proof. Proof of Proposition 4 Recall b Gi(wk) = QN 1 j=0 (I α 2li(wi k,j; Di k,j)) li(wi k,N; T i k). Conditioning on wk, we have E b Gi(wk) 2 j=0 (I α 2li(wi k,j; Di k,j)) 2 li(wi k,N; T i k) 2 SN, i j=0 E DN I α 2li(wi k,j; Di k,j) 2 SN, i li(wi k,N; T i k) 2 SN, i Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning We next upper-bound P and Q in (33). Note that wi k,j, j = 0, ..., N 1 are deterministic when conditioning on SN, i, and wk. Thus, conditioning on SN, i, and wk, we have I α 2li(wi k,j; Di k,j) 2 =Var I α 2li(wi k,j; Di k,j) + I α 2li(wi k,j) 2 α2σ2 H D + (1 + αL)2. (34) We next bound Q term. Conditioning on SN, i and wk, we have ET i k li(wi k,N; T i k) 2 (i) 3ET i k li(wi k,N; T i k) li(wi k,N) 2 + 3ET i k li(wi k,N) li( ewi k,N) 2 + 3ET i k li( ewi k,N) 2 (ii) 3σ2 g T + 3L2 wi k,N ewi k,N 2 + 3(1 + αL)2N li(wk) 2, (35) where (i) follows from the inequality that Pn i=1 a 2 n Pn i=1 a 2, and (ii) follows from Lemma 12. Thus, conditioning on wk and combining (33), (34) and (35), we have E b Gi(wk) 2 3 α2σ2 H D + (1 + αL)2 N σ2 g T + L2E wi k,N ewi k,N 2 + (1 + αL)2NE li(wk) 2 which, in conjunction with Proposition 2, yields E b Gi(wk) 2 3(1 + αL)2N α2σ2 H D + (1 + αL)2 N ( l(wk) 2 + σ2) + Csqu1 Based on Lemma 15 and conditioning on wk, we have l(wk) 2 2 (1 Cl)2 L(wk) + 2C2 l (1 Cl)2 σ2, which, in conjunction with 2x2 (1 x)2 + 1 2 (1 x)2 and (36), finishes the proof. Proof of Theorem 5 The proof of Theorem 5 consists of four main steps: step 1 of bounding an iterative meta update by the meta-gradient smoothness established by Proposition 1; step 2 of characterizing first-moment error of the meta-gradient estimator b Gi(wk) by Proposition 3; step 3 of characterizing second-moment error of the meta-gradient estimator b Gi(wk) by Proposition 4; and step 4 of combining steps 1-3, and telescoping to yield the convergence. To simplify notations, define the smoothness parameter of the meta-gradient as Lwk = (1 + αL)2NL + CLEi p(T ) li(wk) , where CL is given in (13). Based on the smoothness of the gradient L(w) given by Proposition 1, we have L(wk+1) L(wk) + L(w), wk+1 wk + Lwk 2 wk+1 wk 2 Ji, Yang, and Liang Note that the randomness from βk depends on B k and Di Lk, i B k, and thus is independent of Si k,j, Di k,j and T i k for i Bk, j = 0, ..., N. Then, taking expectation over the above inequality, conditioning on wk, and recalling ek := E b Gi(wk) L(wk), we have E(L(wk+1)|wk) L(wk) E(βk) L(wk), L(wk) + ek + Lwk E(β2 k)E 1 B P i Bk b Gi(wk) 2 Then, applying Lemma 16 in the above inequality yields E(L(wk+1)|wk) L(wk) 4 5Cβ 1 Lwk L(wk) 2 + 4 5Cβ 1 Lwk | L(wk), ek | B E b Gi(wk) 2 + E b Gi(wk) 2 . L(wk) 4 5Cβ 1 Lwk L(wk) 2 + 2 5Cβ 1 Lwk L(wk) 2 + 2 5Cβ B E b Gi(wk) 2 + E b Gi(wk) 2 . (37) Then, applying Propositions 3 and 4 to the above inequality yields E(L(wk+1)|wk) L(wk) 2 5Cβ 1 Lwk L(wk) 2 + 2 1 B E b Gi(wk) 2 + 4 1 Lwk L(wk) 2 + 6 5CβLwk + 12 C2 βLwk C2 err2 S L(wk) 2 + C2 err1 S + C2 err2σ2 L(wk) 2 CβLwk C2 err2 S Csqu3 + 6 CβLwk S C2 err1 + C2 err2σ2 + 2 C2 βLwk B S + Csqu3σ2 . (38) Recalling Lwk = (1 + αL)2NL + CLEi li(wk) , we have Lwk L and (i) (1 + αL)2NL + CLσ 1 Cl + CL 1 Cl L(wk) , (39) where (i) follows from Assumption 2 and Lemma 15. Combining (38) and (39) yields E(L(wk+1)|wk) L(wk) + 6 CβL C2 err1 + C2 err2σ2 1 S + Csqu3σ2 1 1 5 3 5 + 6 Cβ CβB 2 Cβ (1 + αL)2NL + CLσ 1 Cl + CL 1 Cl L(wk) L(wk) 2. (40) Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Based on the notations in (19), we rewrite (40) as E(L(wk+1)|wk) L(wk) + ξ B θ L(wk) 2 χ + L(wk) . Unconditioning on wk in the above inequality and telescoping the above inequality over k from 0 to K 1, we have k=0 E θ L(wk) 2 where = L(w0) L . Choosing ζ from {0, ..., K 1} uniformly at random, we obtain from (41) that E θ L(wζ) 2 Consider a function f(x) = x2 c+x, x > 0, where c > 0 is a constant. Simple computation shows that f (x) = 2c2 (x+c)3 > 0. Thus, using Jensen s inequality in (42), we have θ(E L(wζ) )2 χ + E L(wζ) Rearranging the above inequality yields θ 1 B , (44) which finishes the proof. Proof of Corollary 6 Since α = 1 8NL, we have (1 + αL)N = 1 + 1 8N N = e N log(1+ 1 8N ) e1/8 < 5 4, (1 + αL)2N < e1/4 < 3 which, in conjunction with (15), implies that Cerr1 < 5σg 16 , Cerr2 < 3ρσg Furthermore, noting that D σ2 H/L2, we have Csqu1 3(1 + 2αL + 2α2L2)Nσ2 g < 3e9/32σ2 g < 4σ2 g, Csqu2 < 1.3σ2 g 8 < σ2 g 5 , Csqu3 11. (46) Ji, Yang, and Liang Based on (13), we have 128 ρ L < 3 5 ρ L and CL (i) > ρ L((N 1)αL) > 1 16 ρ L, (47) where (i) follows from the inequality that (1+a)n > 1+an. Then, using (45), (46) and (47), we obtain from (19) that σ2 g, φ 1 5000L 3σ2 g T + σ2 g 5S + 11σ2 < 1 1000L(σ2 g + 3σ2) 5 9 16 ρ2σ2 g L4 1 S 11 100B 1 = L 1500ρ, χ 24L2 ρ + σ. (48) Then, treating , ρ, L as constants and using (20), we obtain E L(wζ) O 1 K + σ2 g(σ2 + 1) S + σ2 g + σ2 B + σ2 g TB + 1 K + σ2g(σ2 + 1) S + σ2g + σ2 Then, choosing batch sizes S CSσ2 g(σ2 + 1) max(σ, 1)ϵ 2, B CB(σ2 g + σ2) max(σ, 1)ϵ 2 and TB > CT σ2 g max(σ, 1)ϵ 2, we have E L(wζ) O 1 1 K + 1 σϵ2 After at most K = CK max(σ, 1)ϵ 2 iterations, the above inequality implies, for constants CS, CB, CT and CK large enough, E L(wζ) ϵ. Recall that we need |B k| > 4C2 Lσ2 3(1+αL)4NL2 and |Di Lk| > 64σ2 g C2 L (1+αL)4NL2 for building stepsize βk at each iteration k. Based on the selected parameters, we have 3(1 + αL)4NL2 4σ2 3L2 3ρ 5L Θ(σ2), 64σ2 g C2 L (1 + αL)4NL2 < Θ(σ2 g), which implies |B k| = Θ(σ2) and |Di Lk| = Θ(σ2 g). Then, since the batch size D = Θ(σ2 H/L2), the total number of gradient computations at each meta iteration k is given by B(NS + T) + |B k||Di Lk| O(Nϵ 4 + ϵ 2). Furthermore, the total number of Hessian computations at each meta iteration is given by BND O(Nϵ 2). This completes the proof. 5.2 Proofs for Section 4: Convergence of Multi-Step MAML in Finite-Sum Case In this subsection, we provide proofs for the convergence properties of multi-step MAML in the finite-sum case. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Proof of Proposition 7 By the definition of Li( ), we have Li(w) Li(u) j=0 (I α 2l Si( ewi j)) l Ti( ewi N) j=0 (I α 2l Si(eui j)) l Ti( ewi N) j=0 (I α 2l Si(eui j)) l Ti( ewi N) j=0 (I α 2l Si(eui j)) l Ti(eui N) j=0 (I α 2l Si( ewi j)) j=0 (I α 2l Si(eui j)) l Ti( ewi N) + (1 + αL)N l Ti( ewi N) l Ti(eui N) . (49) We next upper-bound A in the above inequality. Specifically, we have j=0 (I α 2l Si( ewi j)) j=0 (I α 2l Si( ewi j))(I α 2l Si(eui N 1)) j=0 (I α 2l Si( ewi j))(I α 2l Si(eui N 1)) j=0 (I α 2l Si(eui j)) (1 + αL)N 1αρ + ρ L(1 + αL)N (1 + αL)N 1 1 w u , (50) where the last inequality uses an approach similar to (28). Combining (49) and (50) yields Li(w) Li(u) (1 + αL)N 1αρ + ρ L(1 + αL)N (1 + αL)N 1 1 w u l Ti( ewi N) + (1 + αL)NL ewi N eui N . (51) To upper-bound l Ti( ewi N) in (51), using the mean value theorem, we have l Ti( ewi N) = l Ti(w j=0 α l Si( ewi j)) (i) l Ti(w) + αL j=0 (1 + αL)j l Si(w) (ii) (1 + αL)N l Ti(w) + (1 + αL)N 1 bi, (52) where (i) follows from Lemma 17, and (ii) follows from Assumption 5. In addition, using an approach similar to Lemma 11, we have ewi N eui N (1 + αL)N w u . (53) Ji, Yang, and Liang Combining (51), (52) and (53) yields Li(w) Li(u) (1 + αL)N 1αρ + ρ L(1 + αL)N (1 + αL)N 1 1 (1 + αL)N l Ti(w) w u + (1 + αL)N 1αρ + ρ L(1 + αL)N (1 + αL)N 1 1 (1 + αL)N 1 bi w u + (1 + αL)2NL w u , which, in conjunction with Cb and CL given in (21), yields Li(w) Li(u) (1 + αL)2NL + Cbbi + CL l Ti(w) w u . Based on the above inequality and Jensen s inequality, we finish the proof. Proof of Proposition 8 Conditioning on wk, we have E b Gi(wk) 2 =E j=0 (I α 2l Si(wi k,j)) l Ti(wi k,N) 2 (1 + αL)2NE l Ti(wi k,N) 2, which, using an approach similar to (52), yields E b Gi(wk) 2 (1 + αL)2N2(1 + αL)2NE l Ti(wk) 2 + 2(1 + αL)2N (1 + αL)N 1 2Eib2 i 2(1 + αL)4N( l T (wk) 2 + σ2) + 2(1 + αL)2N (1 + αL)N 1 2eb (i) 2(1 + αL)4N 2 C2 1 l T (wk) 2 + 2C2 2 C2 1 + σ2 + 2(1 + αL)2N (1 + αL)N 1 2eb 4(1 + αL)4N C2 1 l T (wk) 2 + 4(1 + αL)4NC2 2 C2 1 + 2(1 + αL)4N(σ2 + eb), (54) where (i) follows from Lemma 19, and constants C1 and C2 are given by (74). Noting that C2 = (1 + αL)2N 1 σ + (1 + αL)N (1 + αL)N 1 b < (1 + αL)2N 1 (σ + b) and using the definitions of Asqu1, Asqu2 in (23), we finish the proof. Proof of Theorem 9 Based on the smoothness of L( ) established in Proposition 7, we have L(wk+1) L(wk) βk D L(wk), 1 b Gi(wk) E + Lwkβ2 k 2 Taking the conditional expectation given wk over the above inequality and noting that the randomness over βk is independent of the randomness over b Gi(wk), we have E(L(wk+1)|wk) wk L(wk) 2 + Lwk b Gi(wk) 2 wk . (55) Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Note that, conditioning on wk, b Gi(wk) 2 1 B Asqu1 L(wk) 2 + Asqu2 + L(wk) 2 (56) where the inequality follows from Proposition 8. Then, combining (56), (55) and applying Lemma 20, we have E(L(wk+1)|wk) L(wk) 1 Lwk Cβ 1 Lwk C2 β B + 1 L(wk) 2 + Asqu2 Lwk C2 βb. (57) Recalling that Lwk = (1 + αL)2NL + Cbb + CLEi p(T ) l Ti(wk) and conditioning on wk, we have Lwk L and Lwk (1 + αL)2NL + Cbb + CL( l T (wk) + σ) (i) (1 + αL)2NL + Cbb + CL C2 C1 + σ + CL C1 L(wk) , (58) where (i) follows from Lemma 19. Combining (58) and (57) yields E(L(wk+1)|wk) 1 Cβ 1 C2 β B + 1 L(wk) 2 (1 + αL)2NL + Cbb + CL C2 C1 + σ + CL C1 L(wk) + 1 LC2 β 1 Cβ 1 C2 β B + 1 L(wk) 2 C1 CL (1 + αL)2NL + b C1Cb CL + C2 + C1σ + L(wk) + 1 LC2 β 1 Cβ 1 C2 β B + 1 L(wk) 2 C1 CL (1 + αL)2NL + b C1Cb CL + (1 + αL)N((1 + αL)2N 1)b + L(wk) + Asqu2 LC2 βB , (59) where the last equality follows from the definitions of C1, C2 in (74). Combining the definitions in (24) with (59) and taking the expectation over wk, we have E θ L(wk) 2 ξ + L(wk) E(L(wk) L(wk+1)) + φ Telescoping the above bound over k from 0 to K 1 and choosing ζ from {0, ..., K 1} uniformly at random, we have E θ L(wζ) 2 Using an approach similar to (43), we obtain from (60) that (E L(wζ) )2 ξ + E L(wζ) which further implies that E L(wζ) 2θK + φ 2θB + 2θK + φ 2θB which finishes the proof. Ji, Yang, and Liang Proof of Corollary 10 Since α = 1 8NL, we have (1 + αL)4N < e0.5 < 2, and thus Asqu1 < 32, Asqu2 < 8(σ + b)2 + 4(σ2 + eb), CL < 5ρ 32NL + ρ L (1 + αL)N 1 1 > ρ LαL(N 1) > ρ 16L, 32 ρ L 1 4 < ρ which, in conjunction with (24), yields L 200ρ, φ 2(σ + b)2 + (σ2 + eb) 1600L , ξ 24L2 Combining (63) and (25) yields 2θK + φ 2θB + 2θK + φ 2θB Then, based on the parameter selection that B CBσ2ϵ 2 and after at most K = Ckϵ 2 iterations, we have E L(wζ) O 1 Then, for CB, CK large enough, we obtain from the above inequality that E L(wζ) ϵ. Thus, the total number of gradient computations is given by B(T +NS) = O(ϵ 2(T +NS)). Furthermore, the total number of Hessian computations is given by BNS = O(NSϵ 2) at each iteration. Then, the proof is complete. 6. Conclusion and Future Work In this paper, we provide a new theoretical framework for analyzing the convergence of multi-step MAML algorithm for both the resampling case and the finite-sum case. Our analysis covers most applications including reinforcement learning and supervised learning of interest. Our analysis reveals that a properly chosen inner stepsize is crucial for guaranteeing MAML to converge with the complexity increasing only linearly with N (the number of the inner-stage gradient updates). Moreover, for problems with small Hessians, the inner stepsize can be set larger while maintaining the convergence. Our results also provide justifications for the empirical findings in training MAML. We expect that our analysis framework can be applied to understand the convergence of MAML in other scenarios such as various RL problems and Hessian-free MAML algorithms. Acknowledgments The work was supported in part by the U.S. National Science Foundation under Grants CCF-1761506, ECCS-1818904, and CCF-1900145. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Appendix A. Examples for Two Types of Objective Functions A.1 RL Example for Resampling Case RL problems are often captured by objective functions in the expectation form. Consider a RL meta learning problem, where each task corresponds to a Markov decision process (MDP) with horizon H. Each RL task Ti corresponds to an initial state distribution ρi, a policy πw parameterized by w that denotes a distribution over the action set given each state, and a transition distribution kernel qi(xt+1|xt, at) at time steps t = 0, ..., H 1. Then, the loss li(w) is defined as negative total reward, i.e., (RL example) : li(w) := Eτ pi( |w)[R(τ)], where τ = (s0, a0, s1, a1, ..., s H 1, a H 1) is a trajectory following the distribution pi( |w), and the reward t=0 γt R(st, at) with R( ) given as a reward function. The estimated gradient here is li(w; Ω) := 1 τ Ω gi(w; τ), where gi(w; τ) is an unbiased policy gradient estimator s.t. Eτ pi( |w)gi(w; τ) = li(w), e.g, REINFORCE (Williams, 1992) or G(PO)MDP (Baxter and Bartlett, 2001). In addition, the estimated Hessian is 2li(w; Ω) := 1 τ Ω Hi(w; τ) , where Hi(w; τ) is an unbiased policy Hessian estimator, e.g., Di CE (Foerster et al., 2018) or LVC (Rothfuss et al., 2019). A.2 Classification Example for Finite-Sum Case The risk minimization problem in classification often has a finite-sum objective function. For example, the mean-squared error (MSE) loss takes the form of (Classification example) : l Si(w) := 1 |Si| (xj,yj) Si yj φ(w; xi) 2 (similarly for l Ti(w)), where xj, yj are a feature-label pair and φ(w; ) can be a deep neural network parameterized by w. Appendix B. Derivation of Simplified Form of Gradient Li(w) in (3) First note that Li(wk) = li( ewi k,N) and ewi k,N is obtained by the following gradient descent updates ewi k,j+1 = ewi k,j α li( ewi k,j), j = 0, 1, ..., N 1 with ewi k,0 := wk. (64) Ji, Yang, and Liang Then, by the chain rule, we have Li(wk) = wkli( ewi k,N) = j=0 ewi k,j ewi k,j+1 li( ewi k,N), which, in conjunction with (64), implies that j=0 ewi k,j ewi k,j α li( ewi k,j) li( ewi k,N) = I α 2li( ewi k,j) li( ewi k,N), which finishes the proof. Appendix C. Auxiliary Lemmas for MAML in Resampling Case In this section, we derive some useful lemmas to prove the propositions given in Section 3.3 on the properties of the meta gradient and the main results Theorem 5 and Corollary 6. The first lemma provides a bound on the difference between ewi j eui j for j = 0, ..., N, i I, where ewi j, j = 0, ..., N, i I are given through the gradient descent updates in (1) and eui j, j = 0, ..., N are defined in the same way. Lemma 11 For any i I, j = 0, ..., N and w, u Rd, we have ewi j eui j (1 + αL)j w u . Proof Based on the updates that ewi m = ewi m 1 α li( ewi m 1) and eui m = eui m 1 α li(eui m 1), we obtain, for any i I, ewi m eui m = ewi m 1 α li( ewi m 1) eui m 1 + α li(eui m 1) (i) ewi m 1 eui m 1 + αL ewi m 1 eui m 1 (1 + αL) ewi m 1 eui m 1 , where (i) follows from the triangle inequality. Telescoping the above inequality over m from 1 to j, we obtain ewi j eui j (1 + αL)j ewi 0 eui 0 , which, in conjunction with the fact that ewi 0 = w and eui 0 = u, finishes the proof. The following lemma provides an upper bound on li( ewi j) for all i I and j = 0, ..., N, where ewi j is defined in the same way as in Lemma 11. Lemma 12 For any i I, j = 0, ..., N and w Rd, we have li( ewi j) (1 + αL)j li(w) . Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Proof For m 1, we have li( ewi m) = li( ewi m) li( ewi m 1) + li( ewi m 1) li( ewi m) li( ewi m 1) + li( ewi m 1) L ewi m ewi m 1 + li( ewi m 1) (1 + αL) li( ewi m 1) , where the last inequality follows from the update ewi m = ewi m 1 α li( ewi m 1). Then, telescoping the above inequality over m from 1 to j yields li( ewi j) (1 + αL)j li( ewi 0) , which, combined with the fact that ewi 0 = w, finishes the proof. The following lemma gives an upper bound on the quantity I Qm j=0(I αVj) for all matrices Vj Rd d, j = 0, ..., m that satisfy Vj L. Lemma 13 For all matrices Vj Rd d, j = 0, ..., m that satisfy Vj L, we have j=0 (I αVj) (1 + αL)m+1 1. Proof First note that the product Qm j=0(I αVj) can be expanded as j=0 (I αVj) = I j=0 αVj + X 0 p 0. (65) Ji, Yang, and Liang Proof First note that ewi N can be rewritten as ewi N = w α PN 1 j=0 li ewi j . Then, based on the mean value theorem (MVT) for vector-valued functions (Mc Leod, 1965), we have, there exist constants rt, t = 1, ..., d satisfying Pd t=1 rt = 1 and vectors w t Rd, t = 1, ..., d such that li( ewi N) = li w α j=0 li ewi j = li(w) + d X t=1 rt 2li(w t) α j=0 li ewi j t=1 rt 2li(w t) li(w) α t=1 rt 2li(w t) j=1 li ewi j . (66) For simplicity, we define K(N) := QN 1 j=0 (I α 2li( ewi j)). Then, using (66), we write li(w) Li(w) as li(w) Li(w) = li(w) K(N) li( ewi N) = li(w) K(N) I α t=1 rt 2li(w t) li(w) + αK(N) t=1 rt 2li(w t) j=1 li ewi j t=1 rt 2li(w t) li(w) + αK(N) t=1 rt 2li(w t) j=1 li ewi j (i) I K(N) I α t=1 rt 2li(w t) li(w) + αL(1 + αL)N N 1 X (ii) I K(N) I α t=1 rt 2li(w t) li(w) + αL(1 + αL)N N 1 X j=1 (1 + αL)j li(w) t=1 rt 2li(w t) li(w) + (1 + αL)N+1((1 + αL)N 1 1) li(w) (iii) ((1 + αL)N+1 1) li(w) + (1 + αL)N+1((1 + αL)N 1 1) li(w) =((1 + αL)2N 1) li(w) , where (i) follows from the fact that 2li(u) L for any u Rd and Pd t=1 rt = 1, and the inequality that Pn j=1 aj Pn j=1 aj , (ii) follows from Lemma 12, and (iii) follows from Lemma 13. Recall that the expected value of the gradient of the loss l(w) := Ei p(T ) li(w) and the objective function L(w) := Li(w). Based on the above lemmas, we next provide an upper bound on l(w) using L(w) . Lemma 15 For any w Rd, we have l(w) 1 1 Cl L(w) + Cl 1 Cl σ, Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning where the constant Cl is given by Cl = (1 + αL)2N 1. Proof Based on the definition of l(w), we have l(w) = Ei p(T )( li(w) Li(w) + Li(w)) Ei p(T ) Li(w) + Ei p(T )( li(w) Li(w)) L(w) + Ei p(T ) li(w) Li(w) (i) L(w) + Cl Ei p(T ) li(w) (ii) L(w) + Cl( l(w) + σ), where (i) follows from Lemma 14, and (ii) follows from Assumption 2. Then, rearranging the above inequality completes the proof. Recall from (14) that we choose the meta stepsize βk = 1 Cβ b Lwk , where Cβ is a positive constant and b Lwk = (1 + αL)2NL + CL 1 |B k| P i B k li(wk; Di Lk) . Using an approach similar to Lemma 4.11 in Fallah et al. (2020a), we establish the following lemma to provide the firstand second-moment bounds for βk. Lemma 16 Suppose that Assumptions 1, 2 and 3 hold. Set the meta stepsize βk = 1 Cβ b Lwk with b Lwk given by (14), where |B k| > 4C2 Lσ2 3(1+αL)4NL2 and |Di Lk| > 64σ2 g C2 L (1+αL)4NL2 for all i B k. Then, conditioning on wk, we have 1 5Lwk , Eβ2 k 4 where Lwk = (1 + αL)2NL + CLEi p(T ) li(wk) with CL given in (13). Proof Let e Lwk = 4L + 4CL (1+αL)2N 1 |B k| P i B k li(wk; Di Lk) . Note that |B k| > 4C2 Lσ2 3(1+αL)4NL2 and |Di Lk| > 64σ2 g C2 L (1+αL)4NL2 , i B k. Then, using an approach similar to (61) in Fallah et al. (2020a) and conditioning on wk, we have σ2 β/(4L)2 + µ2 β/(µβ)2 σ2 β + µ2 β , (67) where σ2 β and µβ are the variance and mean of variable 4CL (1+αL)2N 1 |B k| P i B k li(wk; Di Lk) . Using an approach similar to (62) in Fallah et al. (2020a), conditioning on wk and using |Di Lk| > 64σ2 g C2 L (1+αL)4NL2 , we have CL (1 + αL)2N Ei li(wk) L µβ CL (1 + αL)2N Ei li(wk) + L, (68) Ji, Yang, and Liang which implies that µβ + 5L 4 (1+αL)2N Lwk, and thus using (67) yields 16 (1 + αL)4N L2 wk E 1 µ2 β(25/16 + σ2 β/(8L2)) + 25σ2 β/8 σ2 β + µ2 β . (69) Furthermore, conditioning on wk, σβ is bounded by σ2 β = 16C2 L (1 + αL)4N|B k|Var( li(wk; Di Lk) ) 16C2 L (1 + αL)4N|B k| σ2 + σ2 g |Di Lk| (i) 16C2 Lσ2 (1 + αL)4N|B k| + L2 (ii) 12L2 + 1 where (i) follows from |Di Lk| > 64σ2 g C2 L (1+αL)4NL2 , i B k and (ii) follows from |B k| > 4C2 Lσ2 3(1+αL)4NL2 and |B k| 1. Then, plugging (70) in (69) yields 16 (1+αL)4N L2 wk E 1 e L2wk 8 . Then, noting that βk = 4 Cβ(1+αL)2N e Lwk , using the above inequality and conditioning on wk, we have Eβ2 k = 16 C2 β(1 + αL)4N E 1 L2wk . (71) In addition, by Jensen s inequality and conditioning on wk, we have Eβk = 4 Cβ(1 + αL)2N E 1 4 Cβ(1 + αL)2N 1 Ee Lwk = 4 Cβ(1 + αL)2N 1 4L + µβ 1 4L(1 + αL)2N + Lwk 1 5Lwk , (72) where (i) follows from (68) and (ii) follows from the fact Lwk > (1 + αL)2NL. Appendix D. Auxiliary Lemmas for MAML in Finite-Sum Case In this section, we provide some useful lemmas to prove the propositions in Section 4.2 on properties of the meta gradient and the main results Theorem 9 and Corollary 10. The following lemma provides an upper bound on l Si( ewi j) for all i I and j = 0, ..., N, where ewi j is defined by (9) with ewi 0 = w. Lemma 17 For any i I, j = 0, ..., N and w Rd, we have l Si( ewi j) (1 + αL)j l Si(w) . Proof The proof is similar to that of Lemma 12, and thus omitted. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning We next provide a bound on l Ti(w) Li(w) , where j=0 (I α 2l Si(wi j)) l Ti(wi N). Lemma 18 For any i I and w Rd, we have l Ti(w) Li(w) (1 + αL)N 1 l Ti(w) + (1 + αL)N (1 + αL)N 1 l Si(w) . Proof Using the mean value theorem (MVT), we have, there exist constants rt, t = 1, ..., d satisfying Pd t=1 rt = 1 and vectors w t Rd, t = 1, ..., d such that l Ti( ewi N) = l Ti w α j=0 l Si( ewi j) = l Ti(w) + t=1 rt 2l Ti(w t) α j=0 l Si( ewi j) = l Ti(w) α t=1 rt 2l Ti(w t) j=0 l Si( ewi j). Based on the above equality, we have l Ti(w) Li(w) j=0 (I α 2l Si( ewi j)) l Ti( ewi N) j=0 (I α 2l Si( ewi j)) l Ti(w) + j=0 (I α 2l Si( ewi j))α t=1 rt 2l Ti(w t) j=0 l Si( ewi j) j=0 (I α 2l Si( ewi j)) l Ti(w) + j=0 (I α 2l Si( ewi j))α t=1 rt 2l Ti(w t) j=0 l Si( ewi j) (i) (1 + αL)N 1 l Ti(w) + αL(1 + αL)N N 1 X j=0 l Si( ewi j) (ii) (1 + αL)N 1 l Ti(w) + αL(1 + αL)N N 1 X j=0 (1 + αL)j l Si(w) = (1 + αL)N 1 l Ti(w) + (1 + αL)N (1 + αL)N 1 l Si(w) , where (i) follows from Lemma 13 and Pd t=1 rt 2l Ti(w t) Pd t=1 rt 2l Ti(w t) L, and (ii) follows from Lemma 17. Then, the proof is complete. Recall that l T (w) = Ei p(T ) l Ti(w), L(w) = Ei p(T ) Li(w) and b = Ei p(T )[bi]. The following lemma provides an upper bound on l T (w) . Lemma 19 For any i I and w Rd, we have C1 L(w) + C2 Ji, Yang, and Liang where constants C1, C2 > 0 are give by C1 =2 (1 + αL)2N, C2 = (1 + αL)2N 1 σ + (1 + αL)N (1 + αL)N 1 b. (74) Proof First note that l T (w) = Ei( l Ti(w) Li(w)) + L(w) L(w) + Ei l Ti(w) Li(w) (i) L(w) + Ei (1 + αL)N 1 l Ti(w) + (1 + αL)N (1 + αL)N 1 l Si(w) (ii) L(w) + (1 + αL)N 1 l T (w) + σ + (1 + αL)N (1 + αL)N 1 (Ei l Ti(w) + Eibi) L(w) + (1 + αL)N 1 + (1 + αL)N((1 + αL)N 1) l T (w) + ((1 + αL)N 1)σ + (1 + αL)N((1 + αL)N 1)(σ + b) L(w) + (1 + αL)2N 1 l T (w) + ((1 + αL)2N 1)σ + (1 + αL)N((1 + αL)N 1)b where (i) follows from Lemma 18, (ii) follows from Assumption 5. Based on the definitions of C1 and C2 in (74), the proof is complete. The following lemma provides the firstand second-moment bounds on 1/ˆLwk, where ˆLwk = (1 + αL)2NL + Cbb + CL P i B k l Ti(wk) Lemma 20 If the batch size |B k| 2C2 Lσ2 (Cbb+(1+αL)2NL)2 , then conditioning on wk, we have 1 Lwk , E 1 where Lwk is given by Lwk = (1 + αL)2NL + Cbb + CLEi p(T ) l Ti(wk) . Proof Conditioning on wk and using an approach similar to (67), we have σ2 β/ Cbb + (1 + αL)2NL 2 + µ2 β/(µβ + Cbb + (1 + αL)2NL)2 σ2 β + µ2 β , (75) where µβ and σ2 β are the mean and variance of variable CL |B k| P i B k l Ti(wk) . Noting that µβ = CLEi p(T ) l Ti(wk) , we have Lwk = (1 + αL)2NL + Cbb + µβ, and thus σ2 β ((1+αL)2NL+Cbb+µβ)2 Cbb+(1+αL)2NL 2 + µ2 β σ2 β + µ2 β 2σ2 β + µ2 β + 2σ2 βµ2 β Cbb+(1+αL)2NL 2 σ2 β + µ2 β , (76) Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning where the last inequality follows from (a + b)2 2a2 + 2b2. Note that, conditioning on wk, σ2 β = C2 L |B k|Var l Ti(wk) C2 L |B k|σ2, which, in conjunction with |B k| 2C2 Lσ2 (Cbb+(1+αL)2NL)2 , yields 2σ2 β Cbb + (1 + αL)2NL 2 1. (77) Combining (77) and (76) yields In addition, conditioning on wk, we have EˆLwk = 1 Lwk , (78) where (i) follows from Jensen s inequality. Then, the proof is complete. Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. In International Conference on Learning Representations (ICLR), 2018. Pierre Alquier, Massimiliano Pontil, et al. Regret bounds for lifelong learning. In Artificial Intelligence and Statistics (AISTATS), pages 261 269, 2017. Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your MAML. In International Conference on Learning Representations (ICLR), 2019. Sanjeev Arora, Simon S Du, Sham Kakade, Yuping Luo, and Nikunj Saunshi. Provable representation learning for imitation learning via bi-level optimization. In International conference on machine learning (ICML), 2020. Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning (ICML), pages 424 433, 2019. Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. Y Bengio, S Bengio, and J Cloutier. Learning a synaptic learning rule. In International Joint Conference on Neural Networks (IJCNN). IEEE, 1991. Ji, Yang, and Liang Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. Federated meta-learning for recommendation. ar Xiv preprint ar Xiv:1802.07876, 2018. Liam Collins, Aryan Mokhtari, and Sanjay Shakkottai. Distribution-agnostic modelagnostic meta-learning. ar Xiv preprint ar Xiv:2002.04766, 2020. Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learning-to-learn with statistical guarantees. ar Xiv preprint ar Xiv:1803.08089, 2018a. Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Learning to learn around a common mean. In Advances in Neural Information Processing Systems (Neur IPS), pages 10169 10179, 2018b. Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. ar Xiv preprint ar Xiv:1903.10399, 2019. Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020. Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1082 1092, 2020a. Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Provably convergent policy gradient methods for model-agnostic meta-reinforcement learning. ar Xiv preprint ar Xiv:2002.05135, 2020b. Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In International Conference on Learning Representations (ICLR), 2018. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning (ICML), pages 1126 1135, 2017a. Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. In Conference on Robot Learning (Co RL), pages 357 368, 2017b. Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 9516 9527, 2018. Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online metalearning. In International Conference on Machine Learning (ICML), pages 1920 1930, 2019. Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rockt aschel, Eric Xing, and Shimon Whiteson. Di CE: The infinitely differentiable monte carlo estimator. In International Conference on Machine Learning (ICML), pages 1529 1538, 2018. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations (ICLR), 2018. Ghassen Jerfel, Erin Grant, Thomas L Griffiths, and Katherine Heller. Online gradient-based mixtures for transfer modulation in meta-learning. ar Xiv preprint ar Xiv:1812.06080, 2018. Kaiyi Ji and Yingbin Liang. Lower bounds and accelerated algorithms for bilevel optimization. ar Xiv preprint ar Xiv:2102.03926, 2021. Kaiyi Ji, Jason D Lee, Yingbin Liang, and H Vincent Poor. Convergence of meta-learning with task-specific adaptation over partial parameters. ar Xiv preprint ar Xiv:2006.09486, 2020a. Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Nonasymptotic analysis and faster algorithms. ar Xiv preprint ar Xiv:2010.07962, 2020b. Jin-Hwa Kim, Junyoung Park, and Yongseok Choi. Multi-step estimation for gradient-based meta-learning. ar Xiv preprint ar Xiv:2006.04298, 2020. Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015. Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Davis, and Adrian Weller. UFO-BLO: Unbiased first-order bilevel optimization. ar Xiv preprint ar Xiv:2006.03631, 2020. Hao Liu, Richard Socher, and Caiming Xiong. Taming MAML: Efficient unbiased metareinforcement learning. In International Conference on Machine Learning (ICML), pages 4061 4071, 2019. Robert M Mc Leod. Mean value theorems for vector valued functions. Proceedings of the Edinburgh Mathematical Society, 14(3):197 209, 1965. Fei Mi, Minlie Huang, Jiyong Zhang, and Boi Faltings. Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pages 3151 3157, 2019. Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International Conference on Machine Learning (ICML), pages 2554 2563, 2017. Devang K Naik and Richard J Mammone. Meta-neural networks that learn by learning. In IEEE International Joint Conference on Neural Networks (IJCNN), pages 437 442, 1992. Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. ar Xiv preprint ar Xiv:1803.02999, 2018. Ji, Yang, and Liang Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018. Jaehoon Oh, Hyungjun Yoo, Chang Hwan Kim, and Se-Young Yun. BOIL: Towards representation change for few-shot learning. In International Conference on Learning Representations (ICLR), 2021. Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of MAML. In International Conference on Learning Representations (ICLR), 2020. Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Metalearning with implicit gradients. In Advances in Neural Information Processing Systems (Neur IPS), pages 113 124, 2019. Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2016. Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Pro MP: Proximal meta-policy search. In International Conference on Learning Representations (ICLR), 2019. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), pages 1842 1850, 2016. J urgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph D thesis, Technische Universit at M unchen, 1987. Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 4077 4087, 2017. Xingyou Song, Wenbo Gao, Yuxiang Yang, Choromanski Krzysztof, Aldo Pacchiano, and Yunhao Tang. ES-MAML: Simple hessian-free meta learning. In International Conference on Learning Representations (ICLR), 2020. Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012. Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. ar Xiv preprint ar Xiv:2002.11684, 2020. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 3630 3638, 2016. Haoxiang Wang, Ruoyu Sun, and Bo Li. Global convergence and induced kernels of gradientbased meta-learning with neural nets. ar Xiv preprint ar Xiv:2006.14606, 2020a. Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. On the global optimality of model-agnostic meta-learning. In International conference on machine learning (ICML), 2020b. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229 256, 1992. Pan Zhou, Xiaotong Yuan, Huan Xu, Shuicheng Yan, and Jiashi Feng. Efficient meta learning via minibatch proximal update. In Advances in Neural Information Processing Systems (Neur IPS), pages 1532 1542, 2019. Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. CAML: Fast context adaptation via meta-learning. ar Xiv preprint ar Xiv:1810.03642, 2018.