# maml_and_anil_provably_learn_representations__3bc7a756.pdf MAML and ANIL Provably Learn Representations Liam Collins 1 Aryan Mokhtari 1 Sewoong Oh 2 Sanjay Shakkottai 1 Recent empirical evidence has driven conventional wisdom to believe that gradient-based metalearning (GBML) methods perform well at fewshot learning because they learn an expressive data representation that is shared across tasks. However, the mechanics of GBML have remained largely mysterious from a theoretical perspective. In this paper, we prove that two well-known GBML methods, MAML and ANIL, as well as their first-order approximations, are capable of learning common representation among a set of given tasks. Specifically, in the well-known multitask linear representation learning setting, they are able to recover the ground-truth representation at an exponentially fast rate. Moreover, our analysis illuminates that the driving force causing MAML and ANIL to recover the underlying representation is that they adapt the final layer of their model, which harnesses the underlying task diversity to improve the representation in all directions of interest. To the best of our knowledge, these are the first results to show that MAML and/or ANIL learn expressive representations and to rigorously explain why they do so. 1. Introduction A widely popular approach to achieve fast adaptation in multi-task learning settings is to learn a representation that extracts the important features shared across tasks (Maurer et al., 2016). However, our understanding of how multitask representation learning should be done and why certain methods work well is still nascent. Recently, a paradigm known as meta-learning has emerged as a powerful means of learning multi-task representations. 1Department of Electrical and Computer Engineering, The University of Texas at Austin 2School of Computer Science and Engineering, University of Washington. Correspondence to: Liam Collins . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). This was sparked in large part by the introduction of Model Agnostic Meta-Learning (MAML) (Finn et al., 2017), which achieved impressive results in few-shot image classification and reinforcement learning scenarios, and led to a series of related gradient-based meta-learning (GBML) methods (Raghu et al., 2020; Nichol & Schulman, 2018; Antoniou et al., 2019; Hospedales et al., 2021). Surprisingly, MAML does not explicitly try to learn a useful representation; instead, it aims to find a good initialization for a small number of task-specific gradient descent steps, agnostic of whether the learning model contains a representation. Nevertheless, Raghu et al. (2020) empirically argued that MAML s impressive performance on neural networks is likely due to its tendency to learn a shared representation across tasks. To make this argument, they noticed that MAML s representation does not change significantly when adapted to each task. Moreover, they showed that a modified version of MAML that freezes the representation during local adaptation, known as the Almost-No-Inner-Loop algorithm (ANIL), typically performs at least as well as MAML on few-shot image classification tasks. Yet it is still not well understood why these algorithms that search for a good initialization for gradient descent should find useful a global representation among tasks. Thus, in this paper, we aim to address the following questions: Do MAML and ANIL provably learn high-quality representations? If so, why? To answer these questions we consider the multi-task linear representation learning setting (Maurer et al., 2016; Tripuraneni et al., 2021; Du et al., 2020) in which each task is a noisy linear regression problem in Rd with optimal solution lying in a shared k-dimensional subspace, where k d. The learning model is a two-layer linear network consisting of a representation (the first layer of the model) and head (the last layer). The goal is to learn a representation that projects data onto the shared subspace so as to reduce the number of samples needed to find the optimal regressor for a new task from Ω(d) to Ω(k). Main contributions. We prove, for the first time, that both MAML and ANIL, as well their first-order approximations, are capable of representation learning and recover the ground-truth subspace in this setting. Our analysis reveals that MAML and ANIL s distinctive adaptation updates MAML and ANIL Provably Learn Representations 0 1000 2000 3000 4000 5000 Number of iterations t dist(Bt, B * ) Avg. Risk Min. Figure 1. Distance of learned representation from the ground-truth for ANIL, MAML and average risk minimization run on task population losses in multi-task linear representation learning setting. for the last layer of the learning model are critical to their recovery of the ground-truth representation. Figure 1 visualizes this observation: all meta-learning approaches (Exact ANIL, MAML, and their first-order (FO) versions that ignore second-order derivatives) approach the ground truth exponentially fast, while a non-meta learning baseline of average loss minimization empirically fails to recover the ground-truth. We show that the inner loop updates of the head exploit task diversity to make the outer loop updates bring the representation closer to the ground-truth. However, MAML s inner loop updates for the representation can inhibit this behavior, thus, our results for MAML require an initialization with error related to task diversity, whereas ANIL requires only constant error. We also show that ANIL learns the ground-truth representation with only O( k3d n + k3) d samples per task, demonstrating that ANIL s representation learning is sample-efficient. Related work. Several works have studied why metalearning algorithms are effective; please see Appendix A for a comprehensive discussion. Building off Raghu et al. (2020), most of these works have studied meta-learning from a representation learning perspective (Goldblum et al., 2020; Saunshi et al., 2021; Arnold et al., 2021; Wang et al., 2021a; Kao et al., 2022). Among these, Ni et al. (2021); Bouniot et al. (2020); Setlur et al. (2020) and Kumar et al. (2021) showed mixed empirical impacts of training task diversity on model performance. Most related to our work is (Saunshi et al., 2020), which proved that the continuous version of a first-order GBML method, Reptile (Nichol & Schulman, 2018), learns a one-dimensional linear representation in a two-task setting with a specific initialization, explicit regularization, and infinite samples per task. Other works studied multi-task representation learning in the linear setting we consider from a statistical perspective (Maurer et al., 2016; Du et al., 2020; Tripuraneni et al., 2021). Furthermore, Collins et al. (2021) and Thekumparampil et al. (2021) gave optimization results for gradient-based methods in this setting. However, the algorithms they studied are customized for the assumed low-dimensional linear repre- sentation model, which makes it relatively easy to learn the correct representation efficiently. A more challenging task is to understand how general purpose and model-agnostic meta-learning algorithms perform, such as the algorithms we study. Notations. We use bold capital letters for matrices and bold lowercase letters for vectors. We use Od k to denote the set of matrices in Rd k with orthonormal columns. A hat above a matrix, e.g. ˆB, implies the matrix is a member of Od k. We let col(B) denote the column space of B, and col (B) denote its orthogonal complement. N(0, σ2) denotes the Gaussian distribution with mean 0 and variance σ2. O( ) and Ω( ) hide constant factors, and O( ) and Ω( ) hide constant and logarithmic factors. 2. Problem Formulation We employ the multi-task linear representation learning framework (Maurer et al., 2016; Du et al., 2020; Tripuraneni et al., 2021) studied in prior works. Each task in this setting is a linear regression problem in Rd. We index tasks by (t, i), corresponding to the i-th task sampled on iteration t. The inputs xt,i Rd and labels yt,i R for the (t, i)-th task are sampled i.i.d. from a distribution Pt,i over Rd R such that: xt,i p, zt,i N(0, σ2), yt,i = θ ,t,i, xt,i + zt,i where θ ,t,i Rd is the ground-truth regressor for task (t, i), p is a distribution over Rd and zt,i is white Gaussian noise with variance σ2. Each task has a set of m samples Dt,i := {(xt,i,j, yt,i,j)}j [m] drawn i.i.d. from Pt,i available for training. To account for shared information across tasks, we suppose there exists a matrix B Od k such that the ground-truth regressors {θ ,t,i}i for all tasks lie in col(B ), so they can be written as θ ,t,i = B w ,t,i for all t, i, where w ,t,i Rk. We refer to B as the ground-truth representation and w ,t,i as the ground-truth head for task (t, i). The task environment consists of B and a distribution ν over groundtruth heads. With knowledge of col (B ), we can reduce the number of samples needed to solve a task from Ω(d) to Ω(k) by projecting the task data onto col (B ), then learning a head in Rk. The question becomes how to learn the groundtruth subspace col (B ). The learning model consists of a representation B Rd k and a head w Rk. We would like the column space of B to be close to that of B , measured as follows. Definition 1 (Principle angle distance). Let ˆB Od k and ˆB , Od (d k) denote orthonormal matrices whose columns span col(B) and col (B ), respectively. Then the principle angle distance between B and B is dist(B, B ) := ˆB , ˆB 2. (1) MAML and ANIL Provably Learn Representations For shorthand, we denote distt := dist(Bt, B ). Notice that dim(col(B )) = k. Thus, the learned representation B must extract k orthogonal directions belonging to col(B ). As we will show, MAML and ANIL s task-specific adaptation of the head critically leverages task diversity to learn k such directions. 3. Algorithms Here we formally state the implementation of ANIL and MAML for the problem described above. First, letting θ := [w; vec(B)] R(d+1)k denote the vector of model parameters, we define the population loss for task (t, i): Lt,i(θ) := 1 2E(xt,i,yt,i) Pt,i ( Bw, xt,i yt,i)2 . Often we approximate this loss with the finite-sample loss for a dataset D t,i := {(xt,i,j, yt,i,j)}j [m ]: ˆLt,i(θ; D t,i) := 1 2m j=1 ( Bw, xt,i,j yt,i,j)2. MAML. MAML minimizes the average loss across tasks after a small number of task-specific gradient updates. Here, we consider that the task-specific updates are one step of minibatch SGD with batch Din t,i consisting of min i.i.d. samples from Pt,i. Specifically, the loss function that MAML minimizes is min θ LMAML(θ) := Ew ,t,i,Din t,i[Lt,i(θ α θ ˆLt,i(θ; Din t,i)))] (2) where for ease of notation we have written Ew ,t,i,Din t,i as shorthand for Ew ,t,i ν, Din t,i P min t,i . MAML essentially solves (2) with minibatch SGD. At iteration t, it draws n tasks with ground-truth heads {w ,t,i}i [n] drawn from ν, and for each drawn task, draws m samples contained in Dt,i i.i.d. from Pt,i. MAML then partitions Dt,i into Din t,i and Dout t,i such that |Din t,i| = min, |Dout t,i | = mout, and min + mout = m (we assume min < m). For task (t, i), in what is known as the inner loop, MAML takes a task-specific stochastic gradient step from the initial model (Bt, wt) using the samples Din t,i and step size α to obtain the adapted parameters θt,i: θt,i = wt,i vec(Bt,i) " wt α w ˆLi(Bt, wt; Din t,i) vec(Bt) α vec(B) ˆLi(Bt,wt;Din t,i) Then, in the so-called outer loop, MAML takes a minibatch SGD step with respect to the loss after task-specific adaptation using the samples {Dout t,i }i [n] and step size β: i=1 (I α 2 θ ˆLt,i(θt;Dout t,i )) ˆLt,i(θt,i;Dout t,i ) Note that the above Exact MAML update requires expensive second-order derivative computations. In practice, FOMAML, which drops the Hessian, is often used, since it typically achieves similar performance (Finn et al., 2017). ANIL. Surprisingly, Raghu et al. (2020) noticed that training neural nets with a modified version of MAML that lacks inner loop updates for the representation resulted in models that matched and sometimes even exceeded the performance of models trained by MAML on few-shot image classification tasks. This modified version is ANIL, and its inner loop updates in our linear case are given as follows: θt,i = wt,i vec(Bt,i) = wt α w ˆLt,i(Bt, wt; Din t,i) vec(Bt) In the outer loop, ANIL again takes a minibatch SGD step with respect to the loss after the inner loop update. Then, the outer loop updates for Exact ANIL are given by: i=1 ˆHt,i(θt; Dout t,i ) θ ˆLt,i(θt,i, Dout t,i ) where, for Exact ANIL, ˆHt,i(Bt, wt; Dout t,i ) := " Ik α 2 w ˆLt,i(θt; Dout t,i ) 0 α 2 vec(B) w ˆLt,i(θt; Dout t,i ) I To avoid computing second order derivatives, we can instead treat ˆHt,i as the identity operator, in which case we call the algorithm FO-ANIL. 3.1. Role of Adaptation Now we present new intuition for MAML and ANIL s representation learning ability which motivates our proof structure. The key observation is that the outer loop gradients for the representation are evaluated at the inner loop-adapted parameters; this harnesses the power of task diversity to improve the representation in all k directions. This is easiest to observe in the FO-ANIL case with min = mout = . In this case, the update for the representation is given as: Bt+1 =Bt Ik β i=1 wt,iw t,i | {z } FO-ANIL prior weight i=1 w ,t,iw t,i | {z } FO-ANIL signal weight (3) If the prior weight is small and the signal weight is large, then the update replaces energy from col(Bt) with energy from col(B ). Roughly, this is true as long as Ψt := 1 n Pn i=1 wt,iw t,i is well-conditioned, i.e. the wt,i s are diverse. Assuming wt,i w ,t,i for each task, then Ψt is well-conditioned if and only if the tasks are diverse. This shows how task diversity causes the column space of the MAML and ANIL Provably Learn Representations representation learned by FO-ANIL to approach the groundtruth. For FO-MAML, we observe similar behavior, with a caveat. The representation update is: Bt+1 (a) = Bt β i=1 Bt,iwt,iw t,i + B β n i=1 w ,t,iw t,i (b) = Bt Ik (Ik αwtw t ) β i=1 wt,iw t,i | {z } FO-MAML prior weight i=1 (1 α wt,i, wt )w ,t,iw t,i | {z } FO-MAML signal weight Equation (a) is similar to (3) except that one Bt is replaced by the Bt,i s resulting from inner loop adaptation. Expanding Bt,i in (b), we notice that the prior weight is at least as large as in (3), since λmax(Ik αwtw t ) 1, but it can still be small as long as the wt,i s are diverse and wt 2 is small. Thus we conclude that FO-MAML also can learn the representation, yet its inner loop adaptation complicates its ability to do so. Comparison with no inner-loop adaptation. Compare these updates to the case when there is no inner loop adaptation, i.e. we run SGD on the non-adaptive objective minθ Ew ,t,i[Lt,i(θ)] instead of (2). In this case, Bt+1 is: Bt+1 =Bt Ik βwtw t +βB w ,tw t (4) where w ,t := 1 n Pn i=1 w ,t,i. Observe that the coefficient of Bt in the update is rank k 1, while the coefficient of B is rank 1. Thus, col(Bt+1) can approach col(B ) in at most one direction on any iteration. Empirically, wt points in roughly the same direction throughout training, preventing this approach from learning col(B ) (e.g. see Figure 1). Technical challenges. The intuition on the role of adaptation, while appealing, makes strong assumptions; most notably that the wt,i s are diverse enough to improve the representation and that the algorithm dynamics are stable. To show these points, we observe that wt,i can be written as the linear combination of a vector distinct for each task and a vector that is shared across all tasks at time t. Showing that the shared vector is small implies the wt,i s are diverse, and we can control the magnitude of the shared vector by controlling wt 2 and Ik αB t Bt 2. Showing that these quantities are small at all times also ensures the stability of the algorithms. Meanwhile, we must prove that B , Bt 2 and distt = ˆB , ˆBt 2 are contracting. It is not obvious that any of these conditions hold individually; in fact, they require a novel multi-way inductive argument to show that they hold simultaneously for each t (see Section 5). 4. Main Results In this section we formalize our intuition discussed previously and prove that both MAML and ANIL and their first-order approximations are capable learning the column space of the ground-truth representation. To do so, we first make the following assumption concerning the diversity of the sampled ground-truth heads. Assumption 1 (Task diversity). The eigenvalues of the symmetric matrix Ψ ,t := ( 1 n Pn i=1 w ,t,iw ,t,i) are uniformly bounded below and above by µ2 and L2 , respectively1, i.e., µ2 I Ψ ,t L2 I, for all t [T]. The lower bound on the eigenvalues of the matrix Ψ ,t ensures that the k k matrix Ψ ,t is full rank and hence the vectors {w ,t,i}n i=1 span Rk, therefore they are diverse. However, the diversity level of the tasks is defined by ratio of the eigenvalues of the matrix Ψ ,t, i.e., κ := L µ . If this ratio is close to 1, then the ground-truth heads are very diverse and have equal energy in all directions. On the other hand, if κ is large, then the ground-truth heads are not very diverse as their energy is mostly focused in a specific direction. Hence, as the following results reveal, smaller κ leads to faster convergence for ANIL and MAML. Now we are ready to state our main results for the ANIL and FO-ANIL algorithms in the infinite sample case. Theorem 1. Consider the infinite sample case for ANIL and FO-ANIL, where min = mout = . Further, suppose the conditions in Assumption 1 hold, the initial weights are selected as w0 = 0k and αB 0 B0 = Ik. Let the step sizes are chosen as α = O(1/L ) and β = O(ακ 4 ) for ANIL and β = O(ακ 4 min(1, µ2 /η2 )) for FO-ANIL, where η satisfies 1 n Pn i=1 w ,t,i 2 η for all times t [T] almost surely. If the initial error satisfies the condition dist0 0.9, then almost surely for both ANIL and FOANIL we have, dist(BT , B ) 1 0.5βαE0µ2 T 1 , (5) where E0 := 0.9 dist2 0. Theorem 1 shows that both FO-ANIL and Exact ANIL learn a representation that approaches the ground-truth exponentially fast as long as the initial representation B0 is normalized and is a constant distance away from the ground-truth, the initial head w0 = 0, and the sampled tasks are diverse. Note that β is larger for ANIL and hence its convergence is faster, demonstrating the benefit of second-order updates. 1We could instead assume the ground-truth heads are subgaussian and use standard concentration results show that with n = Ω(k + log(T)), the set of ground-truth heads {w ,t,i}n i=1 sampled on iteration t are (1 + O(1), 1 O(1))-diverse for all T iterations with high probability, but instead we assume generic bounds for simplicity. MAML and ANIL Provably Learn Representations Next, we state our results for FO-MAML and Exact MAML for the same infinite sample setting. Due to the adaptation of both the representation and head, the MAML and FOMAML updates involve thirdand fourth-order products of the ground-truth heads, unlike the ANIL and FO-ANIL updates which involve at most second-order products. To analyze the higher-order terms, we assume that the energy in each ground-truth head is balanced. Assumption 2 (Task incoherence). For all times t [T] and tasks i [n], we almost surely have w ,t,i 2 c k L , where c is a constant. Next, as discussed in Section 3.1, MAML s adaptation of the representation complicates its ability to learn the groundtruth subspace. As a result, we require an additional condition to show that MAML learns the representation: the distance of the initialization to the ground-truth must small in the sense that it must scale with the task diversity and inversely with k. We formalize this in the following theorem. Theorem 2. Consider the infinite sample case for MAML, where min = mout = . Further, suppose the conditions in Assumptions 1 and 2 hold, the initial weights are selected as w0 = 0k and αB 0 B0 = Ik, and the step sizes satisfy α = O(k 2/3L 1 T 1/4) and β = O(ακ 4 ). If dist0 = O(k 0.75κ 1.5 ), then almost surely dist(BT , B ) 1 0.5βαE0µ2 T 1 , where E0 := 0.9 dist2 0. Theorem 2 shows that the initial representation learning error for MAML must scale as O(k 0.75κ 1.5 ), which can be much smaller than the constant scaling that is sufficient for ANIL to learn the representation (see Theorem 1). Next we give the main result for FO-MAML, which requires an additional condition that the norm of the average of the ground-truth heads sampled on each iteration is small. This condition arises due to the fact that the FO-MAML updates are approximations of the exact MAML updates, and thus have a bias that depends on the average of the ground-truth heads. Without control of this bias, the iterates Bt and wt may diverge. Theorem 3. Consider the infinite sample case for FOMAML, where min = mout = . Further, suppose the conditions in Assumptions 1 and 2 hold, the initial weights are selected as w0 = 0k and αB 0 B0 = Ik, and the step sizes satisfy α = O( 1 k L ) and β = O(ακ 4 ). If the initial error satisfies dist0 = O(k 0.5κ 1 ), and the average of the true heads almost surely satisfies 1 n Pn i=1 w ,t,i 2 = O(k 1.5κ 3 µ ) for all times t, then almost surely dist(BT , B ) 1 0.5βαE0µ2 T 1 , where E0 := 0.9 dist2 0. Theorem 3 shows that FO-MAML learns col(B ) as long as the initial principal angle is small and 1 n Pn i=1 w ,t,i 2 = O(k 1.5κ 3 µ ) on all iterations, due to the biased updates. Note that the FO-ANIL updates are also biased, but this bias scales with Ik αB t Bt 2, which is eventually decreasing quickly enough to make the cumulative error induced by the bias negligible without any additional conditions. In contrast, Ik αB t Bt 2 is not guaranteed to decrease for FO-MAML due to the inner loop adaptation of the representation, so we need the additional condition. To the best of our knowledge, the above theorems are the first results to show that ANIL, MAML, and their firstorder approximations learn representations in any setting. Moreover, they are the first to show how task diversity plays a key role in representation learning from an optimization perspective, to the best of our knowledge. Due the the restrictions on β and α, Theorems 1 and 2 show that the rate of contraction of principal angle distance diminishes with less task diversity. Thus, the more diverse the tasks, i.e. the smaller κ , the faster that ANIL and MAML learn the representation. Additionally, the less diverse the tasks, the more accurate initialization that MAML requires, and the tighter that the true heads must be centered around zero to control the FO-MAML bias. 4.1. Finite-sample results Thus far we have only considered the infinite sample case, i.e., min = mout = , to highlight the reasons that the adaptation updates in MAML and ANIL are essential for representation learning. Next, we study the finite sample setting. Indeed, establishing our results for the finite sample case is more challenging, but the mechanisms by which ANIL and MAML learn representations for finite min and mout are very similar to the infinite-sample case, and the finite-sample problem reduces to showing concentration of the updates to the infinite-sample updates. For MAML, this concentration requires assumptions on sixth and eighth-order products of the data which arise due to the inner-loop updates. In light of this, for the sake of readability we only give the finite-sample result for ANIL and FO-ANIL, whose analyses require only standard assumptions on the data, as we state below. Assumption 3 (Sub-gaussian feature distribution). For x p, E[x] = 0 and Cov(x) = Id. Moreover, x is Id-subgaussian in the sense that E[exp(v x)] exp( v 2 2 2 ) v. Under this assumption, we can show the following. Theorem 4 (ANIL Finite Samples). Consider the finitesample case for ANIL and FO-ANIL. Suppose Assumptions 1, 2 and 3 hold, α = O(( k L + σ) 1) and β is chosen as in Theorem 1. For some δ >0 to be defined later, let E0 = 0.9 dist2 0 δ and assume E0 is lower bounded by a positive MAML and ANIL Provably Learn Representations constant. Suppose the sample sizes satisfy min = Ω(Min) and mout = Ω(Mout) for some expressions Min, Mout to be defined later. Then both ANIL and FO-ANIL satisfy: dist(BT , B ) 1 0.5βαµ2 T 1 + O(δ) where for ANIL, Min = k3 + k3d n , Mout = k2 + dk+k3 dk n )( 1 min + 1 mout ) and for FO-ANIL, Min = k2, Mout = dk+k3 µ + σ2 µ2 min ) with probability at least 1 T poly(n) T poly(min) O(Te 90k). For ease of presentation, the Ω() notation excludes log factors and all parameters besides k, d and n; please refer to Theorem 8 in Appendix E for the full statement. We focus on dimension parameters and n here to highlight the sample complexity benefits conferred by ANIL and FO-ANIL compared to solving each task separately (n = 1). Theorem 4 shows that ANIL requires only min + mout = Ω(k3 + k3d n ) samples per task to reach a neighborhood of the groundtruth solution. Since k d and n can be large, this sample complexity is far less than the Ω(d) required to solve each task individually (Hsu et al., 2012). Note that more samples are required for Exact ANIL because the second-order updates involve higher-order products of the data, which have heavier tails than the analogous terms for FO-ANIL. 5. Proof sketch We now discuss how we prove the results in greater detail. We focus on the FO-ANIL case because the presentation is simplest yet still illuminates the key ideas used in all proofs. 5.1. Theorem 1 (FO-ANIL) Intuition. Our goal is to show that the distance between the column spaces of Bt and B , i.e. distt := ˆB , ˆBt 2 is converging to zero at a linear rate for all t. We will use an inductive argument in which we assume favorable conditions to hold up to time t, and will prove they continue to hold at time t+1. To show distt+1 is linearly decaying, it is helpful to first consider the non-normalized energy in the subspace orthogonal to the ground-truth, namely ˆB , Bt+1 2. We have observed in equation (3) that if the inner-loop adapted heads wt,i at time t are diverse, then the FO-ANIL update of the representation subtracts energy from the previous representation and adds energy from the ground-truth representation. Examining (3) closer, we notice that the only energy in the column space of the new representation that can be orthogonal to the ground-truth subspace is contributed by the previous representation, and this energy is contracting at a rate proportional to the condition number of the matrix formed by the adapted heads. In particular, if we define the matrix Ψt := 1 n Pn i=1 wt,iw t,i, then we have B , Bt+1 2 = B , Bt(I βΨt) 2 (1 βλmin(Ψt)) B , Bt 2, (6) as long as β 1/λmax(Ψt). Therefore, to show that the normalized energy ˆB , ˆBt+1 2 approaches zero, we aim to show: (I) The condition number of Ψt continues to stay controlled and finite, which implies linear convergence of the non-normalized energy in col(B ) according to (6); and (II) The minimum singular value of the representation Bt+1 is staying the same. Otherwise, the energy orthogonal to the ground-truth subspace could be decreasing, but the representation could be becoming singular, which would mean the distance to the ground-truth subspace is not decreasing. To show (I), note that the adapted heads are given by: wt,i = twt | {z } non-unique + α B t B w ,t,i | {z } unique where t := Ik αB t Bt. The vector twt is present in every wt,i, so we refer to it as the non-unique part of wt,i. On the other hand, αB t B w ,t,i is the unique part of wt,i. Equation (7) shows that if the non-unique part of each wt,i is relatively small compared to the unique part, then Ψt α2B t B Ψ ,t B Bt, meaning the wt,i s are almost as diverse as the ground-truth heads. So we aim to show t 2 and wt 2 remain small for all t. We specifically need to show they are small compared to σ2 min(B t B ), since this quantity roughly lower bounds the energy in the diverse part of wt,i. One can show that σ2 min(ˆB t B ) = 1 dist2 t, so we need to use that distt is decreasing in order to lower bound the energy in the unique part of wt,i. It is also convenient to track t 2 in order to show (II), since t+1 2 ε implies σmin(Bt+1) 1 ε α . Note that for (II), we need control of t+1 2, whereas to show (I) we needed control of t 2. This difference in time indices is accounted for by the induction we will soon discuss. It is now evident why it makes sense to initialize with 0 2 = 0 and wt 2 = 0 (in fact, they do not have to be exactly zero; any initialization with w0 2 = O( α) and t 2 = O(α2) would suffice). However, proving that t 2 and wt 2 remain small is difficult because the algorithm lacks explicit regularization or a normalization step after each round. Empirically, σmin(Bt) may decrease and wt 2 may increase on any particular round, so it is not clear why σmin(Bt) does not go to zero (i.e. t 2 does not go to 1) and wt 2 does not blow up. To address these issues, one could add an explicit regularization term MAML and ANIL Provably Learn Representations to the loss functions or an orthonormalization step to the algorithm, but doing so is empirically unnecessary and would not be consistent with the ANIL formulation or algorithm. Inductive structure. We overcome the aforementioned challenges by executing a multi-way induction that involves the following six inductive hypotheses: 1. A1(t) := { wt 2 = O( α min(1, µ2 η2 )η )}, 2. A2(t):= t 2 ρ t 1 2+O(β2α2L4 dist2 t 1) , 3. A3(t) := t 2 1 10 , 4. A4(t) := 0.9αE0µ Ik Ψt 1.2αL2 Ik , 5. A5(t) := B , Bt 2 ρ B , Bt 1 2 , 6. A6(t) := {distt = ˆB , ˆBt 2 ρt 1}, where ρ = 1 0.5βαE0µ2 . Our previous intuition motivates our choice of inductive hypotheses A1(t), . . . , A5(t) as intermediaries to ultimately show that distt linearly converges to zero. More specifically, A1(t), A2(t), and A3(t) bound wt 2 and t 2, A4(t) controls the diversity of the inner loop-adapted heads, and A5(t) and A6(t) confirm that the learned representation approaches the ground-truth. We employ two upper bounds on t 2 because we need to use that { t 2}t is both summable (A2(t)) and uniformly small (A3(t)) to complete different parts of the induction. In particular, if true for all t, A2(t) shows that t 2 may initially increase, but eventually linearly converges to zero due to the linear convergence of distt. The initialization implies each inductive hypothesis holds at time t = 1. We must show they hold at time t + 1 if they hold up to time t. To do this, we employ the logic visualized in Figure 2. The top level events (A1(t + 1), A2(t + 1), A5(t + 1)) are most immediate in the sense that they follow directly from other events at all times up to and including t (via the dashed green arrows). The proofs of all other events at time t+1 require the occurrence of other events at time t + 1, with more logical steps needed as one moves down the graph, and solid red arrows denoting implications from and to time t + 1. In particular, A3(t + 1) requires the events up to and including time t and a top-level event at t + 1, namely A2(t + 1), so it is in the second level. Similarly, A6(t + 1) requires events up to and including time t and the secondlevel event at t+1, so it is in the third level, and so on. Recall that our intuition is that diverse adapted heads lead to contraction of the non-normalized representation distance. This logic drives the implication A4(t) = A5(t + 1). We then reasoned that contraction of the non-normalized distance leads to linear convergence of the distance as long as the minimum singular value of the representation is controlled from below. This intuition is captured in the implication A5(t + 1) A3(t + 1) = A6(t + 1). We also discussed that the diversity of the adapted heads depends on the global head being small, the representation being close to a scaled orthonormal matrix, and the representation distance being bounded away from 1 at the start of that iteration. This is ensured by the implication that the adapted heads are again diverse on iteration t + 1, in particular A1(t+1) A3(t+1) A6(t+1) = A4(t+1). The other implications in the graph are technical and needed to control wt+1 2 and t+1 2. Proving the implications. We now formally discuss each implication, starting with the top level. Full proofs are provided in Appendix C. A4(t) = A5(t+1). This is true by equation (6). A1(t) A3(t) A6(t) = A2(t+1). It can be shown that t+1 is of the form: t+1 = t(Ik βα2B t B Ψ ,t B Bt)+Nt (8) for some matrix Nt whose norm is upper bounded by a linear combination of t 2 and distt. We next use λmin(B t B Ψ ,t B Bt) µ2 σ2 min(B t B ) α µ2 (1 dist2 t) (9) where (9) follows by σ2 min(ˆB t B )=1 dist2 t and A3(t). The proof follows by applying A6(t) to control 1 dist2 t. ( t s=1A2(s) A6(s)) = A1(t+1). This is the most difficult induction to show. The FO-ANIL dynamics are such that wt 2 may increase on every iteration throughout the entire execution of the algorithm. However, we can exploit the fact that the amount that it increases is proportional to t 2, which we can show is summable due to the linear convergence of distt. First, we have wt+1 =(Ik βB t Bt t)wt + β i=1 t B t B w ,t,i which implies wt 2 increases on each iteration by O( β α t 2η ). In particular, (a) (1 + 2β α t 2) wt 2 + 2βL α t 2 r=s (1 + 2β 2βη α s 2 1+ 1 t s where (b) follows by recursively applying (a) for t, t 1, .... and (c) follows by the AM-GM inequality. Next, for any s [t], recursively apply A2(s),A2(s 1), ... and MAML and ANIL Provably Learn Representations A6(t) : distt := B *, Bt 2 ρt 1 A3(t) : Δt 2 0.1 A2(t) : Δt 2 ρ Δt 1 2 + α2β2L4 A4(t) : 0.9αE0μ2 *Ik Ψt 1.2αL2 A1(t) : wt 2 0.1 α min(1,μ2 *)η* A5(t) : B *, Bt 2 ρ B (Implications up to t) (t+1) (Implications up to t+1) (t+1) Figure 2. Logical flow of the proof. Note that there are no cycles among the implications from t + 1 to t + 1, so the logic is consistent. use A6(r) r [s] to obtain, for an absolute constant c, r=0 ρs rβ2α2L4 dist2 r cρs s 1 X r=0 ρrβ2α2L4 Plugging (d) into (c), computing the sum of geometric series, and applying the choice of β completes the proof. A2(t+1) A3(t) = A3(t+1). This follows straightforwardly since β is chosen sufficiently small. A3(t+1) t+1 s=1A5(s) A6(t) = A6(t+1). Using the definition of the principal angle distance, the Cauchy Schwarz inequality, and t+1 s=1A5(s), we can show distt+1 1 σmin(Bt+1) ˆB , Bt+1 2 σmax(B0) σmin(Bt+1)ρt dist0 from which the proof follows after applying A3(t+1) and the initial conditions. Note that here we have normalized the representation only once at time t+1 and used the contraction of the non-normalized energy to recurse from t+1 to 0, resulting in a σmax(B0) σmin(Bt+1) scaling error. If we instead tried to directly show the contraction of distance and thereby normalized analytically on every round, we would obtain distt+1 Qt s=0 σmax(Bs) σmin(Bs+1) ρt dist0, meaning a Qt s=0 σmax(Bs) σmin(Bs+1) scaling error, which is too large because Bs is in fact not normalized on every round. A1(t + 1) A3(t + 1) A6(t + 1) = A4(t + 1). This follows by expanding each wt,i as in (7), and using similar logic as in (9). 5.2. Other results ANIL, FO-MAML, and MAML For ANIL, the inductive structure is nearly identical. The only meaningful change in the proof is that the secondorder updates imply wt+1 2 wt 2 = O( t 2 2), which is smaller than the O( t 2) for FO-ANIL, and thereby allows to control wt+1 2 with a potentially larger β. For FO-MAML and MAML, recall that the inner loop update of the representation weakens the benefit of adapted head diversity (see Section 3.1). Thus, larger adapted head diversity is needed to learn col(B ). Specifically, we require a tighter bound of t 2 = O(α2), compared to the t 2 = O(1) bound in ANIL, and for FO-MAML, we also require a tighter bound on wt 2 (recall from Section 5 that smaller t 2 and wt 2 improves adapted head diversity). Moreover, to obtain tight bounds on wt+1 2 we can no longer use that wt+1 2 wt 2 is controlled by t 2 due to to additional terms in the outer loop update. To overcome these issues, we must make stricter assumptions on the initial distance, and in the case of FO-MAML, on the average ground-truth head. Please see Appendix D for details. Finally, the proof of Theorem 4 relies on showing concentration of the finite-sample gradients to the population gradients. The principal challenge is showing this concentration for fourth-order products of the data that arise in the ANIL updates, since we cannot apply standard methods to these higher-order products while maintaining o(d) samples per task. Instead, we leverage the low-rankness of the products by applying a truncated version of the concentration result for low-rank random matrices from (Magen & Zouzias, 2011). We also use the L4-L2-hypercontractiveness of the data to control the bias in these higher-order products. Details are found in Appendix E. 6. Numerical simulations In this section we run numerical simulations to verify our theoretical findings. First, we explore the effect of task diversity on MAML s rate of convergence to the ground-truth representation. In Figure 3, we execute MAML on the task population losses (min = mout = ) in the multi-task linear representation learning setting. We set d = 100 and k=n=5. On each round, the ground-truth heads are sampled i.i.d. from N(0, diag([1, . . . , 1, µ2])), where µ2 < 1. We randomly draw B and B0 at the start of algorithm execution. The parameter µ2 controls task diversity, with larger µ2 meaning the ground-truth heads are closer to isotropic and therefore more diverse. The results show that MAML s linear convergence rate improves with greater task diversity, consistent with Theorem 2. MAML and ANIL Provably Learn Representations 0 2500 5000 7500 10000 12500 15000 17500 20000 Number of iterations t dist(Bt, B * ) Figure 3. Task diversity improves convergence rate. Principal angle distance vs number of iterations for MAML with varying ground-truth head distributions. The larger value of µ2, the more diverse ground-truth heads. 0 5000 10000 15000 20000 t 0 5000 10000 15000 20000 t Figure 4. MAML and FO-MAML initialization and groundtruth mean conditions are empirically necessary. (Left) Random B0. (Right) Methodical B0. In both cases, the mean groundtruth head is far from zero. Next, we show that the additional conditions relative to ANIL and FO-ANIL required for MAML and FO-MAML to learn the ground-truth representation are empirically necessary. That is, (i) MAML and FO-MAML require a good initialization relative to the underlying task diversity, and (ii) FO-MAML further requires the ground-truth heads to be concentrated around zero. To test these conditions, we set d = 20, n = k = 3, randomly draw B , and use the task population losses. The ground-truth heads are drawn as w ,t,i N(101k, Ik). Ground-truth task diversity is thus low, since most of the energy points in the direction 1k. In Figure 4 (left), we use a random Gaussian initialization of B0, which has dist0 0.99. In Figure 4 (right), we initialize with a noisy version of B satisfying dist0 [0.65, 0.7]. The plots show that in this low-diversity setting, MAML requires good initialization to achieve linear convergence, whereas FO-MAML cannot obtain it even with good initialization, as E[w ,t,i] 0. Lastly, note that Figure 1 employs the same setting as Figure 4 (left), except that the mean of the ground-truth heads is zero in the former case, which leads to all four GBML approaches learning col(B ). 7. Conclusion Our analysis reveals that ANIL, MAML, and their first-order approximations exploit task diversity via inner adaptation steps of the head to recover the ground-truth representation in the multi-task linear representation learning setting. Further, task diversity helps these algorithms exhibit an implicit regularization that keeps the learned representation well-conditioned. However, the inner adaptation of the representation plays a restrictive role, inhibiting MAML and FO-MAML from achieving global convergence. To the best of our knowledge, these are the first results showing that GBML algorithms can learn a low-dimensional subspace. Acknowledgements This research is supported in part by NSF Grants 2127697, 2019844, 2107037, and 2112471, ARO Grant W911NF2110226, ONR Grant N00014-19-1-2566, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program. Antoniou, A., Edwards, H., and Storkey, A. How to train your MAML. In Seventh International Conference on Learning Representations, 2019. Arnold, S., Iqbal, S., and Sha, F. When MAML Can Adapt Fast and How to Assist When it Cannot. In International Conference on Artificial Intelligence and Statistics, pp. 244 252. PMLR, 2021. Balcan, M.-F., Khodak, M., and Talwalkar, A. Provable Guarantees for Gradient-Based Meta-Learning. In International Conference on Machine Learning, pp. 424 433. PMLR, 2019. Baxter, J. A Model of Inductive Bias Learning. Journal of Artificial Intelligence Research, 12:149 198, 2000. Bernacchia, A. Meta-Learning with Negative Learning Rates. In International Conference on Learning Representations, 2020. Bertinetto, L., Henriques, J. F., Torr, P., and Vedaldi, A. Meta-Learning with Differentiable Closed-form Solvers. In International Conference on Learning Representations, 2018. Bouniot, Q., Redko, I., Audigier, R., Loesch, A., Zotkin, Y., and Habrard, A. Towards Better Understanding Meta-learning Methods through Multi-task Representation Learning Theory. ar Xiv preprint ar Xiv:2010.01992, 2020. MAML and ANIL Provably Learn Representations Bullins, B., Hazan, E., Kalai, A., and Livni, R. Generalize Across Tasks: Efficient Algorithms for Linear Representation Learning. In Algorithmic Learning Theory, pp. 235 246. PMLR, 2019. Caruana, R. Multitask Learning. Machine learning, 28(1): 41 75, 1997. Chen, J., Wu, X.-M., Li, Y., Li, Q., Zhan, L.-M., and Chung, F.-l. A Closer Look at the Training Strategy for Modern Meta-Learning. Advances in Neural Information Processing Systems, 33, 2020. Chua, K., Lei, Q., and Lee, J. D. How fine-tuning allows for effective Meta-Learning. Advances in Neural Information Processing Systems, 34, 2021. Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. Exploiting Shared Representations for Personalized Federated Learning. In International Conference on Machine Learning, pp. 2089 2099. PMLR, 2021. Collins, L., Mokhtari, A., and Shakkottai, S. How Does the Task Landscape Affect MAML performance? arxiv preprint ar Xiv:2010.14672, 2022. Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incremental Learning-to-Learn with Statistical Guarantees. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, volume 34, pp. 457 466. AUAI, 2018. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-Shot Learning via Learning the Representation, Provably. In International Conference on Learning Representations, 2020. Fallah, A., Mokhtari, A., and Ozdaglar, A. On the Convergence Theory of Gradient-Based Model-Agnostic Meta Learning Algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1082 1092. PMLR, 2020a. Fallah, A., Mokhtari, A., and Ozdaglar, A. Personalized Federated learning with Theoretical Guarantees: A Model Agnostic Meta-Learning Approach. In Advances in Neural Information Processing Systems, volume 33, pp. 3557 3568, 2020b. Fallah, A., Mokhtari, A., and Ozdaglar, A. Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks. Advances in Neural Information Processing Systems, 34, 2021. Finn, C., Abbeel, P., and Levine, S. Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017. Finn, C., Xu, K., and Levine, S. Probabilistic Model Agnostic Meta-Learning. Advances in Neural Information Processing Systems, 31, 2018. Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online Meta-Learning. In International Conference on Machine Learning, pp. 1920 1930. PMLR, 2019. Goldblum, M., Reich, S., Fowl, L., Ni, R., Cherepanova, V., and Goldstein, T. Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks. In International Conference on Machine Learning, pp. 3607 3616. PMLR, 2020. Hospedales, T. M., Antoniou, A., Micaelli, P., and Storkey, A. J. Meta-Learning in Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. Hsu, D., Kakade, S. M., and Zhang, T. Random Design Analysis of Ridge Regression. In Conference on learning theory, pp. 9 1. JMLR Workshop and Conference Proceedings, 2012. Ji, K., Lee, J. D., Liang, Y., and Poor, H. V. Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters. Advances in Neural Information Processing Systems, 33:11490 11500, 2020a. Ji, K., Yang, J., and Liang, Y. Multi-Step Model-Agnostic Meta-Learning: Convergence and Improved algorithms. Co RR, abs/2002.07836, 2020b. Jiang, Y., Koneˇcn y, J., Rush, K., and Kannan, S. Improving Federated Learning Personalization via Model Agnostic Meta Learning. ar Xiv preprint ar Xiv:1909.12488, 2019. Kao, C. H., Chiu, W.-C., and Chen, P.-Y. MAML is a noisy contrastive learner in classification. In International Conference on Learning Representations, 2022. Kumar, R., Deleu, T., and Bengio, Y. Effect of Diversity in Meta-Learning. In Fifth Workshop on Meta-Learning at the Conference on Neural Information Processing Systems, 2021. Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Meta Learning with Differentiable Convex Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10657 10665, 2019. Li, Z., Zhou, F., Chen, F., and Li, H. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning. ar Xiv preprint ar Xiv:1707.09835, 2017. Magen, A. and Zouzias, A. Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication. In Proceedings of the twenty-second annual ACMSIAM symposium on Discrete Algorithms, pp. 1422 1436. SIAM, 2011. MAML and ANIL Provably Learn Representations Maurer, A., Pontil, M., and Romera-Paredes, B. The Benefit of Multitask Representation Learning. Journal of Machine Learning Research, 17(81):1 32, 2016. Mc Namara, D. and Balcan, M.-F. Risk Bounds for Transferring Representations With and Without Fine-Tuning. In International Conference on Machine Learning, pp. 2373 2381. PMLR, 2017. Ni, R., Goldblum, M., Sharaf, A., Kong, K., and Goldstein, T. Data augmentation for Meta-Learning. In International Conference on Machine Learning, pp. 8152 8161. PMLR, 2021. Nichol, A. and Schulman, J. Reptile: A Scalable Meta Learning Algorithm. ar Xiv preprint ar Xiv:1803.02999, 2:2, 2018. Oh, J., Yoo, H., Kim, C., and Yun, S.-Y. BOIL: Towards Representation Change for Few-Shot Learning. In International Conference on Learning Representations, 2020. Raghu, A., Raghu, M., Bengio, S., and Vinyals, O. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. In International Conference on Learning Representations, 2020. Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-Learning with Implicit Gradients. Advances in Neural Information Processing Systems, 32, 2019. Ravi, S. and Larochelle, H. Optimization as a Model for Few-Shot Learning. 2016. Saunshi, N., Zhang, Y., Khodak, M., and Arora, S. A Sample Complexity Separation Between Non-Convex and Convex Meta-Learning. In International Conference on Machine Learning, pp. 8512 8521. PMLR, 2020. Saunshi, N., Gupta, A., and Hu, W. A Representation Learning Perspective on the Importance of Train-Validation Splitting in Meta-Learning. In International Conference on Machine Learning, pp. 9333 9343. PMLR, 2021. Schmidhuber, J. Evolutionary Principles in Self-Referential Learning, or on Learning how to Learn: the Meta-Meta-... Ph D thesis, Technische Universit at M unchen, 1987. Setlur, A., Li, O., and Smith, V. Is Support Set Diversity Necessary for Meta-Learning? ar Xiv preprint ar Xiv:2011.14048, 2020. Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077 4087, 2017. Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Statistically and Computationally Efficient Linear Meta-Representation Learning. Advances in Neural Information Processing Systems, 34, 2021. Tripuraneni, N., Jordan, M., and Jin, C. On the Theory of Transfer Learning: The Importance of Task Diversity. Advances in Neural Information Processing Systems, 33: 7852 7862, 2020. Tripuraneni, N., Jin, C., and Jordan, M. Provable Meta Learning of Linear Representations. In International Conference on Machine Learning, pp. 10434 10443. PMLR, 2021. Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching Networks for One Shot Learning. Advances in Neural Information Processing Systems, 29:3630 3638, 2016. Wang, H., Zhao, H., and Li, B. Bridging Multi-task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation. In International Conference on Machine Learning. PMLR, 2021a. Wang, L., Cai, Q., Yang, Z., and Wang, Z. On the Global Optimality of Model-Agnostic Meta-Learning. In International Conference on Machine Learning, pp. 9837 9846. PMLR, 2020. Wang, X., Yuan, S., Wu, C., and Ge, R. Guarantees for Tuning the Step Size Using a Learning-to-Learn Approach. In International Conference on Machine Learning, pp. 10981 10990. PMLR, 2021b. Xu, Z. and Tewari, A. Representation Learning Beyond Linear Prediction Functions. Advances in Neural Information Processing Systems, 34, 2021. Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian Model-Agnostic Meta-Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7343 7353, 2018. Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Efficient Meta Learning via Minibatch Proximal Update. In Advances in Neural Information Processing Systems, pp. 1532 1542, 2019. Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast Context Adaptation via Meta-Learning. In International Conference on Machine Learning, pp. 7693 7702. PMLR, 2019. MAML and ANIL Provably Learn Representations A. Additional Related Work Meta-learning background. Multi-task representation learning and meta-learning have been of theoretical interest for many years (Schmidhuber, 1987; Caruana, 1997; Baxter, 2000). Recently, meta-learning methods have garnered much attention due to successful implementations in few-shot learning scenarios with deep networks. These modern approaches are roughly grouped into three categories: model-based (Ravi & Larochelle, 2016), metric-based (Snell et al., 2017; Vinyals et al., 2016), and gradient-based (Finn et al., 2017). In this paper we focus on gradient-based methods. Gradient-based meta-learning and MAML. The practicality and simplicity of model-agnostic meta-learning (MAML) (Finn et al., 2017) has led to many experimental and theoretical studies of gradient-based meta-learning in addition to those mentioned in Section 1. There have been numerous algorithms proposed as extensions of MAML (Li et al., 2017; Finn et al., 2018; Yoon et al., 2018; Antoniou et al., 2019; Nichol & Schulman, 2018; Rajeswaran et al., 2019; Zhou et al., 2019; Raghu et al., 2020; Zintgraf et al., 2019), and MAML has been applied to online (Finn et al., 2019) and federated (Fallah et al., 2020b; Jiang et al., 2019) learning settings. Theoretical analyses of MAML and related methods have included sample complexity guarantees in online settings (Balcan et al., 2019; Denevi et al., 2018), general convergence guarantees (Fallah et al., 2020a; Ji et al., 2020b;a), and landscape analysis (Wang et al., 2020; Collins et al., 2022). Other works have studied the choice of inner loop step size (Wang et al., 2021b; Bernacchia, 2020) and generalization (Chen et al., 2020; Fallah et al., 2021), all without splitting model parameters. Gradient-based meta-learning and representation learning. A growing line of research has endeavored to develop and understand gradient-based meta-learning with a representation learning perspective. Besides ANIL, multiple other meta-learning methods fix the representation in the inner loop (Lee et al., 2019; Bertinetto et al., 2018). Goldblum et al. (2020) showed that these meta-learners learn representations that empirically exhibit the desirable behavior of clustering features by class. However, they also gave evidence suggesting this is not true for MAML since it adapts the feature extractor during the inner loop. Meanwhile, other works have argued for the benefits of adapting the representation in the inner loop both experimentally, when the head is fixed (Oh et al., 2020), and theoretically, when the task optimal solutions may not share a representation (Chua et al., 2021). Two recent works have argued that ANIL behaves similarly to empirically successful approaches for representation learning. Wang et al. (2021a) showed that the models learned by ANIL and multi-task learning with a shared representation and unique heads are close in function space for sufficiently wide and deep Re LU networks, when the inner loop learning rate and number of inner adaptation steps for ANIL is small. Kao et al. (2022) noticed that ANIL with the global head initialized at zero at the start of each round is a noisy contrastive learner in the sense that outer loop update for the representation is a gradient step with respect to a contrastive loss, which suggests that ANIL should learn quality representations. Moreover, they showed that zeroing the global head at the start of each round empirically improves the performance of both ANIL and MAML. However, neither work shows that ANIL, let alone MAML, can in fact learn expressive representations. Additionally, our analysis rigorously explains the observation from Kao et al. (2022) that having small wt 2 aids representation learning. Meta-learning and task diversity. Initial efforts to empirically understand the effects of meta-training task diversity on meta-learning performance with neural networks have shown a promising connection between the two, although the picture is not yet clear. Ni et al. (2021) and Bouniot et al. (2020) made modifications to the the meta-training task distribution and the meta-learning objective, respectively, to improve the effective task diversity, and both resulted in significant improvements in performance for a range of meta-learners. On the other hand, Setlur et al. (2020) and Kumar et al. (2021) empirically argued that reducing the overall diversity of the meta-training dataset does not restrict meta-learning. However, Setlur et al. (2020) only considered reducing intra-task data diversity, not the diversity of the tasks themselves (as no classes were dropped from the meta-training dataset), and the results due to Kumar et al. (2021) showed that reducing the overall number of tasks seen during meta-training hurts performance for most meta-learners, including MAML. Multi-task linear representation learning. Several works have studied a similar multi-task linear representation learning setting as ours (Saunshi et al., 2021; Thekumparampil et al., 2021; Collins et al., 2021; Du et al., 2020; Tripuraneni et al., 2021; Bullins et al., 2019; Maurer et al., 2016; Mc Namara & Balcan, 2017), but did not analyze MAML or ANIL. Moreover, multiple works have shown that task diversity is necessary to learn generalizable representations from a statistical perspective (Du et al., 2020; Tripuraneni et al., 2020; Xu & Tewari, 2021; Tripuraneni et al., 2021). Our work complements these by showing the benefit of task diversity to gradient-based meta-learning methods from an optimization perspective. MAML and ANIL Provably Learn Representations B. General Lemmas First we define the following notations used throughout the proofs. Notation Definition t Ik αB t Bt t Id αBt B t L maxt [T ] σ0.5 max 1 n Pn i=1 w ,t,iw ,t,i L µ 0 < µ mint [T ] σ0.5 min 1 n Pn i=1 w ,t,iw ,t,i η maxt [T ] 1 n Pn i=1 w ,t,i 2 η L Lmax maxt [T ],i [n] w ,t,i 2 Lmax c k L for constant c κ L /µ κ ,max Lmax/µ Now we have the following general lemmas. Lemma 1. Suppose Assumption 1 holds and for some t, dist2 t 1 1 τ dist2 0. Also, suppose s 2 τ for all s [t]. Then, for E0 := 1 τ dist2 0, i=1 B t B w ,t,iw ,t,i B Bt E0µ2 α (10) Proof. First note that since σmin(A1A2) σmin(A1)σmin(A2) for any two square matrices A1, A2, we have i=1 w ,t,iw ,t,i B Bt σ2 min(B Bt)σmin i=1 w ,t,iw ,t,i σ2 min(B Bt)µ2 σ2 min(B ˆBt)σ2 min(Rt)µ2 σ2 min(B ˆBt) 1 τ where Bt = ˆBt Rt is the QR-factorization of Bt. Next, observe that dist2 t := B , ˆBt 2 2 = (Id B B )ˆBt 2 2 = max u Rk: u 2=1 u ˆB t (Id B B )(Id B B )ˆBtu = max u Rk: u 2=1 u ˆB t (Id B B )ˆBtu = max u Rk: u 2=1 u (Ik ˆB t B B ˆBt)u = 1 σ2 min(B ˆBt) = σ2 min(B ˆBt) = 1 dist2 t 1 1 1 τ dist2 0 . (12) which gives the result after combining with (11). Note that all four algorithms considered (FO-ANIL, Exact ANIL, FO-MAML, Exact MAML) execute the same inner loop update procedure for the head. The following lemma characterizes the diversity of the inner loop-updated heads for all four algorithms, under some assumptions on the behavior of distt and t 2 which we will show are indeed satisfied later. Lemma 2. Suppose Assumption 1 holds and that on some iteration t, FO-ANIL, Exact ANIL, FO-MAML, and Exact MAML MAML and ANIL Provably Learn Representations satisfy dist2 t 1 1 τ dist2 0 and s 2 τ for all s [t]. Then the inner loop-updated heads on iteration t satisfy: i=1 wt,iw t,i αE0µ2 2(1 + t 2) α t 2 wt 2η (13) i=1 wt,iw t,i ( t 2 wt 2 + p α(1 + t 2)L )2. (14) Proof. We first lower bound the minimum singular value. Observe that 1 t 2 α σ2 min(Bt) σ2 max(Bt) 1+ t 2 α by Weyl s inequality. Next, we have i=1 wt,iw t,i i=1 (Ik αB t Bt)wtw t (Ik αB t Bt) + α(Ik αB t Bt)wtw ,t,i B Bt + αB t B w ,t,iw t (Ik αB t Bt) + α2B t B w ,t,iw ,t,i B Bt i=1 α2B t B w ,t,iw ,t,i B Bt i=1 w ,t,i B Bt i=1 w ,t,i B Bt αE0µ2 2(1 + t 2) α t 2 wt 2η (17) where (15) follows by Weyl s inequality and the fact that B t Btwtw t B t Bt 0, (16) follows by Lemma 1, and (17) follows by the Cauchy-Schwarz inequality. Now we upper bound the maximum singular value of 1 n Pn i=1 wt,iw t,i. We have i=1 wt,iw t,i (Ik αB t Bt)wtw t (Ik αB t Bt) 2 + 2α (Ik αB t Bt)wt 2 i=1 B t B w ,t,i i=1 B t B w ,t,i 2 t 2 2 wt 2 2 + 2 t 2 wt 2 p α(1 + t 2)η + α(1 + t 2)L2 ( t 2 wt 2 + p α(1 + t 2)L )2. (18) Lemma 3. Suppose the sequence {ws}t+1 s=0 satisfies: ws+1 2 (1 + ξ1,s) ws 2 + ξ2,s (19) where ξ1,s 0, ξ2,s 0 for all s [t] and Pt s=1 ξ1,s 1. Then: MAML and ANIL Provably Learn Representations Proof. We have wt+1 2 (1 + ξ1,t) wt 2 + ξ2,t (1 + ξ1,t)2 wt 1 2 + ξ2,t(1 ξ1,t) + ξ2,t ... s=1 (1 + ξ1,s) + r=s (1 + ξ1,r) r=s (1 + ξ1,r) (21) where (21) is due to w0 2 = 0 and (22) follows from the AM-GM inequality. Next, note that 1 + 1 t s Pt r=s ξ1,r t s is of the form 1 + a x x, where x = t s and a = Pt r=s ξ1,r. Since 1 + a x x is an increasing function of x, we can upper bound it by its limit as x , which is ea. Thus we have s=1 ξ2,s exp where (23) follows from the numerical inequality exp(x) 1 + 2x for all x [0, 1]. Lemma 4. Suppose that Bt+1 = Bt βGt and Gt = t St Bt χSt Bt t + Nt (24) for Nt Rd k and a positive semi-definite matrix St Rk k. Then t+1 2 t 2 1 (1 + χ)βασmin(B t St Bt) + 2βα B t Nt 2 + β2α Gt 2 2 (25) Proof. By expanding t+1, Bt+1, and Gt, we obtain t+1 = I αB t+1Bt+1 = I αB t Bt + βαB Gt + βαG t Bt β2αG t Gt = t βα t B t St Bt χβαB t St Bt t + βαB t Nt χβαB t St Bt t βα t B t St Bt + βαN t Bt β2αG t Gt (26) 2 t Ik (1 + χ)βαB t St Bt 2 Ik (1 + χ)βαB t St Bt t + βα(B t Nt + N t Bt) β2αG t Gt (27) t+1 2 t 2 Ik (1 + χ)βαB t St Bt 2 + 2βα B t Nt 2 + β2α Gt 2 2 t 2 1 (1 + χ)βασmin(B t St Bt) + 2βα B t Nt 2 + β2α Gt 2 2 (28) where the last inequality follows by the triangle and Weyl inequalities. MAML and ANIL Provably Learn Representations C. ANIL Infinite Samples We start by considering the infinite sample case, wherein min =mout = . Let E0 := 0.9 dist2 0. We restate Theorem 1 here with full constants. Theorem 5 (ANIL Infinite Samples). Let min =mout = and define E0 :=0.9 dist2 0. Suppose Assumption 1 holds and dist0 0.9. Let α < 1 L , αB 0 B0 = Ik and wt = 0. Then FO-ANIL with β αE3 0µ 180κ4 min(1, µ2 η2 ) and Exact ANIL with β αE2 0 40κ4 both satisfy that after T iterations, dist(BT , B ) 1 0.5βαE0µ2 T 1 (29) Proof. The proof uses an inductive argument with the following six inductive hypotheses: 1. A1(t) := { wt 2 αE0 10 min(1, µ2 η2 )η } 2. A2(t) := { t 2 (1 0.5βαE0µ2 ) t 1 2 + 5 4α2β2L4 dist2 t 1}, 3. A3(t) := { t 2 1 10}, 4. A4(t) := {0.9αE0µ2 Ik 1 n Pn i=1 wt,iw t,i 1.2αL2 Ik}, 5. A5(t) := { B , Bt 2 1 0.5βαE0µ2 B , Bt 1 2}, 6. A6(t) := {distt 1 0.5βαE0µ2 t}. These conditions hold for iteration t = 0 due to the choice of initialization (B0, w0) satisfying Ik αB 0 B0 = 0 and w0 = 0. We will show that if they hold for all iterations up to and including iteration t for an arbitrary t, then they hold at iteration t + 1. 1. Tt s=0{A2(s) A6(s)} = A1(t + 1). This is Lemma 5 for FO-ANIL and Lemma 9 for Exact ANIL. 2. A1(t) A3(t) A5(t) = A2(t + 1). This is Lemma 6 for FO-ANIL and Lemma 10 for Exact ANIL. 3. A2(t+1) A3(t) = A3(t + 1). This is Corollary 1 for FO-ANIL and Corollary 2 for Exact ANIL. 4. A1(t + 1) A3(t + 1) A6(t + 1) = A4(t + 1). This is Lemma 7 for FO-ANIL and Lemma 12 for Exact ANIL. 5. FO-ANIL: A4(t) = A5(t + 1). This is Lemma 8. Exact ANIL: A1(t) A3(t) A4(t) = A5(t+1). This is Lemma 11. The slight discrepancy here in the implications is due to the extra terms in the outer loop representation update for Exact ANIL. 6. A3(t+1) Tt+1 s=0 A5(s) = A6(t + 1). Recall distt+1 = B , ˆBt+1 2 where ˆBt+1 is the orthogonal matrix resulting from the QR factorization of Bt+1, i.e. Bt+1 = ˆBt+1Rt+1 for an upper triangular matrix Rt+1. By A3(t+1) and t+1 s=0A5(s) we have 1 t+1 2 α distt+1 = 1 t+1 2 α B , Bt+1 2 σmin(Bt+1) B , Bt+1 2 1 0.5βαE0µ2 t B , B0 2 1 α 1 0.5βαE0µ2 t B , B0 2 = 1 α 1 0.5βαE0µ2 t dist0 . MAML and ANIL Provably Learn Representations Dividing both sides by 1 t+1 2 α and using the facts that dist0 3 10 and t+1 2 1 10 yields 1 t+1 2 1 0.5βαE0µ2 t dist0 10 3 1 0.5βαE0µ2 t dist0 1 0.5βαE0µ2 t , (30) as desired. C.1. FO-ANIL First note that the inner loop updates for FO-ANIL can be written as: wt,i = wt α w Lt,i(Bt, wt) = (Ik αB t Bt)wt + αB t B w ,t,i, (31) while the outer loop updates for the head and representation are: wt+1 = wt β i=1 w Lt,i(Bt, wt,i) i=1 B t Btwt,i + β i=1 B t B w ,t,i (32) Bt+1 = Bt β i=1 BLt,i(Bt, wt,i) i=1 Btwt,iw t,i + β i=1 B w ,t,iw t,i (33) = Bt β(Id αBt B t ) 1 i=1 (Btwt B w ,t,i)w t,i (34) Lemma 5 (FO-ANIL A1(t + 1)). Suppose we are in the setting of Theorem 1, and that the events A2(s) and A6(s) hold for all s [t]. Then wt+1 2 1 10 αE0 min(1, µ2 η2 )η . (35) Proof. For all s = 1, . . . , t, the outer loop updates for FO-ANIL can be written as: ws+1 = ws β i=1 w Ls,i(Bs, ws,i) = ws β i=1 B s Bsws,i + β i=1 B s B w ,s,i (36) Substituting the definition of ws,i, we have ws+1 = ws βB s Bs(I αB s Bs)ws αβ i=1 B s Bs B s B w ,s,i + β i=1 B s B w ,s,i = (Ik β(I αB s Bs)B s Bs)ws + β(I αB s Bs)B s B 1 n i=1 w ,s,i (37) Note that St s=0 A3(s) implies σmax(B s Bs) 1+ s 2 α for all s {0, . . . , t + 1}. Let c := 1.1. Using σmax(B s Bs) c α with (37), we obtain ws+1 2 (1 + cβ α s 2) ws 2 + cβ α s 2η (38) MAML and ANIL Provably Learn Representations for all s {0, . . . , t}. Therefore, by applying Lemma 3 with ξ1,s = cβ α s 2 and ξ2,s = cβ α s 2η , we have Next, let ρ := 1 0.5βαE0µ2 . By St s=0 A2(s), we have for any s [t] s 2 ρ s 1 2 + 5 4α2β2L4 dist2 s 1 ρ2 s 2 2 + 5 4α2β2ρL4 dist2 s 2 + 5 4α2β2L4 dist2 s 1 ... r=0 ρs 1 r dist2 r r=0 ρs 1 r dist2 r (40) since 0 2 = 0 by choice of initialization. Next, we have that dists ρs for all s {0, ..., t} by St s=0 A5(s). Thus, for any s {0, ..., t}, we can further bound s 2 as r=0 ρs 1 rρ2r 4α2β2L4 ρs 1 s 1 X ρs 1 5α2β2L4 4(1 ρ) ρs 1 5βαL4 2E0µ2 , (41) which means that cβ αρs 1 5βαL4 2E0µ2 η β αρr 1 5βαL4 2E0µ2 2.5cβ2 α L4 η E0µ2 1 + 5cβ2 L4 E0µ2 2.5cβ2 α L4 η E0µ2 s=1 ρs 1 1 + 6β2 L4 ρs 3β2 α L4 η E0µ2 s=1 ρs 1 1 + 12 βL4 αE2 0µ4 6β2 α L4 η E0µ2 s=1 1.5ρs 1 (42) 18 βκ4 η αE2 0 1 10 αE0 min(1, µ2 η2 )η (43) where (42) and (43) follow since β αE3 0 180κ4 min(1, µ2 η2 )η . Remark 1. As referred to in Section 5, it is not necessary to start with w0 2 and 0 2 strictly equal to zero. Precisely, it can be shown that the above lemma still holds with w0 2 c αE0 min(1, µ2 η2 )η and 0 2 c βαL4 E0µ2 for a sufficiently MAML and ANIL Provably Learn Representations small absolute constant c. Inductive hypothesis A6(t + 1) would also continue to hold under this initialization, with a slightly different constant. These are the only times we use w0 2 = 0 2 = 0, so the rest of the proof would hold. Similar statements can be made regarding the rest of the algorithms. Lemma 6 (FO-ANIL A2(t + 1)). Suppose we are in the setting of Theorem 1 and that A1(t), A3(t), A6(t) hold. Then A2(t + 1) holds, i.e. t+1 2 1 0.5βαE0µ2 t 2 + 5 4β2α2L4 dist2 t . (44) Proof. Let Gt be the outer loop gradient for the representation, i.e. Gt = 1 β (Bt Bt+1). We aim to apply Lemma 4, we write Gt as t St Bt + Nt, for some positive definite matrix St and another matrix Nt. We have i=1 (Btwt,i B w ,t,i)w t,i i=1 (Bt twt + αBt B t B w ,t,i B w ,t,i)w t,i i=1 t(Btwt B w ,t,i)w t,i i=1 w ,t,iw ,t,i i=1 t Btwtw t,i 1 i=1 t B w ,t,iw t t (45) = t St Bt + Nt where (45) follows by expanding wt,i, and St = αB 1 n Pn i=1 w ,t,iw ,t,i B and Nt = 1 n Pn i=1 t Btwtw t,i 1 n Pn i=1 t B w ,t,iw t t. Since σmin(B t St Bt) E0µ2 (by Lemma 1), we have by Lemma 4 t+1 2 (1 βαE0µ2 ) t 2 + 2βα B t Nt 2 + β2α Gt 2 2 (46) To bound B t Nt 2, we have i=1 t B t Btwtw t,i 1 i=1 t B t B w ,t,iw t t t B t Btwtw t t 2 + α i=1 t B t Btwtw ,t,i B Bt i=1 t B t B w ,t,iw t t α t 2 2 wt 2 2 + c α( t 2 + t 2 2) wt 2η c E0µ2 1000κ4 t 2 + 11 100c E0µ2 t 2 8µ2 t 2 (47) where we have used A1(t) and A3(t) and the fact that min(1, µ2 η2 )η2 µ2 . To bound Gt 2 2 we have Gt 2 t St Bt 2 + Nt 2 c αL2 ( t 2 + distt) + i=1 t Btwtw t,i i=1 t B w ,t,iw t t 2 c αL2 ( t 2 + distt) + c α t 2 2 wt 2 2 + 2c t 2 wt 2η + t 2 wt η distt c αL2 t 2 + c αL2 distt + c 1000 αµ2 t 2 + 3c 10 αL η t 2 1.5 αL2 t 2 + 1.1 αL2 distt (48) MAML and ANIL Provably Learn Representations where (48) follows since η L . Therefore Gt 2 2 αL4 (2.5 t 2 2 + 3.3 t 2 + 5 4αL4 t 2 + 5 4αL4 dist2 t t+1 2 1 βαE0µ2 + 0.25βαE0µ2 + 4β2α2L4 t 2 + 5 4β2α2L4 dist2 t 1 0.5βαE0µ2 t 2 + 5 4β2α2L4 dist2 t (49) where in (49) we have used β αE3 0 180κ4 , α 1 L , and E0 1. Corollary 1 (FO-ANIL A3(t + 1)). Suppose we are in the setting of Theorem 1. If inductive hypotheses A2(t + 1) and A3(t) hold, then A3(t + 1) holds, i.e. t+1 2 1 10 (50) Proof. Note that according to equation (49), we have t+1 2 (1 0.5βαE0µ2 ) t 2 + 5 4β2α2L4 (1 0.5βαE0µ2 ) 1 10 + 5 720E0βαµ2 (51) where equation (51) is satisfied by our choice of β αE3 0 180κ4 and α 1 L and inductive hypothesis A3(t). Lemma 7 (FO-ANIL A4(t + 1)). Suppose the conditions of Theorem 1 are satisfied and inductive hypotheses A1(t), A3(t) and A6(t) hold. Then A4(t + 1) holds, i.e. i=1 wt+1,iw t+1,i i=1 wt+1,iw t+1,i Proof. By Lemma 2 and inductive hypotheses A1(t), A3(t) and A6(t), we have i=1 wt,iw t,i αE0µ2 0.022αE0µ2 0.9αE0µ2 i=1 wt,iw t,i 100 αE0κ 1 + 1.1αL )2 1.2αL2 (52) where we have used the fact that min(1, µ2 η2 )η2 µ2 to lower bound the minimum singular value. Lemma 8 (FO-ANIL A5(t + 1)). Suppose the conditions of Theorem 1 are satisfied. If inductive hypotheses A4(t) holds, then A5(t + 1) holds, i.e. B , Bt+1 2 1 0.5βαE0µ2 B , Bt 2. (53) Proof. Note from A4(t + 1) that σmax 1 n Pn i=1 wt,iw t,i ) 1 1 αL2 1 L . Thus, since β αE3 0 180κ4 1 L σmax 1 n Pn i=1 wt,iw t,i ) 1, we have by Weyl s inequality that B , Bt+1 2 B , Bt 2 i=1 wt,iw t,i i=1 wt,iw t,i B , Bt 2 1 0.5βαE0µ2 . MAML and ANIL Provably Learn Representations C.2. Exact ANIL To study Exact ANIL, first note that the inner loop updates are identical to those for FO-MAML. However, the outer loop updates are different. Here, we have wt+1 = wt β i=1 w Ft,i(Bt, wt) Bt+1 = Bt β i=1 BFt,i(Bt, wt) where for all t, i: Ft,i(Bt, wt) := Lt,i(Bt, wt α w Lt,i(Bt, wt)) := 1 2 vt,i 2 2 (54) vt,i := Bt twt + αBt B t B w ,t,i B w ,t,i = t(Btwt B w ,t,i) (55) w Ft,i(Bt, wt) = B t tvt,i (56) BFt,i(Bt, wt) = vt,iw t t + αvt,iw ,t,i B Bt αBtwtv t,i Bt αBt B t vt,iw t + αB w ,t,iv t,i Bt (57) One can observe that for w, the outer loop gradient is the same as in the FO-ANIL case but with an extra αB t t factor. Meanwhile, the first two terms in the outer loop gradient for B compose the outer loop gradient in the FO-ANIL case, while the other three terms are new. We deal with these differences in the following lemmas. Lemma 9 (Exact ANIL A1(t + 1)). Suppose the conditions of Theorem 1 are satisfied and A2(s) and A6(s) hold for all s [t], then A1(t + 1) holds, i.e. wt+1 2 1 10 αE0 min(1, µ2 η2 )η . (58) Proof. Similarly to the FO-ANIL case, we can show that for any s [t], ws+1 = ws β i=1 w Fs,i(Bs, ws) = (Ik β s B s Bs s)ws + β 2 s B s B 1 n i=1 w ,s,i (59) Note that St s=0 A3(s) implies σmax(B s Bs) 1+ s 2 α for all s {0, . . . , t+1}. Let c := 1.1. Unlike in the first-order case, the coefficient of ws in (59) is the identity matrix minus a positive semi-definite matrix, so this coefficient has spectral norm at most 1 (as β is sufficiently small). So,we can bound ws+1 2 as: ws+1 2 ws 2 + cβ α s 2 2η (60) which allows us to apply Lemma 3 with ξ1,s = 0 and ξ2,s = cβ α s 2 2η for all s [t]. This results in: cβ α s 2 2η . MAML and ANIL Provably Learn Representations Next, note that s 2 (1 0.5βαE0µ2 ) s 1 2 + 5 4β2α2L4 dist2 s 1 ... r=1 (1 0.5βαE0µ2 )s 1 r( 5 4β2α2L4 dist2 r) 4L4 β2α2(1 0.5βαE0µ2 )s 1 s 1 X r=1 (1 0.5βαE0µ2 )r 5βαL4 2E0µ2 (1 0.5βαE0µ2 )s 1 (61) s=1 c25β3α1.5L8 4E2 0µ4 (1 0.5βαE0µ2 )2s 2η 14β2 αL8 E3 0µ6 η 10κ2 η (62) 10 min(1, µ2 η2 )η . (63) where (62) follows by choice of β αE2 0 40κ4 and α 1/L , and (63) follows since η L . Lemma 10 (Exact ANIL A2(t + 1)). Suppose the conditions of Theorem 1 are satisfied and A1(t), A3(t) and A5(t) hold, then A2(t + 1) holds, i.e. t+1 2 (1 0.5βαE0µ2 ) t 2 + 5 4β2α2L4 dist2 t . (64) Proof. Let Gt := 1 n Pn i=1 BFt,i(Bt, wt) = 1 β (Bt Bt+1) again be the outer loop gradient for the representation, where BFt,i(Bt, wt) is written in (57). Note that Gt can be re-written as: Gt = t St Bt St Bt t + Nt (65) where St := αB 1 n Pn i=1 w ,t,iw ,t,i B and vt,iw t t + α t Btwtw ,t,i B Bt αBtwtv t,i Bt αBt B t vt,iw t + αB w ,t,iw t B t Bt t (66) Since Lemma 1 shows that σmin(B t St Bt) E0µ2 , Lemma 4 (with χ = 1) implies that t+1 2 (1 2βαE0µ2 ) t 2 + 2βα B t Nt 2 + β2α2 Gt 2 2 (67) MAML and ANIL Provably Learn Representations It remains to control B t Nt 2 and Gt 2. Note that i=1 B t vt,iw t t i=1 B t t Btwtw ,t,i B Bt i=1 B t Btwtv t,i Bt i=1 B t Bt B t vt,iw t i=1 B t B w ,t,iw t B t Bt t c α wt 2 t 2 2 c α wt 2 + η + 2 c α wt 2 t 2η + 2 c α wt 2 t 2 c α wt 2 + η 5 α wt 2 t 2η + 3 α wt 2 2 t 2 0.6E0µ2 t 2 (68) by inductive hypotheses A1(t) and A3(t) and the fact that min(1, µ2 η2 )η2 µ2 . Next, Gt 2 t St Bt 2 + St Bt t 2 + Nt 2 c α(2 t 2 + distt)L2 + Nt 2 c α(2 t 2 + distt)L2 + wt 2 t 2 c α wt 2 t 2 + ( t 2 + distt)η + 4c wt 2 t 2η + 2 c2 α wt 2 2 t 2 c α(2 t 2 + distt)L2 + 6 wt 2 t 2η + 3 α wt 2 2 t 2 3 αL2 t 2 + c αL2 distt = Gt 2 2 αL4 (9 t 2 2 + 7 t 2 + 5 αL4 (8 t 2 + 5 4 dist2 t) (69) Combining (67), (68) and (69) yields t+1 2 (1 2βαE0µ2 ) t 2 + 2βα B t Nt 2 + β2α Gt 2 2 (1 2βαE0µ2 + 1.2βαE0µ2 + 8β2α2L4 ) t 2 + 5 4β2α4 dist2 t (1 0.5βαE0µ2 ) t 2 + 5 4β2α4 dist2 t (70) where the last inequality follows since β αE2 0 40κ4 and α 1/L . Corollary 2 (Exact ANIL A3(t + 1)). Suppose the conditions of Theorem 1 are satisfied. If A2(t + 1) and A3(t) hold. Then A3(t + 1) holds, i.e. t+1 2 1 10 (71) Proof. Note that according to equation (70), we have t+1 2 (1 0.5βαE0µ2 ) t 2 + 5 4β2α2L4 (1 0.5βαE0µ2 ) 1 4E0βαµ2 (72) where equation (72) is satisfied by the choice of β αE2 0 40κ4 . and inductive hypothesis A3(t). MAML and ANIL Provably Learn Representations Lemma 11 (Exact-ANIL A4(t + 1)). Suppose the conditions of Theorem 1 are satisfied and that inductive hypotheses A1(t), A3(t) and A6(t) hold. Then A4(t + 1) holds, i.e. i=1 wt+1,iw t+1,i i=1 wt+1,iw t+1,i Proof. The proof is identical to that of Lemma 7. Lemma 12 (Exact ANIL A5(t + 1)). Suppose the conditions of Theorem 1 are satisfied. If inductive hypothesis A4(t) holds, then A5(t + 1) holds, that is B , Bt+1 2 (1 0.5βαE0µ2 ) B , Bt 2 Proof. Note that from (57), the outer loop gradient for the (t, i)-th task can be re-written as: BFt,i(Bt, wt) = vt,iw t t + αvt,iw ,t,i B Bt αBtwtv t,i Bt αBt B t vt,iw t + αB w ,t,iv t,i Bt = Btwt,iw t,i B w ,t,iw t,i αBtwtv t,i Bt αBt B t vt,iw t + αB w ,t,iv t,i Bt Therefore, noting Gt = 1 n Pn i=1 BFt,i(Bt, wt), and using B , B = 0, we have B , Bt+1 2 = B , (Bt βGt) 2 i=1 wt,iw t,i + βα i=1 (wtv t,i Bt + B t vt,iw t ) 2 i=1 wt,iw t,i 2 + 2βα B , Bt 2 i=1 wtv t,i Bt 1 0.9βαE0µ2 + βα 3E0 100 min(1, µ2 η2 )η2 B , Bt 2(1 0.5βαE0µ2 ) where (73) follows by inductive hypotheses A1(t), A3(t), and A4(t), and A3(t) and the fact that min(1, µ2 η2 )η2 µ2 . D. MAML Infnite Samples D.1. FO-MAML We consider FO-MAML when min = mout = . In this case, the inner loop updates are: wt,i = wt α w Lt,i(Bt, wt) = (Id αB t Bt)wt + αB t ˆB w ,t,i Bt,i = Bt α BLt,i(Bt, wt) = Bt(Ik αwtw t ) + αB w ,t,iw t (74) The outer loop updates are: wt+1 = wt β i=1 w Lt,i(Bt,i, wt,i) = wt β i=1 B t,i(Bt,iwt,i B w ,t,i) Bt+1 = Bt β i=1 BLt,i(Bt,i, wt,i) = Bt β i=1 (Bt,iwt,i B w ,t,i)w t,i (75) MAML and ANIL Provably Learn Representations Now we state the main result for Exact MAML in the infinite sample case. Due to third and higher-order products of the ground-truth heads that arise in the FO-MAML and MAML updates, we require an upper bound on the maximum w ,t,i. We define the parameter Lmax as follows. Assumption 4. There exists Lmax < such that almost surely for all t [T], we have max i [n] w ,t,i 2 Lmax (76) Note that if Assumption 2 holds, we have Lmax = O( k L ). Here we prove a slightly more general version of Theorem 3 in which we allow for arbitrary finite Lmax. Note that Theorem 6 immediately implies Theorem 2 after applying Assumption 2. First we state the following assumption, then we prove the theorem. Assumption 5 (Initialization and small average ground-truth heads). The following holds almost surely: dist0 4µ 5Lmax and, for all t [T], 2 η 2E2 0µ4 L3max . (77) Theorem 6 (FO-MAML Infinite Samples). Let min = mout = and define E0 := 0.9 dist2 0. Suppose that α 1 4Lmax , β αE2 0 60κ4 , αB t Bt = Ik, w0 = 0 and Assumptions 1, 4 and 5 hold. Then FO-MAML satisfies that for all T Z+, dist(BT , B ) (1 0.5βαE0µ2 )T 1. (78) Proof. The proof follows by showing that the following inductive hypotheses hold for all t [T]: 1. A1(t) := { wt 2 E2 0 10 αµ κ 3 ,max} 2. A2(t) := { t 2 E0 3. A3(t) := { B , Bt 2 (1 0.5βαE0µ2 ) B , Bt 1 2} 4. A4(t) := {distt 10 3 (1 0.5βαE0µ2 )t 1 dist0} 5. A5(t) := {distt (1 0.5βαE0µ2 )t 1} These conditions hold for iteration t = 0 due to the choice of initialization. Now, assuming they hold for arbitrary t, we will show they hold at t + 1. 1. A1(t) A2(t) A4(t) = A1(t + 1). This is Lemma 13. 2. A1(t) A2(t) A4(t) = A2(t + 1). This is Lemma 14. 3. A1(t) A2(t) A4(t) = A3(t + 1). This is Lemma 15. 4. A2(t + 1) Tt+1 s=1 A3(s) = A4(t + 1) A5(t + 1). Note that A2(t + 1) Tt+1 s=1 A3(s) implies 1 t+1 2 α distt+1 = 1 t+1 2 α B , Bt+1 2 σmin(Bt+1) B , Bt+1 2 1 0.5βαE0µ2 t B , B0 2 1 α 1 0.5βαE0µ2 t B , B0 2 = 1 α 1 0.5βαE0µ2 t dist0 . MAML and ANIL Provably Learn Representations Dividing both sides by 1 t+1 2 α and using the facts that dist0 3 10 and t+1 2 1 10 yields 1 t+1 2 1 0.5βαE0µ2 t dist0 10 3 1 0.5βαE0µ2 t dist0 1 0.5βαE0µ2 t , (79) as desired. Lemma 13 (FO-MAML A1(t + 1). Suppose the conditions of Theorem 6 are satisfied and A1(t),A2(t) and A4(t) hold. Then A1(t + 1) holds, i.e. wt+1 2 E2 0 10 αµ κ 3 ,max. (80) Proof. Let Gt,i be the inner loop gradient for the representation for the (t, i)-th task, in particular Gt,i = Btwtw t B w ,t,iw t . By expanding the outer loop update for the head, we obtain: i=1 (Ik βB t,i Bt,i(I αB t Bt))wt + β 1 i=1 B t,i(I αBt,i B t )B w ,t,i i=1 (Ik βB t,i Bt,i(I αB t Bt))wt + βα2 1 i=1 B t Gt,i B t B w ,t,i i=1 B t,i(I αBt B t )B w ,t,i βα3 1 i=1 G t,i Gt,i B t B w ,t,i i=1 (Ik βB t,i Bt,i(I αB t Bt))wt i=1 B t (Btwtw t B w ,t,iw t )B t B w ,t,i i=1 B t,i(I αBt B t )B w ,t,i βα3 1 i=1 G t,i Gt,i B t B w ,t,i i=1 (Ik βB t,i Bt,i(I αB t Bt))wt βα2B t B i=1 w ,t,iw ,t,i + βα2B t Btwtw t B t B i=1 B t,i(I αBt B t )B w ,t,i βα3 1 i=1 G t,i Gt,i B t B w ,t,i Ik βα2B t B i=1 w ,t,iw ,t,i wt + Nt (81) where Nt := β 1 n Pn i=1 B t,i Bt,i twt + βα2B t Btwtw t B t B 1 n Pn i=1 w ,t,i + β 1 n Pn i=1 B t,i t B w ,t,i βα3 1 n Pn i=1 G t,i Gt,i B t B w ,t,i. Since σmin(B t B 1 n Pn i=1 w ,t,iw ,t,i B Bt) 1 αE0µ2 by Lemma 1, and β 1 2αL2 , we have Ik βα2B t B i=1 w ,t,iw ,t,i 2 wt 2 + Nt 2 (1 βαE0µ2 ) wt 2 + Nt 2 (82) MAML and ANIL Provably Learn Representations The remainder of the proof deals with bounding Nt 2. First note that St s=0 A2(s) with α 1/(4Lmax) implies σmax(B s Bs) 1+ s 2 α for all s {0, . . . , t+1}. In turn, this means that α1.5 Bs 3 2 1.1 Let c := 1.1. We consider each of the four terms in Nt separately. Using α Bt 2, α1.5 Bt 3 2 c and the Cauchy-Schwarz and triangle inequalities, we have i=1 B t,i Bt,i 2 β Bt 2 2+2α Bt 2 i=1 w ,t,iw ,t,i 2 wt 2 2 α + 2c α wt 2η + α2L2 wt 2 2) t 2 βα2 B t Btwtw t B t B 1 n 2 cβ αη wt 2 2 i=1 B t,i t B w ,t,i 2 cβ α t 2η + βαL2 max wt 2 t 2 + βαL2 max wt 2 dist2 t (83) i=1 G t,i Gt,i B t B w ,t,i 2 cβα2.5 c wt 2 2 α η + 2c wt 2L2 α + L3 max wt 2 2. (84) Note that the dist2 t in (83) is due to the fact that B (Ik ˆBt ˆB t )B 2 = B (Ik ˆBt ˆB t )(Ik ˆBt ˆB t )B 2 dist2 t. Combining these bounds and applying inductive hypotheses A2(t) and A3(t) yields i=1 B t,i Bt,i 2 + βα2 B t Btwtw t B t B 1 n i=1 B t,i t B w ,t,i 2 + βα3 1 n i=1 G t,i Gt,i B t B w ,t,i α + 2c α wt 2η + α2L2 wt 2 2) t 2 wt 2 + cβ αη wt 2 2 + cβ α t 2η + βαL2 max wt 2 t 2 + βαL2 max wt 2 dist2 t +cβα2.5 c wt 2 2 α η + 2c wt 2L2 α + L3 max wt 2 2 2c 100βα1.5µ3 κ 3 ,max E3 0 + 2c 10βα1.5µ2 η E0 + 1 10βα1.5µ3 κ ,max E2 0 dist2 t Thus we have wt+1 2 1 βαE0µ2 wt 2 + 2c 100βα1.5µ3 κ 3 ,max E3 0 + 2c 10βα1.5µ2 η E0 + 1 10βα1.5µ3 κ ,max E2 0 dist2 t 1 10E2 0 αµ κ 1 ,max 1 10βα1.5E3 0µ3 κ 1 ,max + 2c 100βα1.5µ3 κ 3 ,max E3 0 + 2c 10βα1.5µ2 η E0 10βα1.5µ3 κ ,max E2 0 dist2 0 1 10E2 0 αµ κ 1 ,max (85) where (85) follows by Assumption 5, namely: η 2E2 0µ4 L3max and dist0 4µ 5Lmax . (86) Lemma 14 (FO-MAML A2(t + 1)). Suppose the conditions of Theorem 6 are satisfied and A1(t),A2(t) and A4(t) hold. Then A2(t + 1) holds almost surely, i.e. 10 α2µ2 . (87) MAML and ANIL Provably Learn Representations Proof. We will employ Lemma (4), which requires writing the outer loop gradient for the representation, i.e. Gt := 1 β (Bt Bt+1), as Gt = t St Bt χBt St t + Nt, for some positive definite matrix St and a matrix Nt (note that this Nt is different from the Nt from that was used in the previous lemma). To this end, we expand the outer loop gradient: i=1 Bt,iwt,iw t,i B w ,t,iw t,i i=1 (Bt,i(Ik αB t Bt)wt (Id αBt,i B t )B w ,t,i)w t,i i=1 Bt,i twtw t,i t B w ,t,iw t,i + α2Btwtw t B t B w ,t,iw t,i α2B w ,t,iw t B t B w ,t,iw t,i i=1 w ,t,iw ,t,i Bt,i twtw t,i t B w ,t,iw t t + α2Btwtw t B t B w ,t,iw t,i α2B w ,t,iw t B t B w ,t,iw t,i = t St Bt + Nt (88) where St := B α 1 n Pn i=1 w ,t,iw ,t,i B , Bt,i twtw t,i t B w ,t,iw t t + α2Btwtw t B t B w ,t,iw t,i α2B w ,t,iw t B t B w ,t,iw t,i , (89) and χ = 0. Since σmin(B t St Bt) E0µ2 (by Lemma 1), Lemma 4 shows t+1 2 (1 βαE0µ2 ) t 2 + 2βα B t Nt 2 + β2α Gt 2 2 (90) So, the remainder of the proof is to bound B t Nt 2 and Gt 2 2. First we deal with B t Nt 2. We have i=1 B t Bt,i twtw t,i i=1 B t t B w ,t,iw t t i=1 α2B t Btwtw t B t B w ,t,iw t,i i=1 α2B t B w ,t,iw t B t B w ,t,iw t,i MAML and ANIL Provably Learn Representations We consider each of the four terms in (91) separately. 1 n i=1 B t Bt,i twtw t,i 2 B t BtΛt twtwt t 2 i=1 B t BtΛt twtw ,t,i B Bt i=1 B t B w ,t,iw t twtwt t i=1 B t B w ,t,iw t twtw ,t,i B Bt α t 2 2 wt 2 2 + c α t 2 wt 2η + c α t 2 2 wt 3 2η + cα t 2 wt 2 2L2 1 n i=1 B t t B w ,t,iw t t 2 c α t 2 2 wt 2η i=1 α2B t Btwtw t B t B w ,t,iw t,i 2 c α wt 2 2( t 2 wt 2η + αL2 ) i=1 α2B t B w ,t,iw t B t B w ,t,iw t,i 2 cα wt 2( t 2 wt 2L2 + αL3 max) Therefore, after applying inductive hypotheses A1(t) and A2(t), we obtain B t Nt 2 2c E2 0 10 α2µ4 . Next we bound Gt 2 2. Note that Gt 2 t St Bt 2 + Nt 2, and t St Bt 2 c αL2 ( t 2 + distt) c αL2 (α2µ2 E0 + distt). i=1 Bt,i twtw t,i i=1 t B w ,t,iw t t i=1 α2Btwtw t B t B w ,t,iw t,i i=1 α2B w ,t,iw t B t B w ,t,iw t,i Gt 2 2 c αL2 (α2µ2 + distt) + 3c E0 10 α2.5µ4 2 3c2α5L4 µ4 + 2c2αL4 dist2 t 3αL4 (92) which means that t+1 2 (1 βαE0µ2 ) t 2 + 2c E2 0 10 βα3µ4 + 3β2α2L4 1 10α2E0µ2 1 10βα3E2 0µ4 + 2c E2 0 10 βα3µ4 + 3β2α2L4 1 10α2E0µ2 (93) where (93) follows by choice of β αE2 0 60κ4 . MAML and ANIL Provably Learn Representations Lemma 15 (FO-MAML A3(t + 1)). Suppose the conditions of Theorem 6 are satisfied and A1(t), A2(t), and A4(t) hold. Then A3(t + 1) holds almost surely, i.e. B , Bt+1 2 (1 0.5βαE0µ2 ) B , Bt 2. Proof. Recalling the definition of Bt+1 from (75) and noting that B , B = 0, we obtain B , Bt+1 = B , Bt Ik β(Ik αwtw t ) 1 i=1 wt,iw t,i Next, using the triangle and Cauchy-Schwarz inequalities, we obtain Ik β Ik αwtw t 1 i=1 wt,iw t,i i=1 wt,iw t,i i=1 wt,iw t,i i=1 wt,iw t,i + βα wt 2 2 i=1 wt,iw t,i 2 1 β αE0µ2 cη α wt 2 t 2 (95) + cβα wt 2 2 wt 2 t 2 2 + η α wt 2 t 2 + αL2 1 βαE0µ2 + 2 E2 0 100βα3µ3 η κ 3 max, + c E4 0 100βα3µ2 L2 κ 6 ,max 1 0.5βαE0µ2 (96) where (95) follows by the diversity of the inner loop-updated heads (Lemma 2) and (96) follows from α 1/(4Lmax). D.2. Exact MAML The first step in the analysis is to compute the second-order outer loop updates for Exact MAML. To do so, we must compute the loss on task i at iteration t after one step of gradient descent for both the representation and head. Let Λt := Ik αwtw t , t := Ik αB t Bt, and t := Id αBt B t . Note that Ft,i(Bt, wt) := Lt,i(Bt α BLt,i(Bt, wt), wt α w Lt,i(Bt, wt)) = 1 2 vt,i 2 2 (97) vt,i = BtΛt twt + αBtΛt B t B w ,t,i + αB w ,t,iw t twt + α2B w ,t,iw t B t B w ,t,i B w ,t,i = t(Btwt B w ,t,i) α(Btwt B w ,t,i)w t twt α2Btwtw t B t B w ,t,i + α2B w ,t,iw ,t,i B Btwt = ( t (αωt + α2at,i)Id)(Btwt B w ,t,i) (98) where at,i := w ,t,i B Btwt t, i and ωt := w t twt t. The outer loop updates for Exact MAML are given by: wt+1 = wt β i=1 w Ft,i(Bt, wt) Bt+1 = Bt β i=1 BFt,i(Bt, wt) Again, we prove a more general version of Theorem 2 in which we allow for general Lmax. First we make the following assumption. Assumption 6 (Exact MAML Initialization). The distance of the initial representation to the ground-truth representation satisfies: dist0 1 17κ 1.5 ,max. (99) MAML and ANIL Provably Learn Representations Theorem 7 (Exact MAML Infinite Samples). Let min = mout = and define E0 := 0.9 dist2 0. Suppose that α E1/4 0 κ3/4 (L /Lmax)1/4 4Lmax T 1/4 and β E0α 10κ4 , αB 0 B0 = Ik, w0 = 0 and Assumptions 1, 6, and 4 hold. Then Exact MAML satisfies dist T (1 0.5βαE0µ2 )T 1 (100) Proof. The proof follows by showing that the following inductive hypotheses hold for all t [T]: 1. A1(t) := { wt 2 wt 1 2 + 16βα3.5L5 maxt + 3βα1.5L3 max dist2 t} 2. A2(t) := { wt 2 E0 3. A3(t) := t 2 α2L2 max 4. A4(t) := { B , Bt 2 (1 0.5βαE0µ2 ) B , Bt 1 2} 5. A5(t) := {distt 10 3 (1 0.5βαE0µ2 )t 1 dist0} 6. A6(t) := {distt (1 0.5βαE0µ2 )t 1} These conditions hold for iteration t = 0 due to the choice of initialization. Now, assuming they hold for arbitrary t, we will show they hold at t + 1. 1. A2(t) A3(t) = A1(t + 1). This is Lemma 16. 2. Tt s=1{A1(s) A5(s)} = A2(t + 1). This is Lemma 17. 3. A2(t) A3(t) A5(t) = A3(t + 1). This is Lemma 18. 4. A2(t) A3(t) A5(t) = A4(t + 1). This is Lemma 19. 5. A3(t+1) Tt+1 s=1 A4(s) = A5(t+1) A6(t+1). Note that A3(t+1) Tt+1 s=1 A4(s) and α 1/(4Lmax) implies 1 0.1 α distt+1 = 1 0.1 α B , ˆBt+1 2 σmin(Bt+1) B , ˆBt+1 2 1 0.5βαE0µ2 t B , B0 2 1 α 1 0.5βαE0µ2 t B , ˆB0 2 (101) = 1 α 1 0.5βαE0µ2 t dist0, where (101) follows due to initialization B0 2 = 1 α. This implies 10 3 1 0.5βαE0µ2 t dist0 1 0.5βαE0µ2 t (102) since α 1/(4Lmax) and dist0 3 10 by Assumption 6. Next, we complete the proof of Theorem 7 by proving the following lemmas. Lemma 16 (Exact MAML A1(t)). Suppose Assumptions 1 and 6 hold, and A2(t) and A3(t) hold. Then wt+1 2 wt 2 + 16α3.5L5 max + 3α1.5L3 max dist2 t (103) MAML and ANIL Provably Learn Representations Proof. Using (98) and the chain rule (while noting that at,i is a function of wt), we find that for all i [n], the gradient of Ft,i(Bt, wt) with respect to wt is: w Ft,i(Bt, wt) = (Bt t αωt Bt α2at,i Bt) (Bt t αωt Bt α2at,i Bt)wt B t ( t αωt Id α2at,i Id)2B w ,t,i 2α twt(Btwt B w ,t,i) vt,i α2B t B w ,t,i(Btwt B w ,t,i) vt,i = (Bt t αωt Bt α2at,i Bt) (Bt t αωt Bt α2at,i Bt)wt + Nt,i (104) where Nt,i := B t ( t αωt Id α2at,i Id)2B w ,t,i 2α twt(Btwt B w ,t,i) vt,i α2B t B w ,t,i(Btwt B w ,t,i) vt,i. Thus, wt+1 = wt β i=1 w Ft,i(Bt, wt) i=1 (Bt t αωt Bt α2at,i Bt) (Bt t αωt Bt α2at,i Bt) i=1 Nt,i (105) which implies that i=1 (Bt t αωt Bt α2at,i Bt) (Bt t αωt Bt α2at,i Bt) where (106) follows since 1 n Pn i=1(Bt t αωt Bt α2at,i Bt) (Bt t αωt Bt α2at,i Bt) is PSD and β is sufficiently small. Next, we upper bound 1 n Pn i=1 Nt,i 2, and to do so, we first use the triangle inequality to write 1 n i=1 B t ( t αωt Id α2at,i Id)2B w ,t,i i=1 α twt(Btwt B w ,t,i) vt,i i=1 α2B t B w ,t,i(Btwt B w ,t,i) vt,i We will bound each of the three terms above shortly. First, note that A4(t) and α 1/(4Lmax) implies α 1 α2L2 max α σ2 min(Bt) σ2 max(Bt) 1 + α2L2 max α 17/16 In turn, this implies that Bt 3 2 1.1. Let c := 1.1. Also, note that B t t = t B t and B t B 2 = B (Id αBt B t )B 2 B (Id ˆBt ˆB t )B 2 + B (ˆBt ˆB t αBt B t )B 2 B (Id ˆBt ˆB t ) (Id ˆBt ˆB t )B 2 + ˆBt ˆB t αBt B t 2 = dist2 t + ˆBt(Ik αRt R t )ˆB t 2 dist2 t + t 2 (109) MAML and ANIL Provably Learn Representations where Rt Rk k is the upper triangular matrix resulting from the QR decomposition of Bt. We will use these observations along with inductive hypotheses A2(t) and A3(t) and the Cauchy-Schwarz and triangle inequalities to separately bound each of the terms from (107) as follows. Let c2 := E0/20. Then we have: i=1 B t ( t (αωt + α2at,i)Id)2B w ,t,i i=1 2 t B t B w ,t,i i=1 αωt t B t B w ,t,i i=1 α2at,i t B t B w ,t,i i=1 α3ωtat,i B t B w ,t,i i=1 α2ω2 t B t B w ,t,i i=1 α4a2 t,i B t B w ,t,i 2 cα3.5L4 maxη + 2cα4.5L4 maxη wt 2 2 + 2cα3L2 max L2 wt 2 + 2cα4L2 max L2 wt 3 2 + cα3.5L4 maxη wt 4 2 + cα2.5L3 max wt 2 2 cα3.5L4 maxη + 2cc2 2α5.5L4 maxη µ2 κ 2 ,max + 2cc2α3.5L2 max L2 µ κ 1 ,max + 2cc3 2α5.5L2 max L2 µ3 κ 3 ,max + cc4 2α5.5L4 maxη µ4 κ 4 ,max + cc2 2α3.5L3 maxµ2 κ 2 ,max 4cα3.5L4 max(η + µ ) (110) i=1 α twt(Btwt B w ,t,i) vt,i i=1 α twtw t B t ( t αωt Id α2at,i Id)(Btwt B w ,t,i) i=1 α twtw ,t,i B ( t αωt Id α2at,i Id)(Btwt B w ,t,i) 2 cα4L4 max wt 3 2 + 2cα4.5L4 maxη wt 2 2 + cα5L4 max wt 5 2 + 2cα5.5L4 maxη wt 4 2 + cα3.5L2 maxη wt 4 2 + 2cα4L2 max L2 wt 3 2 + cα5L6 max wt 3 2 + cα4.5L5 max wt 2 2 + cα5L6 max wt 2 + cα3L4 max dist2 t wt 2 cc3 2α5.5L4 maxµ3 κ 3 ,max + 2cc2 2α5.5L4 maxη µ2 κ 2 ,max + cc5 2α7.5L4 maxµ5 κ 5 ,max + 2cc4 2α7.5L4 maxη µ4 κ 4 ,max + cc4 2α5.5L2 maxη µ4 κ 4 ,max + 2cc3 2α5.5L2 max L2 µ3 κ 3 ,max + cc3 2α7.5L6 maxµ3 κ 3 ,max + cc2 2α5.5L5 maxµ2 κ 2 ,max + cc2α5.5L6 maxµ κ 1 ,max + cc2α3.5L4 maxµ κ 1 ,max dist2 t 9cc2α5.5L5 maxµ2 + cc2α3.5L3 maxµ2 dist2 t (111) MAML and ANIL Provably Learn Representations i=1 α2B t B w ,t,i(Btwt B w ,t,i) vt,i i=1 α2B t B w ,t,iw t B t ( t αωt Id α2at,i Id)(Btwt B w ,t,i) i=1 α2B t B w ,t,iw ,t,i B ( t αωt Id α2at,i Id)(Btwt B w ,t,i) 2 cα2.5L2 maxη wt 2 2 + cα3L2 max L2 wt 2 + cα3.5L2 maxη wt 4 2 + cα2.5L3 max wt 2 2 + cα4L2 max L2 wt 3 2 + cα2L2 wt 3 2 + cα3L2 max L2 wt 2 + cα2.5L3 max wt 2 2 + cα3.5L2 max L3 max + cα1.5L3 max dist2 t +cα3L4 max wt 2 + cα4.5L5 max wt 2 2 + cα4L2 max L2 η wt 3 2 cc2 2α3.5L2 maxη µ2 κ 2 ,max + cc2α3.5L2 max L2 µ κ 1 ,max + cc4 2α5.5L2 maxη µ4 κ 4 ,max + cc2 2α3.5L3 maxµ2 κ 2 ,max + cc3 2α5.5L2 max L2 µ3 κ 3 ,max + cc3 2α3.5L2 µ3 κ 3 ,max + cc2α3.5L2 max L2 µ κ 1 ,max + cc2 2α3.5L3 maxµ2 κ 2 ,max + cα3.5L2 max L3 max + cα1.5L3 max dist2 t +cc2α3.5L4 maxµ κ 1 ,max + cc2 2α5.5L5 maxµ2 κ 2 ,max + cα5.5L2 max L2 η µ3 κ 3 ,max 9cα3.5L5 max + cα1.5L3 max dist2 t (112) where we have used c2 = E0/20 and α 1/(4Lmax) to reduce terms. Next, combining the above bounds with (107) yields: 1 n 2 14cα3.5L5 max + 2cα1.5L3 max dist2 t (113) Applying c = 1.1 yields the result. Lemma 17 (Exact MAML A2(t + 1)). Suppose the conditions of Theorem 7 are satisfied, and Tt s=1{A1(s) A5(s)}. Then 20 αµ . (114) Proof. By inductive hypotheses A1(1), . . . , A1(t), we have ws+1 2 ws 2 + 16βα3.5L5 max + 3βα1.5L3 max dist2 s for all s [t], so we can invoke Lemma 3 with ξ1,s = 0 s [t] and ξ2,s = 16βα3.5L5 max + 3βα1.5L3 max dist2 s. This results in s=1 16βα3.5L5 max + 3βα1.5L3 max dist2 s (115) Next, we invoke inductive hypotheses A6(1), . . . , A6(t) to obtain dist2 s 10 9 (1 0.5βαE0µ2 )2(s 1) dist2 0 for all s [t]. Therefore s=1 16βα3.5L5 max + 3βα1.5L3 maxγ2(s 1) 16βα3.5L5 maxt + 3βα1.5L3 max 9 (1 0.5βαE0µ2 )2(s 1) dist2 0 16βα3.5L5 maxt + 10 3 βα1.5L3 max dist2 0 0.5βαE0µ2 (116) 16βα3.5L5 maxt + 20 3 αL3 max dist2 0 E0µ2 E0 20 αµ (117) MAML and ANIL Provably Learn Representations where (116) is due to the sum of a geometric series and (117) follows since β α 10κ4 E0µ 640α3L5max T (as α is sufficiently small) and the initial representation satisfies 3 αL3 max dist2 0 E0µ2 E0 3 L3 max + 2µ3 )dist2 0 + µ3 dist4 0 which is implied by dist0 1 17κ 1.5 ,max (118) Lemma 18 (Exact MAML A3(t + 1)). Suppose the conditions of Theorem 7 are satisfied and A2(t), A3(t) and A5(t) hold. Then A3(t + 1) holds, namely t+1 2 α2L2 max (119) Proof. According to Lemma 4, we can control t+1 by controlling Gt, recalling that Gt = 1 β (Bt Bt+1) Rd k is the outer loop gradient with respect to the representation at time t. Before studying Gt, we must compute the outer loop gradient with respect to the representation for task i. Again we use the fact that Ft,i(Bt, wt) = 1 2 vt,i 2 2 and apply the chain rule to obtain: BFt,i(Bt, wt) = vt,iw t tΛt αBt(wtv t,i BtΛt + Λt B t vt,iw t ) + α(vt,iw ,t,i B + B w ,t,iv t,i)BtΛt 2α2(w ,t,i B vt,i)Btwtw t + α2B w ,t,iv t,i B w ,t,iw t (120) Note that Gt = 1 n Pn i=1 BFt,i(Bt, wt). We aim to write Gt as Gt = t St Bt St Bt t + Nt for some positive definite S so that we can apply Lemma 4. It turns out that of the five terms in (120), the only one with sub -terms that contribute to St is the third term. To see this, note that α(vt,iw ,t,i B + B w ,t,iv t,i)BtΛt = α( t B w ,t,iw ,t,i B Bt + B w ,t,iw ,t,i B Bt t) + α((vt,i + t B w ,t,i)w ,t,i B + B w ,t,i(vt,i + w ,t,i B t))BtΛt + α2 t B w ,t,iw ,t,i B Bt + B w ,t,iw ,t,i B Bt t)wtw t = ( t St Bt + St Bt t) + α((vt,i + t B w ,t,i)w ,t,i B + B w ,t,i(vt,i + w ,t,i B t))BtΛt + α2 t B w ,t,iw ,t,i B Bt + B w ,t,iw ,t,i B Bt t)wtw t . where St := α 1 n Pn i=1 B w ,t,iw ,t,i B . Thus we can write Gt = t St Bt St Bt t + Nt (121) i=1 vt,iw t tΛt 1 i=1 α2Btwtw ,t,i B vt,iw t + 1 i=1 α2B w ,t,iw ,t,i B vt,iw t i=1 αBt(wtv t,i BtΛt + Λt B t vt,iw t ) + 1 i=1 α(vt,i + t B w ,t,i)w ,t,i B BtΛt i=1 α t B w ,t,iw ,t,i B Btαwtw t (122) MAML and ANIL Provably Learn Representations Note that t 2 1 10 due to A3(t) and choice of α 1 4Lmax . Therefore, Lemma 1 implies that σmin(B t St Bt) E0µ2 where E0 = 1 1 10 dist2 0. Thus by Lemma 4 with χ = 1, we have t+1 2 t 2(1 2βαE0µ2 ) + 2βα B t Nt 2 + β2α Gt 2 2 (123) So it remains to control B t Nt 2 and Gt 2 2. First we deal with B t Nt 2 by upper bounding the norm of B t times each of the six terms in (122). As before, we use c = 1.1 as an absolute constant that satisfies σ3 max(Bt) c/α1.5. We have i=1 B t vt,iw t tΛt i=1 B t ( t (αωt + α2at,i)Id)(Btwt B w ,t,i)w t tΛt i=1 t B t (Btwt B w ,t,i)w t tΛt i=1 (αωt + α2at,i)B t (Btwt B w ,t,i)w t tΛt α t 2 2 wt 2 2 + c α t 2 2 wt 2η + c t 2 2 wt 4 2 + c α t 2 wt 3 2η + c α t 2 2 wt 3 2η + cα t 2 wt 2 2L2 i=1 α2B t Btwtw ,t,i B vt,iw t i=1 αwtw ,t,i B ( t (αωt + α2at,i)Id) (Btwt B w ,t,i)w t 2 c α t 2 wt 3 2η + cα( t 2 + dist2 t) wt 2 2L2 max + cα1.5 t 2 wt 5 2η + cα2 t 2 w 4 2L2 max + cα2 wt 4 2L2 + cα2.5 wt 3 2L3 max 1 n i=1 α2B t B w ,t,iw ,t,i B vt,iw t 2 cα t 2 wt 2 2L2 + cα1.5( t 2 + dist2 t) wt 2L3 max + cα2 t 2 wt 4 2L2 + cα2.5 t 2 w 3 2L3 max + cα2.5 wt 3 2L3 max + cα3 wt 2 2L4 max MAML and ANIL Provably Learn Representations 1 n i=1 αB t Bt(wtv t,i BtΛt + Λt B t vt,iw t ) i=1 αB t Btwtv t,i BtΛt i=1 wt(Btwt B w ,t,i) ( t (αωt + α2at,i)Id)BtΛt α t wt 2 2 + 2c α t 2 wt 2η + 2c t 2 wt 4 2 + 2c α t 2 wt 3 2η + 2c α wt 3 2η + 2cα wt 2 2L2 1 n i=1 αB t (vt,i + t B w ,t,i)w ,t,i B BtΛt 2 c α t 2 wt 2η + c α t 2 wt 3 2η + cα t 2 wt 2 2L2 + cα wt 2 2L2 + cα3/2 wt 2L3 max 1 n i=1 αB t t B w ,t,iw ,t,i B Btαwtw t 2 α t 2 wt 2 2L2 Let c2 := E0/20. We can combine the above bounds and use inductive hypotheses A2(t) and A3(t) to obtain the following bound on B t Nt 2: B t Nt 2 2cc2 2α4L4 µ2 κ 2 ,max + cc2α4L4 η µ κ 1 ,max + 4cc4 2α6L4µ4 κ 4 ,max + 4cc3 2α4L2 maxη µ3 κ 3 ,max + cc3 2α6L4 η µ3 κ 3 ,max + 3cc2 2α4L2 L3 maxµ κ 1 ,max + 3cc2 2α2L3 maxµ κ 1 ,max dist2 t + cc5 2α6L2 η µ5 κ 5 ,max + 2cc3 2α4µ3 L3 maxκ 3 ,max + 2cc3 2α6µ3 L3 max L2 κ 3 ,max + cc2 2α4L4 maxµ2 κ 2 ,max + 2cc2 2α2L2 maxµ2 κ 2 ,max + 3cc2α2L2 maxη µ κ 1 ,max + 2cc4 2α4L2 maxµ4 κ 4 ,max + 2cc3 2α2η µ3 κ 3 ,max + 4cc2α2L3 maxµ κ 1 ,max + 2cc2α4L2 max L2 µ2 κ 2 ,max 21cc2α4(L4 η + L3 max L2 )µ κ 1 ,max + 3cc2 2α2L3 maxµ κ 1 ,max dist2 t + 4cc2α2(L3 max + L2 max(η + µ ))µ κ 1 ,max 10cc2α2L2 maxµ2 + 3cc2 2α2L2 maxµ2 dist2 t 13cc2α2L2 maxµ2 15c2α2L2 maxµ2 (124) using that α 1/L , c = 1.1 and combining like terms. We have not optimized constants. Next we bound Gt 2 2. First, by (121) and the triangle and Cauchy-Schwarz inequalities, Gt 2 t St 2 Bt 2 + St 2 Bt 2 t 2 + Nt 2 c α(distt +2 t 2)L2 + Nt 2. (125) We have already bounded B t Nt by separately bounding B t times each of the six terms in Nt. We obtain a similar bound on Nt 2 by separately considering each of the six terms in Nt (see equation (122)). Of these terms, all but the first and last can be easily bounded by multiplying our previous bounds by α (to account for no Bt). The other two terms are more complicated because we have previously made the reduction B t t 2 = t B t 2 c α t 2, but now that there is no B t to multiply with t, we must control t via t B 2 t 2 + distt. Specifically, for the easy four MAML and ANIL Provably Learn Representations terms we have i=1 α2Btwtw ,t,i B vt,iw t 2 cα t 2 wt 3 2η + cα1.5( t 2 + dist2 t) wt 2 2L2 max + cα2 t 2 wt 5 2η + cα2.5 t 2 w 4 2L2 max + cα2.5 wt 4 2L2 + cα3 wt 3 2L3 max 1 n i=1 α2B w ,t,iw ,t,i B vt,iw t 2 cα1.5 t 2 wt 2 2L2 + cα2( t 2 + dist2 t) wt 2L3 max + cα2.5 t 2 wt 4 2L2 + cα3 t 2 w 3 2L3 max + cα3 wt 3 2L3 max + cα3.5 wt 2 2L4 max i=1 αBt(wtv t,i BtΛt + Λt B t vt,iw t ) 2 2c α t wt 2 2 + 2c t 2 wt 2η + 2c α t 2 wt 4 2 + 2cα t 2 wt 3 2η + 2cα wt 3 2η + 2cα1.5 wt 2 2L2 1 n i=1 α(vt,i + t B w ,t,i)w ,t,i B BtΛt 2 c t 2 wt 2η + cα t 2 wt 3 2η + cα1.5 t 2 wt 2 2L2 + cα1.5 wt 2 2L2 + cα2 wt 2L3 max and for the first and last term from (122), we have i=1 vt,iw t tΛt 2 c α t 2 2 wt 2 2 + c t 2( t 2 + distt) wt 2η + c α t 2 2 wt 4 2 + cα t 2 wt 3 2η + cα t 2 2 wt 3 2η + cα1.5 t 2 wt 2 2L2 1 n i=1 α t B w ,t,iw ,t,i B Btαwtw t 2 α1.5( t 2 + distt) wt 2 2L2 MAML and ANIL Provably Learn Representations Combining these bounds and applying inductive hypotheses A2(t) and A3(t) yields Nt 2 cc3 2α4.5L2 maxη µ3 κ 3 ,max + cc2 2α4.5L2 L2 maxµ2 κ 2 ,max + cc2 2α2.5L2 maxµ2 κ 2 ,max dist2 t + cc5 2α6.5L2 η µ5 κ 5 ,max + cc4 2α6.5L2 max L2 maxµ4 κ 4 ,max + cc2α4.5L2 µ4 κ 4 ,max + cc3 2α4.5L3 maxµ3 κ 3 ,max + cc2 2α4.5L2 L2 maxµ2 κ 2 ,max + cc2α4.5L3 max L2 maxµ κ 1 ,max + cc2α2.5L3 maxµ κ 1 ,max dist2 t + cc4 2α6.5L2 L2 maxµ4 κ 4 ,max + cc3 2α6.5L3 max L2 maxµ3 κ 3 ,max + cc3 2α4.5L3 maxµ3 κ 3 ,max + cc2 2α4.5L4 maxµ2 κ 2 ,max + 2cc2 2α2.5L2 µ2 κ 2 ,max + 2cc2α2.5L2 η µ κ 1 ,max + 2cc4 2α4.5L2 maxµ4 κ 4 ,max + 2cc3 2α4.5L2 maxη µ3 κ 3 ,max + 2cc3 2α2.5η µ3 κ 3 ,max + 2cc2 2α2.5L2 µ2 κ 2 ,max + cc2α2.5L2 maxµ2 κ 2 ,max + cc3 2α4.5L2 maxη µ3 κ 3 ,max + cc2 2α4.5L2 max L2 µ2 κ 2 ,max + cc2 2α2.5L2 µ2 κ 2 ,max + cc2α2.5L3 maxµ κ 1 ,max + cc2 2α4.5L4µ2 κ 2 ,max + cc2α4.5L4η µ κ 1 ,max + cc2α2.5L2 maxη µ κ 1 ,max dist2 t +cc4 2α4.5L4µ4 κ 4 ,max + cc3 2α4.5L2 maxη µ3 κ 3 ,max + cc3 2α6.5L4η µ3 κ 3 ,max + cc2 2α4.5L2 max L2 µ2 κ 2 ,max + cc2 2α4.5L2 max L2 µ2 κ 2 ,max + cc2 2α2.5L2 µ2 κ 2 ,max distt cc2 2α2.5L2 maxµ2 κ 2 ,max + 5cc4 2α6.5L2 maxµ6 + 19cc2α4.5L2 max L2 maxµ2 + cc2α2.5L3 maxµ κ 1 ,max dist2 t + 2cc2 2α2.5L2 µ2 κ 2 ,max + 2cc2α2.5L2 η µ κ 1 ,max + 2cc3 2α2.5η µ3 κ 3 ,max + 2cc2 2α2.5L2 µ2 κ 2 ,max + cc2α2.5L2 maxµ2 κ 2 ,max + cc2 2α2.5L2 µ2 κ 2 ,max + cc2α2.5L3 maxµ κ 1 ,max + cc2α2.5L2 maxη µ κ 1 ,max dist2 t + cc2α2.5L2 µ2 κ 2 ,max distt 24cc2α4.5L2 max L2 maxµ2 + 12cc2α2.5L2 maxµ2 + 3cc2α2.5L2 maxµ2 distt 14cc2α2.5L2 maxµ2 + 3cc2α2.5L2 maxµ2 distt 17cc2α2.5L2 maxµ2 9 8cc2 αµ2 (126) using α 1/(4Lmax), distt 1, and again, c2 = E0/20. Thus, Gt 2 2 c α(distt +2 t 2)L2 + 9 2α L2 + c2µ2 2 (127) using c = 1.1 in (127). Returning to (123) and applying our bounds on B t Nt 2 and Gt 2 2, along with inductive hypothesis A3(t), yields t+1 2 = t 2(1 2βαE0µ2 ) + 30c2βα3L2 maxµ2 + 3β2α2L4 α2L2 max(1 2βαE0µ2 ) + 30c2βα3L2 maxµ2 + 3β2α2L4 α2L2 max 2βα3E0L2 maxµ2 + 30c2βα3L2 maxµ2 + 3β2α2L4 α2L2 max (128) where the last inequality follows due to β E0α 10κ4 αE0L2 maxµ2 6L4 and c2 = E0/20. MAML and ANIL Provably Learn Representations Lemma 19 (Exact MAML A4(t + 1)). Suppose the conditions of Theorem 7 are satisfied and A2(t), A3(t) and A5(t) hold. Then A4(t + 1) holds, i.e. B , Bt+1 2 (1 0.5βαE0µ2 ) B , Bt 2. (129) Proof. Recall from (121) that the outer loop gradient for the representation satisfies Gt = t St Bt St Bt t + Nt (130) where St := α 1 n Pn i=1 B w ,t,iw ,t,i B and Nt 2 9 8cc2 αµ2 , where c2 := E0/20. As a result, B , Bt+1 2 = B , (Bt β( t St Bt St Bt t + Nt)) 2 = B , (Bt + βSt Bt βαBt B t St Bt + βSt Bt βαSt Bt B t Bt βNt 2 = B , Bt(Ik βαB t St Bt) βB , Nt 2 B , Bt 2 Ik βαB t St Bt 2 + β B , Nt 2 (131) where the last equality follows because B , St = α 1 n Pn i=1 B , B w ,t,iw ,t,i B = 0. Note that due to Lemma 1 and t 2 1 10, σmin(B t St Bt) E0µ2 where E0 = 1 1 10 dist2 0. Therefore, by Weyl s inequality, Ik βαB t St Bt 2 1 βαE0µ2 . (132) Furthermore, from (126), we have B , Nt 2 Nt 2 9 4c2 αµ2 = B , Bt+1 2 B , Bt 2(1 βαE0µ2 ) + 5 4c2 αµ2 (133) Next, recall that B , Bt 2 σmin(B , Bt) q 9 10σmin(B , ˆBt)/ α = q 1 dist2 t/ α q 9 dist2 0/ α = E0/ α due to inductive hypotheses A3(t) and A4(t) and E0 := 0.9 dist2 0. Therefore, using c2 2E3/2 0 /5, we obtain 5 4c2 αµ2 0.5 αE3/2 0 µ2 0.5αE0µ2 B , Bt 2 = B , Bt+1 2 B , Bt 2(1 0.5βαE0µ2 ) (134) E. ANIL Finite Samples First we define the following notations for the finite-sample case. The inner loop update for the head of the i-th task on iteration t is given by: wt,i = wt α w ˆLi(Bt, wt, Din i ) = (Ik αB t Σin t,i Bt)wt + αB t Σin t,i B w ,t,i + α min B t (Xin t,i) zin t,i. (135) MAML and ANIL Provably Learn Representations Notation Explanation Xin t,i := [xin t,i,1, . . . , xin t,n,min] Data for inner loop gradient Σin t,i := 1 min Pmin j=1 xin t,i,j(xin t,i,j) Empirical covariance matrix for inner loop gradient zin t,i := [zt,i,1, . . . , zt,i,min] Additive noise for samples for inner loop gradient in t,i := Ik αB t Σin t,i Bt Finite-sample analogues of t in t,i := Id αBt B t Σin t,i Finite-sample analogues of t Xout t,i := [xout t,i,1, . . . , xout t,i,mout] Data for outer loop gradient Σout t,i := 1 mout Pmout j=1 xout t,i,j(xout t,i,j) Empirical covariance matrix for outer loop gradient zout t,i := [zt,i,1, . . . , zt,i,mout] Additive noise for samples for outer loop gradient log(n) m Local concentration parameter δm,d2 := 10 d2 nm Global concentration parameter For Exact ANIL, the finite-sample loss after the inner loop update is given by: ˆFt,i(Bt, wt; Din t,i, Dout t,i ) := ˆLt,i(Bt, wt α w ˆLt,i(Bt, wt; Din t,i); Dout t,i ) j=1 (x t,i,j Bt(wt αB t Σin t,i Btwt + αB t Σin t,i B w ,t,i α min B t (Xin t,i) zin t,i) x t,i,j B w ,t,i) zout t,i,j)2 j=1 (x t,i,j in t,i(Btwt B w ,t,i) + α min x t,i,j Bt B t (Xin t,i) zin t,i zt,i,j)2 = 1 2mout ˆvt,i 2 2 ˆvt,i := Xout t,i in t,i(Btwt B w ,t,i) + α min Xout t,i Bt B t (Xin t,i) zin t,i zout t,i Therefore, using the chain rule, the exact outer loop gradients for the i-th task are: B ˆFt,i(Bt, wt; Din t,i, Din t,i) = ( in t,i) 1 mout (Xout t,i ) ˆvt,iw t α 1 mout (Xout t,i ) ˆvt,iw t B t Σin t,i Bt + α 1 mout (Xout t,i ) ˆvt,iw ,t,i B Σin t,i Bt αΣin t,i Btwtˆv t,i 1 mout Xout t,i Bt + αΣin t,i B w ,t,iˆv t,i 1 mout Xout t,i Bt + α2 minmout (Xout t,i ) ˆvt,i(zin t,i) Xin t,i Bt + α2 minmout (Xin t,i) zin t,iˆv t,i Xout t,i Bt w ˆFt,i(Bt, wt; Din t,i, Dout t,i ) = B t ( in t,i) 1 mout (Xout t,i ) ˆvt,i α mout B t (Xout t,i ) zout t,i Meanwhile, the first-order outer loop gradients for the i-th task are B ˆLt,i(Bt, wt,i; Din t,i, Dout t,i ) = B t Σout t,i Btwt,i B t Σout t,i B w ,t,i = Σout t,i (Btwt,i B w ,t,i)w t,i α mout (Xout t,i ) zout t,i w t,i = Σout t,i (Bt( in t,iwt + αB t Σin t,i B w ,t,i) B w ,t,i)( in t,iwt + αB t Σin t,i B w ,t,i) α mout (Xout t,i ) zout t,i w t,i = Σout t,i in t,i(Btwt B w ,t,i)( in t,iwt + αB t Σin t,i B w ,t,i) α mout (Xout t,i ) zout t,i w t,i w ˆLt,i(Bt, wt,i; Din t,i, Dout t,i ) = B t Σout t,i in t,i(Btwt B w ,t,i) α mout B t (Xout t,i ) zout t,i MAML and ANIL Provably Learn Representations i=1 B ˆFt,i(Bt, wt), GB,t := 1 i=1 BFt,i(Bt, wt) i=1 w ˆFt,i(Bt, wt), Gw,t := 1 i=1 w ˆFt,i(Bt, wt) Now we are ready to state the result. Theorem 8 (ANIL Finite Samples). Suppose Assumptions 1, 2 and 3 hold. Let E0 :=0.9 dist2 0 δ for some δ (0, 1) to be defined shortly and assume E0 is a positive constant. Suppose the initialization further satisfies αB 0 B0 = Ik and w0 = 0, and let the step sizes be chosen as α c k L +σ, and β c αE2 0 κ4 for ANIL and β c αE3 0µ κ4 min 1, µ2 η2 FO-ANIL, for some absolute constant c . Then there exists a constant c > 0 such that, for ANIL, if mout c T 2 k2(L +σ)2 nη2 κ8 + c T k3κ2 (κ2 +σ2/µ2 ) n + c n + log(n))κ 2 ( σ2 L2 + k) + ck + c log(n) min c T 2(k2 + k log(n)) (L +σ)2 η2 κ8 + c T(k3 + k log(n))(κ4 + σ4 T k3d log(nmin) L2 + 1) (136) and for FO-ANIL, if mout c T dk nκ2 + c T dkσ2 n L2 κ2 + c T 2k3κ4 n + c T 2k3σ4 nµ4 + c kµ2 η2 κ6 + c kσ2 min c T(k + log(n))(kκ2 + σ2 µ2 ) + c T 2k3κ4 n + c T 2k2κ2 σ2 µ2 n + c T 2k2(L2 +σ2) η2 κ8 n then both ANIL and FO-ANIL satisfy that after T iterations, dist(BT , B ) 1 0.5βαE0µ2 T 1 + O(δ) (137) with probability at least 1 O(T exp( 90k)) T poly(n) T poly(min), where for ANIL, kκ σ/µ )(k p d log(nmin) + k log(nmin) + d log1.5(nmin) + log2(nmin)) d log(nmin) + log1.5(nmin)) + (kκ2 + and for FO-ANIL, µ + σ2 µ2 min ) dk nmout (138) Proof. The proof uses an inductive argument with the following five inductive hypotheses: 1. A1(t) := { wt 2 αE0 10 min(1, µ2 η2 )η } 2. A2(t) := { t 2 (1 0.5βαE0µ2 ) t 1 2 + 5 4α2β2L4 dist2 t 1 +βαζ2}, 3. A3(t) := { t 2 1 10}, 4. A4(t) := { B , Bt 2 1 0.5βαE0µ2 B , Bt 1 2 + β αζ4}, MAML and ANIL Provably Learn Representations 5. A5(t) := {distt 10 3 1 0.5βαE0µ2 t dist0 +δ}. where ζ2 is defined separately for ANIL and FO-ANIL in Lemmas 34 and 28, respectively, and ζ4 is defined separately for ANIL and FO-ANIL in Lemmas 35 and 29, respectively. These conditions hold for iteration t = 0 due to the choice of initialization (B0, w0). We will show that if they hold for all iterations up to and including iteration t for an arbitrary t, then they hold at iteration t + 1 with probability at least 1 1 poly(n) 1 poly(min) O(exp( 90k)). 1. Tt s=0{A2(s) A6(s)} = A1(t + 1). This is Lemma 27 for FO-ANIL and Lemma 33 for Exact ANIL. 2. A1(t) A3(t) A5(t) = A2(t + 1). This is Lemma 28 for FO-ANIL and Lemma 34 for Exact ANIL. 3. A1(t) A2(t+1) A3(t) A5(t) = A3(t + 1). This is Corollary 3 for FO-ANIL and Corollary 4 for Exact ANIL. 4. A1(t) A3(t) A5(t) = A4(t + 1). This is Lemma 29 for FO-ANIL and Lemma 35 for Exact ANIL. 5. A3(t+1) t+1 s=1A4(s) = A5(t + 1). By A3(t + 1) and A4(t + 1) we have: B , Bt+1 2 (1 0.5βαE0µ2 ) B , Bt 2 + β αζ4 (1 0.5βαE0µ2 )2 B , Bt 1 2 + (1 0.5βαE0µ2 )β αζ4 + β αζ4 ... (1 0.5βαE0µ2 )t B , B0 2 + β αζ4 s=0 (1 0.5βαE0µ2 ) s (1 0.5βαE0µ2 )t B , B0 2 + β αζ4 1 (1 0.5βαE0µ2 ) = (1 0.5βαE0µ2 )t B , B0 2 + 2ζ4 αE0µ2 . (139) Now we orthogonalize Bt and B0 via the QR-factorization, writing Bt = ˆBt Rt and B0 = ˆB0R0. By inductive hypothesis A3(t + 1), we have σmin(Bt+1) 0.9 α , and by the initialization we have σmax(B0) 1 α. Thus, using (139) and the definition of the principal angle distance, we have dist(Bt+1, B ) (1 0.5βαE0µ2 )t dist(B0, B ) R0 2 + 2ζ4 αE0µ2 10 3 (1 0.5βαE0µ2 )t dist(B0, B ) + 3ζ4 (1 0.5βαE0µ2 )t + δ (141) where ε = O( ζ4 After T rounds, we have that the inductive hypotheses hold on every round with probability at least (1 1 poly(n) 1 poly(min) O(exp( 90k)))T 1 O(T exp( 90k)) T poly(n) T poly(min) (142) where the inequality follows by the Weierstrass Inequality, completing the proof. Throughout the proof we will re-use c, c , c , etc. to denote absolute constants. E.1. General Concentration Lemmas We start with generic concentration results for random matrices and vectors that will be used throughout the proof. We use χE to denote the indicator random variable for the event E, i.e. χE = 1 if E holds and χE = 0 otherwise. MAML and ANIL Provably Learn Representations Lemma 20. Let X1 = [x1,1, . . . , x1,m1] Rm1 d have rows which are i.i.d. samples from a mean-zero, Id-sub-gaussian distribution, and let X1,1, . . . , X1,n be independent copies of X1. Likewise, let X2 = [x2,1, . . . , x2,m2] Rm2 d have rows which are i.i.d. samples from a mean-zero, Id-sub-gaussian distribution, and let X2,1, . . . , X2,n be independent copies of X2 (and independent of X1,1, . . . , X1,n). Define Σ1,i := 1 m1 X 1,i X1,i and Σ2,i := 1 m2 X 2,i X2,i for all i [n]. Let the elements of z1 Rm1 and z2 Rm2 be i.i.d. samples from N(0, σ2). Further, let Cℓ,i Rd d ℓ/2 for ℓ= 1, . . . , 6 be fixed matrices for i [n], and let cℓ:= maxi [n] Cℓ,i 2 for ℓ= 1, . . . , 6. Let δm,dl := c log(n) m and δm,dl := c 10 dl nm for some absolute constant c. Assume that in all cases below, each δ and δ is less than 1. Then the following hold: n Pn i=1 C 1,iΣ1,i C2,i C 1,i C2,i 2 c1c2 δm1,d0+d1 2e 90(d0+d1) n Pn i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 1,i C2,i C 3,i C4,i 2 σc1c2c3c4 (1 + δm2,d1+d2) δm1,d0+d2 + δm2,d0+d2 2e 90(d0+d2) + 2n 99 n Pn i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 1,i C2,i C 3,i C4,i 2 σc1c2c3c4 (1 + δm2,d1+d2) δm1,d0+d2 + δm2,d1+d2 2e 90(d0+d2) + 2n 99 n Pn i=1 C 1,i X 1,iz1,i 2 σc1 δm1,d0 2e 90d0 n Pn i=1 C 1,iΣ1,i C2,i C 3,i 1 m2 X 2,iz2,i 2 c1c2c3(1+ δm1,d0)δm2,d1 2e 90d1 + 2n 99 n Pn i=1 C 1,i 1 m1 X 1,iz1,ic 2,iΣ2,i C3,i 2 σc1c2c3(1 + δm2,d1) δm1,d0 2e 90d0 + 2n 99 n Pn i=1 C 1,i 1 m1 X 1,iz1,i 1 m2 z 2,i X2,i C2,i 2 σ2c1c2 δm1,d0δm2,d1 2e 90d0 + 2e 90d1 n Pn i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 5,iΣ2,i C6,i C 1,i C2,i C 3,i C4,i C 5,i C6,i 2 c1c2c3c4c5c6 (1 + δm2,d1+d2)(1 + δm2,d2+d3) δm1,d0+d3 + δm2,d2+d3 + (1 + δm2,d2+d3)δm2,d1+d2 2e 90(d0+d3) + 4n 99 n Pn i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 5,iΣ1,i C6,i C 1,i C2,i C 3,i C4,i C 5,i C6,i 2 c1c2c3c4c5c6 (1 + δm1,d0+d1)(1 + δm1,d2+d3) δm2,d1+d2 + δm1,d0+d3 + (1 + δm1,d2+d3)δm1,d0+d1 2e 90(d0+d3) + 4n 99 n Pn i=1 C 1,iΣ1,i C2,i C 3,i X 2,iz2,ic 4,i C 5,iΣ2,i C6,i 2 σc1c2c3c4c5c6 1 + δm1,d0+d1 (1+δm2,d3) δm2,d1 2e 90d1 + 4n 99 n Pn i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 5,i X 1,iz1,i 2 σc1c2c3c4c5c6 (1 + δm1,d0+d1) 1+ δm2,d1+d2 δm1,d2 2e 90d2 + 4n 99 n Pn i=1 C 1,i X 1,iz1,ic 2,i C 3,iΣ1,i C4,i C 5,iΣ2,i C6,i 2 σc1c2c3c4c5c6 (1 + δm2,d2+d3) (1+δm2,d1+d2) δm1,d0 2e 90d0 + 6n 99 MAML and ANIL Provably Learn Representations n Pn i=1 C 1,iΣ1,i C2,i C 3,i X 2,iz2,iz 2,i X2,i C4,i 2 σ2c1c2c3c4c5δm2,d1δm2,d2 1+ δm1,d0+d2 2e 90d1 + 2e 90d2 + 4n 99 Proof. We give the proofs for (1), (2), and (8) since the rest of the proofs follow using analogous arguments. In all cases, the proofs are standard applications of Bernstein s inequality. 1. For any fixed unit vector u Rd0, ru,i,j := u C 1,iw1,i,j is sub-gaussian with sub-gaussian norm at most c C1,i 2. Likewise, for any fixed unit vector v Rd1, rv,i,j := v C 2,ix2,i,j is sub-gaussian with norm at most c C2,i 2 for an absolute constant c. Furthermore, E[rv,i,jru,i,j] = u C 1,ix1,i,jx 1,i,j C2,iv = u C 1,i C2,iv. Therefore, i=1 C 1,iΣ1,i C2,i C 1,i C2,i j=1 (rv,i,jru,i,j E[rv,i,jru,i,j]) (143) is the sum of nm1 independent, mean-zero, sub-exponential random variables with norm O( C1,i 2 C2,i 2). By Bernstein s inequality we have j=1 (E[rv,i,jru,i,j] rv,i,jru,i,j) c max i [n] C1,i 2 C2,i 2 max d0+d1+λ nm1 , ( d0+d1+λ)2 for some absolute constant c and any λ > 0, with probability at least 1 2e λ2 over the outer loop samples. Let Sd0 1 and Sd1 1 denote the unit spheres in Rd0 and Rk, respectively. From Corollary 4.2.13 in (Vershynin, 2018), we know that there exists 1 4-nets M1 and M2 on Sd0 1 and Sd1 1 with cardinalities at most 9d0 and 9d1, respectively. Thus, conditioning on using the variational definition of the spectral norm, and taking a union bound over the 1 4-nets, we have 1 n i=1 C 1,iΣ1,i C2,i C 1,i C2,i 2 = max v Sd0 1,u Sd1 1 j=1 (E[rv,i,jru,i,j] rv,i,jru,i,j) 2 max v M1,u M2 j=1 (E[rv,i,jru,i,j] rv,i,jru,i,j) c max i [n] C1,i 2 C2,i 2 max d0+d1+λ nm1 , ( d0+d1+λ)2 for some absolute constant c, with probability at least 1 2 9d0+d1e λ2 over the outer loop samples. Choose λ = 10 d and let nm1 11 d0 + d1 to obtain that, 1 n i=1 C 1,iΣ1,i C2,i C 1,i C2,i 2 c max i [n] C1,i 2 C2,i 2 δm1,d0+d1 cc1c2 δm1,d0+d1 with probability at least 1 2e 90(d0+d1). 2. Let E := 1 n Pn i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 1,i C2,i C 3,i C4,i. We have i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 1,i C2,i C 3,i C4,i i=1 C 1,i(Σ1,i Id)C2,i C 3,iΣ2,i C4,i + C 1,i C2,i C 3,iΣ2,i C4,i C 1,i C2,i C 3,i C4,i i=1 C 1,i(Σ1,i Id)C2,i C 3,iΣ2,i C4,i | {z } :=E1 i=1 C 1,i C2,i C 3,i(Σ2,i Id)C4,i | {z } :=E2 MAML and ANIL Provably Learn Representations We first consider E1 2. For any i [n], we have by Theorem 4.6.1 in (Vershynin, 2018), C 3,iΣ2,i C4,i C 3,i C4,i 2 c max i [n] C3,i 2 C4,i 2δm2,d1+d2 = cc3c4δm2,d1+d2 (145) with probability at least 1 2n 100. Union bounding over all i [n] and using the triangle inequality gives P A := n {Σ2,i}i [n] : C 3,iΣ2,i C4,i 2 cc3c4(1 + δm2,d1+d2) i [n] o 1 2n 99. (146) Next, for any fixed set {Σ2,i}i [n] A, the d2-dimensional random vectors {x1,i,j C2,i C 3,iΣ2,i C4,i}i [n],j [m] are sub-gaussian with sub-gaussian norms at most c c2c3c4(1 + δm2,d1+d2). Likewise, the d0-dimensional random vectors {C 1,ixi,j }i [n],j {2,...,m} are sub-gaussian with norms at most c. Thus using the same argument as in the proof of (1.), we have i=1 C 1,i(Σ1,i Id)C2,i C 3,iΣ2,i C4,i < c c1c2c3c4(1 + δm2,d1+d2) δm1,d0+d2 {Σ2,i}i [n], {Σ2,i}i [n] A 1 2e 90(d0+d2). (148) for an absolute constant c . Integrating over all {Σ2,i}i [n] A and using δm2,d1+d2 1 yields i=1 C 1,i(Σ1,i Id)C2,i C 3,iΣ2,i C4,i 2 < c c1c2c3c4 δm1,d0+d2 A 1 2e 90(d0+d2). (149) Therefore, by the law of total probability and (146), we have P E1 2 c c1c2c3c4 δm1,d0+d2 2e 90(d0+d2) + P(Ac) 2e 90(d0+d2) + 2n 99. (150) Next, we have from (1.) that E2 2 = 1 n Pn i=1 C 1,i C2,i C 3,i(Σ2,i Id)C4,i 2 cc1c2c3c4 δm2,d0+d2 with probability at least 1 2e 90(d0+d2). Finally, combining our bounds on the two terms in (144) via a union bound yields P E 2 c c1c2c3c4( δm1,d0+d2 + (1 + δm2,d1+d2) δm2,d0+d2) (151) as desired. Note that we could instead use (146) to bound E2 2, which would result in the bound (3.). 8. Let E := 1 n Pn i=1 C 1,iΣ1,i C2,i C 3,iΣ2,i C4,i C 5,iΣ2,i C6,i C 1,i C2,i C 3,i C4,i C 5,i C6,i. We make a similar argument as in the proof of (2.) We have i=1 C 1,i(Σ1,i Id)C2,i C 3,iΣ2,i C4,i C 5,iΣ2,i C6,i | {z } :=E1 i=1 C 1,i C2,i C 3,i(Σ2,i Id)C4,i C 5,iΣ2,i C6,i | {z } :=E2 i=1 C 1,i C2,i C 3,i C4,i C 5,i(Σ2,i Id)C6,i | {z } :=E3 We know from Theorem 4.6.1 in (Vershynin, 2018) that P( C 3,i(Σ2,i Id)C4,i 2 cc3c4δm2,d1+d2) 1 2n 100 and P( C 5,iΣ2,i C6,i 2 cc5c6(1 + δm2,d2+d3)) 1 2n 100. Union bounding these events over MAML and ANIL Provably Learn Representations i [n] gives P( E2 2 cc1c2c3c4c5c6δm2,d1+d2(1 + δm2,d2+d3)) 1 4e 99. Union bounding over the same events, we also have P( E3 2 cc1c2c3c4c5c6δm2,d2+d3) 1 2n 99. Next, we make a similar argument as in (2.) to control E1 2, except that here A is defined as A := {Σ2,i}i [n] : C 3,iΣ2,i C4,i 2 cc3c4(1 + δm2,d1+d2), C 5,iΣ2,i C6,i 2 cc5c6(1 + δm2,d2+d3) i [n] , (153) which occurs with probability at least 1 4n 99 (which is implied by our discussion of bounding E3 2). Thus, following the logic in (2.), we obtain P( E1 2 cc1c2c3c4c5c6(1 + δm2,d1+d2)(1 + δm2,d2+d3) δm1,d0+d3) 1 4n 99 2e 90(d0+d3). Combining all bounds yields the desired result. More generally, we add and subtract terms to show concentration through either a Σ Id matrix, or an Xz matrix, with off terms bounded for each i by sub-gaussianity. Lemma 21. Consider the setting described in Lemma 20. Further, suppose min(d1, d2, d3) = 1 and max(d1, d2, d3) = k. Then the following events each hold with probability at most c (e 100 + n 99 + m1 99) for absolute constants c, c : U1 := 1 n Pn i=1 Σ1,i C2,i C 3,iΣ2,i C4,i C 5,iΣ1,i C6,i C2,i C 3,iΣ2,i C4,i C 5,i C6,i 2 cc2c3c4c5c6( δ+ k n Pn i=1 Σ1,i C2,i C 3,iΣ2,i C4,i C 5,i X 1,iz1,i 2 cσc2c3c4c5 δ n Pn i=1 Σ1,ic2,iz 1,i X1,i C4,i C 5,iΣ2,i C6,i 2 cσc2c3c4c5 δ n Pn i=1 X 1,iz1,ic 2,i C 3,iΣ1,i C4,i C 5,iΣ2,i C6,i 2 cσc1c2c3c4c5c6 ( d log(nm1)+log(nm1)) log(nm1) nm1 U5 := 1 nm2 1 Pn i=1 X t,izt,iz t,i Xt,i C2,i C 3,iΣ2,i C4,i 2 d log(nm1)+log(nm1)) log(nm1) nm1 + 1 d log(nm1) + k log(nm1) + d log1.5(nm1) + log2(nm1))/ nm1. Proof. 1. Similarly to previous proofs involving sums of products of independent matrices, the idea is to first use that one set of matrices is small with high probability, then condition on these sets of matrices being small to isolate the randomness of the other matrices. Note that matrix C 3,iΣ2,i C4,i has maximum dimension at most k, so by Lemma 20, for any i [n], { C 3,iΣ2,i C4,i 2 cc3c4(1 + δm1,k)} holds with probability at most n100. Applying a union bound over [n] gives that A := i [n]{ C 3,iΣ2,i C4,i 2 cc3c4(1 + δm1,k)} holds with probability at least 1 n 99. Conditioning on A, and using δm1,k 1, we can apply Lemma 22 to obtain that i=1 Σ1,i C2,i C 3,iΣ2,i C4,i C 5,iΣ1,i C6,i C2,i C 3,iΣ2,i C4,i C 5,i C6,i 2 cc2c3c4c5c6( δ+ 1+C2 occurs with probability at most . Since P(U1) P(U1|A) + P(Ac), we obtain the result. 2. We make the same argument as for (1) except that we apply Lemma 23 instead of Lemma 22. MAML and ANIL Provably Learn Representations 3. Again, we use Lemma 23 as in (2). 4. Again, we use Lemma 23 as in (2). 5. Here we make the same argument as (1) except that we apply Lemma 24 instead of Lemma 22. The following is a slightly generalized version of Theorem 1.1 in Magen & Zouzias (2011): here, the random matrices are not necessarily identically distributed, whereas they are identically distributed in Magen & Zouzias (2011). However, the proof from (Magen & Zouzias, 2011) does not rely on the matrices being identically distributed, so the same proof from Magen & Zouzias (2011) holds without modification for the below result. Theorem 9 (Theorem 1.1 in (Magen & Zouzias, 2011)). Let 0 < ϵ < 1 and M1, . . . , MN be a sequence of independent symmetric random matrices that satisfy 1 N PN i=1 E[Mi] 2 1 and Mi 2 B and rank(Mi) r almost surely for all i [N]. Set N = Ω(B log(B/ϵ2)/ϵ2). If r N almost surely, then i=1 Mi E[Mi] ! 2 1 poly(N) (155) The following lemma again gives generic concentration results but for a more difficult set of matrices. The key technical contribution is a truncated version of Theorem 9. Lemma 22. Suppose that x is a random vector with E[x] = 0d, and Cov(x) = Id, and is Id-sub-gaussian. Let {xi,j}i [n],j [m] be nm independent copies of x. Further, let Cℓ,i Rd d ℓ/2 for ℓ= 2, 3, 4 be fixed matrices for i [n], and let cℓ:= maxi [n] Cℓ,i 2 for ℓ= 2, 3, 4. Denote Σi = 1 m Pm j=1 xi,jx i,j. Then, if m max(1, C2cc2c3c4d1), i=1 Σi C2,i C 3,iΣi C4,i 1 i=1 C2,i C 3,i C4,i log(nm) + p log(nm) + p log(m) nm c2c3c4 log(nm) + q max(c2c3c4, 1) + c d1 for an absolute constant c, with probability at least 1 2m 99 1 poly(nm) 2e 90d1. As in previous cases, in this lemma we would like to show concentration of fourth-order products of sub-gaussian random vectors with only m = poly(k) O( d n + 1) samples per task. The issue here, unlike in the cases in Lemma 20, is that the leading Σi has no dimensionality reduction - there is no product matrix C1,i to bring the d-dimensional random vectors that compose the leftmost Σi to a lower dimension. Thus, we would need m = Ω(d) samples per task to show concentration of each Σi (or Σi C2,i). We must get around this by averaging over n. However, doing so requires dealing with fourth-order products of random vectors instead of bounding each of the two copies of Σi in the i-th term separately (perhaps along with their dimensionality-reducing products). Due to the fourth-order products, we cannot apply standard concentrations based on sub-gaussian and sub-exponential tails. Instead, we leverage the low rank (at most k) of the matrices involved by applying a truncated version of the of the concentration result for bounded, low-rank random matrices in (Magen & Zouzias, 2011). Proof. Throughout the proof we use c as a generic absolute constant. First note that by expanding Σi and the triangle MAML and ANIL Provably Learn Representations inequality, 1 n i=1 Σi C2,i C 3,iΣi C4,i 1 i=1 C2,i C 3,i C4,i j,j =j xi,jx i,j C2,i C 3,ixi,j x i,j C4,i m(m 1) i=1 C2,i C 3,i C4,i j xi,jx i,j C2,i C 3,ixi,jx i,j C4,i m nm2 i=1 C2,i C 3,i C4,i j =j xi,j x i,j C2,i C 3,ixi,jx i,j C4,i (m 1) i=1 C2,i C 3,i C4,i j=1 xi,jx i,j C2,i C 3,ixi,jx i,j C4,i 1 nm i=1 C2,i C 3,i C4,i Note that E is unbiased while E is biased due to the fourth-order product. We first bound E 2. Step 1: Bound E 2. Add and subtract (m 1) nm Pn i=1 1 m Pm j=1 C2,i C 3,ixi,jx i,j C4,i to obtain j =j xi,j x i,j Id C2,i C 3,ixi,jx i,j C4,i i=1 C2,i C 3,i j=1 xi,jx i,j Id j =j xi,j x i,j Id C2,i C 3,ixi,jx i,j C4,i + cc2c3c4 δm,max(d ,d2) (156) where (156) follows with probability at least 1 2e 90(d1+d2) by Lemma 20, and d denotes d if the C2,i s are distinct, and denotes d1 otherwise (since if these matrices are equal, they can be factored out of the norm, in which case we show concentration of d1 d2-dimensional random matrices). To deal with the first term in (156), note that as mentioned before, we need to show concentration over i [n] to avoid requiring m = Ω(d). Ideally, we could also concentrate over j [m], but we would lose independence of the summands. Thus, we reorder the sum and use the triangle inequality to write m 1 j =j xi,j x i,j Id C2,i C 3,ixi,jx i,j C4,i j =j xi,j x i,j Id C2,i C 3,ixi,jx i,j C4,i j =j xi,j x i,j Id C2,i C 3,ixi,jx i,j C4,i | {z } =:Ej For each j [m], define Ej as the event { C 3,ixi,j 2 (γ + d1)c3, C 4,ixi,j 2 (γ + d2)c4 i [n]} for some γ > 0. Note that xi,j and C 2,ixi,j are d (resp. d2)-dimensional sub-gaussian random vectors with sub-gaussian norm at most c (resp. cc2). Thus Ej occurs with probability at least 1 2ne cγ2. Then using the law of total probability, for any ϵ > 0, we have P ( Ej 2 ϵ) P ( Ej 2 ϵ|Ej) + P Ec j P ( Ej 2 ϵ|Ej) + 2ne cγ2 (157) MAML and ANIL Provably Learn Representations Consider E1. For any fixed set {xi,1}i [n] E1, the d2-dimensional random vectors {xi,j C2,i C 3,ixi,1x i,1C4,i}i [n],j {2,...,m} are sub-gaussian with norms at most c (γ + d1)(γ + d2)c2c3c4. Likewise, the d-dimensional random vectors {xi,j }i [n],j {2,...,m} are sub-gaussian with norms at most c. Thus using Bernstein s inequality, we can bound P E1 2 < c (γ + p d2)c2c3c4 max d+d2+λ n(m 1), d+d2+λ2 {xi,1}i [n], {xi,1}i [n] E1 1 2e λ2. (158) for λ > 0 and an absolute constant c . Integrating over all {xi,1}i [n] E1 yields P E1 2 < c (γ + p d2)c2c3c4 max d+d2+λ n(m 1), d+d2+λ2 1 2e λ2. (159) Therefore, using (157), we have P E1 2 c (γ + p d2)c2c3c4 max d+d2+λ n(m 1), d+d2+λ2 2e λ2 + 2ne cγ2. (160) Repeating the same argument for all j [m] and applying a union bound gives j=1 Ej 2 c (γ + p d2)c2c3c4 max d+d2+λ n(m 1), d+d2+λ2 2me λ2 + 2mne cγ2. (161) Choose λ = 10 p log(m) and γ = 10 p log(mn), and use p n(m 1) d + d2 + 10 p log(m) to obtain j=1 Ej 2 c c2c3c4( p log(nm) + p log(nm) + p 2m 99 + 2(mn) 99cd1 + 2(mn) 99cd2 = P E 2 c c2c3c4( p log(nm) + p log(nm) + p + cc2c3c4 δm,max(d ,d2) 2m 99 + 2(mn) 99cd1 + 2(mn) 99cd2 + 2e 90(d1+d2) (162) = P E 2 c c2c3c4( p log(nm) + p log(nm) + p 2m 99 + 2(mn) 99cd1 + 2(mn) 99cd2 + 2e 90(d1+d2) (163) where (162) follows from (156) and (163) follows by the fact that δm,max(d ,d2) is dominated. Step 2: Bound E 2. Bounding E 2 is challenging because we must deal with fourth-order products in xi,j, which may have heavy tails. However, we can leverage the independence and low-rank of the summands, combined with the sub-gaussian tails of each random vector. Second, we must control the bias in E , which we achieve by appealing to C-L4-L2 hypercontractivity. First note that by the triangle inequality j=1 = xi,jx i,j C2,i C 3,ixi,jx i,j C4,i i=1 C2,i C 3,i C4,i It remains to control the first norm. To do so, we employ Theorem 9 (a.k.a. Theorem 1.1 from (Magen & Zouzias, 2011)) which characterizes the concentration of low-rank, bounded, symmetric random matrices with small expectation. Thus, in order to apply this theorem, we must truncate and symmetrize the random matrices, and control their expectation. MAML and ANIL Provably Learn Representations Define Eℓ,i,j := { Cℓ,ixi,j 2 c(ρ + pd ℓ/2 )cℓ} for some ρ > 0 and ℓ= 2, 3, 4 and all i, j, and E1,i,j := { xi,j 2 c(ρ + d)} for some ρ > 0 and ℓ= 2, 3, 4 and all i, j. Let χEℓ,i,j be the indicator random variable for the event Eℓ,i,j. Define the truncated random variables xℓ,i,j := χEℓ,i,j Cℓ,ixi,j for ℓ= 2, 3, 4 and all i, j and x1,i,j := χEℓ,i,jxi,j for all i, j. Let Si,j := xi,jx i,j C2,i C 3,ixi,jx i,j C4,i/m and Si,j := x1,i,j x 2,i,j x3,i,j x 4,i,j/m for each i, j. Note that due to sub-gaussianity and earlier arguments, P( i,j 4 ℓ=1 Eℓ,i,j) 2mn P4 ℓ=1 e cρ2 = 8mne cρ2. Thus, for any ϵ > 0, 2 ϵ + 8nme cρ2 (165) First, form the lifted, symmetric matrices Si,j := 0 Si,j S i,j 0 for all i, j, and note that Pn i=1 Pm j=1 Si,j 2 = 2 Pn i=1 Pm j=1 Si,j 2. Also note that by definition, Si,j 2 B := d) Q4 ℓ=2(ρ + pd ℓ/2 ) max(c2c3c4, 1) for all i, j almost surely, and the Si,j s are independent. We still must control E[ Si,j] 2. We have that E[ Si,j] 2 = 2 E[ Si,j] 2. Using Lemma 25 (with C1 = Id), we obtain m E[ Si,j] 2 m C2 C2,i 2 C3,i 2 C4,i 2 m C2c2c3c4d1 for all i [n], j [m]. Thus, E[ Si,j] 2 1 for all i, j as m 2C2c2c3c4d1. Next, note that each Si,j is rank at most min(d, d1, d2), so Si,j is rank at most 2 min(d, d1, d2). Now we are ready to apply Theorem 9. Doing so, we obtain: j=1 E[ Si,j] 2 ϵ 1 poly(nm) (167) as long as nm c B log(B/ϵ2)/ϵ2 and nm c min(d, d1, d2). Setting ϵ = c B nm yields j=1 E[ Si,j] 2 B nm 1 poly(nm) (168) as long as nm Bec B, which always holds since we will soon choose ρ = p log(nm) and we have chosen B appropriately. Therefore, with probability at least 1 poly(nm), we have j=1 E[ Si,j] j=1 E[ Si,j] B 2 nm + C2d1 i=1 C2,i 2 C3,i 2 C4,i 2 which implies that E 2 c nm(ρ + max(c2c3c4, 1) + (1 + C2)d1 m c2c3c4 (169) with probability at least 1 1 poly(nm) 8nme cρ2 by (164) and (165). Choose ρ = 10 p log(nm) and recall that C is an absolute constant to obtain E 2 c nm( p log(nm) + q max(c2c3c4, 1) + c d1 m c2c3c4 (170) MAML and ANIL Provably Learn Representations with probability at least 1 1 poly(nm). Combining Steps 1 and 2, we have 1 n i=1 Σi C2,i C 3,iΣi C4,i 1 i=1 C2,i C 3,i C4,i log(nm) + p log(nm) + p log(m) nm c2c3c4 log(nm) + q max(c2c3c4, 1) + d1 for an absolute constant c with probability at least 1 2m 99 1 poly(nm) 2e 90d1. Lemma 23. Suppose that x is a random vector with mean-zero, Id-sub-gaussian distribution over Rd. Let {xi,j}i [n],j [m] be nm independent copies of x. Denote Σi = 1 m Pm j=1 xi,jx i,j and Xi = [xi,1, . . . , xi,m] for all i [n]. Let z = [z1, . . . , zm] Rm be a vector whose elements are i.i.d. draws from N(0, σ2), and let {zi}i [n] be n independent copies of z. Further, let Cℓ,i Rd d ℓ/2 for ℓ= 2, 3, 5 be fixed matrices for i [n], and let c4,i Rd. Also define cℓ:= maxi [n] Cℓ,i 2 for ℓ= 2, 3, 5, c4 := maxi [n] c4,i 2. Then, i=1 Σi C2,i C 3,i Xizi log(nm))(d1+log(nm)) nm i=1 X i zic 4,iΣi C 5,i log(mn))( d1+ log(nm))log(nm) nm for an absolute constant c, each with probability at least 1 2m 99 1 poly(nm). Proof. We only show the proof for (i) as the proof for (ii) follows by similar arguments. We argue similarly to the proof of Lemma 22. We have 1 n i=1 Σi C2,i C 3,i Xizi j =j xi,j x i,j C2,i C 3,ixi,jzi,j j=1 xi,jx i,j C2,i C 3,ixi,jzi,j Step 1: e 2. Add and subtract m 1 nm Pn i=1 1 m Pm j=1 C2,i C 3,ixi,jzi,j to obtain j=1 C2,i C 3,ixi,jzi,j j =j xi,j x i,j Id C2,i C 3,ixi,jzi,j cσc2c3 δm,d1 + m 1 j =j xi,j x i,j Id C2,i C 3,ixi,jzi,j where the second inequality follows with probability at least 1 e 90d1 by Lemma 20. Next, m 1 j =j xi,j x i,j Id C2,i C 3,ixi,jzi,j j =j xi,j x i,j Id C2,i C 3,ixi,jzi,j MAML and ANIL Provably Learn Representations By sub-gaussianity, we have that with probability at least 1 4(nm) 99, C3,ixi,j 2 cc3( d1 + p log(nm)) and zi,j 2 cσ p log(nm) for all i [n], j [m]. Thus, as in previous arguments, we have j =j xi,j x i,j Id C2,i C 3,ixi,jzi,j log(m))( d1+ for all j [m] with probability at least 1 2m 99 4(nm) 99, resulting in e 2 cσc2c3 δm,d1 + cσc2c3( log(m))( d1 + p log(m))( d1 + p log(nm) nm (172) with probability at least 1 2m 99 4(nm) 99. Step 2: e 2. For e , we again use Theorem 9. Define Eℓ,i,j and xℓ,i,j as in Lemma 22 for ℓ= 1, 2, 3 and i [n] and j [m]. Define E4,i,j = {|zi,j| cσ p log(nm)} and zi,j = χE4,i,jzi,j for all i [n] and j [m]. Define si,j = xi,jx i,j C2,i C 3,ixi,jzi,j/m and si,j = x1,i,j x 2,i,j x3,i,j zi,j/m, then we have si,j = si,j for all i, j with probability at least 1 1 poly(nm). Also, si,j B := cσc2c3( log(nm))(d1 + log(nm)) p log(nm). Next, by the symmetry of the Gaussian distribution, E[ zi,j] = 0, thus E[ si,j] 2 = 0 by independence. Defining si,j as in Lemma 22, we can now apply Theorem 9 as in Lemma 22 to obtain: log(nm))(d1+log(nm)) 1 poly(nm) (173) which, recalling e 2 = 1 nm Pn i=1 Pm j=1 si,j 2 , implies P e 2 cσc2c3( log(nm))(d1+log(nm)) log(nm) nm 1 poly(nm) (174) Combining (172) and (174) completes the proof. Lemma 24. Suppose that x is a random vector with mean-zero, Id-sub-gaussian distribution over Rd. Let {xi,j}i [n],j [m] be nm independent copies of x. Denote Σi = 1 m Pm j=1 xi,jx i,j and Xi = [xi,1, . . . , xi,m] for all i [n]. Let z = [z1, . . . , zm] Rm be a vector whose elements are i.i.d. draws from N(0, σ2), and let {zi}i [n] be n independent copies of z. Further, let Ci Rd d1 be fixed matrices for i [n], and let c := maxi [n] Ci 2. Then, i=1 X i ziz i Xi Ci log(nm))( d1+ log(nm) nm + σ2 c for an absolute constant c with probability at least 1 2m 99 1 poly(nm). Proof. We have 1 n i=1 X i ziz i Xi Ci j =j xi,jzi,j zi,jx i,j Ci j=1 z2 i,jxi,jx i,j Ci MAML and ANIL Provably Learn Representations Step 1: E 2. Note that m 1 j=1 xi,jzi,j j =j zi,j x i,j Ci i=1 xi,jzi,j j =j zi,j x i,j Ci Next, with probability at least 1 4(nm) 99, Cixi,j 2 c( d1 + p log(nm)) and zi,j 2 cσ p log(nm) for all i [n], j [m]. Thus, by conditioning on this event as in previous arguments, we can show m 1 i=1 xi,jzi,j j =j zi,j x i,j Ci log(m))( d1 + p for all j [m] with probability at least 1 2m 99 4(nm) 99, resulting in log(m))( d1 + p log(nm) nm (175) with probability at least 1 2m 99 4(nm) 99. Step 2: E 2. Define E1,i,j = { xi,j c( log(nm))}, E2,i,j = { Cixi,j 2 c( d1 + p log(nm))} and E3,i,j = {|zi,j| cσ p log(nm)} for all i [n], j [m]. Define x1,i,j = χE1,i,jxi,j, x2,i,j = χE2,i,j C i xi,j, and zi,j = χE3,i,jzi,j for all i [n] and j [m]. Define Si,j = z2 i,jx1,i,jx 2,i,j Ci/m and Si,j = z2 i,j x1,i,j x 2,i,j/m, then we have Si,j = Si,j for all i, j with probability at least 1 1 poly(nm). Also, Si,j B := cσ2 c( log(nm))( d1 + p log(nm). Note that by the law of total expectation, E[ Si,j] 2 = E[Si,j E1,i,j E2,i,j E3,i,j] 2P(E1,i,j, E2,i,j, E3,i,j) E[Si,j E1,i,j E2,i,j E3,i,j] 2P(E1,i,j E2,i,j E3,i,j) + E[Si,j Ec 1,i,j Ec 2,i,j Ec 3,i,j] 2P(Ec 1,i,j Ec 2,i,j E3,i,j) = E[Si,j] 2 Now, defining Si,j as in Lemma 22, we can now apply Theorem 9 as in Lemma 22 to obtain for m σ2 c: Si,j E[ Si,j] 2 cσ2 c( log(nm))( d1+ 1 poly(nm) (176) Now, note that 1 nm j=1 Si,j E[ Si,j] 2 + 1 nm j=1 E[ Si,j] 2 Si,j E[ Si,j] 2 + σ2 c Thus, recalling E 2 = 1 nm Pn i=1 Pm j=1 Si,j 2 , we have P E 2 cσ2 c( log(nm))( d1+ log(nm) nm + σ2 c 1 1 poly(nm) (178) Combining (175) and (178) completes the proof. Fact 1. Suppose x p satisfies E[x] = 0, Cov(x) = Id and x is Id-sub-gaussian, as in Assumption 3. Then x is C-L4-L2 hypercontractive for an absolute constant C, that is for any u Rd : u 2 = 1, E[ u, xi,j 4] C2(E[ u, xi,j 2])2 (179) MAML and ANIL Provably Learn Representations Lemma 25 (L4-L2 hypercontractive implication). Suppose x Rd is C-L4-L2 hypercontractive, E[x] = 0, and Cov(x) = Id. Further, let Cℓ Rd d ℓ/2 for ℓ= 1, 2, 3, 4 be fixed matrices for i [n], and let cℓ:= maxi [n] Cℓ,i 2 for ℓ= 1, 2, 3, 4. Given scalar thresholds aℓfor ℓ= 1, . . . , 4, form the truncated random vectors xℓ:= χ C ℓx 2 aℓCℓx. Then, E[ x1 x 2 x3 x 4 ] 2 C2 C1 2 C2 2 C3 2 C4 2d1. (180) Proof. First we note that if a random vector x is C-L4-L2 hypercontractive, then for any fixed matrix C Rd d1, then the random vector C x Rd1 is also C-L4-L2 hypercontractive, since for any unit vector u, 1 Cu 4 2 E[ u, C x 4] = E[ Cu Cu 2 , x 4] C2(E[ Cu Cu 2 , x 2])2 = 1 Cu 4 2 C2(E[ u, C x 2])2 = E[ u, C x 4] C2(E[ u, C x 2])2 Also, if the random vector x is C-L4-L2 hypercontractive then the truncated random vector x := χ x 2 a x 2 is also C-L4-L2 hypercontractive. To see this, observe that by the law of total expectation, E u, x 4 = E u, χ x 2 ax 4 = E u, x 4 x 2 a P( x 2 a) E[ u, x 4] C2(E[ u, x 2])2 (181) So we have that the truncated random vectors { xh}4 h=1 are C-L4-L2 hypercontractive. Next, pick some u Rd1 : u 2 1 and v Rd4 : v 1. By the Cauchy-Schwarz inequality and C-L4-L2 hypercontractivity, we have E[u x1 x 2 x3 x 4 v] (E[(u x1 x 4 v)2]E[( x 2 x3)2])1/2 (E[(u x1)4]E[( x 4 v)4])1/4(E[Tr( x2 x 3 )2])1/2 C(E[(u x1)2]E[( x 4 v)2])1/2 E d X ℓ=1 e ℓ x2 x 3 eℓ C(E[(u x1)2]E[( x 4 v)2])1/2 X ℓ,ℓ E e ℓ x2 x 3 eℓe ℓ x2 x 3 eℓ 1/2 C(E[(u x1)2]E[( x 4 v)2])1/2 X E (e ℓ x2)4]E[( x 3 eℓ)4]E[(e ℓ x2)4]E[( x 3 eℓ )4 1/4 1/2 C2(E[(u x1)2]E[( x 4 v)2])1/2 (E[(e ℓ x2)2])2(E[( x 3 eℓ)2])2(E[(e ℓ x2)2])2(E[( x 3 eℓ )2])2 1/4 1/2 (182) where eℓis the ℓ-th standard basis vector in Rd1. Note that by the law of total expectation and the nonnegativity of U C 1 xx C1U, E[( x 1 u)2] = E u C 1 xx C1u C 1 x 2 a P( C 1 x 2 a) E[u C 1 xx C1u] = u C 1 C1u C1 2 2 Therefore, applying the same logic for E[(e ℓ x2)2], E[(e ℓ x3)2], and E[(e ℓ x4)2], and using (182), we obtain E[u x1 x 2 x3 x 4 v] C2 C1 2 C4 2 ℓ,ℓ C2 2 2 C3 2 2 1/2 = C2 C1 2 C2 2 C3 2 C4 2d1 Repeating this argument over all unit vectors u, v completes the proof. Next, we characterize the diversity of the inner loop-updated heads for both ANIL and FO-ANIL. Note that now we are analyzing ANIL and FO-ANIL specifically rather than studying generic matrix concentration. MAML and ANIL Provably Learn Representations Lemma 26. Let wt,i be the inner loop-updated head for the i-th task at iteration t for ANIL and FO-ANIL for all i [n]. Define µ2 := σmin 1 n Pn i=1 wt,iw t,i and L2 := σmax 1 n Pn i=1 wt,iw t,i . Assume t 2 1 10 and Assumption 1, 2, and 3 hold. Then i=1 wt,iw t,i L2 := 2 t 2 wt 2 + αL + δmin,k( wt 2 + αLmax + ασ) 2 (183) i=1 wt,iw t,i µ2 := 0.9αE0µ2 2.2 α wt 2 t 2η 2 t 2 wt 2 δmin,k( wt 2 + αLmax + ασ) 2.2 α δmin,k( wt 2 + αL + ασ)Lmax (184) with probability at least 1 4n 99 6e 90k. Proof. Note that wt,i can be written as: wt,i = wt αB t Σin t,i Btwt + αB t Σin t,i B w ,t,i = r + si + p1,i + p2,i + p3,i (185) where r = twt, si = αB t B w ,t,i, p1,i := α(B t Bt B t Σin t,i Bt)wt, p2,i := α(B t B B t Σin t,i B )w ,t,i, and p3,i := α min B t (Xin t,i) zin t,i for all i [n] (for ease of notation we drop the iteration index t). Note that since t 2 1 10, 11/10 α . As a result, for any i [n], from Lemma 20 we have p1,i 2 1.1 wt 2δmin,k, p2,i 2 1.1αLmaxδmin,k, p3,i 2 1.1ασδmin,k (186) each with probability at least 1 2n 100. Thus, all of these events happen simultaneously with probability at least 1 6n 100 via a union bound. Further, a union bound over all i [n] shows that A := i [n] p1,i 2 1.1 wt 2δmin,k p2,i 2 1.1αLmaxδmin,k p3,i 2 1.1ασδmin,k occurs with probability at least 1 6n 99. Thus by the triangle inequality, a i=1 p1,i + p2,i + p3,i 2 1.1δmin,k( wt 2 + αLmax + ασ) (187) with probability at least 1 6n 99, and i=1 (r + si)(r + si) 2 t 2 2 wt 2 2 + 2.2 α t 2 wt 2η + 1.1αL2 (188) MAML and ANIL Provably Learn Representations i=1 wt,iw t,i i=1 (r + si)(r + si) 2 + 2 i=1 (r + si)(p1,i + p2,i + p3,i) 2 i=1 (p1,i + p2,i + p3,i)(p1,i + p2,i + p3,i) 2 t 2 2 wt 2 2 + 2.2 α t 2 wt 2η + 1.1αL2 + 2 i=1 r(p1,i + p2,i + p3,i) 2 + 2α B t B 2 ... (p1,i + p2,i + p3,i) ... + max i [n] p1,i + p2,i + p3,i 2 2 (189) t 2 2 wt 2 2 + 2.2 α t 2 wt 2η + 1.1αL2 + 2.2 t 2 2 wt 2δmin,k( wt 2 + αLmax + ασ) + 2.2 αL δmin,k( wt 2 + αLmax + ασ) + 1.12( wt 2 + αLmax + ασ)2δ2 min,k 2 t 2 wt 2 + αL + δmin,k( wt 2 + αLmax + ασ) 2 (190) where W ,t = [w ,t,1, . . . , w ,t,n] , (189) follows from the triangle inequality, and (190) follows with probability at least 1 6n 99 from the discussion above. We make an analogous argument to lower bound σmin 1 n Pn i=1 wt,iw t,i . This time, we only need to bound first-order products of the p matrices, which concentrate around zero as n becomes large. So now we are able to obtain finite-sample dependence on δmin,k (which decays with 1 n) instead of δmin,k (which does not), as follows. i=1 wt,iw t,i i=1 (r + si)(r + si) + (r + si)(p1,i + p2,i + p3,i) + (p1,i + p2,i + p3,i)(r + si) + (p1,i + p2,i + p3,i)(p1,i + p2,i + p3,i) ! i=1 (r + si)(r + si) ! i=1 (r + si)(p1,i + p2,i + p3,i) 2 i=1 (r + si)(p1,i + p2,i + p3,i) 2 0.9αE0µ2 2.2 α wt 2 t 2η 2 t 2 wt 2 δmin,k( wt 2 + αLmax + ασ) 2.2 α δmin,k( wt 2 + αL + ασ)Lmax where the last inequality follows with probability at least 1 6e 90k. E.2. FO-ANIL For FO-ANIL, inner loop update for the head of the i-th task on iteration t is given by: wt,i = wt α w ˆLi(Bt, wt, Din i ) = (Ik αB t Σin t,i Bt)wt + αB t Σin t,i B w ,t,i + α min B t (Xin t,i) zin t,i. (191) MAML and ANIL Provably Learn Representations The outer loop updates for the head and representation are: wt+1 = wt β i=1 w ˆLi(Bt, wt,i, Dout t,i ) B t Σout t,i Btwt,i B t Σout t,i B w ,i 2 mout B t (Xout,g t,i ) zout,g t,i (192) Bt+1 = Bt β i=1 B ˆLi(Bt, wt,i, Dout t,i ) Σout t,i Btwt,iw t,i Σout t,i B w ,t,iw t,i 2 mout (Xout t,i ) zout t,i w t,i (193) Lemma 27 (FO-ANIL, Finite samples A1(t + 1)). For any t, suppose that A2(s), A3(s) and A4(s) occur for all s [t]. Then wt+1 2 1 10 αE0 min(1, µ2 η2 )η (194) with probability at least 1 1 poly(n) . Proof. The proof follows similar structure as in the analogous proof for the infinite-sample case. Recall the outer loop updates for ANIL (here we replace t with s): ws+1 = (Ik βB s Bs(Ik αB s Bs))ws + β(Ik αB s Bs)B s B 1 n + αβB s Bs 1 n B s Bs B s Σin s,i Bs ws αβB s Bs 1 n i=1 (B s B B s Σin s,i B )w ,s,i + αβB s Bs 1 n i=1 B s (Xin s,i) zin s,i + β i=1 (B s Bs B s Σout s,i Bs)ws,i i=1 (B s B B s Σout s,i B )w ,s,i + 2β nmout i=1 B s (Xout s,i ) zout s,i (195) Note that St s=0 A3(s) implies σmax(B s Bs) 1+ s 2 α for all s {0, . . . , t+1}. Also, we can straightforwardly use Lemma 20 with the Cauchy-Schwartz inequality to obtain, for some absolute constant c, αβB s Bs 1 n B s Bs B s Σin s,i Bs wt α ws 2 δmin,k αβB s Bs 1 n i=1 (B s B B t Σin s,i Bs)w ,s,i 2 c β αLmax δmin,k (196) i=1 (B s Bs B s Σout s,i Bs)ws,i α δmout,k max i [n] ws,i 2 α δmout,k( s 2 ws 2+ αLmax + δmin,k ws 2 + αδmin,k Lmax + ασδmin,k) α δmout,k( s 2 ws 2+c αLmax + ασδmin,k) (197) β n i=1 (B s B B s Σout s,i B )w ,s,i 2 c β αLmax δmout,k i=1 B s (Xout s,i ) zout s,i 2 c β ασ δmout,k MAML and ANIL Provably Learn Representations using that δmin,k < 1 and ws 2 c αη in (197). Thus using (195) and the Cauchy-Schwarz and triangle inequalities, we have for an absolute constant c: ws+1 2 (1 + c β α s 2) ws 2 + c β α s 2η + c β α ws 2 δmin,k + c β α δmin,k(Lmax + σ) + c β α δmout,k αLmax + c β αLmax δmout,k + c β ασ δmout,k α s 2) ws 2 + t β α s 2η + t β α (Lmax + σ) ( δmin,k + δmout,k) α s 2) ws 2 + c β α s 2η + β αζ1 (198) using ws,i 2 αLmax, where ζ1 = c(Lmax + σ)( δmin,k + δmout,k). Thus, by Lemma 3, we have wt+1 2 c β α s=1 ( s 2η + ζ1) Next, let ρ := 1 0.5βαE0µ2 and ζ2 = O (Lmax + σ)2 δmout,k + (L2 max + Lmaxσ)( δmin,k + δ2 min,k) + σ2δ2 min,k + βα(Lmax + σ)4 δ2 mout,d as defined in (219). By Ss r=0 A2(r), we have s+1 2 ρ s 2 + cα2β2L4 dist2 s +βαζ2 ρ2 s 1 2 + ρ(cα2β2 dist2 s 1 +βαζ2) + cα2β2 dist2 s +βαζ2 ... r=0 ρs r(cα2β2L4 dist2 r +βαζ2) r=0 ρs r(cα2β2L4 dist2 r +βαζ2) (200) since I αB 0 B0 2 = 0 by choice of initialization. Now, we have that dists ρs +ε for all s {0, ..., t} by St s=0 A5(s). Thus, for any s {0, ..., t}, we have r=0 ρs r(cα2β2L4 dist2 r +βαζ2) r=0 ρs r(cα2β2L4 (2ρ2r + 2ε2) + βαζ2) r=0 ρs rρ2r + 2cα2β2L4 r=0 ρs rε2 + βα 2cρs α2β2L4 1 ρ + (2cα2β2L4 ε2 + βαζ2) 1 1 ρ 2cρsβαL2 κ2 /E0 + 2cε2βαL2 κ2 /E0 + ζ2/(E0µ2 ) =: ϵs (201) Now, applying equation (199) yields s=1 (ϵsη + ζ1) s=1 (ϵsη + ζ1) 1 + 4cβρsκ4 /(αE2 0) + 4c(t s)β2ε2κ2 L2 /E0 + 2(t s)βζ2/(αE0µ2 ) (202) s=1 (ϵsη + ζ1) 1 + 4cβκ4 /(αE2 0) + 4c Tβ2ε2κ2 L2 /E0 + 2Tβζ2/(αE0µ2 ) (203) MAML and ANIL Provably Learn Representations where (202) follows by plugging in the definition of ϵr and using the sum of a geometric series. In order for the RHS of (203) to be at most 1 10 αE0 min(1, µ2 η2 )η as desired, we can ensure that 4cβκ4 /(αE2 0) + 4c Tβ2ε2κ2 L2 /E0 + 2Tβζ2/(αE0µ2 ) 1 for all s and β α Pt s=1(ϵsη + ζ1) 1 10 αE0 min(1, µ2 η2 )η . To satisfy the first condition, it is sufficient to have β c αE2 0 κ4 ε2 c E0 T β2κ2 L2 ζ2 c αE0µ2 T β For the second condition, it is sufficient to have β cαE3 0κ 4 min(1, µ2 η2 ) ζ1 c κ4 η T E2 0 t X s=1 ϵs c κ4 E2 0 = ε2 c 1 T βαµ2 E0 ζ2 c L2 κ2 T E0 (204) However, for Corollary 3, will need a tighter bound on ζ2, namely ζ2 c E0µ2 T . In summary, the tightest bounds are: β cαE3 0κ 4 min(1, µ2 η2 ) (205) ε2 c 1 T βαµ2 E0 (206) ζ1 c κ4 η T E2 0 (207) ζ2 c E0µ2 T (208) To determine when these conditions hold, we must recall the scaling of ε, ζ1, ζ2. ε = O( Lmax(Lmax+σ) µ2 δmout,d) ζ1 = O((Lmax + σ)( δmin,k + δmout,k)) ζ2 = O (L2 max + Lmaxσ)( δmin,k + δmout,k + δ2 min,k) + σ2δ2 min,k + βα δ2 mout,d(Lmax + σ)2(Lmax + σδmin,k)2 Thus, in order to satisfy (205)-(208), we can choose: mout c βα d T E0(Lmax+σ)4 nµ2 + T 2k E2 0L2 max(Lmax+σ)2 nµ4 + T 2E4 0k(Lmax+σ)2 nη2 κ8 + βα T L2 max(Lmax+σ)2E0d Recalling that Lmax c k L , β ακ 4 , and α 1 Lmax+σ, we see that our choice of mout as nκ2 + Tdkσ2 n L2 κ2 + T 2k3κ4 n + T 2k3σ4 nµ4 + kµ2 η2 κ6 + kσ2 is sufficient, where we have treated E0 as a constant. For min, we can choose: min c T(k + log(n))E0(Lmax + σ)2 µ2 + c T 2k E2 0L2 max(Lmax + σ)2 nµ4 + c T 2k E4 0(Lmax + σ)2 MAML and ANIL Provably Learn Representations which is satisfied by min c T(k + log(n))(kκ2 + σ2 µ2 ) + c T 2k3κ4 n + c T 2k2κ2 σ2 µ2 n + c T 2k2(L2 +σ2) η2 κ8 n Since min and mout satisfy these conditions, we have completed the proof. Lemma 28 (FO-ANIL, Finite samples, A2(t + 1)). Suppose the conditions of Theorem 8 are satisfied and inductive hypotheses A1(t), A3(t) and A5(t) hold. Then A2(t + 1) holds with high probability, i.e. t+1 2 (1 0.5βαE0µ2 ) t 2 + cβ2α2L4 dist2 t +βαζ2 (209) for an absolute constant c and ζ2 = O (Lmax +σ)2 δmout,k +(L2 max +Lmaxσ)( δmin,k +δ2 min,k)+σ2δ2 min,k +βα(Lmax + σ)4 δ2 mout,d , with probability at least 1 1 poly(n). Proof. Note that we can write: Bt+1 = Bt β t 1 i=1 (Btwt B w ,t,i)((Ik αB t Bt)wt + αB t B w ,t,i) i=1 (Btwt B w ,t,i)(α(B t Bt B t Σin t,i Bt)wt + α(B t B B t Σin t,i B )w ,t,i) (210) i=1 (Btwt B w ,t,i)(αB t 1 min (Xin t,i) zt,i) (211) Id Σout t,i (Btwt,i B w ,t,i) w t,i + β nmout i=1 (Xout t,i ) zout t,i w t,i (212) i=1 Bt B t Id Σin t,i (Btwt B w ,t,i) w t,i + βα nmin i=1 Bt B t (Xin t,i) zin t,iw t,i (213) = Bpop t+1 + β(E1 + E2 + E3 + E4) (214) where Bt,pop := Bt β t 1 n Pn i=1(Btwt B w ,t,i)( twt + αB t B w ,t,i) denotes the update of the representation in the infinite sample case, and E1, E2, E3 and E4 are the finite-sample error terms in lines (210), (211), (212) and (213), respectively. From (214) and the triangle inequality, we can compute the final bound. t+1 2 Ik αB t,pop Bt,pop 2 + 2βα B t,pop(E1 + E2 + E3 + E4) 2 + β2α E1 + E2 + E3 + E4 2 2 (215) Note that from Corollary 1 and the fact that Bt 2 1.1/ α by A3(t), and β, α are sufficiently small, we have that Bpop t+1 2 1.1 α. Also, clearly Bpop t+1 Rd k. Therefore by the concentration results in Lemma 20 and the triangle and MAML and ANIL Provably Learn Representations Cauchy-Schwarz inequalities, we have, for an absolute constant c, max i [n] wt,i 2 t 2 wt 2 + c αLmax + δmin,k wt 2 + c ασδmin,k B t,pop E1 2 c α t 2 wt 2 2 δmin,k + c α t 2Lmax wt 2 δmin,k + c α t 2 wt 2Lmax δmin,k + t 2L2 max δmin,k c L2 max δmin,k B t,pop E2 2 c α t 2 wt 2σ δmin,k + c t 2Lmaxσ δmin,k c Lmaxσ δmin,k B t,pop E3 2 c α δmout,k t 2 wt 2 + αLmax + δmin,k wt 2 + ασδmin,k 2 + c α δmout,k Lmax t 2 wt 2 + αLmax + δmin,k wt 2 + ασδmin,k + c ασ δmout,k t 2 wt 2 + αLmax + δmin,k wt 2 + ασδmin,k c δmout,k(Lmax + σ) (Lmax + σδmin,k) B t,pop E4 2 cδmin,k wt 2 α + Lmax α t 2 wt 2 n + Lmax α n + δmin,k wt 2 + ασδmin,k + c α( t 2 wt 2σ δmin,k + Lmaxσ δmin,k + σδ2(min, k) wt 2 α + σ2δ2(min, k)) cδmin,k Lmax( Lmax n + Lmaxδmin,k + σδmin,k) + c(Lmaxσ δmin,k + Lmaxσδ2(min, k) + σ2δ2(min, k)) c δmout,k(Lmax + σ) (Lmax + σδmin,k) + L2 max(δ2(min, k) + δmin,k) + c Lmaxσ(δ2 min,k + δmin,k) + cσ2δ2 min,k (216) with probability at least 1 1 poly(n). Thus B t,pop(E1 + E2 + E3 + E4) 2 c δmout,k n (Lmax + σ) (Lmax + σδmin,k) + c(L2 max + Lmaxσ)δmin,k n + L2 maxδ2 min,k + Lmaxσδ2 min,k + σ2δ2 min,k (217) E1 2 ( t 2 wt 2 α + distt Lmax + t 2Lmax)(δmin,k wt 2 + αδmin,k Lmax) 1 n αL2 max δmin,k n E2 2 ( t 2 wt 2 α + distt Lmax + t 2Lmax) ασ δmin,k n αLmaxσ δmin,k n E3 2 δmout,d α(Lmax + σ)(Lmax + σδmin,k) E4 2 α δmin,k wt 2 α + Lmax α t 2 wt 2 n + αLmax n + δmin,k wt 2 + ασδmin,k + t 2 wt 2σ δmin,k α + Lmaxσ δmin,k + σδ2 min,k wt 2 α + σ2δ2 min,k αδmin,k Lmax Lmax n + Lmaxδmin,k + σδmin,k + α Lmaxσ δmin,k n + Lmaxσδ2 min,k + σ2δ2 min,k c α δmout,k(Lmax + σ) (Lmax + σδmin,k) + αL2 max(δ2 min,k + δmin,k) + αLmaxσ(δ2 min,k + δmin,k) + ασ2δ2 min,k MAML and ANIL Provably Learn Representations E1 + E2 + E3 + E4 2 α(L2 max + Lmaxσ)( δmin,k + δ2 min,k) + δmout,d α(Lmax + σ)(Lmax + σδmin,k) + ασ2δ2 min,k Now, from (215) and the triangle inequality, we can compute the final bound. Ik αB t,pop Bt,pop 2 + cβα δmout,k(Lmax + σ) (Lmax + σδmin,k) + (L2 max + Lmaxσ)( δmin,k + δ2 min,k) + σ2δ2 min,k + cβ2α2 (L2 max + Lmaxσ)( δmin,k + δ2 min,k) + δmout,d(Lmax + σ)(Lmax + σδmin,k) + σ2δ2 min,k 2 Ik αB t,pop Bt,pop 2 + cβα δmout,k(Lmax + σ) (Lmax + σδmin,k) + (L2 max + Lmaxσ)( δmin,k + δ2 min,k) + σ2δ2 min,k + cβ2α2 δ2 mout,d(Lmax + σ)2(Lmax + σδmin,k)2 = Ik αB t,pop Bt,pop 2 + βαζ2 (1 0.5βαE0µ2 ) t 2 + cβ2α2L4 dist2 t +βαζ2 (218) where the last line follows from Lemma 6 (note that all conditions for that lemma are satisfied by Ik αB t,pop Bt,pop 2), and ζ2 = O (Lmax + σ)2 δmout,k + (L2 max + Lmaxσ)( δmin,k + δ2 min,k) + σ2δ2 min,k + βα(Lmax + σ)4 δ2 mout,d (219) Corollary 3 (FO-ANIL, Finite samples A3(t + 1)). Suppose that A2(t + 1) and A3(t) hold. Then t+1 2 1 10 (220) Proof. From A2(t + 1) we have t+1 2 (1 0.5βαE0µ2 ) t 2 + cβ2α2L4 dist2 t +βαζ2 (1 0.5βαE0µ2 ) 1 10 + cβ2α2L4 + βαζ2 1 10 0.25βαE0µ2 + cβ2α2L4 (221) where (221) follows as long as ζ2 0.25E0µ2 , and (222) follows since β c αE3 0κ 4. Lemma 29 (FO-ANIL, Finite samples, A4(t + 1)). Suppose A1(t), A3(t) and A5(t) hold. Then A4(t + 1) holds, i.e. B , Bt+1 2 (1 0.5βαE0µ2) B , Bt 2 + β αζ4 where ζ4 = O((Lmax + σ)(Lmax + σδmin,k) δmout,d) with probability at least 1 1 poly(n). Proof. Using (213), we have ˆB , Bt+1 = ˆB , Bt i=1 wt,iw t,i i=1 B , Id Σout t,i (Btwt,i B w ,t,i)w t,i | {z } =:E1 i=1 B , (Xout t,i ) zout t,i w t,i | {z } =:E2 MAML and ANIL Provably Learn Representations Next, we can use the concentration results in Lemma 20 to show that all of the following inequalities hold with probability at least 1 1 poly(n) max i [n] wt,i 2 t 2 wt 2 + c αLmax + δmin,k wt 2 + c ασδmin,k c αLmax + c ασδmin,k E1 2 c Lmax max i [n] wt,i 2 δmout,d (224) E2 2 cσ max i [n] wt,i 2 δmout,d (225) Thus we have ˆB , Bt+1 2 ˆB , Bt(Ik β i=1 wt,iw t,i) 2 + β αζ4 where ζ4 = O((Lmax +σ)(Lmax +σδmin,k) δmout,d) with probability at least 1 1 poly(n). Next, recall from Lemma 29 that i=1 wt,iw t,i L2 := 2 t 2 wt 2 + αL + δmin,k( wt 2 + αLmax + ασ) 2 i=1 wt,iw t,i µ2 := 0.9αE0µ2 2.2 α wt 2 t 2η 2 t 2 wt 2 δmin,k( wt 2 + αLmax + ασ) 2.2 α δmin,k( wt 2 + αL + ασ)Lmax with probability at least 1 1 poly(n). Apply inductive hypotheses A1(t) and A3(t) to obtain i=1 wt,iw t,i 4αL2 + 4α(Lmax + σ)2δ2 min,k 12αL2 by choice of min = Ω((k + log(n))(Lmax + σ)2) . This means that we have β σmax 1 n Pn i=1 wt,iw t,i 1 since we have chosen β = O(ακ 4 ). Also, we have i=1 wt,iw t,i 0.8αE0µ2 2.3αLmax(Lmax + σ) δmin,k 0.5αE0µ2 (226) where the last inequality follows since min = Ω k2 n (kκ4 + κ2 σ2µ 2 ) , recalling that Lmax c k L . Thus, using the above and Weyl s inequality with β σmax 1 n Pn i=1 wt,iw t,i 1, we obtain: ˆB , Bt+1 2 ˆB , Bt 2 (1 0.5βαE0α2) + β αζ4 (227) E.3. Exact ANIL Lemma 30 (Exact ANIL FS representation concentration I). For Exact ANIL, consider any t [T]. With probability at least 1 1 poly(n) 1 poly(min) ce 90k, ˆGB,t GB,t 2 = αζ2,a, (228) MAML and ANIL Provably Learn Representations Lmax(Lmax + σ)( Lmax(Lmax + σ)( Lmax(Lmax + σ)(k p d log(nmin) + k log(nmin) + d log1.5(nmin) + log2(nmin)) d log(nmin) + log1.5(nmin)) + Lmax(Lmax + σ) Lmax(Lmax + σ) Proof. Let qt,i := Btwt B w ,t,i. First recall that ˆGB,t = 1 n Pn i=1 B ˆFt,i(Bt, wt), where B ˆFt,i(Bt, wt) = ( in t,i) 1 mout (Xout t,i ) ˆvt,iw t α 1 mout (Xout t,i ) ˆvt,iq t,iΣin t,i Bt αΣin t,iqt,iˆv t,i 1 mout Xout t,i Bt + α2 minmout (Xout t,i ) ˆvt,i(zin t,i) Xin t,i Bt + α2 minmout (Xin t,i) zin t,iˆv t,i Xout t,i Bt where ˆvt,i = Xout t,i in t,iqt,i + α min Xout t,i Bt B t Xin t,izin t,i zout t,i . Also, GB,t = 1 n Pn i=1 BFt,i(Bt, wt), where BFt,i(Bt, wt) = tvt,iw t αvt,iq t,i Bt αqt,iv t,i Bt and vt,i = tqt,i. Thus, ˆGB,t GB,t 2 i=1 ( in t,i) 1 mout X t,iˆvt,iw t tvt,iw t | {z } =:E1 1 mout (Xout t,i ) ˆvt,iq t,iΣin t,i Bt vt,iq t,i Bt | {z } =:E2 i=1 Σin t,iqt,iˆv t,i 1 mout Xout t,i Bt qt,iv t,i Bt | {z } =:E3 2 + α 1 nminmout i=1 (Xout t,i ) ˆvt,i(zin t,i) Xin t,i Bt | {z } =:E4 + α 1 nminmout i=1 (Xin t,i) zin t,iˆv t,i Xout t,i Bt | {z } =:E5 We will further decompose each of the above terms into terms for which we can apply concentration results from Lemmas MAML and ANIL Provably Learn Representations 20 and 21. First we bound E1 2. We have i=1 ( in t,i) 1 mout X t,iˆvt,iw t tvt,iw t i=1 ( in t,i) Σout t,i in t,iqt,iw t t tqt,iw t α min ( in t,i) Σout t,i Bt B t (Xin t,i) zin t,iw t i=1 ( in t,i) 1 mout (Xout t,i ) zout t,i w t i=1 Σout t,i qt,iw t qt,iw t | {z } =:E1,1 i=1 αΣin t,i Bt B t Σout t,i qt,iw t αBt B t qt,iw t | {z } =:E1,2 i=1 αΣout t,i Bt B t Σin t,iqt,iw t αBt B t qt,iw t | {z } =:E1,3 i=1 α2Σin t,i Bt B t Σout t,i Bt B t Σin t,iqt,iw t α2Bt B t Bt B t qt,iw t | {z } =:E1,4 α min Σout t,i Bt B t (Xin t,i) zin t,iw t | {z } =:E1,5 min Σin t,i Bt B t Σout t,i Bt B t (Xin t,i) zin t,iw t | {z } =:E1,6 1 mout (Xout t,i ) zout t,i w t | {z } =:E1,7 i=1 Σin t,i Bt B t 1 mout (Xout t,i ) zout t,i w t | {z } =:E1,8 Note that after factoring out trailing wt s where necessary, each of the above matrices is in the form that is bounded in Lemma 20 or Lemma 21. We apply the bounds from those lemmas and use α Bt 2 2 = O(1), wt 2 = O( α min(1, η2 /µ2 )η ), and maxi [n] qt,i 2 = O(Lmax) to obtain that each of the following bounds hold with probability at least 1 1 poly(n) 1 poly(min) c e 90k, for some absolute constants c, c . E1,1 2 c α Lmax L E1,2 2 + E1,3 2 c α Lmax L κ2 ( δmout,d + δmin,d) E1,4 2 c α Lmax L d log(nmin)+ d log1.5(nmin)+log2(nmin) nmin + 1+C2k min + δmout,k E1,5 2 c α σL E1,6 2 c α σL d log(nmin)+ d log1.5(nmin)+log2(nmin) nmin + δmout,k E1,7 2 c α σL E1,8 2 c α σL κ2 δmout,k (230) MAML and ANIL Provably Learn Representations For E2 2, we have i=1 Σout t,i in t,iqt,iq t,iΣin t,i Bt tqt,iq t,i Bt i=1 Σout t,i α min Bt B t (Xin t,i) zin t,iq t,iΣin t,i Bt 1 mout (Xout t,i ) zout t,i q t,iΣin t,i Bt i=1 Σout t,i qt,iq t,iΣin t,i Bt qt,iq t,i Bt | {z } =:E2,1 i=1 Σout t,i Bt B t Σin t,iqt,iq t,iΣin t,i Bt Bt B t qt,iq t,i Bt | {z } =:E2,2 i=1 Σout t,i Bt B t 1 min (Xin t,i) zin t,iq t,iΣin t,i Bt | {z } =:E2,3 1 mout (Xout t,i ) zout t,i q t,iΣin t,i Bt | {z } =:E2,4 As before, we apply the bounds from Lemmas 20 and 21 and use α Bt 2 2 = O(1), wt 2 = O( αη /κ2 ), and maxi [n] qt,i 2 = O(Lmax) to obtain that each of the following bounds hold with probability at least 1 1 poly(n) 1 poly(min) c e 90k, for some absolute constants c, c . E2,1 2 c L2 max α ( δmout,d + δmin,k) E2,2 2 c L2 max α ( δmout,d + δmin,k) E2,3 2 c Lmaxσ α δmin,k E2,4 2 c Lmaxσ α δmout,d For E3 2, we have i=1 Σin t,iqt,iq t,i( in t,i) Σout t,i Bt qt,iq t,i t Bt i=1 Σin t,iqt,i α min (zin t,i) (Xin t,i)Bt B t Σout t,i Bt i=1 Σin t,iqt,i 1 mout (zout t,i ) Xout t,i Bt i=1 Σin t,iqt,iq t,iΣout t,i Bt qt,iq t,i Bt | {z } =:E3,1 i=1 Σin t,iqt,iq t,iΣin t,i Bt B t Σout t,i Bt qt,iq t,i t Bt | {z } =:E3,2 i=1 Σin t,iqt,i α min (zin t,i) (Xin t,i)Bt B t Σout t,i Bt | {z } =:E3,3 i=1 Σin t,iqt,i 1 mout (zout t,i ) Xout t,i Bt | {z } =:E3,4 MAML and ANIL Provably Learn Representations Each term is bounded as follows with probability at least 1 1 poly(n) 1 poly(min) c e 90k, for some absolute constants c, c . E3,1 2 c L2 max α ( δmin,d + δmout,k) E3,2 2 c L2 max α kd log(nmin)+ k log(nmin)+ d log1.5(nmin)+log2(nmin) nmin + E3,3 2 c Lmaxσ α d log(nmin)+k log(nmin)+ d log1.5(nmin)+log2(nmin) nmin + E3,4 2 c Lmaxσ α δmout,k For E4 2, we have E4 2 1 nmin i=1 Σout t,i in t,iqt,i(zin t,i) Xin t,i Bt i=1 Σout t,i Bt B t 1 m2 in (Xin t,i) zin t,i(zin t,i) Xin t,i Bt + 1 nminmout i=1 (Xout t,i ) zout t,i (zin t,i) Xin t,i Bt i=1 Σout t,i qt,i(zin t,i) Xin t,i Bt | {z } =:E4,1 2 + α 1 nmin i=1 Σout t,i Bt B t Σin t,iqt,i(zin t,i) Xin t,i Bt | {z } =:E4,2 i=1 Σout t,i Bt B t 1 m2 in (Xin t,i) zin t,i(zin t,i) Xin t,i Bt | {z } =:E4,3 + 1 nminmout i=1 (Xout t,i ) zout t,i (zin t,i) Xin t,i Bt | {z } =:E4,4 Each term is bounded as follows with probability at least 1 1 poly(n) 1 poly(min) c e 90k, for some absolute constants c, c . E4,1 2 c Lmaxσ α δmin,k E4,2 2 c Lmaxσ α δmin,k α δmout,dδmin,k For E5 2, we have E5 2 1 nmin i=1 (Xin t,i) zin t,iq t,iΣout t,i Bt | {z } =:E5,1 i=1 (Xin t,i) zin t,iq t,iΣin t,i Bt B t Σout t,i Bt | {z } =:E5,2 i=1 (Xin t,i) zin t,i(zin t,i) Xin t,i Bt B t Σout t,i Bt | {z } =:E5,3 2 + 1 nminmout i=1 (Xin t,i) zin t,i(zout t,i ) Xout t,i Bt | {z } =:E5,4 Each term is bounded as follows with probability at least 1 1 poly(n) 1 poly(min) c e 90k, for some absolute constants MAML and ANIL Provably Learn Representations E5,1 2 c Lmaxσ α δmin,d E5,2 2 c Lmaxσ α kd log(nmin)+ d log1.5(nmin)+log2(nmin) nmin + d log(nmin)+log1.5(nmin) nmin + α δmin,dδmout,k Applying a union bound over these events yields that ˆGB,t GB,t 2 Lmax(Lmax + σ)( Lmax(Lmax + σ)( Lmax(Lmax + σ)(k p d log(nmin) + k log(nmin) + d log1.5(nmin) + log2(nmin)) d log(nmin) + log1.5(nmin)) Lmax(Lmax + σ) with probability at least 1 1 poly(n) 1 poly(min) c e 90k for absolute constants c, c . Lemma 31 (Exact ANIL FS representation concentration II). For Exact ANIL, consider any t [T]. With probability at least 1 ce 100k 1 poly(n) for an absolute constant c: B t ˆGB,t B t GB,t 2 ζ2,b log(n) min Lmax(Lmax + σ) + σ2( log(n) min + k nmout ) + k nmout Lmax(Lmax + σ) . (231) Proof. We adapt the proof of Lemma 30. Multiplying ˆGB,t GB,t on the left by B t serves to reduce the dimensionality of ˆGB,t GB,t from Rd k to Rk k. This means that all of the d dependence in the previous concentration result for ˆGB,t GB,t 2 is reduced to k. Moreover, we no longer need to apply the complicated bounds on sums of fourth-order products (Lemma 21) to show concentration at a rate of d nmin , since we can afford to show concentration of each second order product at a rate log(n) min (see Lemma 20). Finally, we must divide the remaining bound from Lemma 30 by α since Bt 2 = Θ( 1 α). Making these changes yields the result. Lemma 32 (Exact ANIL FS head concentration). For Exact ANIL, consider any t [T]. With probability at least 1 ce 100k 1 poly(n) for an absolute constant c, we have ˆGw,t Gw,t 2 1 αζ1, (232) where ζ1 = O( (Lmax+σ)( log(n)) min + (Lmax+σ) MAML and ANIL Provably Learn Representations Proof. We have: ˆGw,t Gw,t 2 = 1 n i=1 B t ( in t,i) 1 mout (Xout t,i ) ˆvt,i 1 mout B t (Xout t,i ) zout t,i B t t tqt,i i=1 B t ( in t,i) Σout t,i in t,iqt,i t t B t qt,i 1 mout B t (Xout t,i ) zout t,i α min B t ( in t,i) Σout t,i Bt B t (Xin t,i) zin t,i i=1 B t ( in t,i) 1 mout (Xout t,i ) zout t,i i=1 t B t Σout t,i tqt,i t B t tqt,i | {z } =:E1 1 mout B t (Xout t,i ) zout t,i | {z } =:E2 i=1 B t Σin t,i Bt B t Σout t,i Bt B t Σin t,iqt,i α2 i=1 B t Bt B t Σout t,i Bt B t qt,i | {z } =:E3 α min B t tΣout t,i Bt B t (Xin t,i) zin t,i | {z } =:E4 α2 min B t Σin t,i Bt B t Σout t,i Bt B t (Xin t,i) zin t,i | {z } =:E5 α2 min B t Bt B t Σout t,i Bt B t (Xin t,i) zin t,i | {z } =:E6 i=1 B t 1 mout (Xout t,i ) zout t,i | {z } =:E7 i=1 B t (Σin t,i Id)Bt B t 1 mout (Xout t,i ) zout t,i | {z } =:E8 By Lemma 20 and the facts that Bt 2 = 1 α, t 2 = O(1), and maxi qt,i 2 = Lmax we have P( E1 2 t 2 Bt 2 max i tqt,i 2 δmout,k) 2e 90k P( E2 2 σ Bt 2 δmout,k) 2e 90k P( E3 2 α2 Bt 5 2 max i qt,i 2δmin,k 8n 99 P( E4 2 α Bt 3 2σ t 2 δmin,k) 4n 99 P( E5 2 α2 Bt 5 2σδmin,k 6n 99 P( E6 2 α2 Bt 5 2σ δmin,k) 4n 99 P( E7 2 Bt 2σ t 2 δmout,k) 2e 90k P( E8 2 α Bt 3 2σδmin,k δmout,k) 2e 90k + 2n 99 Combining these bounds with a union bound yields: ˆGw,t Gw,t 2 c α (Lmax+σ)( log(n)) min + c α (Lmax+σ) k nmout (233) MAML and ANIL Provably Learn Representations with probability at least 1 c e 90k 1 poly(n) for absolute constants c, c . Lemma 33 (Exact ANIL, Finite samples, A1(t + 1)). For Exact ANIL, suppose A2(s) and A5(s) hold for all s [t]. Then wt+1 2 1 10 αE0 min 1, µ2 η2 with probability at least 1 ce 100k 1 poly(n) for an absolute constant c. Proof. For any s [t], we have ws+1 2 = ws βGw,s + β(Gw,s ˆGw,s) 2 ws βGw,s 2 + β Gw,s ˆGw,s 2 ws 2 + c β α s 2 2η + β Gw,s ˆGw,s 2 (235) ws 2 + c β α s 2 2η + β αζ1 (236) where ζ1 is defined as in Lemma 32, (235) follows from equation (60) and (236) follows from Lemma 32. This will allow us to apply Lemma 3 with ξ1,s = 0 and ξ2,s = cβ α( s 2 2η + ζ1). Before doing so, let ζ2 be defined as in Lemma 34 and ζ4 := ζ2,a, corresponding to Lemma 35. Observe that for any s [t], we can recursively apply A2(s), A2(s 1), . . . to obtain s 2 (1 0.5βαE0µ2 ) s 1 2 + β2α2L4 dist2 s 1 +βαζ2 ... r=1 (1 0.5βαE0µ2 )s 1 r(cβ2α2L4 dist2 r +βαζ2) r=1 (1 0.5βαE0µ2 )s 1 r(cβ2α2L4 (2ρ2r + β2αζ2 4) + βαζ2) r=1 (1 0.5βαE0µ2 )s 1 rρ2r + c α3β4L4 r=1 (1 0.5βαE0µ2 )s 1 rζ2 4 r=1 (1 0.5βαE0µ2 )s 1 rζ2 c α2β2L4 (1 0.5βαE0µ2 )s 1 0.5βαµ2 + c α3β4L4 ζ2 4 0.5βαµ2 + βαζ2 0.5βαµ2 c αβκ2 L2 (1 0.5βαE0µ2 )s 1 + c α2β3κ2 L2 ζ2 4 + ζ2 Therefore, via Lemma 3, s=1 c β α s 2 2η + cβ αζ1 β3α1.5L8 E2 0µ4 (1 0.5βαE0µ2 )2s 2η + tβ7α3.5κ4 L4 η ζ4 4 + t β α η µ4 ζ2 2 + t β αζ1 c β2 αL8 E3 0µ6 η + c tβ7α3.5κ4 L4 η ζ4 4 + c t β α η µ4 ζ2 2 + c t β αζ1 (238) κ2 min(1, µ2 η2 )η + tβ7α3.5κ4 L4 η ζ4 4 + t αE0 L4 min(1, µ2 η2 )η ζ2 2 + t αE0 κ4 min(1, µ2 η2 )ζ1 (239) κ2 min(1, µ2 η2 )η + t αE0 L4 min(1, µ2 η2 )η ζ2 2,b + tβ2α2.5 E0 L4 min(1, µ2 η2 )η ζ4 2,a κ4 min(1, µ2 η2 )ζ1 (240) MAML and ANIL Provably Learn Representations where (238) follows by the sum of a geometric series and (239) follows by choice of β c αE2 0 κ4 for a sufficiently small constant c, (240) follows by using the definitions of ζ4 and ζ2, the numerical inequality (a+b)2 2a2 +2b2, and subsuming the dominated term. In order for the RHS (240) to be at most αE0 10 min(1, µ2 η2 )η , we require the following: T , ζ2,b L2 T , ζ2,a L βαT 0.25 (241) However, from Corollary 4 we require tighter bounds on ζ2,b and ζ2,a when T is small. Accounting for these, it is sufficient to choose ζ2,b c E0µ2 ζ2,a c E0µ βαT 0.25 (242) We also require mout ck + c log(n) so that the concentration results hold. This implies that we need mout c T 2 k(Lmax+σ)2 nη2 κ8 + c T k L2 max(Lmax+σ)2 n E2 0µ4 + c T βα E0µ2 (L2 max(Lmax + σ)2(k + log(n)) + (Lmax + σ)4 d + ck + c log(n) For min, we need min c T 2 (Lmax+σ)2(k+log(n)) η2 κ8 + c T (Lmax+σ)4(k+log(n)) E2 0µ4 + c T 0.25 βαLmaxk E0κ T βαL2 max(Lmax+σ)2(k+log(n)) T βα(Lmax+σ)4k2d log(nmin) n E0µ2 (243) under the natural assumption that k = Ω(log(nmin)). Note that if min satisfies the above lower bound, this implies min >> k + log(n), as needed. Using our upper bounds on β and α, replacing Lmax with k L , and treating E0 as a constant gives the final results: mout c T 2 k2(L +σ)2 nη2 κ8 + c T k3κ2 (κ2 +σ2/µ2 ) n + c n + log(n))κ 2 ( σ2 L2 + k) + ck + c log(n) min c T 2(k2 + k log(n)) (L +σ)2 η2 κ8 + c T(k3 + k log(n))(κ4 + σ4 T k3d log(nmin) Lemma 34 (Exact ANIL Finite samples A2(t + 1)). Suppose the conditions of Theorem 8 are satisfied and A1(t), A3(t) and A5(t) hold. Then with probability at least 1 ce 90k 1 poly(n) 1 poly(min) for an absolute constant c, t+1 2 (1 0.5βαE0µ2 ) t 2 + 5 4β2α2L4 dist2 t +βαζ2, (244) ζ2 := 2ζ2,b + βαζ2 2,a, and ζ2,a and ζ2,b are defined in Lemmas 30 and 31, respectively. Proof. As in Lemma 28, let Bt+1 = Bpop t+1 + β(GB,t ˆGB,t), and let pop t+1 = Ik α(Bpop t+1) Bpop t+1. Note that the bound from Lemma 10 applies to pop t+1 2 This results in t+1 2 = pop t+1 βαB t (GB,t ˆGB,t) βα(GB,t ˆGB,t) Bt + β2α(GB,t ˆGB,t) (GB,t ˆGB,t) 2 pop t+1 2 + 2βα B t (GB,t ˆGB,t) 2 + β2α GB,t ˆGB,t) 2 2 (1 0.5βαE0µ2 ) t 2 + 5 4β2α2L4 dist2 t +2βα B t (GB,t ˆGB,t) 2 + β2α GB,t ˆGB,t 2 2 (245) (1 0.5βαE0µ2 ) t 2 + 5 4β2α2L4 dist2 t +2βαζ2,b + β2α2ζ2 2,a (246) MAML and ANIL Provably Learn Representations where (245) follows from Lemma 10 and (246) ζ2,a and ζ2,b are defined in Lemmas 30 and 31, respectively. Define ζ2 := 2ζ2,b + βαζ2 2,a to complete the proof. Corollary 4 (Exact ANIL, Finite samples, A3(t + 1)). Suppose the conditions of Theorem 8 are satisfied and A2(t + 1) and A3(t) hold. Then t+1 2 1 10 (247) Proof. By A2(t + 1) and A3(t), we have t+1 2 (1 0.5βαE0µ2 ) t 2 + 5 4β2α2L2 dist2 t +2βαζ2,b + β2α2ζ2 2,a 1 10 0.05βαE0µ2 + 5 4β2α2L4 + 2βαζ2,b + β2α2ζ2 2,a 1 10 0.04βαE0µ2 + 2βαζ2,b + β2α2ζ2 2,a (248) where (248) follows by choice of β = c αE0 κ4 and (249) follows by ζ2,b c E0µ2 and ζ2 2,a c E0µ2 βα for a sufficiently small constant c. Lemma 35 (Exact-ANIL, Finite samples, A4(t + 1)). Suppose the conditions of Theorem 8 are satisfied and A1(t), A3(t) and A5(t) hold. Then A4(t + 1) holds with high probability , i.e. B , Bt+1 2 (1 0.5βαE0µ2 ) B , Bt 2 + β αζ4 (250) where ζ4 = ζ2,a where ζ2,a is defined in Lemma 30, with probability at least 1 ce 90k 1 poly(n) 1 poly(min) for an absolute constant c. Proof. We have ˆB , Bt+1 2 = ˆB , (Bt βGB,t) + β ˆB , (GB,t ˆGB,t) 2 ˆB , (Bt βGB,t) 2 + β GB,t ˆGB,t 2 (1 0.5βαE0µ2 ) ˆB , Bt 2 + β GB,t ˆGB,t 2 (251) (1 0.5βαE0µ2 ) ˆB , Bt 2 + β αζ2,a (252) where (251) follows by Lemma 12 (note that all the required conditions are satisfied) and (252) holds with probability at least 1 ce 90k 1 poly(n) 1 poly(min) for an absolute constant c according to Lemma 30, where ζ2,a is defined therein. F. Additional simulation and details In all experiments, we generated B by sampling a matrix in Rd k with i.i.d. standard normal elements, then orthogonalizing this matrix by computing its QR-factorization. The same procedure was used to generate B0 in cases with random initialization, except that the result of the QR-factorization was scaled by 1 α such that 0 = 0, and for the case of methodical initialization (Figure 4 (right)), we initialized with an orthogonalized and scaled linear combination of Gaussian noise and B such that dist0 [0.65, 0.7] and 0 = 0. Meanwhile, we set w0 = 0. We used step sizes β = α = 0.05 in all cases for Figure 4, which were tuned optimally. Figure 1 uses the same setting of d = 20, n = k = 3, and Gaussian ground-truth heads as in Figure 4, except that the mean of the ground-truth heads is shifted to zero. We are therefore able to use the larger step sizes of α = β = 0.1 and observe faster convergence in this case, as task diversity is larger since the ground-truth heads are isotropic, and L and Lmax are smaller. Additionally, in Figure 1, Avg. Risk Min. is the algorithm that tries to minimize Ew ,t,i[Lt,i(B, w)] via standard mini-batch SGD. It is equivalent to ANIL and MAML with no inner loop (α = 0). In Figure 3, we use d=100, k=n=5 and α = β = 0.1. All results are averaged over 5 random trials.