# metalearning_with_negative_learning_rates__9574848b.pdf Published as a conference paper at ICLR 2021 META-LEARNING WITH NEGATIVE LEARNING RATES Alberto Bernacchia Media Tek Research alberto.bernacchia@mtkresearch.com Deep learning models require a large amount of data to perform well. When data is scarce for a target task, we can transfer the knowledge gained by training on similar tasks to quickly learn the target. A successful approach is meta-learning, or learning to learn a distribution of tasks, where learning is represented by an outer loop, and to learn by an inner loop of gradient descent. However, a number of recent empirical studies argue that the inner loop is unnecessary and more simple models work equally well or even better. We study the performance of MAML as a function of the learning rate of the inner loop, where zero learning rate implies that there is no inner loop. Using random matrix theory and exact solutions of linear models, we calculate an algebraic expression for the test loss of MAML applied to mixed linear regression and nonlinear regression with overparameterized models. Surprisingly, while the optimal learning rate for adaptation is positive, we find that the optimal learning rate for training is always negative, a setting that has never been considered before. Therefore, not only does the performance increase by decreasing the learning rate to zero, as suggested by recent work, but it can be increased even further by decreasing the learning rate to negative values. These results help clarify under what circumstances meta-learning performs best. 1 INTRODUCTION Deep Learning models represent the state-of-the-art in several machine learning benchmarks (Le Cun et al. (2015)), and their performance does not seem to stop improving when adding more data and computing resources (Rosenfeld et al. (2020), Kaplan et al. (2020)). However, they require a large amount of data and compute to start with, which are often not available to practitioners. The approach of fine-tuning has proved very effective to address this limitation: pre-train a model on a source task, for which a large dataset is available, and use this model as the starting point for a quick additional training (fine-tuning) on the small dataset of the target task (Pan & Yang (2010), Donahue et al. (2014), Yosinski et al. (2014)). This approach is popular because pre-trained models are often made available by institutions that have the resources to train them. In some circumstances, multiple source tasks are available, all of which have scarce data, as opposed to a single source task with abundant data. This case is addressed by meta-learning, in which a model gains experience over multiple source tasks and uses it to improve its learning of future target tasks. The idea of meta-learning is inspired by the ability of humans to generalize across tasks, without having to train on any single task for long time. A meta-learning problem is solved by a bi-level optimization procedure: an outer loop optimizes meta-parameters across tasks, while an inner loop optimizes parameters within each task (Hospedales et al. (2020)). The idea of meta-learning has gained some popularity, but a few recent papers argue that a simple alternative to meta-learning is just good enough, in which the inner loop is removed entirely (Chen et al. (2020a), Tian et al. (2020), Dhillon et al. (2020), Chen et al. (2020b), Raghu et al. (2020)). Other studies find the opposite (Goldblum et al. (2020), Collins et al. (2020), Gao & Sener (2020)). It is hard to resolve the debate because there is little theory available to explain these findings. In this work, using random matrix theory and exact solutions of linear models, we derive an algebraic expression of the average test loss of MAML, a simple and successful meta-learning algorithm (Finn et al. (2017)), as a function of its hyperparameters. In particular, we study its performance as a Published as a conference paper at ICLR 2021 function of the inner loop learning rate during meta-training. Setting this learning rate to zero is equivalent to removing the inner loop, as advocated by recent work (Chen et al. (2020a), Tian et al. (2020), Dhillon et al. (2020), Chen et al. (2020b), Raghu et al. (2020)). Surprisingly, we find that the optimal learning rate is negative, thus performance can be increased by reducing the learning rate below zero. In particular, we find the following: In the problem of mixed linear regression, we prove that the optimal learning rate is always negative in overparameterized models. The same result holds in underparameterized models provided that the optimal learning rate is small in absolute value. We validate the theory by running extensive experiments. We extend these results to the case of nonlinear regression and wide neural networks, in which the output can be approximated by a linear function of the parameters (Jacot et al. (2018), Lee et al. (2019)). While in this case we cannot prove that the optimal learning rate is always negative, preliminary experiments suggest that the result holds in this case as well. 2 RELATED WORK The field of meta-learning includes a broad range of problems and solutions, see Hospedales et al. (2020) for a recent review focusing on neural networks and deep learning. In this context, metalearning received increased attention in the past few years, several new benchmarks have been introduced, and a large number of algorithms and models have been proposed to solve them (Vinyals et al. (2017), Bertinetto et al. (2019), Triantafillou et al. (2020)). Despite the surge in empirical work, theoretical work is still lagging behind. Similar to our work, a few other studies used random matrix theory and exact solutions to calculate the average test loss for the problem of linear regression (Advani & Saxe (2017), Hastie et al. (2019), Nakkiran (2019)). To our knowledge, our study is the first to apply this technique to the problem of meta-learning with multiple tasks. Our results reduce to those of linear regression in the case of one single task. Furthermore, we are among the first to apply the framework of Neural Tangent Kernel (Jacot et al. (2018), Lee et al. (2019)) to the problem of meta-learning (a few papers appeared after our submission: Yang & Hu (2020), Wang et al. (2020a), Zhou et al. (2021)). Similar to us, a few theoretical studies looked at the problem of mixed linear regression in the context of meta-learning. In Denevi et al. (2018), Bai et al. (2021), a meta-parameter is used to bias the taskspecific parameters through a regularization term. Kong et al. (2020) looks at whether many tasks with small data can compensate for a lack of tasks with big data. Tripuraneni et al. (2020), Du et al. (2020) study the sample complexity of representation learning. However, none of these studies look into the effect of learning rate on performance, which is our main focus. In this work, we focus on MAML, a simple and successful meta-learning algorithm (Finn et al. (2017)). A few theoretical studies have investigated MAML, looking at: universality of the optimization algorithm (Finn & Levine (2018)), bayesian inference interpretation (Grant et al. (2018)), proof of convergence (Ji et al. (2020)), difference between convex and non-convex losses (Saunshi et al. (2020)), global optimality (Wang et al. (2020b)), effect of the inner loop (Collins et al. (2020), Gao & Sener (2020)). Again, none of these studies look at the effect of the learning rate, the main subject of our work. The theoretical work of Khodak et al. (2019) connects the learning rate to task similarity, while the work of Li et al. (2017) meta-learns the learning rate. 3 META-LEARNING AND MAML In this work, we follow the notation of Hospedales et al. (2020) and we use MAML (Finn et al. (2017)) as the meta-learning algorithm. We assume the existence of a distribution of tasks τ and, for each task, a loss function Lτ and a distribution of data points Dτ = {xτ, yτ} with input xτ and label yτ. We assume that the loss function is the same for all tasks, Lτ = L, but each task is characterized by a different distribution of the data. The empirical meta-learning loss is evaluated on a sample of Published as a conference paper at ICLR 2021 m tasks, and a sample of nv validation data points for each task: Lmeta (ω; Dt, Dv) = 1 mnv j=1 L θ(ω; D(i) t ); xv(i) j , yv(i) j (1) The training set D(i) t = n xt(i) j , yt(i) j o j=1:nt and validation set D(i) v = n xv(i) j , yv(i) j o drawn independently from the same distribution in each task i. The function θ represents the adaptation of the meta-parameter ω, which is evaluated on the training set. Different meta-learning algorithms correspond to a different choice of θ, we describe below the choice of MAML (Eq.3), the subject of this study. During meta-training, the loss of Eq.1 is optimized with respect to the meta-parameter ω, usually by stochastic gradient descent, starting from an initial point ω0. The optimum is denoted as ω (Dt, Dv). This optimization is referred to as the outer loop, while computation of θ is referred to as the inner loop of meta-learning. During meta-testing, a new (target) task is given and θ adapts on a set Dr of nr target data points. The final performance of the model is computed on test data Ds of the target task. Therefore, the test loss is equal to Ltest = Lmeta (ω (Dt, Dv); Dr, Ds) (2) In MAML, the inner loop corresponds to a few steps of gradient descent, with a given learning rate αt. In this work we consider the simple case of a single gradient step: θ(ω; D(i) t ) = ω αt ω;xt(i) j ,yt(i) j (3) If the learning rate αt is zero, then parameters are not adapted during meta-training and θ(ω) = ω. In that case, a single set of parameters in learned across all data and there is no inner loop. However, it is important to note that a distinct learning rate αr is used during meta-testing. A setting similar to this has been advocated in a few recent studies (Chen et al. (2020a), Tian et al. (2020), Dhillon et al. (2020), Chen et al. (2020b), Raghu et al. (2020)). Figure 1: Graphical model of data generation in mixed linear regression We show that, intuitively, the optimal learning rate at meta-testing (adaptation) time αr is always positive. Surprisingly, in the family of problems considered in this study, we find that the optimal learning rate during meta-training αt is instead negative. We note that the setting αt = 0 effectively does not use the nt training data points, therefore we could in principle add this data to the validation set, but we do not consider this option here since we are interested in a wide range of possible values of αt as opposed to the specific case αt = 0. 4 MIXED LINEAR REGRESSION We study MAML applied to the problem of mixed linear regression. Note that the goal here is not to solve the problem of mixed linear regression, but to probe the performance of MAML as a function of its hyperparameters. In mixed linear regression, each task is characterized by a different linear function, and a model is evaluated by the mean squared error loss function. We assume a generative model in the form of y = x T w + z, where x is the input vector (of dimension p), y is the output (scalar), z is noise (scalar), and w is a vector of generating parameters (of dimension p), therefore p represents both the number of parameters and the input dimension. All distributions are assumed Gaussian: x N (0, Ip) y|x, w N x T w, σ2 (4) Published as a conference paper at ICLR 2021 where Ip is the p p identity matrix, σ is the label noise, w0 is the task mean and ν represents the task variability. Different meta-training tasks i correspond to different draws of generating parameters w(i), while the parameters for the meta-testing task are denoted by w . We denote by superscripts t, v, r, s the training, validation, target and test data, respectively. A graphical model of data generation is shown in Figure 1. Using random matrix theory and exact solutions of linear models, we calculate the test loss as a function of the following hyperparameters: the number of training tasks m, number of data points per task for training (nt), validation (nv) and target (nr), learning rate for training αt and for adaptation to target αr. Furthermore, we have the hyperparameters specific to the mixed linear regression problem: p, ν, σ, w0. Since we use exact solutions to the linear problem, our approach is equivalent to running the outer loop optimization until convergence (see section 7.1 in the Appendix for details). We derive results in two cases: overparameterized p > nvm and underparameterized p < nvm. 5.1 OVERPARAMETERIZED CASE In the overparameterized case, the number of parameters p is larger than the total number of validation data across tasks nvm. In this case, since the data does not fully constrain the parameters, the optimal value of ω found during meta-training depends on the initial condition used for optimization, which we call ω0. Theorem 1. Consider the algorithm of section 3 (MAML one-step), and the data generating model of section 4 (mixed linear regression). Let p > nvm. Let p(ξ) and nt(ξ) be any function of order O(ξ) as ξ . Let |ω0 w0| be of order O(ξ 1/4). Then the test loss of Eq.2, averaged over the entire data distribution (see Eq.27 in the Appendix) is equal to L test = σ2 1 + α2 rp nr |ω0 w0|2 + σ2nvm 1 + α2 t p nt ht + O ξ 3/2 (5) where we define the following expressions ht = (1 αt)2 + α2 t p + 1 hr = (1 αr)2 + α2 r p + 1 Proof. The proof of this Theorem can be found in the Appendix, sections 7.3, 7.3.1. The loss always increases with the output noise σ and task variability ν. Overfitting is expressed in Eq.5 by the term |ω0 w0|, the distance between the initial condition for the optimization of ω0 and the ground truth mean of the generating model w0. Adding more validation data nv and tasks m may increase or decrease the loss depending on the size of this term relative to the noise (Nakkiran (2019)), as it does reducing the number of parameters p. However, the loss always decreases with the number of data points for the target task nr, as that data only affects the adaptation step. Our main focus is studying how the loss is affected by the learning rates, during training αt and adaptation αr. The loss is a quadratic and convex function of αr, therefore it has a unique minimum. While it is possible to compute the optimal value of αr from Eq.5, here we just note that the loss is a sum of two quadratic functions, one has a minimum at αr = 0 and another has a minimum at αr = 1/ (1 + (p + 1)/nr), therefore the optimal learning rate is in between the two values and is always positive. This is intuitive, since a positive learning rate for adaptation implies that the parameters get closer to the optimum for the target task. An example of the loss as a function of the adaptation learning rate αr is shown in Figure 2a, where we also show the results of experiments Published as a conference paper at ICLR 2021 in which we run MAML empirically. The good agreement between theory and experiment suggest that Eq.5 is accurate. However, the training learning rate αt shows the opposite: by taking the derivative of Eq.5 with respect to αt, it is possible to show that it has a unique absolute minimum for a negative value of αt. This can be proved by noting that this function has the same finite value for large positive or negative αt, its derivative is always positive at αt = 0, and it has one minimum ( ) and one maximum (+) at values α t = nt + 1 Note that the argmax α+ t is always positive, while the argmin α t is always negative. This result is counter-intuitive, since a negative learning rate pushes parameters towards higher values of the loss. However, learning of the meta-parameter ω is performed by the outer loop (minimize Eq.1), for which there is no learning rate since we are using the exact solution to the linear problem and thus we are effectively training to convergence. Therefore, it remains unclear whether the inner loop (Eq.3) should push parameters towards higher or lower values of the loss. An example of the loss as a function of the training learning rate αr is shown in Figure 2b, where we also show the results of experiments in which we run MAML empirically. Here the theory slightly underestimate the experimental loss, but the overall shapes of the curves are in good agreement, suggesting that Eq.5 is accurate. Additional experiments are shown in the Appendix, Figure 6. Figure 2: Average test loss of MAML as a function of the learning rate, on overparameterized mixed linear regression, as predicted by our theory and confirmed in experiments. a) Effect of learning rate αr during adaptation. b) Effect of learning rate αt during training. The optimal learning rate during adaptation is positive, while that during training is negative. Values of parameters: nt = 30, nv = 2, nr = 20, m = 3, p = 60, σ = 1., ν = 0.5, ω0 = 0, w0 = 0. In panel a) we set αt = 0.2, in panel b) we set αr = 0.2. In the experiments, each run is evaluated on 100 test tasks of 50 data points each, and each point is an average over 100 runs (a) or 1000 runs (b). 5.2 UNDERPARAMETERIZED CASE In the underparameterized case, the number of parameters p is smaller than the total number of validation data across tasks nvm. In this case, since the data fully constrains the parameters, the optimal value of ω found during meta-training is unique. We prove the following result. Theorem 2. Consider the algorithm of section 3 (MAML one-step), and the data generating model of section 4 (mixed linear regression). Let p < nvm. Let nv(ξ) and nt(ξ) be any function of order O(ξ). For ξ, m , the test loss of Eq.2, averaged over the entire data distribution (see Eq.27 in the Appendix) is equal to Published as a conference paper at ICLR 2021 L test = σ2 1 + α2 rp nr σ2 ht + α2 t nt [(nv + 1) g1 + pg2] + ν2 p [(nv + 1) g3 + pg4] + O (mξ) 3/2 where hr, ht are defined as in previous section, Eqs.6, 7, and gi are order O(1) polynomials in αt, see Eqs.98-101 in the Appendix. Proof. The proof of this Theorem can be found in the Appendix, sections 7.3, 7.3.2. Again, the loss always increases with the output noise σ and task variability ν. Furthermore, in this case the loss always decreases with the number of data points nv, nr, and tasks m. Note that, for a very large number of tasks m, the loss does not depend on meta-training hyperparameters αt, nv, nt. When the number of tasks is infinite, it doesn t matter whether we run the inner loop, and how much data we have for each task. As in the overparameterized case, the loss is a quadratic and convex function of the adaptation learning rate αr, and there is a unique minimum. While the value of the argmin is different, in this case as well the loss is a sum of two quadratic functions, one with minimum at αr = 0 and another with a minimum at αr = 1/ (1 + (p + 1)/nr), therefore the optimal learning rate is again in between the same two values and is always positive. Similar comments applies to this case: a positive learning rate for adaptation implies that the parameters get closer to the optimum for the target task. An example of the loss as a function of the adaptation learning rate αr is shown in Figure 3a, where we also show the results of experiments in which we run MAML empirically. The good agreement between theory and experiment suggest that Eq.9 is accurate. As a function of the training learning rate αt, the loss Eq.9 is the ratio of two fourth order polynomials, therefore it is not straightforward to determine its behaviour. However, it is possible to show that the following holds suggesting that performance is always better for negative values of αt around zero. Even if counterintuitive, this finding aligns with that of previous section, and similar comments apply. An example of the loss as a function of the training learning rate αr is shown in Figure 3b, where we also show the results of experiments in which we run MAML empirically. A good agreement is observed between theory and experiment, again suggesting that Eq.9 is accurate. Additional experiments are shown in the Appendix, Figure 6. 5.3 NON-GAUSSIAN THEORY IN OVERPARAMETERIZED MODELS In previous sections we studied the performance of MAML applied to the problem of mixed linear regression. It remains unclear whether the results in the linear case are relevant for the more interesting case of nonlinear problems. Inspired by recent theoretical work, we consider the case of nonlinear regression with squared loss L (ω) = E x E y|x 1 2 [y f (x, ω)]2 (11) where y is a target output and f (x, ω) the output of a neural network with input x and parameters ω. The introduction of the Neural Tangent Kernel showed that, in the limit of infinitely wide neural networks, the output is a linear function of its parameters during the entire course of training (Jacot et al. (2018), Lee et al. (2019)). This is expressed by a first order Taylor expansion f (x, ω) f (x, ω0) + k (x, ω0)T (ω ω0) (12) k (x, ω0) = ω f (x, ω)|x,ω0 (13) Published as a conference paper at ICLR 2021 Figure 3: Average test loss as a function of the learning rate, on underparameterized mixed linear regression, as predicted by our theory and confirmed in experiments. a) Effect of learning rate αr during testing. b) Effect of learning rate αt during training. The optimal learning rate during testing is always positive, while that during training is negative. Values of parameters: nt = 5, nv = 25, nr = 10, m = 40, p = 30, σ = 0.2, ν = 0.2. In panel a) we set αt = 0.2, in panel b) we set αr = 0.2. In the experiments, the model is evaluated on 100 tasks of 50 data points each, and each point is an average over 100 (a) or 1000 (b) runs. The parameters ω remain close to the initial condition ω0 during the entire course of training, a phenomenon referred to as lazy training (Chizat et al. (2020)), and therefore the output can be linearized around ω0. Intuitively, in a model that is heavily overparameterized, the data does not constrain the parameters, and a parameter that minimizes the loss in Eq.11 can be found in the vicinity of any initial condition ω0. Note that, while the output of the neural network is linear in the parameters, it remains a nonlinear function of its input, through the vector of nonlinear functions k in Eq.13. By substituting Eq.12 into Eq.11, the nonlinear regression becomes effectively linear, in the sense that the loss is a quadratic function of the parameters ω, and all nonlinearities are contained in the functions k in Eq.13, that are fixed by the initial condition ω0. This suggests that we can carry over the theory developed in the previous section to this problem. However, in this case the input to the linear regression problem is effectively k (x), and some of the assumptions made in the previous section are not acceptable. In particular, even if we assume that x is Gaussian, k (x) is a nonlinear function of x and cannot be assumed Gaussian. We prove the following result, where we generalize the result of section 5.1 to non-Gaussian inputs and weights. Theorem 3. Consider the algorithm of section 3 (MAML one-step), with ω0 = 0, and the data generating model of section 4, where the input x and the weights w are not necessarily Gaussian, and have zero mean and covariances, respectively, Σ = Exx T and Σw = Eww T . Let F be the matrix of fourth order moments F = E x T Σx xx T . Let p > nvm. Let p(ξ) and nt(ξ) be any function of order O(ξ) as ξ . Let Tr Σ2 w be of order O ξ 1 , and let the variances of matrix products of the rescaled inputs x/ p, up to sixth order, be of order O ξ 1 (see Eqs.134-136 in the Appendix). Then the test loss of Eq.2, averaged over the entire data distribution (see Eq.27 in the Appendix) is equal to 2Tr (Σw Hr) + σ2 1 + α2 r nr Tr Σ2 + 2nvm Tr (Hr Ht) n Tr (Σw Ht) + σ2 h 1 + α2 t nt Tr Σ2 io Tr (Ht)2 + O ξ 3/2 (14) where we define the following matrices Ht = Σ (I αtΣ)2 + α2 t nt Hr = Σ (I αrΣ)2 + α2 r nr Published as a conference paper at ICLR 2021 Proof. The proof of this Theorem can be found in Appendix, section 7.4. Note that this result reduces to Eqs.5, 6, 7 when Σ = I, Σw = Iν2/p, F = I(p + 2), ω0 = 0, w = 0. This expression for the loss is more difficult to analyze than those given in the previous sections, because it involves traces of nonlinear functions of matrices, all elements of which are free hyperparameters. Nevertheless, it is possible to show that, as a function of the adaptation learning rate αr, the loss in Eq.14 is still a quadratic function. As a function of the adaptation learning rate αr, the loss in Eq.14 is the ratio of two fourth order polynomials, but it is difficult to draw any conclusions since their coefficients do not appear to have simple relationships. Even if the influence of the hyperparameters is not easy to predict, the expression in Eq.14 can still be used to quickly probe the behavior of the loss empirically, by using example values for the Σ, Σw, F, since computing the expression is very fast. Here we choose values of Σ, Σw by a single random draw from a Wishart distribution Σ W (I, p) Σw ν2 p W (I, p) (17) Note that the number of degrees of freedom of the distribution is equal to the size of the matrices, p, therefore this covariances display significant correlations. Furthermore, we choose F = 2Σ3 + ΣTr Σ2 , which is the value taken when x follows a Gaussian distribution. Therefore, we effectively test the loss in Eq.14 for a Gaussian distribution, as in previous section, but we stress that the expression is valid for any distribution of x within the assumptions of Theorem 3. We also run experiments of MAML, applied again to mixed linear regression, but now using the covariance matrices drawn in Eq.17. Figure 4 shows the loss in Eq.14 as a function of the learning rates, during adaptation (panel a) and training (panel b). Qualitatively, we observe a similar behaviour as in section 5.1: the adaptation learning rate has a unique minimum for a positive value of αr, while the training learning rate shows better performance for negative values of αt. Again, there is a good agreement between theory and experiment, suggesting that Eq.14 is a good approximation. Figure 4: Average test loss of MAML as a function of the learning rate, on overparameterized mixed linear regression with Wishart covariances, as predicted by our theory and confirmed in experiments. a) Effect of learning rate αr during adaptation. b) Effect of learning rate αt during training. The optimal learning rate during adaptation is positive, while that during training appears to be negative. Values of parameters: nt = 30, nv = 2, nr = 20, m = 3, p = 60, σ = 1., ν = 0.5, ω0 = 0, w0 = 0. In panel a) we set αt = 0.2, in panel b) we set αr = 0.2. In the experiments, each run is evaluated on 100 tasks of 50 data points each, and each point is an average over 100 runs (a) or 500 runs (b). 5.4 NONLINEAR REGRESSION To investigate whether negative learning rates improve performance on non-linear regression in practice, we studied the simple case of MAML with a neural network applied to a quadratic function. Specifically, the target output is generated according to y = (w T x + b)2 + z, where b is a bias term. The data x, z and generating parameters w are sampled as described in section 4 (in addition, the bias b was drawn from a Gaussian distribution of zero mean and unit variance.). We use a 2-layer Published as a conference paper at ICLR 2021 feed-forward neural network with Re LU activation functions. Weights are initialized following a Gaussian distribution of zero mean and variance equal to the inverse number of inputs. We report results with a network width of 400 in both layers; results were similar with larger network widths. We use the square loss function and we train the neural network in the outer loop with stochastic gradient descent with a learning rate of 0.001 for 5000 epochs (until convergence). We used most parameters identical to section 5.1: nt = 30; nv = 2; nr = 20; m = 3; p = 60; σ = 1, ν = 0.5, w0 = 0. The learning rate for adaptation was set to αr = 0.01. Note that in section 5.1 the model was initialized at the ground truth of the generative model (ω0 = w0), while here the neural network parameters are initialized at random. Figure 5 shows the test loss as a function of the learning rate αt. The best performance is obtained for a negative learning rate of αt = 0.0075. 6 DISCUSSION Figure 5: Average test loss of MAML as a function of the learning rate, on nonlinear (quadratic) regression using a 2-layer feedforward neural network. Optimal learning rate is negative, consistent with results on the linear case. Each run is evaluated on 1000 test tasks, and each point is an average over 10 runs. Error bars show standard errors. Note the qualitative similarity with Figures 2b and 4b. We calculated algebraic expressions for the average test loss of MAML applied to a simple family of linear models, as a function of the hyperparameters. Surprisingly, we showed that the optimal value of the learning rate of the inner loop during training is negative. This finding seems to carry over to more interesting nonlinear models in the overparameterized case. However, additional work is necessary to establish the conditions under which the optimal learning rate may be positive, for example by probing more extensively Eq.14. A negative optimal learning rate is surprising and counter-intuitive, since negative learning rates push parameters towards higher values of the loss. However, the meta-training loss is minimized by the outer loop, therefore it is not immediately obvious whether the learning rate of the inner loop should be positive, and we show that in some circumstances it should not. However, perhaps obviously, we also show that the learning rate during adaptation at test time should always be positive, otherwise the target task cannot be learned. In this work, we considered the case of nonlinear models in the overparameterized case. However, typical applications of MAML (and meta-learning in general) implement relatively small models due to the heavy computational load of running bi-level optimization, including both outer and inner loop. Our theory applies to regression problems, and assumes a limited number of tasks where data is independently drawn in each task, while some applications use a large number of tasks with correlated draws (for example, images may be shared across tasks in few-shot image classification, see Bertinetto et al. (2019)). Our theory is valid at the exact optimum of the outer loop, which is equivalent to training the outer loop to convergence, therefore overfitting may occur in the outer loop of our model. Another limitation of our theory is represented by the assumptions on the input and task covariance, which have no correlations in Theorems 1, 2, and are subject to some technical assumptions in Theorem 3. To the best of our knowledge, nobody has considered training meta-learning models with negative learning rates in the inner loop. Given that some studies advocate removing the inner loop altogether, which is similar to setting the learning rate to zero, it would be interesting to try a negative one. On the other hand, it is possible that a negative learning rate does not work in classification problems, in nonlinear models, or using input or tasks with a complex structure, settings that are outside the theory presented in this work. We would like to thank Paolo Grazieschi for helping with formalizing the theorems, and Ritwik Niyogi for helping with nonlinear regression experiments. Published as a conference paper at ICLR 2021 Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. ar Xiv:1710.03667 [physics, q-bio, stat], October 2017. URL http: //arxiv.org/abs/1710.03667. ar Xiv: 1710.03667. Yu Bai, Minshuo Chen, Pan Zhou, Tuo Zhao, Jason D. Lee, Sham Kakade, Huan Wang, and Caiming Xiong. How Important is the Train-Validation Split in Meta-Learning? ar Xiv:2010.05843 [cs, stat], February 2021. URL http://arxiv.org/abs/2010.05843. ar Xiv: 2010.05843. Luca Bertinetto, Jo ao F. Henriques, Philip H. S. Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. ar Xiv:1805.08136 [cs, stat], July 2019. URL http:// arxiv.org/abs/1805.08136. ar Xiv: 1805.08136. Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A Closer Look at Few-shot Classification. ar Xiv:1904.04232 [cs], January 2020a. URL http: //arxiv.org/abs/1904.04232. ar Xiv: 1904.04232. Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A New Meta-Baseline for Few-Shot Learning. ar Xiv:2003.04390 [cs], March 2020b. URL http://arxiv.org/ abs/2003.04390. ar Xiv: 2003.04390. Lenaic Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Programming. ar Xiv:1812.07956 [cs, math], January 2020. URL http://arxiv.org/abs/1812. 07956. ar Xiv: 1812.07956. Liam Collins, Aryan Mokhtari, and Sanjay Shakkottai. Why Does MAML Outperform ERM? An Optimization Perspective. ar Xiv:2010.14672 [cs, math, stat], December 2020. URL http: //arxiv.org/abs/2010.14672. ar Xiv: 2010.14672. Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Learning To Learn Around A Common Mean. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 10169 10179. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 8220-learning-to-learn-around-a-common-mean.pdf. Guneet S. Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A Baseline for Few-Shot Image Classification. ar Xiv:1909.02729 [cs, stat], March 2020. URL http: //arxiv.org/abs/1909.02729. ar Xiv: 1909.02729. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. De CAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. ICML, pp. 9, 2014. Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, and Qi Lei. Few-Shot Learning via Learning the Representation, Provably. ar Xiv:2002.09434 [cs, math, stat], February 2020. URL http: //arxiv.org/abs/2002.09434. ar Xiv: 2002.09434. Chelsea Finn and Sergey Levine. Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm. ar Xiv:1710.11622 [cs], February 2018. URL http://arxiv.org/abs/1710.11622. ar Xiv: 1710.11622. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ar Xiv:1703.03400 [cs], March 2017. URL http://arxiv.org/ abs/1703.03400. ar Xiv: 1703.03400. Katelyn Gao and Ozan Sener. Modeling and Optimization Trade-off in Meta-learning. ar Xiv:2010.12916 [cs, math, stat], October 2020. URL http://arxiv.org/abs/2010. 12916. ar Xiv: 2010.12916. Micah Goldblum, Steven Reich, Liam Fowl, Renkun Ni, Valeriia Cherepanova, and Tom Goldstein. Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks. ar Xiv:2002.06753 [cs, stat], March 2020. URL http://arxiv.org/abs/2002.06753. ar Xiv: 2002.06753. Published as a conference paper at ICLR 2021 Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. RECASTING GRADIENT-BASED META-LEARNING AS HIERARCHICAL BAYES. ICLR, pp. 13, 2018. Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in High Dimensional Ridgeless Least Squares Interpolation. ar Xiv:1903.08560 [cs, math, stat], November 2019. URL http://arxiv.org/abs/1903.08560. ar Xiv: 1903.08560. Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-Learning in Neural Networks: A Survey. ar Xiv:2004.05439 [cs, stat], April 2020. URL http://arxiv.org/ abs/2004.05439. ar Xiv: 2004.05439. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. ar Xiv:1806.07572 [cs, math, stat], June 2018. URL http: //arxiv.org/abs/1806.07572. ar Xiv: 1806.07572. Kaiyi Ji, Junjie Yang, and Yingbin Liang. Multi-Step Model-Agnostic Meta-Learning: Convergence and Improved Algorithms. ar Xiv:2002.07836 [cs, math, stat], February 2020. URL http: //arxiv.org/abs/2002.07836. ar Xiv: 2002.07836. Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models. ar Xiv:2001.08361 [cs, stat], January 2020. URL http://arxiv.org/abs/2001. 08361. ar Xiv: 2001.08361. Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive Gradient-Based Meta Learning Methods. ar Xiv:1906.02717 [cs, stat], December 2019. URL http://arxiv.org/ abs/1906.02717. ar Xiv: 1906.02717. Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. Meta-learning for mixed linear regression. ar Xiv:2002.08936 [cs, stat], February 2020. URL http://arxiv. org/abs/2002.08936. ar Xiv: 2002.08936. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, May 2015. ISSN 0028-0836, 1476-4687. doi: 10.1038/nature14539. URL http://www. nature.com/articles/nature14539. Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. ar Xiv:1902.06720 [cs, stat], February 2019. URL http://arxiv.org/abs/ 1902.06720. ar Xiv: 1902.06720. Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning. ar Xiv:1707.09835 [cs], September 2017. URL http://arxiv.org/ abs/1707.09835. ar Xiv: 1707.09835. Preetum Nakkiran. More Data Can Hurt for Linear Regression: Sample-wise Double Descent. ar Xiv:1912.07242 [cs, math, stat], December 2019. URL http://arxiv.org/abs/1912. 07242. ar Xiv: 1912.07242. Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, October 2010. ISSN 1041-4347. doi: 10.1109/TKDE. 2009.191. URL http://ieeexplore.ieee.org/document/5288526/. Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. ar Xiv:1909.09157 [cs, stat], February 2020. URL http://arxiv.org/abs/1909.09157. ar Xiv: 1909.09157. Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A CONSTRUCTIVE PREDICTION OF THE GENERALIZATION ERROR ACROSS SCALES. ICLR, pp. 30, 2020. Nikunj Saunshi, Yi Zhang, Mikhail Khodak, and Sanjeev Arora. A Sample Complexity Separation between Non-Convex and Convex Meta-Learning. ar Xiv:2002.11172 [cs, math, stat], February 2020. URL http://arxiv.org/abs/2002.11172. ar Xiv: 2002.11172. Published as a conference paper at ICLR 2021 Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, and Phillip Isola. Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? ar Xiv:2003.11539 [cs], June 2020. URL http://arxiv.org/abs/2003.11539. ar Xiv: 2003.11539. Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta Dataset: A Dataset of Datasets for Learning to Learn from Few Examples. ar Xiv:1903.03096 [cs, stat], February 2020. URL http://arxiv.org/abs/1903.03096. ar Xiv: 1903.03096. Nilesh Tripuraneni, Chi Jin, and Michael I. Jordan. Provable Meta-Learning of Linear Representations. ar Xiv:2002.11684 [cs, stat], February 2020. URL http://arxiv.org/abs/2002. 11684. ar Xiv: 2002.11684. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching Networks for One Shot Learning. ar Xiv:1606.04080 [cs, stat], December 2017. URL http://arxiv.org/abs/1606.04080. ar Xiv: 1606.04080. Haoxiang Wang, Ruoyu Sun, and Bo Li. Global Convergence and Generalization Bound of Gradient-Based Meta-Learning with Deep Neural Nets. ar Xiv:2006.14606 [cs, stat], November 2020a. URL http://arxiv.org/abs/2006.14606. ar Xiv: 2006.14606. Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. On the Global Optimality of Model Agnostic Meta-Learning. ar Xiv:2006.13182 [cs, stat], June 2020b. URL http://arxiv. org/abs/2006.13182. ar Xiv: 2006.13182. Greg Yang and Edward J. Hu. Feature Learning in Infinite-Width Neural Networks. ar Xiv:2011.14522 [cond-mat], November 2020. URL http://arxiv.org/abs/2011. 14522. ar Xiv: 2011.14522. Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? ar Xiv:1411.1792 [cs], November 2014. URL http://arxiv.org/abs/ 1411.1792. ar Xiv: 1411.1792. Yufan Zhou, Zhenyi Wang, Jiayi Xian, Changyou Chen, and Jinhui Xu. Meta-Learning with Neural Tangent Kernels. ar Xiv:2102.03909 [cs], February 2021. URL http://arxiv.org/abs/ 2102.03909. ar Xiv: 2102.03909. Published as a conference paper at ICLR 2021 Figure 6: Average test loss of MAML as a function of the learning rate αt (training) on mixed linear regression, showing the transition from strongly overparameterized (a), to weakly overparameterized (b), weakly underparameterized (c) and strongly underparameterized (d). As expected, predictions of theory are accurate only in panels (a) and (d). The amount of validation data increases from panels (a) to (d), with the following values: m = 1, nv = 2 (a), m = 5, nv = 5 (b), m = 10, nv = 10 (c), m = 10, nv = 40. Other parameters are equal to: nt = 40, nr = 40, p = 50, σ = 0.5., ν = 0.5, αr = 0.2, ω0 = 0, w0 = (0.1, 0.1, . . . , 0.1) (note that overfitting occurs since ω0 = w0). In the experiments, each run is evaluated on 100 test tasks of 50 data points each, and each point is an average over 100 runs. 7.1 DEFINITION OF THE LOSS FUNCTION We consider the problem of mixed linear regression y = Xw + z with squared loss, where X is a n p matrix of input data, each row is one of n data vectors of dimension p, z is a n 1 noise vector, w is a p 1 vector of generating parameters and y is a n 1 output vector. Data is collected for m tasks, each with a different value of the parameters w and a different realization of the input X and noise z. We denote by w(i) the parameters for task i, for i = 1, . . . , m. For a given task i, we denote by Xt(i), Xv(i) the input data for, respectively, the training and validation sets, by zt(i), zv(i) the corresponding noise vectors and by yt(i), yv(i) the output vectors. We denote by nt, nv the data sample size for training and validations sets, respectively. For a given task i, the training output is equal to yt(i) = Xt(i)w(i) + zt(i) (18) Similarly, the validation output is equal to yv(i) = Xv(i)w(i) + zv(i). (19) We consider MAML as a model for meta-learning (Finn et al 2017). The meta-training loss is equal to Lmeta = 1 2nvm yv(i) Xv(i)θ(i)(ω) 2 (20) Published as a conference paper at ICLR 2021 where vertical brackets denote euclidean norm, and the estimated parameters θ(i)(ω) are equal to the one-step gradient update on the single-task training loss L(i) = |yt(i) Xt(i)θ(i)|2/2nt, with initial condition given by the meta-parameter ω. The single gradient update is equal to θ(i)(ω) = Ip αt nt Xt(i)T Xt(i) ω + αt nt Xt(i)T yt(i) (21) where Ip is the p p identity matrix and αt is the learning rate. We seek to minimize the metatraining loss with respect to the meta-parameter ω, namely ω = arg min ω Lmeta (22) We evaluate the solution ω by calculating the meta-test loss Ltest = 1 2ns |ys Xsθ |2 (23) Note that the test loss is calculated over test data Xs, zs, and test parameters w , namely ys = Xsw + zs (24) Furthermore, the estimated parameters θ are calculated on a separate set of target data Xr, zr, namely nr Xr T Xr ω + αr nr Xr T yr (25) yr = Xrw + zr (26) Note that the learning rate and sample size can be different at testing, denoted by αr, nr, ns. We are interested in calculating the average test loss, that is the test loss of Eq.23 averaged over the entire data distribution, equal to L test = E w E zt E Xt E zv E Xv E w E zs E Xs E zr E Xr 1 2ns |ys Xsθ |2 (27) 7.2 DEFINITION OF PROBABILITY DISTRIBUTIONS We assume that all random variables are Gaussian. In particular, we assume that the rows of the matrix X are independent, and each row, denoted by x, is distributed according to a multivariate Gaussian with zero mean and unit covariance x N (0, Ip) (28) where Ip is the p p identity matrix. Similarly, the noise is distributed following a multivariate Gaussian with zero mean and variance equal to σ2, namely z N 0, σ2In (29) Finally, the generating parameters are also distributed according to a multivariate Gaussian of variance ν2/p, namely The generating parameter w is drawn once and kept fixed within a task, and drawn independently for different tasks. The values of x and z are drawn independently in all tasks and datasets (training, validation, target, test). In order to perform the calculations in the next section, we need the following results. Published as a conference paper at ICLR 2021 Lemma 1. Let X be a Gaussian n p random matrix with independent rows, and each row has covariance equal to Ip, the p p identity matrix. Then: E XT X = n Ip (31) E h XT X 2i = n (n + p + 1) Ip = n2µ2Ip (32) E h XT X 3i = n n2 + p2 + 3np + 3n + 3p + 4 Ip = n3µ3Ip (33) E h XT X 4i = n n3 + p3 + 6n2p + 6np2+ (34) +6n2 + 6p2 + 17np + 21n + 21p + 20 Ip = n4µ4Ip (35) E XT X Tr XT X = n2p + 2n Ip = pn2µ1,1Ip (36) E h XT X 2 Tr XT X i = n n2p + np2 + np + 4n + 4p + 4 Ip = pn3µ2,1Ip (37) E h XT XTr XT X 2 i = n n2p + np2 + np + 4n + 4p + 4 Ip = pn3µ1,2Ip (38) E h XT X 2 Tr XT X 2 i = n n3p + np3 + 2n2p2 + 2n2p + 2np2+ (39) +8n2 + 8p2 + 21np + 20n + 20p + 20 Ip = pn4µ2,2Ip (40) where the last equality in each of these expressions defines the variables µ. Furthermore, for any n n symmetric matrix C and any p p symmetric matrix D, independent of X: E XT CX = Tr (C) Ip (41) E XT XDXT X = n (n + 1) D + n Tr (D) Ip (42) Proof. The Lemma follows by direct computations of the above expectations, using Isserlis theorem. Particularly, for higher order exponents, combinatorics plays a crucial role in counting products of different Gaussian variables in an effective way. Lemma 2. Let Xv(i), Xt(i) be Gaussian random matrices, of size respectively nv p and nt p, with independent rows, and each row has covariance equal to Ip, the p p identity matrix. Let p(ξ) and nt(ξ) be any function of order O(ξ) as ξ . Then: Xv(i)Xv(i)T = p Inv + O ξ1/2 (43) Xv(i)Xt(i)T Xt(i)Xv(i)T = pnt Inv + O ξ3/2 (44) Xv(i)Xt(i)T Xt(i)Xt(i)T Xt(i)Xv(i)T = pnt(nt + p + 1)Inv + O ξ5/2 (45) Note that the order O (ξ) applies to all elements of the matrix in each expression. For i = j Xv(i)Xv(j)T = O ξ1/2 (46) Xv(i)Xt(i)T Xt(i)Xv(j)T = O ξ3/2 (47) Xv(i)Xt(i)T Xt(i)Xt(j)T Xt(j)Xv(j)T = O ξ5/2 (48) Published as a conference paper at ICLR 2021 Furthermore, for any positive real number δ and for any p p symmetric matrix D independent of X, where Tr(D) and Tr(D2) are both of order O(ξδ) Xv(i)DXv(i)T = Tr (D) Inv + O ξδ/2 (49) Xv(i)Xt(i)T Xt(i)DXv(i)T = Tr (D) nt Inv + O ξ1+δ/2 (50) Xv(i)Xt(i)T Xt(i)DXt(i)T Xt(i)Xv(i)T = Tr (D) nt(nt + p + 1)Inv + O ξ2+δ/2 (51) Xv(i)DXv(j)T = O ξδ/2 (52) Xv(i)Xt(i)T Xt(i)DXv(j)T = O ξ1+δ/2 (53) Xv(i)Xt(i)T Xt(i)DXt(j)T Xt(j)Xv(j)T = O ξ2+δ/2 (54) Proof. The Lemma follows by direct computations of the expectations and variances of each term. Lemma 3. Let Xv, Xt be Gaussian random matrices, of size respectively nv p and nt p, with independent rows, and each row has covariance equal to Ip, the p p identity matrix. Let nv(ξ) and nt(ξ) be any function of order O(ξ) for ξ . Then: Xv T Xv = nv Ip + O ξ1/2 (55) Xt T Xt Xv T Xv = ntnv Ip + O ξ3/2 (56) Xt T Xt Xv T Xv Xt T Xt = nvnt(nt + p + 1)Ip + O ξ5/2 (57) Note that the order O (ξ) applies to all elements of the matrix in each expression. Proof. The Lemma follows by direct computations of the expectations and variances of each term. 7.3 PROOF OF THEOREMS 1 AND 2 We calculate the average test loss as a function of the hyperparameters nt, nv, nr, p, m, αt, αr, σ, ν, w0. Using the expression in Eq.24 for the test output, we rewrite the test loss in Eq.27 as L test = E 1 2ns |Xs (w θ ) + zs|2 (58) We start by averaging this expression with respect to Xs, zs, noting that θ does not depend on test data. We further average with respect to w , but note that θ depends on test parameters, so we average only terms that do not depend on θ . Using Eq.31, the result is L test = σ2 2 (w0 + δw )T θ # where we define δw = w w0. The second term in the expectation is linear in θ and can be averaged over Xr, zr, using Eq.25 and noting that ω does not depend on target data. The result is E Xr E zr θ = (1 αr)ω + αr (w0 + δw ) (60) Using Eq.60 we average over w the second term in the expectation of Eq.59 and find L test = σ2 ν2 + |w0|2 (1 αr) w T 0 E ω + E|θ |2 Published as a conference paper at ICLR 2021 We average the last term of this expression over zr, w , using Eq.25 and noting that ω does not depend on target data and test parameters. The result is E w E zr |θ |2 = |ω |2 + α2 r n2r (ω w0)T Xr T Xr 2 (ω w0) (62) nr Xr T Xrω T (ω w0) + α2 rσ2 n2r Tr h Xr Xr T i + α2 rν2 n2rp Tr Xr Xr T 2 (63) We now average over Xr, again noting that ω does not depend on target data. Using Eqs.31, 32, we find E Xr E w E zr |θ |2 = |ω |2 + α2 r ν2 + |ω w0|2 2αrω T (ω w0) + α2 rσ2p nr (64) We can now rewrite the average test loss 61 as L test = σ2 1 + α2 rp nr (1 αr)2 + α2 r p + 1 ν2 + E |ω w0|2 (65) In order to average the last term, we need an expression for ω . We note that the loss in Eq.20 is quadratic in ω, therefore the solution of Eq.22 can be found using standard linear algebra. In particular, the loss in Eq.20 can be rewritten as Lmeta = 1 2nvm |γ Bω|2 (66) where γ is a vector of shape nvm 1, and B is a matrix of shape nvm p. The vector γ is a stack of m vectors Xv(1) Ip αt nt Xt(1)T Xt(1) w(1) αt nt Xv(1)Xt(1)T zt(1) + zv(1) ... Xv(m) Ip αt nt Xt(m)T Xt(m) w(m) αt nt Xv(m)Xt(m)T zt(m) + zv(m) Similarly, the matrix B is a stack of m matrices Xv(1) Ip αt nt Xt(1)T Xt(1) ... Xv(m) Ip αt nt Xt(m)T Xt(m) We denote by Ip the p p identity matrix. The expression for ω that minimizes Eq.66 depends on whether the problem is overparameterized (p > nvm) or underparameterized (p < nvm), therefore we distinguish these two cases in the following sections. 7.3.1 OVERPARAMETERIZED CASE (THEOREM 1) In the overparameterized case (p > nvm), under the assumption that the inverse of BBT exists, the value of ω that minimizes Eq.66 is equal to ω = BT BBT 1 γ + h Ip BT BBT 1 B i ω0 (69) The vector ω0 is interpreted as the initial condition of the parameter optimization of the outer loop, when optimized by gradient descent. Note that the matrix B does not depend on w, zt, zv, and Ew Ezt Ezv γ = Bw0. We denote by δγ the deviation from the average, and we have ω w0 = BT BBT 1 δγ + h Ip BT BBT 1 B i (ω0 w0) (70) We square this expression and average over w, zt, zv. We use the cyclic property of the trace and the fact that BT BBT 1 B is a projection. The result is |ω w0|2 = Tr h Γ BBT 1i + (ω0 w0)T h Ip BT BBT 1 B i (ω0 w0) (71) Published as a conference paper at ICLR 2021 The matrix Γ is defined as Γ = E w E zt E zv δγ δγT = 0 ... 0 0 0 Γ(m) Where matrix blocks are given by the following expression p Xv(i) Ip αt nt Xt(i)T Xt(i) 2 Xv(i)T + σ2 Inv + α2 t n2 t Xv(i)Xt(i)T Xt(i)Xv(i)T It is convenient to rewrite the scalar product of Eq.71 in terms of the trace of outer products |ω w0|2 = Tr h BBT 1 Γ B (ω0 w0) (ω0 w0)T BT i + |ω0 w0|2 (74) In order to calculate E |ω w0|2 in Eq.65 we need to average this expression over training and validation data. These averages are hard to compute since they involve nonlinear functions of the data. However, we can approximate these terms by assuming that p and nt are large, both of order O(ξ), where ξ is a large number. Furthermore, we assume that |ω0 w0| is of order O(ξ 1/4). Using Lemma 2, together with the expressions of B (Eq.68) and Γ (Eqs.72,73), we can prove that 1 p BBT = (1 αt)2 + α2 t p + 1 Invm + O ξ 1/2 (75) Γ = ν2 (1 αt)2 + α2 t p + 1 + σ2 1 + α2 tp nt Invm + O ξ 1/2 (76) B (ω0 w0) (ω0 w0)T BT = |ω0 w0|2 (1 αt)2 + α2 t p + 1 Invm + O ξ 1/2 (77) Using Eq.75 and Taylor expansion, the inverse BBT 1 is equal to (1 αt)2 + α2 t p + 1 1 Invm + O ξ 3/2 , (78) Substituting the three expressions above in Eq.74, and ignoring terms of lower order, we find E |ω w0|2 = 1 nvm |ω0 w0|2 + nvm ν2 + σ2 1 + α2 t p nt (1 αt)2 + α2 t p+1 (79) Substituting this expression into in Eq.65, we find the value of average test loss 1 + α2 rp nr |ω0 w0|2 + σ2nvm 1 + α2 t p nt ht where we define the following expressions ht = (1 αt)2 + α2 t p + 1 nt and hr = (1 αr)2 + α2 r p + 1 7.3.2 UNDERPARAMETERIZED CASE (THEOREM 2) In the underparameterized case (p < nvm), under the assumption that the inverse of BT B exists, the value of ω that minimizes Eq.66 is equal to ω = BT B 1 BT γ (83) Published as a conference paper at ICLR 2021 Note that the matrix B does not depend on w, zt, zv, and Ew Ezt Ezv γ = Bw0. We denote by δγ the deviation from the average, and we have |ω w0|2 = Tr h BT B 1 BT δγ δγT B BT B 1i (84) We need to average this expression in order to calculate E |ω w0|2 in Eq.65. We start by averaging δγ δγT over w, zt, zv, since B does not depend on those variables. Note that w, zt, zv are independent on each other and across tasks. As in previous section, we denote by Γ the result of this operation, given by Eq.s72, 73. Finally, we need to average over the training and validation data E |ω w0|2 = E Xt E Xv Tr h BT B 1 BT ΓB BT B 1i (85) It is hard to average this expression because it includes nonlinear functions of the data. However, we can approximate these terms by assuming that either m or ξ (or both) is a large number, where ξ is defined by assuming that both nt and nv are of order O(ξ). Using Lemma 3, together with the expression of B (Eq.68), and noting that each factor in Eq.85 has a sum over m independent terms, we can prove that 1 nvm BT B = 1 2αt + α2 tµ2 Ip + O (mξ) 1/2 (86) The expression for µ2 is given in Eq.32. Using this result and a Taylor expansion, the inverse is equal to nvm BT B 1 = 1 2αt + α2 tµ2 1 Ip + O (mξ) 1/2 (87) Similarly, the term BT ΓB is equal to its average plus a term of smaller order 1 nvm BT ΓB = 1 nvm E BT ΓB + O (mξ) 1/2 (88) We substitute these expressions in Eq.85 and neglect lower orders. Here we show how to calculate explicitly the expectation of BT ΓB. For ease of notation, we define the matrix At(i) = I αt nt Xt(i)T Xt(i). Using the expressions of B (Eq.68) and Γ (Eqs.72,73), the expression for BT ΓB is given by BT ΓB = σ2 m X i=1 At(i)T Xv(i)T Xv(i)At(i) + ν2 At(i)T Xv(i)T Xv(i)At(i) 2 + i=1 At(i)T Xv(i)T Xv(i)Xt(i)T Xt(i)Xv(i)T Xv(i)At(i) (89) We use Eqs.31, 32 to calculate the average of the first term in Eq.89 i=1 At(i)T Xv(i)T Xv(i)At(i) = nvm 1 2αt + α2 tµ2 Ip (90) We use Eqs.31, 32, 33, 41, 36, 37, 38, 39 to calculate the average of the second term At(i)T Xv(i)T Xv(i)At(i) 2 = E Xt h nv (nv + 1) At(i)4 + nv At(i)2Tr At(i)2 i = = mnv (nv + 1) 1 4αt + 6α2 tµ2 4α3 tµ3 + α4 tµ4 Ip+ + mnvp 1 4αt + 2α2 tµ2 + 4α2 tµ1,1 4α3 tµ2,1 + α4 tµ2,2 Ip (92) Finally, we compute the average of the third term, using Eqs.31, 32, 33, 34, 41, 36, 37 i=1 At(i)T Xv(i)T Xv(i)Xt(i)T Xt(i)Xv(i)T Xv(i)At(i) = (93) h nv (nv + 1) At(i)T Xt(i)T Xt(i)At(i) + nv At(i)T At(i)Tr Xt(i)T Xt(i) i = (94) = mnv (nv + 1) nt 1 2αtµ2 + α2 tµ3 Ip + mnvntp 1 2αtµ1,1 + α2 tµ2,1 Ip (95) Published as a conference paper at ICLR 2021 Putting everything together in Eq.85, and applying the trace operator, we find the following expression for the meta-parameter variance E |ω w0|2 = p nvm 1 2αt + α2 tµ2 2 σ2 1 2αt + α2 tµ2 + (nv + 1) 1 2αtµ2 + α2 tµ3 + p 1 2αtµ1,1 + α2 tµ2,1 (nv + 1) 1 4αt + 6α2 tµ2 4α3 tµ3 + α4 tµ4 + + p 1 4αt + 2α2 tµ2 + 4α2 tµ1,1 4α3 tµ2,1 + α4 tµ2,2 + O (mξ) 3/2 (96) We rewrite this expression as E |ω w0|2 = p ht2nvm σ2 ht + α2 t nt [(nv + 1) g1 + pg2] + ν2 p [(nv + 1) g3 + pg3] + + O (mξ) 3/2 (97) where we defined the following expressions for gi g1 = 1 2αtµ2 + α2 tµ3 (98) g2 = 1 2αtµ1,1 + α2 tµ2,1 (99) g3 = 1 4αt + 6α2 tµ2 4α3 tµ3 + α4 tµ4 (100) g4 = 1 4αt + 2α2 tµ2 + 4α2 tµ1,1 4α3 tµ2,1 + α4 tµ2,2 (101) and µi are equal to nt (nt + p + 1) (102) n2 t + p2 + 3ntp + 3nt + 3p + 4 (103) n3 t + p3 + 6n2 tp + 6ntp2 + 6n2 t + 6p2 + 17ntp + 21nt + 21p + 20 (104) µ1,1 = 1 n2 tp n2 tp + 2nt (105) µ2,1 = 1 n2 tp n2 tp + ntp2 + ntp + 4nt + 4p + 4 (106) µ2,2 = 1 n3 tp n3 tp + ntp3 + 2n2 tp2 + 2n2 tp + 2ntp2 + 8n2 t + 8p2 + 21ntp + 20nt + 20p + 20 Substituting this expression back into Eq.65 returns the final expression for the average test loss, equal to L test = σ2 1 + α2 rp nr σ2 ht + α2 t nt [(nv + 1) g1 + pg2] + ν2 p [(nv + 1) g3 + pg4] + O (mξ) 3/2 7.4 PROOF OF THEOREM 3 In this section, we release some assumption on the distributions of data and parameters. In particular, we do not assume a specific distribution for input data vectors x and generating parameter vector Published as a conference paper at ICLR 2021 w, besides that different data vectors are independent, and so are data and parameters for different tasks. We further assume that those vectors have zero mean, and denote their covariance as Σ = Exx T (109) Σw = Eww T (110) We will also use the following matrix, including fourth order moments F = E x T Σx xx T (111) We do not make any assumption about the distribution of x, but we note that, if x is Gaussian, then F = 2Σ3 + ΣTr Σ2 . We keep the assumption that the output noise is Gaussian and independent for different data points and tasks, with variance σ2. Using the same notation as in previous sections, we will also use the following expressions (for any p p matrix A) E XT X = nΣ (112) E Tr ΣXT XAXT X = Tr A n2Σ3 + n F Σ3 (113) We proceed to derive the same formula under these less restrictive assumptions, in the overparameterized case only, following is the same derivation of section 7.3. We further assume ω0 = 0, w0 = 0. Again we start from the expression in Eq.24 for the test output, and we rewrite the test loss in Eq.27 as L test = E 1 2ns |Xs (w θ ) + zs|2 (114) We average this expression with respect to Xs, zs, noting that θ does not depend on test data. We further average with respect to w , but note that θ depends on test parameters, so we average only terms that do not depend on θ . Using Eq.112, the result is L test = σ2 2Tr (ΣΣw) + E 1 2θ T Σ θ w T Σ θ (115) The second term in the expectation is linear in θ and can be averaged over Xr, zr, using Eq.25 and noting that ω does not depend on target data. The result is E Xr E zr θ = (I αrΣ)ω + αrΣw (116) Furthermore, we show below (Eq.128) that the following average holds E w E zt E zv ω = 0 (117) Combining Eqs.116, 117, we can calculate the second term in the expectation of Eq.115 and find L test = σ2 2Tr (ΣΣw) αr Tr Σ2Σw + E1 2θ T Σ θ (118) We start by averaging the third term of this expression over zr, w , using Eq.25 and noting that ω does not depend on target data and test parameters. The result is E w E zr θ T Σ θ = Tr Σ I αr nr Xr T Xr ω ω T I αr nr Xr T Xr + (119) n2r Tr h XrΣXr T i + α2 r n2r Tr h ΣXr T XrΣw Xr T Xri (120) We now average over Xr, again noting that ω does not depend on target data. Using Eqs.112, 113, we find E Xr E w E zr θ T Σ θ = Tr ω ω T Σ (I αrΣ)2 + α2 r nr F Σ3 + (121) nr Tr Σ2 + α2 r Tr Σw Published as a conference paper at ICLR 2021 We can now rewrite the average test loss in Eq.118 as L test = σ2 1 + α2 r nr Tr Σ2 + 1 2Tr h Σw + E ω ω T Hri (123) where we define the following matrix Hr = Σ (I αrΣ)2 + α2 r nr In order to average the last term, we need an expression for ω . We note that the loss in Eq.20 is quadratic in ω, therefore the solution in Eq.22 can be found using standard linear algebra. In particular, the loss in Eq.20 can be rewritten as Lmeta = 1 2nvm |γ Bω|2 (125) where γ is a vector of shape nvm 1, and B is a matrix of shape nvm p. The vector γ is a stack of m vectors nt Xt(1)T Xt(1) w(1) αt nt Xv(1)Xt(1)T zt(1) + zv(1) ... Xv(m) I αt nt Xt(m)T Xt(m) w(m) αt nt Xv(m)Xt(m)T zt(m) + zv(m) Similarly, the matrix B is a stack of m matrices nt Xt(1)T Xt(1) ... Xv(m) I αt nt Xt(m)T Xt(m) In the overparameterized case (p > nvm), under the assumption that the inverse of BBT exists, the value of ω that minimizes Eq.125, and that also has minimum norm, is equal to ω = BT BBT 1 γ (128) Note that the matrix B does not depend on w, zt, zv, and Ew Ezt Ezv γ = 0, therefore Eq.117 holds. In order to finish calculating Eq.123, we need to average the following term Tr Hrω ω T = Tr h BBT 1 γγT BBT 1 BHr BT i (129) where we used the cyclic property of the trace. We start by averaging γγT over w, zt, zv, since B does not depend on those variables. Note that w, zt, zv are independent on each other and across tasks. We denote by Γ the result of this operation, which is equal to a block diagonal matrix Γ = E w E zt E zv γγT = 0 ... 0 0 0 Γ(m) Where matrix blocks are given by the following expression Γ(i) = Xv(i) I αt nt Xt(i)T Xt(i) Σw nt Xt(i)T Xt(i) Xv(i)T + (131) + σ2 Inv + α2 t n2 t Xv(i)Xt(i)T Xt(i)Xv(i)T (132) Finally, we need to average over the training and validation data E Tr Hrω ω T = E Xt E Xv Tr h BBT 1 Γ BBT 1 BHr BT i (133) Published as a conference paper at ICLR 2021 These averages are hard to compute since they involve nonlinear functions of the data. However, we can approximate these terms by assuming that p and nt are large, both of order O(ξ), where ξ is a large number. Furthermore, we assume that Tr Σ2 w is of order O ξ 1 , and that the variances of matrix products of the rescaled inputs x/ p, up to sixth order, are all of order O ξ 1 , in particular p Xv(i)Xv(j)T = O ξ 1 (134) p2 Xv(i)Xt(i)T Xt(i)Xv(j)T = O ξ 1 (135) p3 Xv(i)Xt(i)T Xt(i)Xt(j)T Xt(j)Xv(j)T = O ξ 1 (136) Then, using Eqs.112, 113 and the expressions of B (Eq.127) and Γ (Eqs.130,131), we can prove that BBT = Tr Ht Invm + O ξ1/2 (137) Γ = Tr Σw Ht + σ2 1 + α2 t nt Tr Σ2 Invm + O ξ1/2 (138) BHr BT = Tr Hr Ht Invm + +O ξ1/2 (139) where, similar to Eq.124, we define Ht = Σ (I αtΣ)2 + α2 t nt Note that all these terms are of order O (ξ). The inverse of BBT can be found by a Taylor expansion BBT 1 = Tr Ht 1 Invm + O ξ 3/2 (141) Substituting these expressions in Eq.133, we find E Tr Hrω ω T = nvm Tr (Hr Ht) n Tr (Σw Ht) + σ2 h 1 + α2 t nt Tr Σ2 io Tr (Ht)2 +O ξ 3/2 (142) Substituting this expression into in Eq.123, we find the value of average test loss 2Tr (Σw Hr) + σ2 1 + α2 r nr Tr Σ2 + (143) 2nvm Tr (Hr Ht) n Tr (Σw Ht) + σ2 h 1 + α2 t nt Tr Σ2 io Tr (Ht)2 + O ξ 3/2 (144)