# effective_metaregularization_by_kernelized_proximal_regularization__b784441f.pdf Effective Meta-Regularization by Kernelized Proximal Regularization Weisen Jiang1, 2, James T. Kwok2, Yu Zhang1, 3 1 Department of Computer Science and Engineering, Southern University of Science and Technology 2 Department of Computer Science and Engineering, Hong Kong University of Science and Technology 3 Peng Cheng Laboratory {wjiangar, jamesk}@cse.ust.hk, yu.zhang.ust@gmail.com We study the problem of meta-learning, which has proved to be advantageous to accelerate learning new tasks with a few samples. The recent approaches based on deep kernels achieve the state-of-the-art performance. However, the regularizers in their base learners are not learnable. In this paper, we propose an algorithm called Meta Prox to learn a proximal regularizer for the base learner. We theoretically establish the convergence of Meta Prox. Experimental results confirm the advantage of the proposed algorithm. 1 Introduction Humans, by leveraging prior knowledge and experience, can easily learn new tasks from a handful of examples. In contrast, deep networks are data-hungry, and a large number of training samples are required to avoid overfitting. To reduce the labor-intensive and time-consuming process of data labeling, meta-learning (or learning to learn) [3, 37] aims to exact meta-knowledge from seen tasks to accelerate learning on unseen tasks. Recently, meta-learning has been receiving increasing attention due to its diverse successful applications in few-shot learning [41, 12, 39, 35], hyperparameter optimization [14], neural architecture search [24, 44], and reinforcement learning [29]. Many meta-learning algorithms operate on two levels. A base learner learns task-specific models in the inner loop, and a meta-learner learns the meta-parameter in the outer loop. A popular class of algorithms is based on meta-initialization [12, 25, 11, 38], such as the well-known MAML [12]. It learns a model initialization such that a good model for an unseen task can be learned from limited samples by a few gradient updates. However, computing the meta-gradient requires back-propagating through the entire inner optimization path, which is infeasible for large models and/or there are many gradient steps. During testing, it is common for MAML s base learner to perform many gradient steps to seek a more accurate solution [12]. However, for regression using a linear base learner and square loss, we will show that though the meta-learner can converge to the optimal meta-initialization, the base learner may overfit the training data at meta-testing. Another class of meta-learning algorithms is based on meta-regularization [28, 43, 7, 8, 9], in which the base learner learns the task-specific model by minimizing the loss with a proximal regularizer (a biased regularizer from the meta-parameter). Denevi et al. [7] uses a linear model with efficient closed-form solution for the base learner. However, extending to nonlinear base learners requires computing the meta-gradient using matrix inversion, which can be infeasible for deep networks [28]. To introduce nonlinearity to the base learner, a recent approach is to make use of the kernel trick. For example, R2D2 [4] and Meta Opt Net [22] use deep kernels [42] in meta-learning for few-shot Corresponding author. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). classification. Specifically, the deep network is learned in the meta-learner, while a base kernel is used in the base learner. Though they achieve state-of-the-art performance, their base learners use a Tikhonov regularizer rather than a learnable proximal regularizer as in meta-regularization methods. As learning a meta-regularization has been shown to be effective in linear models for regression [7] and classification [8], in this paper we propose a kernel-based algorithm to meta-learn a proximal regularizer for a nonlinear base learner. After kernel extension, the learnable function in the proximal regularizer is a function in the reproducing kernel Hilbert space (RKHS) induced by the base kernel. The proposed algorithm is guaranteed to converge to a critical point of the meta-loss and its global convergence is also established. Experiments on various benchmark regression and classification datasets demonstrate the superiority of the proposed algorithm over the state-of-the-arts. Notations. Vectors are denoted by lowercase boldface, and matrices by uppercase boldface. For a vector x, x = p P i x2 i and diag(x) constructs a diagonal matrix with x on the diagonal. H is the norm on the Hilbert space H. N(0, σ2) is the univariate normal distribution with mean zero and variance σ2. N(m, Σ) is the multivariate normal distribution with mean m and covariance matrix Σ. 2 Related Work In meta-learning, a collection T of tasks are used to learn a meta-parameter θ and base learner s parameters {w1, . . . , w|T |}. Each task τ is sampled from a given distribution p(τ), and has a support set Sτ = {(xi, yi) : i = 1, . . . , ns} and a query set Qτ = {(xi, yi) : i = 1, . . . , nq}, where x Rd are the features and y the labels. Let f( ; w) be a model parameterized by w and L(D; w) P (x,y) D ℓ(f(x; w), y) be the supervised loss on data set D, where ℓ( , ) is a loss function that is assumed to be convex w.r.t. the first argument. In each meta-training iteration, a batch B of tasks are randomly sampled from T . The base learner takes a task τ from B and the meta-parameter θ to build the model f( ; wτ). After all tasks in the batch are processed by the base learner, the meta-learner minimizes the loss P τ B L(Qτ; wτ) w.r.t. θ, and the iteration repeats. During meta-testing, given an unseen task τ p(τ), a model f( ; wτ ) is similarly learned from Sτ and θ. Finally, its performance is evaluated on Qτ . Popular meta-learning algorithms usually construct the task-specific model by: (i) metainitialization [12, 25, 11, 38], (ii) meta-regularization [7, 8, 9, 28, 43], or (iii) metric learning [35, 4, 22, 27]. A representative meta-learning algorithm based on meta-initialization is MAML [12]. Its base learner learns wτ by gradient descent from a learnable initialization. Computing the meta-gradient θL(Qτ; wτ) needs to back-propagate through the inner optimization path and involves second-order derivatives. This can be expensive for large models and/or when there are many gradient steps. For meta-learning algorithms based on meta-regularization, the base learner learns wτ by minimizing a proximal regularized loss L(Sτ; w) + λ 2 w θ 2, (1) where λ > 0 is the regularization parameter. The meta-gradient can be computed directly from the learned wτ, without back-propagating through the inner optimization trajectory [28]. Though more efficient, this still takes O(n3 θ) time [28], where nθ is the number of parameters in θ. Metric learning methods have been widely studied in few-shot learning [39, 35, 4, 22, 27]. The meta-learner maps raw samples to an embedding space via a a deep network, then feeds the embeddings to the base learner to train a simple task model. Typical models include non-parametric prototype classifier (Proto Net [35]), linear models like ridge regression (R2D2 [4]), SVM classifier (Meta Opt Net-SVM [22]), and softmax classifier (ANIL [27]). In particular, the base learners in R2D2 [4], Meta Opt Net [22] and DKT [26] seek solutions in the dual space, which achieve state-ofthe-art performance. However, their base learners use a Tikhonov regularizer, which is not learnable as in meta-regularization approaches. 3 Meta-Regularization by Kernelized Proximal Regularization 3.1 Meta-Initialization versus Meta-Regularization In this section, we consider a simple regression setting with linear model and square loss. Each task τ is a linear regressor with parameter w τ Rd. We assume that each input x is sampled Algorithm 1 MAML [12]. Require: step size γ and ηt, batch size b; 1: for t = 1, 2, 3, . . . do 2: sample a batch Bt of tasks from p(τ); 3: base learner: 4: for τ Bt do 5: w(gd) τ = ψt γX τ (Xτψt yτ); 6: gτ = 1 (x,y) Qτ ψt(x w(gd) τ y)2; 7: end for 8: meta-learner: ψt+1 = ψt ηt τ Bt gτ; 9: end for Algorithm 2 Common Mean [7]. Require: hyperparameter λ, step size ηt, batch size b; 1: for t = 1, 2, 3, . . . do 2: sample a batch Bt of tasks from p(τ); 3: base learner: 4: for τ Bt do 5: w(prox) τ = minw 1 2 Xτw yτ 2+ λ 2 w θt 2; 6: gτ = 1 (x,y) Qτ θt(x w(prox) τ y)2; 7: end for 8: meta-learner: θt+1 = θt ηt τ Bt gτ; 9: end for from N(0, σ2 x I), and the output y is obtained as x w τ + ξ, where ξ N(0, σ2 ξ) is the random noise. We compare two representative meta-learning algorithms: (i) MAML [12], which is based on meta-initialization and performs one gradient descent step in the inner loop of the bilevel optimization problem; and (ii) learning around a common mean (denoted Common Mean) [7], which is based on meta-regularization. It learns the model parameters for task τ by minimizing the loss with a proximal regularizer around the meta-parameter θ: w(prox) τ =argmin w (xi,yi) Sτ 1 2(w xi yi)2+ λ 2 w θ 2 = (λI+X τ Xτ) 1(λθ+X τ yτ), (2) where Xτ = [x 1 ; . . . ; x ns] is the sample matrix from Sτ, and yτ = [y1; . . . ; yns] is the corresponding label vector. Algorithms 1 and 2 show MAML and Common Mean, respectively, for this problem. Recently, Balcan et al. [2] study the convex online meta-learning setting and show that both approaches achieve the same average task regret. Here, we consider the offline setting. First, the following Proposition shows that both MAML and Common Mean converge to the same meta-parameter. All proofs are in the appendix. Proposition 1. Let ηt = 1/t. Assume that γ < 1/σ2 x. Both ψt in MAML (with one inner gradient step) and θt in Common Mean converge to w = Eτw τ. The following Proposition shows that w in Proposition 1 is also the best ψ (for MAML) or θ (for Common Mean) with the smallest population risk for this meta-learning problem. Proposition 2. Assume that γ < 1/σ2 x. We have w =argminθ EτESτ EQτ P (x,y) Qτ(x w(prox) τ y)2 =argminψ EτESτ EQτ P (x,y) Qτ (x w(gd) τ y)2. During meta-testing, we sample a task τ p(τ) with parameter w τ . Let Xτ be the sample matrix from Sτ , and yτ be the corresponding label vector. We assume that Xτ is full rank and ns < d. To simplify notations, we drop the subscript τ in the following. Let the singular value decomposition of X be UΣV (where Σ = diag([ν1, . . . , νns])), and V be V s orthogonal complement. As only forward passes are needed, it is common for the base learner in MAML to perform multiple gradient steps [12]. With the convex loss and linear model here, the base learner can obtain a globally optimal solution w(gd ) directly (which is equivalent to taking infinite gradient steps). As is common in few-shot learning, the number of support samples is much smaller than feature dimensionality. Hence, w(gd ) is not unique but depends on the learned initialization. Let w(gd ) be written as w(gd ) = Va(gd ) + V b(gd ). For gradient descent, its update direction X (Xw y) = VΣU (Xw y) is always in the span of V and so b(gd ) remains unchanged. Let w and θ be written as w = Va + V b and θ = Va0 + V b0. Moreover, let a = a0 a and b = b0 b . Proposition 3 ([17]). Assume that γ < min1 j ns 1/ν2 j . We have Eξ w(gd ) w 2 = b(gd ) b 2 + Pns j=1 σξ νj 2 , where the expectation is over the label noise vector ξ. For Common Mean, using the Woodbury matrix identity, w(prox) = θ+X (λI+XX ) 1(y Xθ) = V(a0 + ΣU (λI + XX ) 1(y Xθ)) + V b0 Va(prox) + V b(prox), where b(prox) = b0. Assume that w(gd ) is initialized with θ. Since b(gd ) remains unchanged, w(prox) and w(gd ) only differ in the components lying in the column space of V. Proposition 4. Eξ w(prox) w 2 = b 2 + Pns j=1 λ aj λ+ν2 j 2 + Pns j=1 σξ (λ/νj)+νj 2 , where the expectation is over the label noise vector ξ. As can be seen, when the labels are noise-free (σ2 ξ = 0), w(gd ) performs better than w(prox). However, when the labels are noisy, as ns < d, gradient descent always converges to zero training error and overfits the noisy labels. On the other hand, the estimation error of w(prox) equals to that of w(gd ) when λ = 0. For λ > 0, it trades off between fitting the noisy labels (the last term in Proposition 4) and introducing an estimation bias of a (the second term in Proposition 4). 3.2 Proposed Algorithm It is straightforward to use the dual formulation for the Common Mean algorithm. When the square loss is used as ℓ( , ), it is easy to see that the dual variable has the closed-form solution ατ = (I + λ 1XτX τ ) 1(yτ Xτθ). (3) Compared with the primal formulation, we only need to invert a ns ns matrix (instead of the d d matrix λI + X τ Xτ). In meta-learning, usually ns d (e.g., ns = 5). From the dual solution ατ, the primal solution can be recovered as wτ = θ + λ 1X τ ατ. Given a query example (x, y) Qτ, the model predicts ˆy = x wτ = x θ + λ 1x X τ ατ. The loss gradient is θℓ(ˆy, y) = 1ℓ(ˆy, y) θˆy, where 1ℓ(ˆy, y) denotes the gradient w.r.t. the first argument, and θˆy = x + λ 1( θατ) Xτx, θατ = (I + λ 1XτX τ ) 1Xτ. The complexity of computing θℓ(ˆy, y) is thus very low (O(n3 s + n2 sd)). The dual formulation also allows introduction of nonlinearity with the kernel trick. Based on deep kernels [42], recent state-of-the-arts (R2D2 [4], Meta Opt Net [22], and DKT [26]) propose to use a base kernel in the base learner and update the deep network in the meta-learner. However, their regularizers are not learnable. We consider to learn a proximal regularizer. An input x is mapped to z = NN(x; φ) in an embedding space E via a deep network parameterized by φ. With the dual formulation, θ in (2) allows extra flexibility over [22, 4, 7]. Specifically, θ becomes a function fθ H, the reproducing kernel Hilbert space (RKHS) corresponding to a given kernel function K on E E. The base learner obtains a model fτ H by minimizing (xi,yi) Sτ ℓ(f(zi), yi) + λ 2 f fθ 2 H , (4) where H is the norm on H. By setting fθ = 0, this recovers the state-of-the-arts of Meta Opt Net [22], R2D2 [4], and DKT [26]. However, Proposition 4 suggests that a good fθ provides good meta-knowledge by biasing the task model. By the representer theorem [34], the solution of (4) is fτ = fθ + P (xi,yi) Sτ ατ,i Kzi, where Kzi = K(zi, ) H, and ατ = [ατ,1; . . . ; ατ,ns] is obtained from the convex program (xi,yi) Sτ ℓ(fτ(zi), yi) + α τ K(Zτ, Zτ)ατ, (5) where Zτ = [z 1 ; . . . ; z ns], and K(Zτ, Zτ) is the kernel matrix. Note that the hyperparameter λ in (4) is absorbed into z as the network is learnable. With the square loss, the dual solution of (5) is ατ = (I + K(Zτ, Zτ)) 1(yτ fθ(Zτ)), where fθ(Zτ) = [fθ(z1); . . . ; fθ(zns)]. For general loss functions, the dual problem has no closed-form solution, but this has only ns variables (which is usually small) and can be solved efficiently. After the base learner has obtained the dual solution ατ, the meta-learner updates fθ and network parameter φ by one gradient descent step on the validation loss P (x,y) Qτ ℓ(ˆy, y), where ˆy fτ(z) = fθ(z) + K(Zτ, z) ατ. Using the chain rule, (θ,φ)ℓ(ˆy, y) = 1ℓ(ˆy, y) (θ,φ)ˆy. The first component 1ℓ(ˆy, y) can be computed directly and the second component is (θ,φ)ˆy = (θ,φ)fθ(z) + ( (θ,φ)K(Zτ, z)) ατ + ( (θ,φ)ατ) K(Zτ, z). (6) Both (θ,φ)fθ(z) and (θ,φ)K(Zτ, z) can be obtained by direct differentiation. By the chain rule, (θ,φ)ατ = pατ (θ,φ)p, where p = [fθ(z1); . . . ; fθ(zns); K(Zτ, z1); . . . ; K(Zτ, zns)] Rns+n2 s is the input to the dual problem. (θ,φ)p can be directly computed. When the square loss is used, pατ = (I + K(Zτ, Zτ)) 1 I | I α τ . For a general loss, ατ is obtained by solving the convex program. Hence, ατ depends implicitly on p and pατ can be obtained by implicit differentiation. Denote the dual objective in (5) by g(p, α). By the implicit function theorem [32], pατ = 2 αg(p, ατ) 1 2 p αg(p, ατ), where 2 αg(p, ατ) = P (xi,yi) Sτ 2 1ℓ(fτ(zi), yi)K(Zτ, zi)K(Zτ, zi) + K(Zτ, Zτ), 2 p αg(p, ατ) = K(Zτ, Zτ)D | (K(Zτ, Zτ)D) α τ + v I + I α τ , D = diag([ 2 1ℓ(fτ(z1), y1); . . . ; 2 1ℓ(fτ(zns), yns)]), v = [ 1ℓ(fτ(z1), y1); . . . ; 1ℓ(fτ(zns), yns)], where is the Kronecker product. The whole procedure, called Meta Prox, is shown in Algorithm 3. Let nφ and nθ be the numbers of parameters in φ and θ, respectively. Computing (θ,φ)ℓ(ˆy, y) takes O(n3 s + n2 s(nθ + nφ)) time, which is linear in the number of meta-parameters. This is lower than the other meta-learning algorithms (e.g., MAML [12] with single step takes O(n2 φ) time, i MAML [28]: O(n3 φ), Common Mean [7]: O(d3)). Algorithm 3 Meta Prox. Require: step size ηt, batch size b; 1: for t = 1, 2, , T do 2: sample a batch Bt of tasks from T ; 3: base learner: 4: for τ Bt do 5: zi = NN(xi; φt) for each (xi, yi) Sτ; 6: fτ(z; α) fθt(z) + K(Zτ, z) α denote the task model w.r.t. dual variables; 7: ατ = argminα P (xi,yi) Sτ ℓ(fτ(zi; α), yi) + α K(Zτ, Zτ)α; 8: gτ = P (x,y) Qτ (θt,φt)ℓ(ˆy, y), where ˆy = fτ(z; ατ) and z = NN(x; φt); 9: end for 10: meta-learner: (θt+1, φt+1) = (θt, φt) ηt τ Bt gτ; 11: end for Classification. We consider extension from regression to N-way classification. For task τ, the base learner learns the model fτ = [f (1) τ ; . . . ; f (N) τ ] for the N classes by minimizing min f (1),...,f (N) H (xi,yi) Sτ ℓ(ˆyi, yi) + λ c=1 f (c) fθ(c) 2 H, (7) where ˆyi = [f (1)(zi); . . . ; f (N)(zi)], and fθ(1), . . . , fθ(N) H are functions learned by the metalearner. By the representer theorem [34], fτ = [fθ(1) +K(Zτ, ) α(1) τ ; . . . ; fθ(N) +K(Zτ, ) α(N) τ ], where [α(1) τ ; ; α(N) τ ] is obtained from the convex program minα(1),...,α(N) P (xi,yi) Sτ ℓ(ˆyi, yi)+ PN c=1 α(c) K(Zτ, Zτ)α(c). As [α(1) τ ; ; α(N) τ ] RNns and both N, ns are typically very small (e.g., N = 5, ns = 5 in 5-way 5-shot classification), this convex program can be solved efficiently. The meta-learner then updates the network parameter φ and {fθ(1), . . . , fθ(N)} by one gradient descent step on the validation loss P (xi,yi) Qτ ℓ(ˆyi, yi). Computing (θ(1),...,θ(N),φ)ℓ(ˆy, y) takes O((Nns)3 + (Nns)2(nφ + PN c=1 nθ(c))) time, where nθ(c) is the size of θ(c). This is again linear in the number of meta-parameters and thus very efficient. 3.3 Theoretical Analysis Let Lmeta(θ, φ) = 1 |T | P (x,y) Qτ ℓ(fθ(z; ατ), y) be the empirical loss of Eτ p(τ) P (x,y) Qτ ℓ(fθ(z; ατ), y), where z = NN(x; φ). With the linear kernel and square loss, the dual solution (3) is affine in the meta-parameter, and so is the primal solution wτ = θ + λ 1X τ ατ. Thus, the meta-loss Lmeta(θ, φ) is convex and convergence follows from convex optimization [5, 7]. After introducing nonlinearity, the meta-loss is no longer convex. The following introduces Lipschitz-smoothness assumptions, which have been commonly used in stochastic non-convex optimization [15, 31] and meta-learning in non-convex settings [11, 43]. Assumption 1 (Smoothness). (i) The deep network NN(x; φ) is Lipschitz-smooth, i.e., φ NN(x; φ) φ NN(x; φ ) β1 φ φ with a Lipschitz constant β1 > 0; (ii) the kernel K(z, z ) is Lipschitz-smooth w.r.t. (z, z ); (iii) fθ(z) is Lipschitz-smooth w.r.t. (θ, z); (iv) 2 1ℓ(ˆy, y) is Lipschitz w.r.t. ˆy, i.e., | 2 1ℓ(ˆy, y) 2 1ℓ(ˆy , y)| β2|ˆy ˆy | with a Lipschitz constant β2; (v) Eτ T (θ,φ) P (x,y) Qτ ℓ(fθ(z; ατ), y) (θ,φ)Lmeta(θ, φ) 2 = σ2 g, where τ T denotes uniformly sample a task from T . The following Lemma guarantees smoothness of the meta-loss. Lemma 1. Lmeta(θ, φ) is Lipschitz-smooth w.r.t. (θ, φ) with a Lipschitz constant βmeta. Theorem 1. Let the step size be ηt = min(1/ T, 1/2βmeta). Algorithm 3 satisfies min1 t T E (θt,φt)Lmeta(θt, φt) 2 = O σ2 g/ T , where the expectation is taken over the random training samples. This rate is the same as MAML [11, 19] and Meta-Minibatch Prox [43]. For MAML with J > 1 gradient steps, Ji et al. [19] assumes that the step size in the inner loop is of the order 1/J. This slows down inner loop learning when J is large. On the other hand, Meta Prox does not have this restriction, as its meta-gradient depends only on the last iterate rather than all iterates along the trajectory. Next, we study the global convergence of Meta Prox. Prior work [13, 43] focus on the case where Lmeta(θ, φ) is strongly convex in (θ, φ). This can be restrictive in deep learning. We instead only require ℓ(ˆy, y) to be strongly convex in ˆy. This assumption is easily met by commonly-used loss functions such as the square loss and logistic loss with a compact domain. A recent work [40] studies the global convergence of MAML in over-parameterized neural networks. Over-parameterization is closely related to the assumption of uniform conditioning [18, 21, 23]. Assumption 2 (Uniform conditioning [23]). A multivariable function M(θ, φ) is µ-uniformly conditioning if its tangent kernel [18] satisfies min(θ,φ) λmin( (θ,φ)M(θ, φ) (θ,φ)M(θ, φ)) µ > 0, where λmin( ) is the smallest eigenvalue of the matrix argument. Assume that the loss ℓ( , ) is ρ-strongly convex w.r.t. the first argument and Assumption 1 holds. Let xτ,j be the jth query example of task τ, zτ,j be its embedding, and ˆyτ,j = fθ(zτ,j) + K(Zτ, zτ,j) ατ be its prediction, where ατ is the dual solution. Let M(θ, φ) = ˆyτ1,1; . . . ; ˆyτ1,nq; . . . ; ˆyτ|T |,1; . . . ; ˆyτ|T |,nq be an auxiliary function which maps the meta-parameter to predictions on all query examples. The following Theorem shows that the proposed algorithm converges to a global minimum of the empirical risk Lmeta(θ, φ) at the rate of O(σ2 g/ T). The rate is improved to exponential if the meta-learner adopts full gradient descent. Theorem 2. Assume that M(θ, φ) is uniform conditioning. (i) Let ηt = min(1/ T, 1/2βmeta). Algorithm 3 satisfies min1 t T ELmeta(θt, φt) min(θ,φ) Lmeta(θ, φ) = O σ2 g/ T , where the expectation is taken over the random training samples. (ii) Let ηt = η < min(1/2βmeta, 4|T |/ρµ) and Bt = T . Algorithm 3 satisfies Lmeta(θt, φt) min(θ,φ) Lmeta(θ, φ) = O((1 ηρµ/4|T |)t). 4 Experiments 4.1 Few-shot Regression Data sets. Experiments are performed on three data sets. (i) Sine. This is the sinusoid regression problem in [12]. Samples x s are uniformly sampled from [ 5, 5]. Each task τ learns a sine function y = aτ sin(x + bτ) + ξ, where aτ [0.1, 5], bτ [0, π], and ξ N(0, σ2 ξ) is the label noise. We consider both σ2 ξ = 0 (noise-free) and σ2 ξ = 1. In addition to the 5-shot setting in [12], we also evaluate on the more challenging 2-shot setting. We randomly generate a meta-training set of 8000 tasks, a meta-validation set of 1000 tasks for early stopping, and a meta-testing set of 2000 tasks for performance evaluation. (ii) Sale. This is a real-world dataset from [36], which contains weekly purchased quantities of 811 products over 52 weeks. For each product (task), a sample is to predict the sales quantity for the current week from sales quantities in the previous 5 weeks. Thus, each product contains 47 samples. We evaluate on the 5-shot and 1-shot settings. We randomly split the tasks into a meta-training set of 600 tasks, a meta-validation set of 100 tasks, and a meta-testing set of 111 tasks. (iii) QMUL, which is a multiview face dataset [16] from Queen Mary University of London. This consists of grayscale face images of 37 people (32 for meta-training and 5 for meta-testing). We follow the setting in [26] and evaluate the model on 10-shot regression. Each person has 133 facial images covering a viewsphere of 90 in yaw and 30 in tilt at 10 increment. A task is a trajectory taken from the discrete manifold for images from the same person. The regression goal is to predict the tilt given an image. In the in-range setting, meta-training tasks are sampled from the entire manifold. In the more challenging out-of-range setting, meta-training tasks are sampled from the sub-manifold with yaw in [ 90 , 0 ]. In both settings, meta-testing tasks are sampled from the entire manifold. We randomly sample 2400 tasks for meta-training, and 500 tasks for meta-testing. As in [26], we do not use a meta-validation set since the dataset is small. Network Architecture. For Sine and Sale, we use the network in [12], which is a small multilayer perceptron with two hidden layers of size 40 and Re LU activation. For QMUL, we use the three-layer convolutional neural network in [26]. The embeddings are always from the last hidden layer. We use a simple linear kernel as base kernel, and fθ(z) = θ z. Implementation Details. We use the Adam optimizer [20] with a learning rate of 0.001. Each minibatch has 16 tasks. For Sine and Sale, the model (φ and fθ) is meta-trained for 40, 000 iterations. To prevent overfitting on the meta-training set, we evaluate the meta-validation performance every 500 iterations, and stop training when the loss on the meta-validation set has no significant improvement for 10 consecutive evaluations. For QMUL, we follow [26] and meta-train the model for 100 iterations. We repeat each experiment 30 times. For performance evaluation, we use the average mean squared error (MSE) on the meta-testing set. Baselines. On Sine and Sale, we compare Meta Prox with Common Mean [7], MAML [12], Meta Opt Net-RR [22], Meta-Minibatch Prox [43], and i MAML [28]. Common Mean is a linear model, and Meta Opt Net-RR is equivalent to Meta Prox when fθ = 0. Following [12], we set the number of inner gradient steps for MAML to 1 during meta-training, and 20 during meta-validation and meta-testing. Both Meta-Minibatch Prox [43] and i MAML [28] are meta-regularization approaches. For QMUL, we compare Meta Prox with the baselines reported in [26] (namely, DKT [26], Feature Transfer [10], and MAML). As further baselines, we also compare with Meta-Minibatch Prox and Meta Opt Net-RR to evaluate the improvement of Meta Prox due to the learnable fθ. (b) 2-shot, σ2 ξ = 0. (c) 2-shot, σ2 ξ = 1. (d) 5-shot, σ2 ξ = 0. (e) 5-shot, σ2 ξ = 1. Figure 1: Convergence curves for few-shot sinusoid regression. Best viewed in color. Results on Sine. Figure 1 shows the convergence curves of Meta Prox and the baselines. We do not show the convergence of Common Mean, as it does not use a neural network backbone as the other methods. As can be seen, Meta Prox converges much faster and better than the non-kernel-based methods (MAML, i MAML and Meta-Minibatch Prox). In the 2-shot settings, Meta Prox converges to a loss smaller than that of Meta Opt Net-RR. Figure 2 shows the learned functions on 2 meta-testing tasks (τ1 with (a = 4.6, b = 3.2) and τ2 with (a = 3.7, b = 0.5)) in the 5-shot setting and more challenging 2-shot setting. As can be seen, Meta Prox always fits the target curve well. Though MAML, i MAML and Meta-Minibatch Prox can fit the support samples, it performs worse in regions far from the support samples. This is especially noticeable in the 2-shot setting. Table 1 shows the MSE on the meta-testing set. Obviously, Common Mean (a linear model) fails in this nonlinear regression task. Meta Prox and Meta Opt Net-RR significantly outperform the other baselines. Meta Prox (with the learned fθ) performs better than Meta Opt Net-RR, particularly when the data is noisy. (a) task τ1, σ2 ξ = 0. (b) task τ1, σ2 ξ = 1. (c) task τ2, σ2 ξ = 0. (d) task τ2, σ2 ξ = 1. (e) task τ1, σ2 ξ = 0. (f) task τ1, σ2 ξ = 1. (g) task τ2, σ2 ξ = 0. (h) task τ2, σ2 ξ = 1. Figure 2: Sinusoid regression: Two meta-testing tasks τ1 and τ2 with different σξ s in 2-shot ((a) (d)) and 5-shot ((e) (h)) settings. Best viewed in color. Table 1: Average MSE (with 95% confidence intervals) of few-shot regression on the Sine and Sale datasets. (The confidence intervals in Sale experiments are 0.001 for all methods) Sine (2-shot) Sine (5-shot) Sale noise-free noisy noise-free noisy 1-shot 5-shot Common Mean [7] 4.58 0.07 4.59 0.07 4.29 0.06 4.31 0.06 0.090 0.074 MAML [12] 1.24 0.12 1.91 0.13 0.41 0.03 1.15 0.05 0.069 0.063 i MAML [28] 1.12 0.11 1.84 0.10 0.38 0.02 1.02 0.05 0.068 0.063 Meta-Minibatch Prox [43] 1.15 0.08 1.87 0.09 0.37 0.02 1.01 0.03 0.081 0.064 Meta Opt Net-RR [22] 0.18 0.01 0.79 0.01 0.01 0.00 0.19 0.01 0.088 0.068 Meta Prox (proposed) 0.11 0.01 0.43 0.01 0.01 0.00 0.13 0.01 0.061 0.060 Table 2: Average MSE (with 95% confidence intervals) of few-shot regression on QMUL (10-shot). Results of the first four methods are from [26]. method in-range out-of-range Feature Transfer [10] 0.22 0.03 0.18 0.01 MAML [12] 0.21 0.01 0.18 0.02 DKT + RBF [26] 0.12 0.04 0.14 0.03 DKT + Spectral [26] 0.10 0.02 0.11 0.02 Meta-Minibatch Prox [43] 0.171 0.022 0.193 0.025 Meta Opt Net-RR [22] 0.021 0.007 0.039 0.009 Meta Prox (proposed) 0.012 0.003 0.020 0.005 Results on Sale. As can be seen from Table 1, the linear model (Common Mean) performs poorly as expected. Meta Prox again outperforms the other baselines, particularly in the more challenging 1-shot setting. Results on QMUL. Table 2 shows that Meta Prox achieves the lowest MSE and the kernel methods (DKT+RBF, DKT+Spectral, Meta Opt Net-RR, and Meta Prox) perform better than non-kernel-based methods (Feature Transfer, MAML, and Meta-Minibatch Prox). Meta Prox with the learnable fθ reduces the errors of Meta Opt Net-RR by half. 4.2 Few-shot Classification Datasets. We use the standard 5-way K-shot setting (K = 1 or 5) on the mini-Image Net [39] dataset, which consists of 100 randomly chosen classes from ILSVRC-2012 [33]. Each class contains 600 84 84 images. We use the commonly-used split in [30]: the 100 classes are randomly split into 64 for meta-training, 16 for meta-validation, and 20 for meta-testing. Network Architecture. For the network backbone, we use the Conv4 in [12, 39] and Res Net12 in [22]. As the cosine similarity is more effective than ℓ2 distance in few-shot classification [6], we adopt the cosine kernel K(z, z ) = cos(z, z ) as base kernel, where z is the embedding of sample x extracted from the last hidden layer as in regression. For each c = 1, . . . , 5, fθ(c) = [Kq(1); . . . ; Kq(5)] θ(c) is a weighted prototype classifier on the embedding space, where Table 3: Accuracies (with 95% confidence intervals) of 5-way few-shot classification on mini Image Net using Conv4. means that the result is obtained by rerunning the code with our setup here. Other results from their original publications (Result on the 5-shot setting is not reported in i MAML [28]). method 1-shot 5-shot MAML [12] 48.7 1.8 63.1 0.9 FOMAML [12] 48.1 1.8 63.2 0.9 REPTILE [25] 50.0 0.3 66.0 0.6 i MAML [28] 49.0 1.8 Meta-Minibatch Prox [43] 50.8 0.9 67.4 0.9 ANIL [27] 46.7 0.4 61.5 0.5 R2D2 [4] 49.5 0.2 65.4 0.3 Proto Net [35] 49.4 0.8 68.2 0.7 Meta Opt Net-SVM(lin) [22] 49.8 0.9 66.9 0.7 Meta Opt Net-SVM(cos) [22] 50.1 0.9 67.2 0.6 Meta Prox (proposed) 52.4 1.0 68.8 0.8 q(1), . . . , q(5) are the class centroids computed from Sτ, and the weights {θ(1), . . . , θ(5)} are metaparameters. Baselines. We compare Meta Prox with the state-of-the-arts: (i) meta-initialization: MAML [12] and its variants FOMAML [12], and REPTILE [25]; (ii) meta-regularization: i MAML [28] and Meta-Minibatch Prox [43]; and (iii) metric learning: ANIL [27], R2D2 [4], Proto Net [35], and Meta Opt Net [22] with SVM using the linear kernel and cosine kernel. . Table 4: Accuracies (with 95% confidence intervals) of 5-way few-shot classification on mini Image Net using Res Net-12. means that the result is obtained by rerunning the code with our setup here. method 1-shot 5-shot FOMAML [12] 57.41 0.71 72.12 0.54 ANIL [27] 59.66 0.68 73.28 0.49 Proto Net [35] 59.25 0.64 75.60 0.48 Meta Opt Net-SVM(lin) [22] 62.31 0.64 78.21 0.42 Meta Opt Net-SVM(cos) [22] 62.75 0.42 78.68 0.24 Meta Prox (proposed) 63.82 0.23 79.12 0.18 Implementation Details. The entire model is trained end-to-end. ℓ(ˆy, y) is the cross-entropy loss. The CVXPYLayers package [1] is used to solve the dual problem. We train the model for 80, 000 iterations, and each mini-batch has 4 tasks. We use the Adam optimizer [20] with an initial learning rate of 0.001, which is then reduced by half every 2, 500 iterations. To prevent overfitting, we evaluate the meta-validation performance every 500 iterations, and stop training when the meta-validation accuracy has no significant improvement for 10 consecutive evaluations. We report the classification accuracy averaged over 600 tasks randomly sampled from the meta-testing set. Results. Tables 3 and 4 show the results for Conv4 and Res Net-12, respectively. As can be seen, Meta Prox is always the best. Compared with Meta Opt Net-SVM, Meta Prox is better due to the learnable regularizer. 5 Conclusion In this paper, we proposed Meta Prox, an effective meta-regularization algorithm for meta-learning. Meta Prox combines deep kernel and meta-regularization. By reformulating the problem in the dual space, a learnable proximal regularizer is introduced to the base learner. The meta-parameters in the regularizer and network are updated by the meta-learner. We also established convergence of Meta Prox. Extensive experiments on standard datasets for regression and classification verify the effectiveness of learning a proximal regularizer. Acknowledgements This work is supported by NSFC grant 62076118. [1] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter. Differentiable convex optimization layers. In Neural Information Processing Systems, pages 9562 9574, 2019. [2] M.-F. Balcan, M. Khodak, and A. Talwalkar. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pages 424 433, 2019. [3] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In International Joint Conference on Neural Networks, pages 969 975, 1991. [4] L. Bertinetto, J. F. Henriques, P. Torr, and A. Vedaldi. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2018. [5] S. Boyd, S. P. Boyd, and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. [6] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2018. [7] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Learning to learn around a common mean. In Neural Information Processing Systems, pages 10169 10179, 2018. [8] G. Denevi, C. Ciliberto, R. Grazzi, and M. Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566 1575, 2019. [9] G. Denevi, M. Pontil, and C. Ciliberto. The advantage of conditional meta-learning for biased regularization and fine tuning. Neural Information Processing Systems, 33, 2020. [10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pages 647 655, 2014. [11] A. Fallah, A. Mokhtari, and A. Ozdaglar. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In International Conference on Artificial Intelligence and Statistics, pages 1082 1092, 2020. [12] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126 1135, 2017. [13] C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online meta-learning. In International Conference on Machine Learning, pages 1920 1930, 2019. [14] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pages 1568 1577, 2018. [15] S. Ghadimi and G. Lan. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. [16] S. Gong, S. Mc Kenna, and J. J. Collins. An investigation into face pose distributions. In International Conference on Automatic Face and Gesture Recognition, pages 265 270, 1996. [17] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832 1841, 2018. [18] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Neural Information Processing Systems, 2018. [19] K. Ji, J. Yang, and Y. Liang. Multi-step model-agnostic meta-learning: Convergence and improved algorithms. Preprint ar Xiv:2002.07836, 2020. [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [21] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Neural Information Processing Systems, 2019. [22] K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex optimization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10657 10665, 2019. [23] C. Liu, L. Zhu, and M. Belkin. Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. Preprint ar Xiv:2003.00307, 2020. [24] H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2018. [25] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. Preprint ar Xiv:1803.02999, 2018. [26] M. Patacchiola, J. Turner, E. J. Crowley, M. O Boyle, and A. J. Storkey. Bayesian meta-learning for the few-shot setting via deep kernels. In Neural Information Processing Systems, 2020. [27] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals. Rapid learning or feature reuse? Towards understanding the effectiveness of MAML. In International Conference on Learning Representations, 2019. [28] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In Neural Information Processing Systems, pages 113 124, 2019. [29] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning, pages 5331 5340, 2019. [30] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017. [31] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314 323, 2016. [32] W. Rudin. Principles of Mathematical Analysis. Mc Graw-Hill, New York, 1976. [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. [34] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416 426, 2001. [35] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Neural Information Processing Systems, pages 4077 4087, 2017. [36] S. C. Tan and J. P. San Lau. Time series clustering: A superior alternative for market basket analysis. In International Conference on Advanced Data and Information Engineering, pages 241 248, 2014. [37] S. Thrun and L. Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3 17. Springer, 1998. [38] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, and H. Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations, 2020. [39] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. Matching networks for one shot learning. In Neural Information Processing Systems, pages 3630 3638, 2016. [40] H. Wang, R. Sun, and B. Li. Global convergence and induced kernels of gradient-based meta-learning with neural nets. Preprint ar Xiv:2006.14606, 2020. [41] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1 34, 2020. [42] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In International Conference on Artificial Intelligence and Statistics, 2016. [43] P. Zhou, X. Yuan, H. Xu, S. Yan, and J. Feng. Efficient meta learning via minibatch proximal update. In Neural Information Processing Systems, pages 1534 1544, 2019. [44] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.