# online_hyperparameter_metalearning_with_hypergradient_distillation__a45e0f34.pdf

Published as a conference paper at ICLR 2022

ONLINE HYPERPARAMETER META-LEARNING WITH HYPERGRADIENT DISTILLATION

Hae Beom Lee1, Hayeon Lee1, Jaewoong Shin3, Eunho Yang1,2, Timothy M. Hospedales4,5, Sung Ju Hwang1,2

KAIST1, AITRICS2, Lunit3, South Korea, University of Edinburgh4, Samsung AI Centre, Cambridge5, United Kingdom {haebeom.lee, hayeon926, shinjw148, eunhoy, sjhwang82}@kaist.ac.kr {t.hospedales}@ed.ac.uk

Many gradient-based meta-learning methods assume a set of parameters that do not participate in inner-optimization, which can be considered as hyperparameters. Although such hyperparameters can be optimized using the existing gradientbased hyperparameter optimization (HO) methods, they suffer from the following issues. Unrolled differentiation methods do not scale well to high-dimensional hyperparameters or horizon length , Implicit Function Theorem (IFT) based methods are restrictive for online optimization, and short horizon approximations suffer from short horizon bias. In this work, we propose a novel HO method that can overcome these limitations, by approximating the second-order term with knowledge distillation. Speciﬁcally, we parameterize a single Jacobian-vector product (JVP) for each HO step and minimize the distance from the true second-order term. Our method allows online optimization and also is scalable to the hyperparameter dimension and the horizon length. We demonstrate the effectiveness of our method on two different meta-learning methods and three benchmark datasets.

1 INTRODUCTION

Meta-learning (Schmidhuber, 1987; Thrun & Pratt, 1998) aims to learn a learning process itself over a task distribution. Many gradient-based meta-learning approaches assume a set of parameters that do not participate in inner-optimization (Lee & Choi, 2018; Flennerhag et al., 2019; Raghu et al., 2019) which can be seen as hyperparameters. Those hyperparameters are important in helping the inner-learner converge faster and generalize better. As they are usually very high-dimensional such as element-wise learning rates (Li et al., 2017), we cannot meta-learn them with simple hyperparameter optimization (HO) techniques such as random search (Bergstra & Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) due to the too extensive search space.

Table 1: Comparison between the various gradientbased HO algorithms. 1-step denotes one-step lookahead approximation (Luketina et al., 2016).

FMD RMD Dr MAD IFT 1-step Ours

High-dim. X O O O O O

Online opt. O X X O O

Constant memory O X O O O O

Horizon > 1 O O O O X O

In this case, we can use gradient-based HO methods that can directly optimize the highdimensional hyperparameters by minimizing the validation loss w.r.t. the hyperparameters (Bengio, 2000). Due to the expensive computational cost of evaluating the hypergradients (i.e. the gradient w.r.t. the hyperparameters), there has been a lot of efforts to improve the effectiveness and the efﬁciency of the algorithms. However, unfortunately, none of the existing algorithms satisfy the following criteria at the same time that should be met for their practical use: 1) scalable to hyperparameter dimension, 2) online optimization, 3) memory-efﬁcient, 4) avoid short-horizon bias. Please See Table 1 for the comparison of existing gradient-based HO algorithms in the above four criteria.

Forward-Mode Differentiation (FMD) (Franceschi et al., 2017) in Table 1 is an algorithm that forward-propagates Jacobians (i.e. derivatives of the update function) from the ﬁrst to the last step, which is analogous to real-time recurrent learning (RTRL) (Williams & Zipser, 1989) in recurrent neural networks. FMD allows online optimization (i.e. update hyperparameters every inner-step)

Published as a conference paper at ICLR 2022

with the intermediate Jacobians and also computes the hypergradients over the entire horizon. However, a critical limitation is that the time and space complexity linearly increases w.r.t. the hyperparameter dimension. Thus, we cannot use FMD for solving many practical meta-learning problems that come with millions of hyperparameters, which is the main problem we tackle in this paper.

Secondly, Reverse-Mode Differentiation (RMD) (Maclaurin et al., 2015) back-propagates the Jacobian-vector products (JVPs) from the last to the initial step, which is structurally identical to backprop through time (BPTT) (Werbos, 1990). RMD is scalable to the hyperparameter dimension, but the space complexity linearly increases w.r.t. the horizon length (i.e., the number of innergradient steps used to compute the hypergradient). It is possible to reduce the memory burden by checkpointing some of the previous weights and further interpolating between the weights to approximate the trajectory (Fu et al., 2016). However, RMD and its variants are not scalable for online optimization. This is because they do not retain the intermediate Jacobians unlike FMD and thus need to recompute the whole second-order term for every online HO step.

Thirdly, algorithms based on Implicit Function Theorem (IFT) are applicable to high-dimensional HO (Bengio, 2000; Pedregosa, 2016). Under the assumption that the main model parameters have arrived at convergence, the best-response Jacobian, i.e. how the converged model parameters change w.r.t. the hyperparameters, can be expressed by only the information available at the last step, such as the inverse of Hessian at convergence. Thus, we do not have to explicitly unroll the previous update steps. Due to the heavy cost of computing inverse-Hessian-vector product, Lorraine et al. (2020) propose to approximate it by an iterative method, which works well for high-dimensional HO problems. However, still it is not straightforward to use the method for online optimization because of the convergence assumption. That is, computing hypergradients before convergence does not guarantee the quality of the hypergradients.

To our knowledge, the short horizon approximation such as one-step lookahead (1-step in Table 1) (Luketina et al., 2016) is the only existing method that fully supports online optimization, while being scalable to the hyperparameter dimension at the same time. It computes hypergradients only over a single update step and ignores the past learning trajectory, which is computationally efﬁcient as only a single JVP is computed per each online HO step. However, this approximation suffers from the short horizon bias (Wu et al., 2018) by deﬁnition.

In this paper, we propose a novel HO algorithm that can simultaneously satisfy all the aforementioned criteria for practical HO. The key idea is to distill the entire second-order term into a single JVP. As a result, we only need to compute the single JVP for each online HO step, and at the same time the distilled JVP can consider longer horizons than short horizon approximations such as one-step lookahead or ﬁrst-order method. We summarize the contribution of this paper as follows:

We propose Hyper Distill, a novel HO algorithm that satisﬁes the aforementioned four criteria for practical HO, each of which is crucial for a HO algorithm to be applied to the current meta-learning frameworks. We show how to efﬁciently distill the hypergradient second-order term into a single JVP. We empirically demonstrate that our algorithm converges faster and provides better generalization performance at convergence, with three recent meta-learning models and on two benchmark image datasets.

2 RELATED WORK

Hyperparameter optimization When the hyperparameter dimension is small (e.g. less than 100), random search (Bergstra & Bengio, 2012) or Bayesian optimization (Snoek et al., 2012) works well. However, when the hyperparameter is high-dimensional, gradient-based HO is often preferred since random or Bayesian search could become infeasible. One of the most well known methods for gradient-based HO are based on Implicit Function Theorem which compute or approximate the inverse Hessian only at convergence. Bengio (2000) computes the exact inverse Hessian, and Luketina et al. (2016) approximate the inverse Hessian with the identity matrix, which is identical to the onestep lookahead approximation. Pedregosa (2016) approximates the inverse Hessian with conjugate gradients (CG) method. Lorraine et al. (2020) propose Neumann approximation, which is numerically more stable than CG approximation. On the other hand, Domke (2012) proposes unrolled differentiation for solving bi-level optimization, and Shaban et al. (2019) analyzes the truncated unrolled differentiation, which is computationally more efﬁcient. Unrolled diffrentiation can be fur-

Published as a conference paper at ICLR 2022

ther categorized into forward (FMD) and reverse mode (RMD) (Franceschi et al., 2017). FMD is more suitable for optimizing low-dimensional hyperparamters (Im et al., 2021; Micaelli & Storkey, 2020), but RMD is more scalable to the hyperparameter dimension. Maclaurin et al. (2015) proposes a more memory-efﬁcient RMD, which reverses the SGD trajectory with momentum. Fu et al. (2016) further reduce memory burden of RMD by approximating the learning trajectory with linear interpolation. Luketina et al. (2016) can also be understood as a short horizon approximation of RMD for online optimization. Our method also supports online optimization, but the critical difference is that our algorithm can alleviate the short horizon bias (Wu et al., 2018). RMD is basically a type of backpropagation and it is available in deep learning libraries (Grefenstette et al., 2019).

Meta-learning Meta-learning (Schmidhuber, 1987; Thrun & Pratt, 1998) aims to learn a model that generalizes over a distribution of tasks (Vinyals et al., 2016; Ravi & Larochelle, 2016). While there exists a variety of approaches, in this paper we focus on gradient-based meta-learning (Finn et al., 2017), especially the methods with high-dimensional hyperparameters that do not participate in inner-optimization. For instance, there have been many attempts to precondition the innergradients for faster inner-optimization, either by warping the parameter space with every pair of consecutive layers interleaved with a warp layer (Lee & Choi, 2018; Flennerhag et al., 2019) or directly modulating the inner-gradients with diagonal (Li et al., 2017) or block-diagonal matrix (Park & Oliva, 2019). Perturbation function is another form of hyperparameters that help the inner-learner generalize better (Lee et al., 2019; Ryu et al., 2020; Tseng et al., 2020). It is also possible to let the whole feature extractor be hyperparameters and only adapt the last fully-connected layer (Raghu et al., 2019). On the other hand, some of the meta-learning literatures do not assume a task distribution, but tune their hyperparameters with a holdouot validation set, similarly to the conventional HO setting. In this case, the one-step lookahead method (Luketina et al., 2016) is mostly used for scalable online HO, in context of domain generalization (Li et al., 2018), handling class imbalance (Ren et al., 2018; Shu et al., 2019), gradient-based neural architecture search (Liu et al., 2018), and coefﬁcient of norm-based regularizer (Balaji et al., 2018). Although we mainly focus on metalearning setting in this work, whose goal is to transfer knowledge through a task distribution, it is straightforward to apply our method to conventional HO problems.

3 BACKGROUND

In this section, we ﬁrst introduce RMD and its approximations for efﬁcient computation. We then introduce our novel algorithm that supports high-dimensional online HO over the entire horizon.

3.1 HYPERPARAMETER UNROLLED DIFFERENTIATION

We ﬁrst introduce notations. Throughout this paper, we will speciﬁy w as weight and λ as hyperparameter. The series of weights w0, w1, w2 . . . , w T evolve with the update function wt = Φ(wt 1, λ; Dt) over steps t = 1, . . . , T. The function Φ takes the previous weight wt 1 and the hyperparameter λ as inputs and its form depends on the current mini-batch Dt. Note that w1, w2, . . . , w T are functions w.r.t. the hyperparameter λ. The question is how to ﬁnd a good hyperparameter λ that yields a good response w T at the last step. In gradient-based HO, we ﬁnd the optimal λ by minimizing the validation loss Lval as a function of λ. min λ Lval(w T (λ), λ) (1)

Note that we let the loss function Lval( , ) itself be modulated by λ for generality. According to the chain rule, the hypergradient is decomposed into d Lval(w T , λ)

dλ = Lval(w T , λ)

λ | {z } g FO T : First-order term

+ Lval(w T , λ)

dλ | {z } g SO T : Second-order term

On the right hand side, the ﬁrst-order (FO) term g FO T directly computes the gradient w.r.t λ by ﬁxing w T . The second-order (SO) term g SO T computes the indirect effect of λ through the response w T .

αT = Lval(w T ,λ)

w T can be easily computed similarly to g FO T , but the response Jacobian dw T

dλ is more computationally challenging as it is unrolled into the following form.

Bt, where As = Φ(ws 1, λ; Ds)

ws 1 , Bt = Φ(wt 1, λ; Dt)

Published as a conference paper at ICLR 2022

Algorithm 1 Reverse-HG (RMD)

1: Input: The last weight w T and all the previous weights w0, . . . , w T 1. 2: Output: Hypergradient g FO + g SO. 3: α L(w T ,λ)

w T , g FO L(w T ,λ)

λ , g SO 0 4: for t = T downto 1 do 5: g SO g SO + αBt 6: α αAt 7: end for 8: return g FO + g SO

Algorithm 2 Dr MAD (Fu et al., 2016)

1: Input: The last weight w T and the initial weight w0. 2: Output: Approximated hypergradient g FO + g SO. 3: α L(w T ,λ)

w T , g FO L(w T ,λ)

λ , g SO 0 4: for t = T downto 1 do 5: ˆwt 1 1 t 1

T w T 6: g SO g SO + α ˆBt 7: α α ˆAt 8: end for 9: return g FO + g SO

Eq. (3) involves the Jacobians {A} and {B} at the intermediate steps. Evaluating them or their vector products are computationally expensive in terms of either time (FMD) or space (FMD, RMD) (Franceschi et al., 2017). Therefore, how to approximate Eq. (3) is the key to developing an efﬁcient and effective HO algorithm.

3.2 REVERSE-MODE DIFFERENTIATION AND ITS APPROXIMATIONS

Basically, RMD is structurally analogous to backpropagation through time (BPTT) (Werbos, 1990). In RMD, we ﬁrst obtain αT = Lval(w T ,λ)

w T and back-propagate A and B from the last to the ﬁrst step in the form of JVPs (See Algorithm 1). Whereas RMD is much faster than FMD as we only need to compute one or two JVPs per each step, it usually requires to store all the previous weights w0, . . . , w T 1 to compute the previous-step JVPs, unless we consider reversible training with momentum optimizer (Maclaurin et al., 2015). Therefore, when w is high-dimensional, RMD is only applicable to short-horizon problems such as few-shot learning (e.g. T = 5 in Finn et al. (2017)).

Trajectory approximation. Instead of storing all the previous weights for computing A and B, we can approximate the learning trajectory by linearly interpolating between the last weight w T and the initial weight w0. Algorithm 2 illustrates the procedure called Dr MAD (Fu et al., 2016), where each intermediate weight wt is approximated by ˆwt = 1 t

T w T for t = 1, . . . , T 1. A and B are also approximated by ˆAs = Φ( ˆ ws 1,λ;Ds)

ˆ ws 1 and ˆBt = Φ( ˆ wt 1,λ;Dt)

λ , respectively. However, although Dr MAD dramatically lower the space complexity, it does not reduce the number of JVPs per each hypergradient step. For each online HO step t = 1, . . . , T we need to compute 2t 1 JVPs, thus the number of total JVPs to complete a single trajectory accumulates up to PT t=1(2t 1) = T 2, which is deﬁnitely not scalable as an online optimization algorithm.

Short-horizon approximations. One-step lookahead approximation (Luketina et al., 2016) is currently one of the most popular high-dimensional online HO method that can avoid computing the excessive number of JVPs (Li et al., 2018; Ren et al., 2018; Shu et al., 2019; Liu et al., 2018; Balaji et al., 2018). The idea is very simple; for each online HO step we only care about the last previous step and ignore the rest of the learning trajectory for computational efﬁciency. Speciﬁcally, for each step t = 1, . . . , T we compute the hypergradient by viewing wt 1 as constant, which yields dwt

λ wt 1 = Bt (See Eq. (3)). Or, we may completely ignore all the second-order deriva-

tives for computational efﬁciency, such that dwt

dλ 0 (Flennerhag et al., 2019; Ryu et al., 2020). While those approximations enable online HO with low cost, they are intrinsically vulnerable to short-horizon bias (Wu et al., 2018) by deﬁnition.

We next introduce our novel online HO method based on knowledge distillation. Our method can overcome all the aforementioned limitations at the same time.

4.1 HYPERGRADIENT DISTILLATION

The key idea is to distill the whole second-order term g SO t = αt Pt i=1 Qt j=i+1 Aj Bi in Eq. (2) into a single JVP evaluated at a distilled weight point w and with a distilled dataset D. We denote the normalized JVP as ft(w, D) := σ(αt Φ(w,λ;D)

λ ) with σ(x) = x x 2 . Speciﬁcally, we want to solve the following knowledge distillation problem for each online HO step t = 1, . . . , T:

Published as a conference paper at ICLR 2022

π t , w t , D t = arg min π,w,D

πft(w, D) g SO t 2 (4)

so that we use π t ft(w t , D t ) instead of g SO t . Online optimization is now feasible because for each online HO step t = 1, . . . , T we only need to compute the single JVP ft(w t , D t ) rather than computing 2t 1 JVPs for RMD or Dr MAD. Also, unlike short horizon approximations, the whole trajectory information is distilled into the JVP, alleviating the short horizon bias (Wu et al., 2018).

Notice that solving Eq. (4) only w.r.t. π is simply a vector projection.

πt(w, D) = ft(w, D)Tg SO t . (5) Then, plugging Eq. (5) into π in Eq. (4) and making use of ft(w, D) 2 = 1, we can easily convert the optimization problem Eq. (4) into the following equivalent problem (See Appendix A). w t , D t = arg max w,D πt(w, D), π t = πt(w t , D t ). (6)

w t and D t match the hypergradient direction and π t matches the size.

Technical challenge. However, solving Eq. (6) requires to evaluate g SO t for t = 1, . . . , T, which is tricky as g SO t is the target we aim to approximate. We next show how to roughly solve Eq. (6) even without evaluting g SO t (for w t , D t ) or by sparsely evaluating an approximation of g SO t (for π t ).

4.2 DISTILLING THE HYPERGRADIENT DIRECTION

Hessian approximation. In order to circumvent the technical difﬁculty, we start from making the optimization objective πt(w, D) = ft(w, D)Tg SO t in Eq. (5) simpler. We approximate g SO t as

g SO t = αt

i=1 γt iαt Bi. (7)

with γ 0, which we tune on a meta-validation set. Note that Eq. (7) is yet too expensive to use for online optimization as it consists of t JVPs. We thus need further distillation, which we will explain later. Eq. (7) is simply a Hessian identity approximation. For instance, vanilla SGD with learning rate ηInner corresponds to Aj = wj 1(wj 1 ηInner Ltrain(wj 1, λ)). Approximating the Hessian as 2 wj 1Ltrain(wj 1, λ) k I, we have Aj = I ηInner k I (1 ηInnerk)I = γI. Plugging

Eq. (7) to Eq. (5) and letting ft(wi 1, Di) := σ(αt Φ(wi 1,λ;Di)

λ ) = σ(αt Bi), we have

πt(w, D) ˆπt(w, D) =

i=1 δt,i ft(w, D)Tft(wi 1, Di) (8)

where δt,i = γt i αt Bi 2 0. Instead of maximizing πt directly, we now maximize ˆπt w.r.t. w and D as a proxy objective.

Lipschitz continuity assumption. Now we are ready to see how to distill the hypergradient direction w t and D t without evaluating g SO t . The important observation is that the maximum of ˆπt in Eq. (8) is achieved when ft(w, D) is well-aligned to the other ft(w0, D1), . . . , ft(wt 1, Dt). This intuition is directly related to the following Lipschitz continuity assumption on ft.

ft(w, D) ft(wi 1, Di) 2 K (w, D) (wi 1, Di) X , for i = 1, . . . , t. (9) where K 0 is the Lipschitz constant. Eq. (9) captures which (w, D) can minimize ft(w, D) ft(wi 1, Di) 2 over i = 1, . . . , t, which is equivalent to maximizing ft(w, D)Tft(wi 1, Di) since f( , ) 2 = 1. For the metric X , we let K2 (w, D) 2 X = K2 1 w 2 2 + K2 2 D 2 2 where K1, K2 0 are additional constants that we introduce for notational convenience. Taking square of the both sides of Eq. (9) and summing over all i = 1, . . . , t, we can easily derive the following lower bound of ˆπt (See Appendix B).

i=1 δt,i K2 1

i=1 δt,i w wi 1 2 2 K2 2

i=1 δt,i D Di 2 2 ˆπt(w, D) (10)

We now maximize this lower bound instead of directly maximizing ˆπt. Interestingly, it corresponds to the following simple minimization problems for w and D.

i=1 δt,i w wi 1 2 2, min D

i=1 δt,i D Di 2 2. (11)

Published as a conference paper at ICLR 2022

Algorithm 3 Hyper Distill

1: Input: γ [0, 1], initial λ, and initial φ. 2: Output: Learned hyperparameter λ. 3: for m = 1 to M do 4: if m Estimation Period then 5: θ Linear Estimation(γ, λ, φ) 6: end if 7: w0 φ 8: for t = 1 to T do 9: w t , D t Eq. (13), Eq. (14). 10: π t cγ(t; θ) in Eq. (15) 11: wt Φ(wt 1, λ; Dt) 12: g g FO t + π t ft(w t , D t ) 13: λ λ ηHyperg 14: end for 15: φ φ ηReptile(φ w T ) 16: end for

Algorithm 4 Linear Estimation(γ, λ, φ)

1: Input: w0 φ 2: for t = 1 to T do 3: wt Φ(wt 1, λ; Dt) 4: end for 5: α, αT Lval(w T ,λ)

w T , g SO 0 6: for t = T downto 1 do 7: ˆwt 1 1 t 1

T w T 8: g SO g SO + α ˆBt (Eq. (16)) 9: α α ˆAt 10: s T t + 1 11: w s, D s Eq. (17), Eq. (18) 12: vs αT Φ(w s ,λ;D s ) λ 13: xs vs 2 1 γs

1 γ , ys σ(vs)Tg SO

14: end for 15: return (x Ty)/(x Tx)

Efﬁcient sequential update. Eq. (11) tells how to determine the distilled w t and D t for each HO step t = 1, . . . , T. Since αt Bi 2 is expensive to compute, we approximate as αt B0 2 αt B1 2 αt Bt 2 , yielding the following weighted average as the approximated solution for w t .

w t γt 1 Pt i=1 γt i w0 + γt 2 Pt i=1 γt i w1 + + γ0 Pt i=1 γt i wt 1 (12)

The following sequential update allows to efﬁciently evaluate Eq. (12) for each HO step. Denoting pt = (γ γt)/(1 γt) [0, 1), we have t = 1 : w 1 w0, t 2 : w t ptw t 1 + (1 pt)wt 1 (13)

Note that the online update in Eq. (13) does not require to evaluate g SO t . It only requires to incorporate the past learning trajectory w0, w1, . . . , wt 1 through the sequential updates. Therefore, the only additional cost is the memory for storing and updating the weighted running average w t .

For D, we have assumed Euclidean distance metric as with w, but it is not straightforward to think of Euclidean distance between datasets. Instead, we simply interpret pt and 1 pt as probabilities with which we proportionally subsample each dataset. t = 1 : D 1 D1, t 2 : D t SS D t 1, pt SS (Dt, 1 pt) (14)

where SS(D, p) denotes random Sub Sampling of round off(|D|p) instances from D. There may be a better distance metric for datasets and a corresponding solution, but we leave it as a future work. See Algorithm 3 for the overall description of our algorithm, which we name as Hyper Distill.

Role of γ Note that Eq. (12) tells us the role of γ as a decaying factor. The larger the γ, the longer the past learning trajectory we consider. In this sense, our method is a generalization of the one-step lookahead approximation, i.e. γ = 0, which yields w t = wt 1 and D t = Dt, ignoring the whole information about the past learning trajectory except the last step. γ = 0 may be too pessimistic for most of the cases, so we need to ﬁnd better performing γ for each task carefully.

4.3 DISTILLING THE HYPERGRADIENT SIZE

Now we need to plug the distilled w t and D t into πt(w, D) in Eq. (5) to obtain the scaling factor π t = ft(w t , D t )Tg SO t , for online HO steps t = 1, . . . , T. However, whereas evaluating the single JVP ft(w t , D t ) is tolerable, again, evaluating g SO t is misleading as it is the target we aim to approximate. Also, it is not straightforward for π t to apply a similar trick we used in Sec. 4.2.

Linear estimator. We thus introduce a linear function cγ(t; θ) that estimates π t by periodically ﬁtting θ R, the parameter of the estimator. Then for each HO step t we could use cγ(t; θ) instead of fully evaluating π t . Based on the observation that the form of lower bound in Eq. (10) is roughly proportional to Pt i=1 δt,i = Pt i=1 γt i αt Bi 2, we conveniently set cγ(t; θ) to as follows:

cγ(t; θ) = θ vt 2

i=1 γt i, where vt := αt Φ(w t , λ; D t ) λ . (15)

Published as a conference paper at ICLR 2022

𝑤! "𝑤" 𝑤# "𝑤$

Forward pass

Back-prop. (Dr MAD)

Figure 1: Collecting g SO s

Collecting samples. We next see how to collect samples {(xs, ys)}T s=1 for ﬁtting the parameter θ, where xs = vs 2 1 γs

1 γ and ys = π s = fs(w s, D s)Tg SO s = σ(vs)Tg SO s . For this, we need to efﬁciently collect:

1. g SO s , the second-order term computed over the horizon of size s. 2. vs, the distilled JVP computed over the horizon of size s.

1) g SO s : Note that Dr MAD in Algorithm 2 (line 6) sequentially back-propagates g SO for t = T, . . . , 1. The important observation is that, at step t, this incomplete second-order term g SO = PT i=t αT ˆAT ˆAT 1 ˆAi+1 ˆBi can be seen as the valid second-order term computed over the horizon of size s = T t + 1. This is because the reparameterization s = T t + 1 gives

i=1 αs+(T s) ˆAs+(T s) ˆAs 1+(T s) ˆAi+1+(T s) ˆBi+(T s) (16)

for s = 1, . . . , T, nothing but shifting the trajectory index by T s steps so that the last step is always T. Therefore, we can efﬁciently obtain the valid second-order term g SO s for all s = 1, . . . , T through the single backward travel along the interpolated trajectory (See Figure 1). Each g SO s requires to compute only one or two additional JVPs. Also, as we use Dr MAD instead of RMD, we only store w0 such that the memory cost is constant w.r.t. the total horizon size T.

2) vs: For computing the distilled JVP vs, we ﬁrst compute the distilled w s and D s as below, similarly to Eq. (13) and (14). Denoting ps = (1 γs 1)/(1 γs), we have

s = 1 : w 1 w T 1, s 2 : w s psw s 1 + (1 ps)w T s (17)

s = 1 : D 1 DT , s 2 : D s SS D s 1, ps SS (DT s+1, 1 ps) (18)

We then compute the unnormalized distilled JVP as vs = αs+(T s)

Φ(w s,λ;D s) λ .

Estimating θ. Now we are ready to estimate θ. For x, we have xs = vs 2 1 γs

1 γ and collect x = (x1, . . . , x T ). For y, we have ys = σ(vs)Tg SO s and collect y = (y1, . . . , y T ). Finally, we estimate θ = (x Ty)/(x Tx). See Algorithm 3 and Algorithm 4 for the details. Practically, we set Estimation Period in Algorithm 3 to every 50 completions of the inner-optimizations, i.e. {1, 51, 101, . . . }. Thus, the computational cost of Linear Estimation is marginal in terms of the wall-clock time (see Table 3).

5 EXPERIMENTS

Baselines. We demonstrate the efﬁcacy of our algorithm by comparing to the following baselines. 1) First-Order Approximation (FO). Computationally the most efﬁcient HO algorithm that completely ignores the second-order term, i.e. g SO = 0. 2) One-step Look-ahead Approximation (1-step). (Luketina et al., 2016) The short-horizon approximation where only a single step is unrolled to compute each hypergradient. 3) Dr MAD. (Fu et al., 2016) An approximation of RMD that linearly interpolates between the initial and the last weight to save memory (see Algorithm 2). 4) Neumann IFT (N.IFT). (Lorraine et al., 2020) An IFT based method that approximates the inverse-Hessian-vector product by Neumann series. Note that this method supports online optimization around convergence. Speciﬁcally, among total T = 100 inner-steps, N.IFT(N, K) means for the last K steps we perform online HO each with N inversion steps. It requires total (N + 1) K JVPs. We tune (N, K) among {(2, 25), (5, 10), (10, 5)}, roughly computing 50 JVPs per inner-opt. 5) Hyper Distill. Our high-dimensional online HO algorithm based on the idea of knowledge distillation. We tune the decaying factor within γ {0.9, 0.99, 0.999, 0.9999}. The linear regression is done every 50 inner-optimization problems.

Target meta-learning models. We test on the following three meta-learning models. 1) Almost No Inner Loop (ANIL). (Raghu et al., 2019) The intuition of ANIL is that the need for task-speciﬁc adaptation diminishes when the task distribution is homogeneous. Following this intuition, based on a typical 4-layer convolutional network with 32 channels (Finn et al., 2017), we designate the three bottom layers as the high-dimensional hyperparameter and the 4th convolutional layer and the last fully connected layer as the weight, similarly to Javed & White (2019).

Published as a conference paper at ICLR 2022

0 200 400 600 800 1K # Inner-opt. Solved

1.8 ANIL / Tiny Image Net

FO 1-step Dr MAD N.IFT (2, 25) Hyper Distill

0 200 400 600 800 1K # Inner-opt. Solved

1.2 Warp Grad / CIFAR100

N.IFT (10, 5)

0 200 400 600 800 1K # Inner-opt. Solved

Warp Grad / Tiny Image Net

N.IFT (10, 5)

0 200 400 600 800 1K # Inner-opt. Solved

1.4Meta Weight Net / CIFAR100

N.IFT (2, 25)

Figure 2: Meta-training convergence measured in Lval(w T , λ) with T = 100 inner-steps. We report mean and and 95% conﬁdence intervals over 5 meta-training runs.

Online # JVPs ANIL Warp Grad Meta Weight Net optim. / inner-opt. tiny Image Net CIFAR100 tiny Image Net CIFAR100 FO O 0 53.62 0.06 58.16 0.52 53.54 0.74 N/A 1-step O 50 53.90 0.43 58.18 0.52 49.97 2.46 58.45 0.40 Dr MAD X 199 49.84 1.35 55.13 0.64 50.71 1.16 57.03 0.42 Neumann IFT {55, 60, 75} 53.76 0.31 58.88 0.65 50.15 0.98 59.34 0.27 Hyper Distill O 58 56.37 0.27 60.91 0.27 55.04 0.52 60.82 0.33

Table 2: Meta-test performance measured in test classiﬁcation accuracy (%). We report mean and and 95% conﬁdence intervals over 5 meta-training runs.

2) Warp Grad. (Flennerhag et al., 2019) Secondly, we consider Warp Grad, whose goal is to metalearn non-linear warp layers that facilitate fast inner-optimization and better generalization. We use 3-layer convolutional network with 32 channels. Every layer is interleaved with two warp layers that do not participate in the inner-optimization, which is the high-dimensional hyperparameter. 3) Meta Weight Net. (Shu et al., 2019) Lastly, we consider solving the label corruption problem with Meta Weight Net, which meta-learns a small MLP taking a 1D loss as an input and output a reweighted loss. The parameter of the MLP is considered as a high-dimensional hyperparameter. Labels are independently corrupted to random classes with probability 0.4. Note that we aim to meta-learn the MLP over a task distribution and apply to diverse unseen tasks, instead of solving a single task. Also, in this meta model the direct gradient is zero, g FO = 0. In this case, π in Hyper Distill has a meaning of nothing but rescaling the learning rate, so we simply set π = 1.

Use of Reptile. Note that for all the above meta-learning models, we meta-learn the weight initialization with Reptile (Nichol et al., 2018) as well, representing a more practical meta-learning scenario than learning from random initialization. We use the Reptile learning rate ηReptile = 1. Note that φ in Algorithm 3 and Algorithm 4 denotes the Reptile initialization parameter.

Task distribution. We consider the following two datasets. To generate each task, we randomly sample 10 classes from each dataset (5000 examples) and randomly split them into 2500 training and 2500 test examples. 1) Tiny Image Net. (Le & Yang, 2015) This dataset contains 200 classes of general categories. We split them into 100, 40, and 60 classes for meta-training, meta-validation, and meta-test. Each class has 500 examples of size 64 64. 2) CIFAR100. (Krizhevsky et al., 2009) This dataset contains 100 classes of general categories. We split them into 50, 20, and 30 classes for meta-training, meta-validation, and meta-test. Each class has 500 examples of size 32 32.

Experimental setup. Meta-training: For inner-optimization of the weights, we use SGD with momentum 0.9 and set the learning rate µInner = 0.1 for Meta Weight Net and µInner = 0.01 for the others. The number of inner-steps is T = 100 and batchsize is 100. We use random cropping and horizontal ﬂipping as data augmentations. For the hyperparameter optimization, we also use SGD with momentum 0.9 with learning rate µHyper = 0.01 for Meta Weight Net and µHyper = 0.001 for the others, which we linearly decay toward 0 over total M = 1000 inner-optimizations. We perform parallel meta-learning with meta-batchsize set to 4. Meta-testing: We solve 500 tasks to measure average performance, with exactly the same inner-optimization setup as meta-training. We repeat this over 5 different meta-training runs and report mean and 95% conﬁdence intervals (see Table 2). Code is publicly available at: https://github.com/haebeom-lee/hyperdistill

5.1 ANALYSIS

We perform the following analysis together with the Warp Grad model and CIFAR100 dataset. Hyper Distill provides faster convergence and better generalization. Figure 2 shows that Hyper Distill shows much faster meta-training convergence than the baselines for all the meta-learning

Published as a conference paper at ICLR 2022

0 200 400 600 800 1000 # Inner-opt. Solved

Dr MAD N.IFT (10, 1) Hyper Distill N.IFT (2, 1) 1-step FO

0 200 400 600 800 1000 # Inner-opt. Solved

0 200 400 600 800 1000 # Inner-opt. Solved

Eq.(7), =0.99

Dr MAD N.IFT (10, 1)

1step Hyper Distill

(c) Figure 3: Cosine similarity to exact RMD in terms of (a) hypergradients g FO +g SO. (b, c) second-order term g SO. The curves in (b) correspond to Eq. (7) with various γ.

0 2 4 6 8 x

0 1 2 3 4 x

0 1 2 3 4 x

0 500 1K # Inner-opt. Solved

0 50 100 150 200 #JVP per inner-opt.

Meta-test / Test Acc.

Hyper Distill N.IFT Dr MAD 1-step FO

(e) Figure 4: (a,b) Samples collected from Algorithm 4 and correspondingly ﬁtted linear estimators (θ). (c) A ﬁtted estimator and the range of actual ground-truth estimator (the shaded area is one sigma). (d) The stability of θ estimation. (e) Meta-test performance vs. computational cost in terms of the number of JVPs per inner-opt.

models and datasets we considered. We see that the convergence of ofﬂine method such as Dr MAD is signiﬁcantly worse than a simple ﬁrst-order method, demonstrating the importance of frequent update via online optimization. Hyper Distill shows signiﬁcantly better convergence than FO and 1-step because it is online and at the same time alleviates the short horizon bias. As a result, Table 2 shows that the meta-test performance of Hyper Distill is signiﬁcantly better than the baselines, although it requires comparable number of JVPs per each inner-optimiztion. Hyper Distill is a reasonable approximation of the true hypergradient. We see from Figure 3(a) that the hypergradient obtained from Hyper Distill is more similar to the exact RMD than those obtained from FO and 1-step, demonstrating that Hyper Distill can actually alleviate the short horizon bias. Hyper Distill is even comparable to N.IFT(10, 1) that computes 11 JVPs, whereas Hyper Distill computes only a single JVP. Such results indicate that the approximation we used in Eq. (7) and Dr MAD in Eq. (16) are accurate enough. Figure 3(b) shows that with careful tuning of γ (e.g. 0.99), the direction of the approximated second-order term in Eq. (7) can be much more accurate than the second-order term of 1-step (γ = 0). In Figure 3(c), as Hyper Distill distills such a good approximation, it can provide a better direction of the second-order term than 1-step. Although the gap may seem marginal, even N.IFT(10, 1) performs similarly, showing that matching the direction of the second-order term without unrolling the full gradient steps is inherently a challenging problem. Figure 4(a) and 4(b) show that the samples collected according to Algorithm 4 is largely linear, supporting our choice of Eq. (15). Figure 4(c) and 4(d) show that the range of ﬁtted θ is accurate and stable, explaining why we do not have to perform the estimation frequently. Note that Dr MAD approximation (Eq. (16)) is accurate (Figure 3(a) and 3(c)), helping to predict the hypergradient size. Hyper Distill is compuatationally efﬁcient. Figure 4(e) shows the superior computational efﬁciency of Hyper Distill in terms of the trade-off between meta-test performance and the amount of JVP computations. Note that wall-clock time is roughly proportional to the number of JVPs per inner-optimization. In Appendix F, we can see that the actual increase in memory cost and wallcock time is very marginal compared to 1-step approximation.

6 CONCLUSION

In this work, we proposed a novel HO method, Hyper Distill, that can optimize high-dimensional hyperparameters in an online manner. It was done by approximating the exact second-order term with knowledge distillation. We demonstrated that Hyper Distill provides faster meta-convergence and better generalization performance based on realistic meta-learning methods and datasets. We also veriﬁed that it is thanks to the accurate approximations we proposed.

Published as a conference paper at ICLR 2022

Acknowledgements This work was supported by Google AI Focused Research Award, Center for Applied Research in Artiﬁcial Intelligence (CARAI) grant funded by DAPA and ADD (UD190031RD), the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921), and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-00075, Artiﬁcial Intelligence Graduate School Program(KAIST)).

Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems, 31, 2018.

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889 1900, 2000.

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.

Justin Domke. Generic methods for optimization-based modeling. In Artiﬁcial Intelligence and Statistics, pp. 318 326. PMLR, 2012.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126 1135. PMLR, 2017.

Sebastian Flennerhag, Pablo G Moreno, Neil D Lawrence, and Andreas Damianou. Transferring knowledge across learning processes. In International Conference on Learning Representations, 2018.

Sebastian Flennerhag, Andrei A Rusu, Razvan Pascanu, Francesco Visin, Hujun Yin, and Raia Hadsell. Meta-learning with warped gradient descent. In International Conference on Learning Representations, 2019.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp. 1165 1173. PMLR, 2017.

Jie Fu, Hongyin Luo, Jiashi Feng, Kian Hsiang Low, and Tat-Seng Chua. Drmad: distilling reversemode automatic differentiation for optimizing hyperparameters of deep neural networks. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, pp. 1469 1475, 2016.

Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov, Franziska Meier, Douwe Kiela, Kyunghyun Cho, and Soumith Chintala. Generalized inner loop metalearning. ar Xiv preprint ar Xiv:1910.01727, 2019.

Daniel Jiwoong Im, Cristina Savin, and Kyunghyun Cho. Online hyperparameter optimization by real-time recurrent learning. ar Xiv preprint ar Xiv:2102.07813, 2021.

Khurram Javed and Martha White. Meta-learning representations for continual learning. Advances in Neural Information Processing Systems, 32, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.

Hae Beom Lee, Taewook Nam, Eunho Yang, and Sung Ju Hwang. Meta dropout: Learning to perturb latent features for generalization. In International Conference on Learning Representations, 2019.

Published as a conference paper at ICLR 2022

Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and subspace. In International Conference on Machine Learning, pp. 2927 2936. PMLR, 2018.

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize: Metalearning for domain generalization. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for fewshot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1540 1552. PMLR, 2020.

Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In International conference on machine learning, pp. 2952 2960. PMLR, 2016.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pp. 2113 2122. PMLR, 2015.

Paul Micaelli and Amos Storkey. Non-greedy gradient-based hyperparameter optimization over long horizons. 2020.

Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Eunbyung Park and Junier B Oliva. Meta-curvature. Advances in Neural Information Processing Systems, 32, 2019.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737 746. PMLR, 2016.

Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, 2019.

Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.

Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International conference on machine learning, pp. 4334 4343. PMLR, 2018.

Jeong Un Ryu, Jaewoong Shin, Hae Beom Lee, and Sung Ju Hwang. Metaperturb: Transferable regularizer for heterogeneous tasks and architectures. Advances in Neural Information Processing Systems, 33:11501 11512, 2020.

J urgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph D thesis, Technische Universit at M unchen, 1987.

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1723 1732. PMLR, 2019.

Jaewoong Shin, Hae Beom Lee, Boqing Gong, and Sung Ju Hwang. Large-scale meta-learning with continual trajectory shifting. In International Conference on Machine Learning, pp. 9603 9613. PMLR, 2021.

Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Metaweight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems, 32, 2019.

Published as a conference paper at ICLR 2022

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.

Sebastian Thrun and Lorien Pratt (eds.). Learning to Learn. Kluwer Academic Publishers, Norwell, MA, USA, 1998. ISBN 0-7923-8047-9.

Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, and Ming-Hsuan Yang. Cross-domain few-shot classiﬁcation via learned feature-wise transformation. In International Conference on Learning Representations, 2020.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.

Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550 1560, 1990.

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270 280, 1989.

Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic meta-optimization. In International Conference on Learning Representations, 2018.

Published as a conference paper at ICLR 2022

A DERIVATION OF EQUATION (6)

Let f := ft(w, D) and g := g SO t for notational simplicity. Note that f = 1. Then,

πt(w, D) = arg min π πf g

= arg min π πf g 2

= arg min π π2f Tf 2πf Tg + g Tg (19)

Plugging this into π in Eq. (19) and with the assumption π(w, D) = f Tg 0, we have

w , D = arg min w,D (f Tg)2 f Tf 2(f Tg) f Tg + g Tg

= arg max w,D (f Tg)2

= arg max w,D f Tg

= arg max w,D π(w, D). (20)

Note that Eq. (20) results from encoding the closed-form solution πt(w, D) already. Therefore, the above is a joint optimization so that we do not have to repeat alternating optimizations between (w, D) and π.

B DERIVATION OF EQUATION (10)

Let f := ft(w, D) and fi := ft(wi 1, Di) for notational simplicity. Note that f = f1 = = ft = 1 and we are given the following t inequalities.

f fi K (w, D) (wi 1, Di) X , for i = 1, . . . , t.

Taking square of both sides and multiplying δt,i,

2δt,i δt,if Tfi K2 1δt,i w wi 1 2 + K2 2δt,i D Di 2, for i = 1, . . . , t.

Summing the t inequalities over all i = 1, . . . , t,

i=1 δt,if Tfi

i=1 δt,i w wi 1 2 + K2 2

i=1 δt,i D Di 2.

Rearranging the terms,

i=1 δt,i K2 1

i=1 δt,i w wi 1 2 K2 2

i=1 δt,i D Di 2

i=1 δt,if Tfi

Published as a conference paper at ICLR 2022

C META-VALIDATION PERFORMANCE

0 200 400 600 800 1K # Inner-opt. Solved

60 ANIL / Tiny Image Net

FO 1-step Dr MAD N.IFT (2, 25) Hyper Distill

0 200 400 600 800 1K # Inner-opt. Solved

68 Warp Grad / CIFAR100

0 200 400 600 800 1K # Inner-opt. Solved

Warp Grad / Tiny Image Net

0 200 400 600 800 1K # Inner-opt. Solved

66 MWN / CIFAR100

Figure 5: Meta-validation performance. We report mean and and 95% conﬁdence intervals over 5 metatraining runs.

Figure 5 shows the meta-validation performance as the meta-training proceeds. We can see that our Hyper Distill shows much faster meta-convergence and shows better generalization at convergence than the baselines, which is consistent with the meta-training convergence shown in Figure 2.

D HYPER-HYPERPARAMETER ANALYSIS

0.9 0.99 0.999 0.9999 Gamma

ANIL / Tiny Image Net

FO 1-step Dr MAD

0.9 0.99 0.999 0.9999 Gamma

Warp Grad / CIFAR100

0.9 0.99 0.999 0.9999 Gamma

56 Warp Grad / Tiny Image Net

0.9 0.99 0.999 0.9999 Gamma

Meta Weight Net / CIFAR100

Figure 6: Meta-test performance by varying the value of γ. Red stars denote the actuall γ we used for each experiment (we found them with a meta-validation set) and the corresponding performance.

Our algorithm, Hyper Distill has a hyper-hyperparamter γ that we tune with a meta-validation set in the range {0.9, 0.99, 0.999, 0.9999}. Figure 6 shows that with all the values of γ and for all the experimental setups we consider, Hyper Distill outperforms all the baselines with signiﬁcant margins. This demonstrates that the performance of Hyper Distill is not much sensitive to the value of γ.

E MORE DETAILS OF METAWEIGHTNET EXPERIMENTS

0 2 4 6 8 10 Loss

0 2 4 6 8 10 Loss

0 2 4 6 8 10 Loss

0 2 4 6 8 10 Loss

(d) Hyper Distill Figure 7: Learned loss weighting function with each algorithm.

We provide the additional experimental setup for the Meta Weight Net (Shu et al., 2019) experiments. We use 1 200(Re LU) 1 loss weighting network architecture, following the original paper. Also, we found that lower bounding the output of the weighting function with 0.1 can stabilize the training. Figure 7 shows the resultant loss weighting function learned with each algorithm. We see that the learned weighting function with Hyper Distill tend to output lower values than the baselines.

Published as a conference paper at ICLR 2022

F COMPUTATIONAL EFFICIENCY

ANIL Warp Grad Meta Weight Net tiny Image Net CIFAR100 tiny Image Net CIFAR100 (Mb) / (s / inner-opt.) (Mb) / (s / inner-opt.) (Mb) / (s / inner-opt.) (Mb) / (s / inner-opt.) FO 1430 / 6.23 1092 / 5.24 1840 / 7.01 N/A 1-step 1584 / 6.80 1650 / 6.88 3844 / 18.81 1214 / 6.17 Dr MAD 1442 / 20.88 1734 / 19.83 4148 / 57.09 1262 / 17.57 Neumann IFT 1392 / 7.98 1578 / 7.43 3286 / 21.49 1262 / 6.93 Hyper Distill 1638 / 6.92 1714 / 8.68 4098 / 22.15 1206 / 6.04

Table 3: Memory and wall-clock time required by a single process. We used RTX 2080 Ti for the measurements. Note that we run 4 processes in parallel in our actual experiments (meta-batchsize is 4), which requires roughly 4 memory than the values reported in this table.

Table 3 shows the computational efﬁciency measured in actual memory usage and average wallclock time required to complete a single inner-optimization. We can see from the table that whereas our Hyper Distill requires slightly more memory and wall-clock time than 1-step or Neumann IFT method, the additional cost is deﬁnitely tolerable considering the superior meta-test performance shown in Table 2.

G SINUSOIDAL REGRESSION

0 5 10 15 20 25 30 # Inner-opt. Solved

Sinusoidal Regression

FO 1-step Dr MAD N.IFT (3, 5) Hyper Distill

Figure 8: Meta-convergence

MSE FO 0.567 0.193 1-step 0.670 0.283 Dr MAD 1.086 0.176 N.IFT 0.502 0.146 Hyper Distill 0.327 0.052

Table 4: Meta-test performance.

In this section, we conduct sinusoidal experiments to demonstrate the efﬁcacy of our method on a regression task.

Task distribution. Each task is to regress a curve sampled from the following distribution of sinusoidal functions; the amplitude and phase is sampled from U(0.1, 5) and U(0, π), respectively. The range of input is [ 5, 5], and the input and output dimensions are both 1 (Finn et al., 2017). We consider 10-shot regression problems.

Meta-model and network architecture. We set the meta-model to ANIL (Raghu et al., 2019). Given the 4-layer fully-connected Re LU network (1-100-100-100-1), the ﬁrst three layers are set to the hyperparameter, and only the last layer is adapted to given tasks.

Experimental setup. For inner-optimization of the weights, we use SGD with momentum 0.9 and set the learning rate to µInner = 0.01 The number of inner-steps is T = 30. For the hyperparameter optimization, we use Adam optimizer (Kingma & Ba, 2015) with the learning rate µHyper = 0.001. The number of inner-optimizations solved per each meta-convergence is M = 30. We perform parallel meta-learning with the meta-batchsize set to 10. Meta-testing: We solve 1000 tasks to measure average mean squared error (MSE), with exactly the same inner-optimization setup as meta-training. We repeat this over 5 different meta-training runs and report mean and 95% conﬁdence intervals (see Table 4).

Results. In Figure 8 and Table 4, we see that Hyper Distill shows better meta-convergence and meta-test performance than all the baselines. Comparing to FO and 1-step baselines, we see that it is still important to consider longer horizons even in this relatively fewer-shot learning scenario. Also, Dr MAD shows poor performance, demonstrating the importance of online optimization.

Published as a conference paper at ICLR 2022

H STANDARD LEARNING SCENARIO

0 2K 4K 6K 8K 10K # Steps

First-stage

1-step N.IFT Hyper Distill

Figure 9: Meta-train convergence

0 2K 4K 6K 8K 10K # Steps

Second-stage

1-step N.IFT Hyper Distill

Figure 10: Meta-test convergence

Test ACC 1-step 70.15 1.24 N.IFT 70.85 0.63 Hyper Distill 72.68 1.15

Table 5: Meta-test performance.

In this section, instead of meta-learning setting which involves some task distribution, we consider standard learning scenario where we are given only a single classiﬁcation task.

Two-stage learning. We consider the following two-stage learning scenario, which is a reasonable way to cast a standard classiﬁcation task into a meta-learning problem (Liu et al., 2018). In the ﬁrststage we split the whole CIFAR10 (Krizhevsky et al., 2009) training dataset into two sets with equal number of instances (each with 25,000 instances). We then use one as a training set to optimize the weight w and the other as a validation set to optimize the hyperparameter λ. In the second stage, we merge the two datasets into the original one and re-train w with it from the random initialization, while the learned hyperparameter λ in the ﬁrst stage is ﬁxed.

Meta Weight Net (Shu et al., 2019). Again, we consider Meta Weight Net which we used in the experimental section 5. We use the same network structure for the loss weighting network and the same label corruption strategy. Note that the original paper uses 1-step strategy.

Experimental setup. In the ﬁrst-stage, for both weight w and hyperparameter λ, we use SGD with momentum 0.9 and set the learning rate to 0.01 The number of training steps is set to T = 10, 000. In the second-stage, as mentioned above, we reinitialize and train w with the merged dataset, while ﬁxing λ obtained from the ﬁrst stage. We repeat this two-stage process 5 times and report mean and 95% conﬁdence intervals (see Table 5). Hyper-hyperparameters: For Neumann IFT, we compute the hypergradients (each with 5 inversion steps) for every 5 gradient steps. For Hyper Distill, we set γ = 0.9.

Results. In Figure 9, 10 and Table 5, we see that Hyper Distill shows much better metaconvergence and meta-test performance than all the baselines. The results demonstrate the effectiveness of our method for solving standard HO problems. Note that we cannot consider FO because there is no direct gradient, i.e. g FO = 0 for this Meta Weight Net model. Also, we do not consider the ofﬂine methods like Dr MAD because now the horizon length became too long to backpropagate all the way through the learning process.

I FURTHER ANALYSIS ON SHORT HORIZON BIAS

0 100 200 300 400 500 # Inner-opt. Solved

Warp Grad / CIFAR100

= 0 = 0.3 = 0.6 = 0.9 = 0.99

Figure 11: short horizon bias

In this section, we demonstrate the effect of short horizon bias by showing the convergence plots with the varying decaying factor γ (see Figure 11). Note that in this controlled experiment we want to see the effect of γ only, so we ﬁx the scaling factor as π = 1. In the right Figure 11, γ = 0 corresponds to 1-step which computes the hypergradients by unrolling only a single step, thus suffers from short horizon bias. As we increase γ, the convergence roughly improves as well, demonstrating that the short horizon bias can be alleviated by increasing γ. Also, see Figure 3(b) which shows how well the different values of γ can recover the true hypergradients. The best performing γ seems 0.99 in terms of the cosine similarity to the true hypergradients.

Published as a conference paper at ICLR 2022

J EXPERIMENTS WITH FOMAML

0 200 400 600 800 1K # Inner-opt. Solved

Warp Grad / CIFAR100

FO 1-step Hyper Distill

Figure 12: Meta-train convergence

0 200 400 600 800 1K # Inner-opt. Solved

Warp Grad / CIFAR100

Figure 13: Meta-val. convergence

Test ACC FO 45.01 1.29 1-step 46.03 0.59 Hyper Distill 51.43 0.98

Table 6: Meta-test performance.

In order to demonstrate that our Hyper Distill works well with other meta-learning algorithms than Reptile (Nichol et al., 2018), we consider ﬁrst-order MAML (FOMAML) (Finn et al., 2017). Note that we need ﬁrst-order approximation of MAML because the original MAML with second order derivative is too expensive with the long horizon T = 100. We can see from Figure 12 and Figure 13 that our method provides much faster and better meta-convergence than FO and 1-step. As a result, in Table 6 our Hyper Distill achieves signiﬁcantly better meta-test performance than the baselines. Note that we do not report the performance of Dr MAD and Neumann IFT becuase they fail to meta-converge with FOMAML.

Note that the performance with FOMAML is much worse than with Reptile, which is well known results from the previous literature (Flennerhag et al., 2018; Shin et al., 2021). This is because FOMAML ignores the whole learning process except the very last step s gradient information. This becomes more critical with longer horizons as the last step gradient becomes arbitrary uninformative to the initialization. We thus recommend using Reptile instead of FOMAML.