# adaptive_gradientbased_metalearning_methods__2b3d2a88.pdf

Adaptive Gradient-Based Meta-Learning Methods

Mikhail Khodak Carnegie Mellon University khodak@cmu.edu

Maria-Florina Balcan Carnegie Mellon University ninamf@cs.cmu.edu

Ameet Talwalkar Carnegie Mellon University & Determined AI talwalkar@cmu.edu

We build a theoretical framework for designing and understanding practical metalearning methods that integrates sophisticated formalizations of task-similarity with the extensive literature on online convex optimization and sequential prediction algorithms. Our approach enables the task-similarity to be learned adaptively, provides sharper transfer-risk bounds in the setting of statistical learning-to-learn, and leads to straightforward derivations of average-case regret bounds for efﬁcient algorithms in settings where the task-environment changes dynamically or the tasks share a certain geometric structure. We use our theory to modify several popular meta-learning algorithms and improve their meta-test-time performance on standard problems in few-shot learning and federated learning.

1 Introduction

Meta-learning, or learning-to-learn (LTL) [52], has recently re-emerged as an important direction for developing algorithms for multi-task learning, dynamic environments, and federated settings. By using the data of numerous training tasks, meta-learning methods seek to perform well on new, potentially related test tasks without using many samples. Successful modern approaches have also focused on exploiting the capabilities of deep neural networks, whether by learning multi-task embeddings passed to simple classiﬁers [51] or by neural control of optimization algorithms [46].

Because of its simplicity and ﬂexibility, a common approach is parameter-transfer, where all tasks use the same class of Θ-parameterized functions fθ : X 7 Y; often a shared model φ Θ is learned that is used to train within-task models. In gradient-based meta-learning (GBML) [23], φ is a meta-initialization for a gradient descent method over samples from a new task. GBML is used in a variety of LTL domains such as vision [38, 44, 35], federated learning [16], and robotics [20, 1]. Its simplicity also raises many practical and theoretical questions about the task-relations it can exploit and the settings in which it can succeed. Addressing these issues has naturally led several authors to online convex optimization (OCO) [55], either directly [24, 34] or from online-tobatch conversion [34, 19]. These efforts study how to ﬁnd a meta-initialization, either by proving algorithmic learnability [24] or giving meta-test-time performance guarantees [34, 19].

However, this recent line of work has so far considered a very restricted, if natural, notion of tasksimilarity closeness to a single ﬁxed point in the parameter space. We introduce a new theoretical framework, Average Regret-Upper-Bound Analysis (ARUBA), that enables the derivation of metalearning algorithms that can provably take advantage of much more sophisticated structure. ARUBA treats meta-learning as the online learning of a sequence of losses that each upper bounds the regret on a single task. These bounds often have convenient functional forms that are (a) sufﬁciently nice, so that we can draw upon the existing OCO literature, and (b) strongly dependent on both the task-data and the meta-initialization, thus encoding task-similarity in a mathematically accessible way. Using ARUBA we introduce or dramatically improve upon GBML results in the following settings:

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Adapting to the Task-Similarity: A major drawback of previous work is a reliance on knowing the task-similarity beforehand to set the learning rate [24] or regularization [19], or the use of a sub-optimal guess-and-tune approach using the doubling trick [34]. ARUBA yields a simple gradient-based algorithm that eliminates the need to guess the similarity by learning it on-the-ﬂy.

Adapting to Dynamic Environments: While previous theoretical work has largely considered a ﬁxed initialization [24, 34], in many practical applications of GBML the optimal initialization varies over time due to a changing environment [1]. We show how ARUBA reduces the problem of meta-learning in dynamic environments to a dynamic regret-minimization problem, for which there exists a vast array of online algorithms with provable guarantees that can be directly applied.

Adapting to the Inter-Task Geometry: A recurring notion in LTL is that certain model weights, such as feature extractors, are shared, whereas others, such as classiﬁcation layers, vary between tasks. By only learning a ﬁxed initialization we must re-learn this structure on every task. Using ARUBA we provide a method that adapts to this structure and determines which directions in Θ need to be updated by learning a Mahalanobis-norm regularizer for online mirror descent (OMD). We show how a variant of this can be used to meta-learn a per-coordinate learning-rate for certain GBML methods, such as MAML [23] and Reptile [44], as well as for Fed Avg, a popular federated learning algorithm [41]. This leads to improved meta-test-time performance on few-shot learning and a simple, tuning-free approach to effectively add user-personalization to Fed Avg.

Statistical Learning-to-Learn: ARUBA allows us to leverage powerful results in online-to-batch conversion [54, 33] to derive new bounds on the transfer risk when using GBML for statistical LTL [8], including fast rates in the number of tasks when the task-similarity is known and highprobability guarantees for a class of losses that includes linear regression. This improves upon the guarantees of Khodak et al. [34] and Denevi et al. [19] for similar or identical GBML methods.

1.1 Related Work

Theoretical LTL: The statistical analysis of LTL was formalized by Baxter [8]. Several works have built upon this theory for modern LTL, such as via a PAC-Bayesian perspective [3] or by learning the kernel for the ridge regression [18]. However, much effort has also been devoted to the online setting, often through the framework of lifelong learning [45, 5, 2]. Alquier et al. [2] consider a many-task notion of regret similar to the one we study in order to learn a shared data representation, although our algorithms are much more practical. Recently, Bullins et al. [11] developed an efﬁcient online approach to learning a linear data embedding, but such a setting is distinct from GBML and more closely related to popular shared-representation methods such as Proto Nets [51]. Nevertheless, our approach does strongly rely on online learning through the study of data-dependent regret-upperbounds, which has a long history of use in deriving adaptive single-task methods [40, 21]; however, in meta-learning there is typically not enough data to adapt to without considering multi-task data. Analyzing regret-upper-bounds was done implicitly by Khodak et al. [34], but their approach is largely restricted to using Follow-the-Leader (FTL) as the meta-algorithm. Similarly, Finn et al. [24] use FTL to show learnability of the MAML meta-initialization. In contrast, the ARUBA framework can handle general classes of meta-algorithms, which leads not only to new and improved results in static, dynamic, and statistical settings but also to signiﬁcantly more practical LTL methods.

GBML: GBML stems from the Model-Agnostic Meta-Learning (MAML) algorithm [23] and has been widely used in practice [1, 44, 31]. An expressivity result was shown for MAML by Finn and Levine [22], proving that the meta-learner can approximate any permutation-invariant learner given enough data and a speciﬁc neural architecture. Under strong-convexity and smoothness assumptions and using a ﬁxed learning rate, Finn et al. [24] show that the MAML meta-initialization is learnable, albeit via an impractical FTL method. In contrast to these efforts, Khodak et al. [34] and Denevi et al. [19] focus on providing ﬁnite-sample meta-test-time performance guarantees in the convex setting, the former for the SGD-based Reptile algorithm of Nichol et al. [44] and the latter for a regularized variant. Our work improves upon these analyses by considering the case when the learning rate, a proxy for the task-similarity, is not known beforehand as in Finn et al. [24] and Denevi et al. [19] but must be learned online; Khodak et al. [34] do consider an unknown task-similarity but use a doubling-trick-based approach that considers the absolute deviation of the task-parameters from the meta-initialization and is thus average-case suboptimal and sensitive to outliers. Furthermore, ARUBA can handle more sophisticated and dynamic notions of task-similarity and in certain settings can provide better statistical guarantees than those of Khodak et al. [34] and Denevi et al. [19].

2 Average Regret-Upper-Bound Analysis

Our main contribution is ARUBA, a framework for analyzing the learning of X-parameterized learning algorithms via reduction to the online learning of a sequence of functions Ut : X 7 R upper-bounding their regret on task t. We consider a meta-learner facing a sequence of online learning tasks t = 1, . . . , T, each with mt loss functions ℓt,i : Θ 7 R over action-space Θ Rd. The learner has access to a set of learning algorithms parameterized by x X that can be used to determine the action θt,i Θ on each round i [mt] of task t. Thus on each task t the meta-learner chooses xt X, runs the corresponding algorithm, and suffers regret Rt(xt) = Pmt i=1 ℓt,i(θt,i) minθ Pmt i=1 ℓt,i(θ). We propose to analyze the meta-learner s performance by studying the online learning of a sequence of regret-upper-bounds Ut(xt) Rt(xt), speciﬁcally by bounding the average regret-upper-bound UT = 1

T PT t=1 Ut(xt). The following two observations highlight why we care about this quantity:

1. Generality: Many algorithms of interest in meta-learning have regret guarantees Ut(x) with nice, e.g. smooth and convex, functional forms that depend strongly on both their parameterizations x X and the task-data. This data-dependence lets us adaptively set the parameterization xt X. 2. Consequences: By deﬁnition of Ut we have that UT bounds the task-averaged regret (TAR) RT = 1 T PT t=1 Rt(xt) [34]. Thus if the average regret-upper-bound is small then the metalearner will perform well on-average across tasks. In Section 5 we further show that a low average regret-upper-bound will also lead to strong statistical guarantees in the batch setting.

ARUBA s applicability depends only on ﬁnding a low-regret algorithm over the functions Ut; then by observation 2 we get a task-averaged regret bound where the ﬁrst term vanishes as T while by observation 1 the second term can be made small due to the data-dependent task-similarity:

RT UT o T (1) + min x 1 T

The Case of Online Gradient Descent: Suppose the meta-learner uses online gradient descent (OGD) as the within-task learning algorithm, as is done by Reptile [44]. OGD can be parameterized by an initialization φ Θ and a learning rate η > 0, so that X = {(φ, η) : φ Θ, η > 0}. Using the notation va:b = Pb i=a vi and t,j = ℓt,j(θt,j), at each round i of task t OGD plays θt,i = arg minθ Θ 1 2 θ φ 2 2 + η t,1:i 1, θ . The regret of this procedure when run on m convex G-Lipschitz losses has a well-known upper-bound [48, Theorem 2.11]

Ut(x) = Ut(φ, η) = 1

2η θ t φ 2 2 + ηG2m

i=1 ℓt,i(θt) ℓt,i(θ t ) = Rt(x) (1)

which is convex in the learning rate η and the initialization φ. Note the strong data dependence via θ t arg minθ Pmt i=1 ℓt,i(θ), the optimal action in hindsight. To apply ARUBA, ﬁrst note that if θ = 1

T θ 1:T is the mean of the optimal actions θ t on each task and V 2 = 1

T PT t=1 θ t θ 2 2 is their empirical variance, then minφ,η 1

T PT t=1 Ut(φ, η) = O(GV m). Thus by running a low-regret algorithm on the regret-upper-bounds Ut the meta-learner will suffer task-averaged regret at most o T (1) + O(GV m), which can be much better than the single-task regret O(GD m), where D is the ℓ2-radius of Θ, if V D, i.e. if the optimal actions θ t are close together. See Theorem 3.2 for the result yielded by ARUBA in this simple setting.

3 Adapting to Similar Tasks and Dynamic Environments

We now demonstrate the effectiveness of ARUBA for analyzing GBML by using it to prove a general bound for a class of algorithms that can adapt to both task-similarity, i.e. when the optimal actions θ t for each task are close to some good initialization, and to changing environments, i.e. when this initialization changes over time. The task-similarity will be measured using the Bregman divergence BR(θ||φ) = R(θ) R(φ) R(φ), θ φ of a 1-strongly-convex function R : Θ 7 R [10], a generalized notion of distance. Note that for R( ) = 1

2 2 2 we have BR(θ||φ) = 1

2 θ φ 2 2. A changing environment will be studied by analyzing dynamic regret, which for a sequence of actions {φt}t Θ taken by some online algorithm over a sequence of loss functions {ft : Θ 7 R}t is deﬁned w.r.t. a reference sequence Ψ = {ψt}t Θ as RT (Ψ) = PT t=1 ft(φt) ft(ψt). Dynamic regret measures the performance of an online algorithm taking actions φt relative to a potentially timevarying comparator taking actions ψt. Note that when we ﬁx ψt = ψ arg minψ Θ PT t=1 ft(ψ) we recover the standard static regret, in which the comparator always uses the same action.

Algorithm 1: Generic online algorithm for gradient-based parameter-transfer meta-learning. To run OGD within-task set R( ) = 1

2 2 2. To run FTRL within-task substitute ℓt,j(θ) for t,j, θ .

Set meta-initialization φ1 Θ and learning rate η1 > 0. for task t [T] do

for round i [mt] do

θt,i arg minθ Θ BR(θ||φt) + ηt t,1:i 1, θ // online mirror descent step Suffer loss ℓt,i(θt,i) Update φt+1, ηt+1 // meta-update of OMD initialization and learning rate

Putting these together, we seek to deﬁne variants of Algorithm 1 for which as T the average regret scales with VΨ, where V 2 Ψ = 1

T PT t=1 BR(θ t ||ψt), without knowing this quantity in advance. Note for ﬁxed ψt = θ = 1

T θ 1:T this measures the empirical standard deviation of the optimal taskactions θ t . Thus achieving our goal implies that average performance improves with task-similarity.

On each task t Algorithm 1 runs online mirror descent with regularizer 1

ηt BR( ||φt) for initialization φt Θ and learning rate ηt > 0. It is well-known that OMD and the related Follow-the-Regularized Leader (FTRL), for which our results also hold, generalize many important online methods, e.g. OGD and multiplicative weights [26]. For mt convex losses with mean squared Lipschitz constant G2 t they also share a convenient, data-dependent regret-upper-bound for any θ t Θ [48, Theorem 2.15]:

Rt Ut(φt, ηt) = 1

ηt BR(θ t ||φt) + ηt G2 tmt (2)

All that remains is to come up with update rules for the meta-initialization φt Θ and the learning rate ηt > 0 in Algorithm 1 so that the average over T of these upper-bounds Ut(φt, ηt) is small. While this can be viewed as a single online learning problem to determine actions xt = (φt, ηt) Θ (0, ), it is easier to decouple φ and η by ﬁrst deﬁning two function sequences {f init t }t and {f sim t }t:

f init t (φ) = BR(θ t ||φ)Gt mt f sim t (v) = BR(θ t ||φt) v + v Gt mt (3)

We show in Theorem 3.1 that to get an adaptive algorithm it sufﬁces to specify two OCO algorithms, INIT and SIM, such that the actions φt = INIT(t) achieve good (dynamic) regret over f init t and the actions vt = SIM(t) achieve low (static) regret over f sim t ; these actions then determine the update rules of φt and ηt = vt/(Gt mt). We will specialize Theorem 3.1 to derive algorithms that provably adapt to task similarity (Theorem 3.2) and to dynamic environments (Theorem 3.3).

To understand the formulation of f init t and f sim t , ﬁrst note that f sim t (v) = Ut(φt, v/(Gt mt)), so the online algorithm SIM over f sim t corresponds to an online algorithm over the regret-upper-bounds Ut when the sequence of initializations φt is chosen adversarially. Once we have shown that SIM is low-regret we can compare its losses f sim t (vt) to those of an arbitrary ﬁxed v > 0; this is the ﬁrst line in the proof of Theorem 3.1 (below). For ﬁxed v, each f init t (φt) is an afﬁne transformation of f sim t (v), so the algorithm INIT with low dynamic regret over f init t corresponds to an algorithm with low dynamic regret over the regret-upper-bounds Ut when ηt = v/(Gt mt) t. Thus once we have shown a dynamic regret guarantee for INIT we can compare its losses f init t (φt) to those of an arbitrary comparator sequence {ψt}t Θ; this is the second line in the proof of Theorem 3.1.

Theorem 3.1. Assume Θ Rd is convex, each task t [T] is a sequence of mt convex losses ℓt,i : Θ 7 R with mean squared Lipschitz constant G2 t, and R : Θ 7 R is 1-strongly-convex. Let INIT be an algorithm whose dynamic regret over functions {f init t }t w.r.t. any reference sequence Ψ = {ψt}T t=1 Θ is upper-bounded by Uinit T (Ψ). Let SIM be an algorithm whose static regret over functions {f sim t }t w.r.t. any v > 0 is upperbounded by a non-increasing function Usim T (v) of v.

If Algorithm 1 sets φt = INIT(t) and ηt = SIM(t) Gt mt then for V 2 Ψ =

PT t=1 BR(θ t ||ψt)Gt mt PT t=1 Gt mt it will achieve average regret

RT UT Usim T (VΨ)

Uinit T (Ψ)

v u u t Uinit T (Ψ)

Proof. For σt = Gt mt we have by the regret bound on OMD/FTRL (2) that

BR(θ t ||φt) vt + vt

σt min v>0 Usim T (v) +

BR(θ t ||φt) v + v σt

min v>0 Usim T (v) + Uinit T (Ψ)

BR(θ t ||ψt) v + v σt

Usim T (VΨ) + min

( Uinit T (Ψ)

Uinit T (Ψ)σ1:T

where the last line follows by substituting v = max VΨ, q

Uinit T (Ψ)/σ1:T

Similar Tasks in Static Environments: By Theorem 3.1, if we can specify algorithms INIT and SIM with sublinear regret over f init t and f sim t (3), respectively, then the average regret will converge to O(VΨ m) as desired. We ﬁrst show an approach in the case when the optimal actions θ t are close to a ﬁxed point in Θ, i.e. for ﬁxed ψt = θ = 1

T θ 1:T . Henceforth we assume the Lipschitz constant G and number of rounds m are the same across tasks; detailed statements are in the supplement.

Note that if R( ) = 1 2 2 2 then {f init t }t are quadratic functions, so playing φt+1 = 1 t θ 1:t has logarithmic regret [48, Corollary 2.2]. We use a novel strongly convex coupling argument to show that this holds for any such sequence of Bregman divergences, even for nonconvex BR(θ t || ). The second sequence {f sim t }t is harder because it is not smooth near 0 and not strongly convex if θ t = φt. We study a regularized sequence f sim t (v) = f sim t (v) + ε2/v for ε 0. Assuming a bound of D2 on the Bregman divergence and setting ε = 1/ 4

T, we achieve O(

T) regret on the original sequence by running exponentially-weighted online-optimization (EWOO) [28] on the regularized sequence:

0 v exp( γ P

s<t f sim s (v))dv R

s<t f sim s (v))dv for γ = 2 DG m min ε2

Note that while EWOO is inefﬁcient in high dimensions, we require only single-dimensional integrals. In the supplement we also show that simply setting v2 t+1 = ε2t + P

s t BR(θ s||φt) has only a slightly worse regret of O(T 3/5). These guarantees sufﬁce to show the following: Theorem 3.2. Under the assumptions of Theorem 3.1 and boundedness of BR over Θ, if INIT plays φt+1 = 1

t θ 1:t and SIM uses ε-EWOO (4) with ε = 1/ 4

T then Algorithm 1 achieves average regret

RT UT = O min 1 + 1

+ V m for V 2 = min φ Θ 1 T

t=1 BR(θ t ||φ)

Observe that if V , the average deviation of θ t , is ΩT (1) then the bound becomes O(V m) at rate O(1/

T), while if V = o T (1) the bound tends to zero. Theorem 3.1 can be compared to the main result of Khodak et al. [34], who set the learning rate via a doubling trick. We improve upon their result in two aspects. First, their asymptotic regret is O(D m), where D is the maximum distance between any two optimal actions. Note that V is always at most D , and indeed may be much smaller in the presence of outliers. Second, our result is more general, as we do not need convex BR(θ t || ). Remark 3.1. We assume an oracle giving a unique θ arg minθ Θ P

ℓ S ℓ(θ) for any ﬁnite loss sequence S, which may be inefﬁcient or undesirable. One can instead use the last or average iterate of within-task OMD/FTRL for the meta-update; in the supplement we show that this incurs an additional o( m) regret term under a quadratic growth assumption that holds in many practical settings [34].

Related Tasks in Changing Environments: In many settings we have a changing environment and so it is natural to study dynamic regret. This has been widely analyzed by the online learning community [15, 30], often by showing a dynamic regret bound consisting of a sublinear term plus a bound on the variation in the action or function space. Using Theorem 3.1 we can show dynamic guarantees for GBML via reduction to such bounds. We provide an example in the Euclidean geometry using the popular path-length-bound PΨ = PT t=2 ψt ψt 1 2 for reference actions Ψ = {ψt}T t=1 [55]. We use a result showing that OGD with learning rate η 1/β over α-stronglyconvex, β-strongly-smooth, and L-Lipschitz functions has a bound of O(L(1 + PΨ)) on its dynamic regret [42, Corollary 1]. Observe that in the case of R( ) = 1

2 2 2 the sequence f init t in Theorem 3.1 consists of DG m-Lipschitz quadratic functions. Thus using Theorem 3.1 we achieve the following:

Figure 1: Left - Theorem 3.2 improves upon [34, Theorem 2.1] via its dependence on the average deviation V rather than the maximal deviation D of the optimal task-parameters θ t (light blue). Right - a case where Theorem 3.3 yields a strong task-similarity-based guarantee via a dynamic comparator Ψ despite the deviation V being large.

Figure 2: Learning rate variation across layers of a convolutional net trained on Mini-Image Net using Algorithm 2. Following intuition outlined in Section 6, shared feature extractors are not updated much if at all compared to higher layers.

Theorem 3.3. Under Theorem 3.1 assumptions, bounded Θ, and R( ) = 1

2 2 2, if INIT is OGD with learning rate 1 G m and SIM uses ε-EWOO (4) with ε = 1/ 4

T then by using OGD within-task Algorithm 1 will achieve for any ﬁxed comparator sequence Ψ = {ψt}t [T ] Θ the average regret

for V 2 Ψ = 1 2T PT t=1 θ t ψt 2 2 and PΨ = PT t=2 ψt ψt 1 2.

This bound controls the average regret across tasks using the deviation VΦ of the optimal task parameters θ t from some reference sequence Φ, which is assumed to vary slowly or sparsely so that the path length PΦ is small. Figure 1 illustrates when such a guarantee improves over Theorem 3.2. Note also that Theorem 3.3 speciﬁes OGD as the meta-update algorithm INIT, so under the approximation that each task t s last iterate is close to θ t this suggests that simple GBML methods such as Reptile [44] or Fed Avg [41] are adaptive. The generality of ARUBA also allows for the incorporation of other dynamic regret bounds [25, 53] and other non-static notions of regret [27].

4 Adapting to the Inter-Task Geometry

Previously we gave improved guarantees for learning OMD under a simple notion of task-similarity: closeness of the optimal actions θ t . We now turn to new algorithms that can adapt to a more sophisticated task-similarity structure. Speciﬁcally, we study a class of learning algorithms parameterized by an initialization φ Θ and a symmetric positive-deﬁnite matrix H M Rd d which plays

θt,i = arg min θ Θ

1 2 θ φ 2 H 1 + t,1:i 1, θ (5)

This corresponds θt,i+1 = θt,i H t,i, so if the optimal actions θ t vary strongly in certain directions, a matrix emphasizing those directions improves within-task performance. By strong-convexity of 1 2 θ φ 2 H 1 w.r.t. H 1, the regret-upper-bound is Ut(φ, H) = 1

2 θ t φ 2 H 1 +Pm i=1 t,i 2 H [48, Theorem 2.15]. We ﬁrst study the diagonal case, i.e. learning a per-coordinate learning rate η Rd to get iteration θt,i+1 = θt,i ηt t,i. We propose to set ηt at each task t as follows:

s<t ε2s + 1

2(θ s φs)2 P

s<t ζ2s + Pms i=1 2 s,i for ε2 t = ε2

(t + 1)p , ζ2 t = ζ2

(t + 1)p t 0, where ε, ζ, p > 0 (6)

Observe the similarity between this update Ada Grad [21], which is also inversely related to the sum of the element-wise squares of all gradients seen so far. Our method adds multi-task information by setting the numerator to depend on the sum of squared distances between the initializations φt set by the algorithm and that task s optimal action θ t . This algorithm has the following guarantee: Theorem 4.1. Let Θ be a bounded convex subset of Rd, let D Rd d be the set of positive deﬁnite diagonal matrices, and let each task t [T] consist of a sequence of m convex Lipschitz loss functions ℓt,i : Θ 7 R. Suppose for each task t we run the iteration in Equation 5 setting φ = 1 t 1θ 1:t 1 and setting H = Diag(ηt) via Equation 6 for ε = 1, ζ = m, and p = 2

5. Then we achieve

RT UT = min φ Θ H D

( 1 Hjj + Hjj

θ t φ 2 H 1 2 +

i=1 t,i 2 H

As T the average regret converges to the minimum over φ, H of the last two terms, which corresponds to running OMD with the optimal initialization and per-coordinate learning rate on every task. The rate of convergence of T 2/5 is slightly slower than the usual 1/

T achieved in the previous section; this is due to the algorithm s adaptivity to within-task gradients, whereas previously we simply assumed a known Lipschitz bound Gt when setting ηt. This adaptivity makes the algorithm much more practical, leading to a method for adaptively learning a within-task learning rate using multi-task information; this is outlined in Algorithm 2 and shown to signiﬁcantly improve GBML performance in Section 6. Note also the per-coordinate separation of the left term, which shows that the algorithm converges more quickly on non-degenerate coordinates. The per-coordinate speciﬁcation of ηt (6) can be further generalized to learning a full-matrix adaptive regularizer, for which we show guarantees in Theorem 4.2. However, the rate is much slower, and without further assumptions such methods will have Ω(d2) computation and memory requirements.

Theorem 4.2. Let Θ be a bounded convex subset of Rd and let each task t [T] consist of a sequence of m convex Lipschitz loss functions ℓt,i : Θ 7 R. Suppose for each task t we run the iteration in Equation 5 with φ = 1 t 1θ 1:t 1 and H the unique positive deﬁnite solution of B2 t = HG2 t H for

B2 t = tε2Id + 1

s<t (θ s φs)(θ s φs)T and G2 t = tζ2Id + X

i=1 s,i T s,i

for ε = 1/ 8

T and ζ = m/ 8

T. Then for λj corresponding to the jth largest eigenvalue we have

RT UT = O 1

m + min φ Θ H 0

2λ2 1(H) λd(H) 1 + log T

θ t φ 2 H 1 2 +

i=1 t,i 2 H

5 Fast Rates and High Probability Bounds for Statistical Learning-to-Learn

Batch-setting transfer risk bounds have been an important motivation for studying LTL via online learning [2, 34, 19]. If the regret-upper-bounds are convex, which is true for most practical variants of OMD/FTRL, ARUBA yields several new results in the classical distribution over task-distributions setup of Baxter [8]. In Theorem 5.1 we present bounds on the risk ℓP( θ) of the parameter θ obtained by running OMD/FTRL on i.i.d. samples from a new task distribution P and averaging the iterates. Theorem 5.1. Assume Θ, X are convex Euclidean subsets. Let convex losses ℓt,i : Θ 7 [0, 1] be drawn i.i.d. Pt Q, {ℓt,i}i Pm t for distribution Q over tasks. Suppose they are passed to an algorithm with average regret upper-bound UT that at each t picks xt X to initialize a within-task method with convex regret upper-bound Ut : X 7 [0, B m], for B 0. If the within-task algorithm is initialized by x = 1

T x1:T and it takes actions θ1, . . . , θm on m i.i.d. losses from new task P Q then θ = 1

mθ1:m satisﬁes the following transfer risk bounds for any θ Θ (all w.p. 1 δ):

1. general case: EP Q EPm ℓP( θ) EP Q ℓP(θ ) + LT for LT = U m + B q

8 m T log 1

2. ρ-self-bounded losses ℓ: if ρ > 0 s.t. ρ Eℓ P ℓ(θ) Eℓ P( ℓ(θ) Eℓ P ℓ(θ))2 for all distributions P Q, where ℓ(θ) = ℓ(θ) ℓ(θ ) for any θ arg minθ Θ ℓP(θ), then for

LT as above we have EP Q ℓP( θ) EP Q ℓP(θ ) + LT + q

3. α-strongly-convex, G-Lipschitz regret-upper-bounds Ut: in parts 1 and 2 above we can

substitute LT = U + minx EP Q U(x)

U αm log 8 log T

δ + max{16G2,6αB m}

αm T log 8 log T

In the general case, Theorem 5.1 provides bounds on the excess transfer risk decreasing with U /m and 1/

m T. Thus if U improves with task-similarity so will the transfer risk as T . Note that the second term is 1/

m T rather than 1/

T as in most-analyses [34, 19]; this is because regret is m-bounded but the OMD regret-upper-bound is O( m)-bounded. The results also demonstrate ARUBA s ability to utilize specialized results from the online-to-batch conversion literature. This is witnessed by the guarantee for self-bounded losses, a class which Zhang [54] shows includes linear regression; we use a result by the same author to obtain high-probability bounds, whereas previous GBML bounds are in-expectation [34, 19]. We also apply a result due to Kakade and Tewari [33] for the case of strongly-convex regret-upper-bounds, enabling fast rates in the number of tasks T. The strongly-convex case is especially relevant for GBML since it holds for OGD with ﬁxed learning rate.

Algorithm 2: ARUBA: an approach for modifying a generic batch GBML method to learn a per-coordinate learning rate. Two specialized variants provided below. Input: T tasks, update method for meta-initialization, within-task descent method, settings ε, ζ, p > 0 Initialize b1 ε21d, g1 ζ21d for task t = 1, 2, . . . , T do

Set φt according to update method, ηt p

bt/gt Run descent method from φt with learning rate ηt: observe gradients t,1, . . . , t,mt obtain within-task parameter ˆθt bt+1 bt + ε21d (t+1)p + 1

gt+1 gt + ζ21d (t+1)p + Pmt i=1 2 t,i

Result: initialization φT , learning rate ηT = p

ARUBA++: starting with ηT,1 = ηT and g T,1 = g T , adaptively reset the learning rate by setting ˆg T,i+1 ˆg T,i+c 2 i for some c > 0 and then updating ηT,i+1 p

b T /g T,i+1. Isotropic: bt and gt are scalars tracking the sum of squared distances and sum of squared gradient norms, respectively.

Figure 3: Next-character prediction performance for recurrent networks trained on the Shakespeare dataset [12] using Fed Avg [41] and its modiﬁcations by Algorithm 2. Note that the two ARUBA methods require no learning rate tuning when personalizing the model (reﬁne), unlike both Fed Avg methods; this is a critical improvement in federated settings. Furthermore, isotropic ARUBA has negligible overhead by only communicating scalars.

We present two consequences of these results for the algorithms from Section 3 when run on i.i.d. data. To measure task-similarity we use the variance V 2 Q = minφ Θ EP Q EPm θ φ 2 2 of the empirical risk minimizer θ of an m-sample task drawn from Q. If VQ is known we can use strong-convexity of the regret-upper-bounds to obtain a fast rate for learning the initialization, as shown in the ﬁrst part of Corollary 5.1. The result can be loosely compared to Denevi et al. [19], who provide a similar asymptotic improvement but with a slower rate of O(1/

T) in the second term. However, their task-similarity measures the deviation of the true, not empirical, risk-minimizers, so the results are not directly comparable. Corollary 5.1 also gives a guarantee for when we do not know VQ and must learn the learning rate η in addition to the initialization; here we match the rate of Denevi et al. [19], who do not learn η, up to some additional fast o(1/ m) terms.

Corollary 5.1. In the setting of Theorems 3.2 & 5.1, if δ 1/e and Algorithm 1 uses within-task

OGD with initialization φt+1 = 1

t θ 1:t and step-size ηt = VQ+1/

T G m for VQ as above, then w.p. 1 δ

E P Q E Pm ℓP( θ) E P Q ℓP(θ ) + O VQ m + 1

If ηt is set adaptively using ε-EWOO as in Theorem 3.2 for ε = 1/ 4

m T + 1/ m then w.p. 1 δ

E P Q E Pm ℓP( θ) E P Q ℓP(θ ) + O

6 Empirical Results: Adaptive Methods for Few-Shot & Federated Learning

A generic GBML method does the following at iteration t: (1) initialize a descent method at φt; (2) take gradient steps with learning rate η to get task-parameter ˆθt; (3) update meta-initialization to φt+1. Motivated by Section 4, in Algorithm 2 we outline a generic way of replacing η by a per-coordinate rate learned on-the-ﬂy. This entails keeping track of two quantities: (1) bt Rd, a per-coordinate sum over s < t of the squared distances from the initialization φs to within-task parameter ˆθs; (2) gt Rd, a per-coordinate sum of the squared gradients seen so far. At task t we set η to be the element-wise square root of bt/gt, allowing multi-task information to inform the trajectory. For example, if along coordinate j the ˆθt,j is usually not far from initialization then bj will be small and thus so will ηj; then if on a new task we get a high noisy gradient along coordinate j the performance will be less adversely affected because it will be down-weighted by the learning rate. Single-task algorithms such as Ada Grad [21] and Adam [36] also work by reducing the learning rate along frequent directions.

20-way Omniglot 5-way Mini-Image Net 1-shot 5-shot 1-shot 5-shot 1st-Order MAML [23] 89.4 0.5 97.9 0.1 48.07 1.75 63.15 0.91 1st Reptile [44] w. Adam [36] 89.43 0.14 97.12 0.32 49.97 0.32 65.99 0.58 Order Reptile w. ARUBA 86.67 0.17 96.61 0.13 50.73 0.32 65.69 0.61 Reptile w. ARUBA++ 89.66 0.3 97.49 0.28 50.35 0.74 65.89 0.34 2nd 2nd-Order MAML 95.8 0.3 98.9 0.2 48.7 1.84 63.11 0.92 Order Meta-SGD [38] 95.93 0.38 98.97 0.19 50.47 1.87 64.03 0.94 Table 1: Meta-test-time performance of GBML algorithms on few-shot classiﬁcation benchmarks. 1st-order and 2nd-order results obtained from Nichol et al. [44] and Li et al. [38], respectively.

However, in meta-learning some coordinates may be frequently updated during meta-training because good task-weights vary strongly from the best initialization along them, and thus their gradients should not be downweighted; ARUBA encodes this intuition in the numerator using the distance-traveled per-task along each direction, which increases the learning rate along high-variance directions. We show in Figure 2 that this is realized in practice, as ARUBA assigns a faster rate to deeper layers than to lower-level feature extractors, following standard intuition in parameter-transfer meta-learning. As described in Algorithm 2, we also consider two variants: ARUBA++, which updates the meta-learned learning-rate at meta-test-time in a manner similar to Ada Grad, and Isotropic ARUBA, which only tracks scalar quantities and is thus useful for communication-constrained settings. Few-Shot Classiﬁcation: We ﬁrst examine if Algorithm 2 can improve performance on Omniglot [37] and Mini-Image Net [46], two standard few-shot learning benchmarks, when used to modify Reptile, a simple meta-learning method [44]. In its serial form Reptile is roughly the algorithm we study in Section 3 when OGD is used within-task and η is ﬁxed. Thus we can set Reptile+ARUBA to be Algorithm 2 with ˆθt the last iterate of OGD and the meta-update a weighted sum of ˆθt and φt. In practice, however, Reptile uses Adam [36] to exploit multi-task gradient information. As shown in Table 1, ARUBA matches or exceeds this baseline on Mini-Image Net, although on Omniglot it requires the additional within-task updating of ARUBA++ to show improvement. It is less clear how ARUBA can be applied to MAML [23], as by only taking one step the distance traveled will be proportional to the gradient, so η will stay ﬁxed. We also do not ﬁnd that ARUBA improves multi-step MAML perhaps not surprising as it is further removed from our theory due to its use of held-out data. In Table 1 we compare to Meta-SGD [38], which does learn a per-coordinate learning rate for MAML by automatic differentiation. This requires more computation but does lead to consistent improvement. As with the original Reptile, our modiﬁcation performs better on Mini-Image Net but worse on Omniglot compared to MAML and its modiﬁcation Meta-SGD. Federated Learning: A main goal in this setting is to use data on heterogeneous nodes to learn a global model without much communication; leveraging this to get a personalized model is an auxiliary goal [50], with a common application being next-character prediction on mobile devices. A popular method is Fed Avg [41], where at each communication round r the server sends a global model φr to a batch of nodes, which then run local OGD; the server then sets φr+1 to the average of the returned models. This can be seen as a GBML method with each node a task, making it easy to apply ARUBA: each node simply sends its accumulated squared gradients to the server together with its model. The server can use this information and the squared difference between φr and φr+1 to compute a learning rate ηr+1 via Algorithm 2 and send it to each node in the next round. We use Fed Avg with ARUBA to train a character LSTM [29] on the Shakespeare dataset, a standard benchmark of a thousand users with varying amounts of non-i.i.d. data [41, 12]. Figure 3 shows that ARUBA signiﬁcantly improves over non-tuned Fed Avg and matches the performance of Fed Avg with a tuned learning rate schedule. Unlike both baselines we also do not require step-size tuning when reﬁning the global model for personalization. This reduced need for hyperparameter optimization is crucial in federated settings, where the number of user-data accesses are extremely limited.

7 Conclusion

In this paper we introduced ARUBA, a framework for analyzing GBML that is both ﬂexible and consequential, yielding new guarantees for adaptive, dynamic, and statistical LTL via online learning. As a result we devised a novel per-coordinate learning rate applicable to generic GBML procedures, improving their training and meta-test-time performance on few-shot and federated learning. We see great potential for applying ARUBA to derive many other new LTL methods in a similar manner.

Acknowledgments

We thank Jeremy Cohen, Travis Dick, Nikunj Saunshi, Dravyansh Sharma, Ellen Vitercik, and our three anonymous reviewers for helpful feedback. This work was supported in part by DARPA FA875017C0141, National Science Foundation grants CCF-1535967, CCF-1910321, IIS-1618714, IIS-1705121, IIS-1838017, and IIS-1901403, a Microsoft Research Faculty Fellowship, a Bloomberg Data Science research grant, an Amazon Research Award, an Amazon Web Services Award, an Okawa Grant, a Google Faculty Award, a JP Morgan AI Research Faculty Award, and a Carnegie Bosch Institute Research Award. Any opinions, ﬁndings and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of DARPA, the National Science Foundation, or any other funding agency.

[1] Maruan Al-Shedivat, Trapit Bansal, Yura Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. In Proceedings of the 6th International Conference on Learning Representations, 2018.

[2] Pierre Alquier, The Tien Mai, and Massimiliano Pontil. Regret bounds for lifelong learning. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

[3] Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In Proceedings of the 35th International Conference on Machine Learning, 2018.

[4] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tôhoku Mathematical Journal, 19:357 367, 1967.

[5] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Efﬁcient representations for lifelong learning and autoencoding. In Proceedings of the Conference on Learning Theory, 2015.

[6] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705 1749, 2005.

[7] Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, 2008.

[8] Jonathan Baxter. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149 198, 2000.

[9] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[10] Lev M. Bregman. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7:200 217, 1967.

[11] Brian Bullins, Elad Hazan, Adam Kalai, and Roi Livni. Generalize across tasks: Efﬁcient algorithms for linear representation learning. In Proceedings of the 30th International Conference on Algorithmic Learning Theory, 2019.

[12] Sebastian Caldas, Peter Wu, Tian Li, Jakub Koneˇcný, H. Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. LEAF: A benchmark for federated settings. ar Xiv, 2018.

[13] Nicoló Cesa-Bianchi and Claudio Gentile. Improved risk tail bounds for on-line algorithms. In Advances in Neural Information Processing Systems, 2005.

[14] Nicoló Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004.

[15] Nicoló Cesa-Bianchi, Pierre Gaillard, Gabor Lugosi, and Gilles Stoltz. A new look at shifting regret. HAL, 2012.

[16] Fei Chen, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. Federated meta-learning for recommendation. ar Xiv, 2018.

[17] Chandler Davis. Notions generalizing convexity for functions deﬁned on spaces of matrices. In Proceedings of Symposia in Pure Mathematics, 1963.

[18] Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Incremental learningto-learn with statistical guarantees. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2018.

[19] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. ar Xiv, 2019.

[20] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems, 2017.

[21] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

[22] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In Proceedings of the 6th International Conference on Learning Representations, 2018.

[23] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[24] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergei Levine. Online meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019. To Appear.

[25] Eric C. Hall and Rebecca M. Willet. Online optimization in dynamic environments. ar Xiv, 2016.

[26] Elad Hazan. Introduction to online convex optimization. In Foundations and Trends in Optimization, volume 2, pages 157 325. now Publishers Inc., 2015.

[27] Elad Hazan and C. Seshadri. Efﬁcient learning algorithms for changing environments. In Proceedings of the 26th International Conference on Machine Learning, 2009.

[28] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2007.

[29] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9: 1735 1780, 1997.

[30] Ali Jadbabaie, Alexander Rakhlin, and Shahin Shahrampour. Online optimization : Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, 2015.

[31] Ghassen Jerfel, Erin Grant, Thomas L. Grifﬁths, and Katherine Heller. Online gradient-based mixtures for transfer modulation in meta-learning. ar Xiv, 2018.

[32] Sham Kakade and Shai Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, 2008.

[33] Sham Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems, 2008.

[34] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradientbased meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019. To Appear.

[35] Jaehong Kim, Sangyeul Lee, Sungwan Kim, Moonsu Cha, Jung Kwon Lee, Youngduck Choi, Yongseok Choi, Dong-Yeon Choi, and Jiwon Kim. Auto-Meta: Automated gradient based meta learner search. ar Xiv, 2018.

[36] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.

[37] Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the Conference of the Cognitive Science Society (Cog Sci), 2017.

[38] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learning quickly for few-shot learning. ar Xiv, 2017.

[39] Elliott H. Lieb. Convex trace functions and the Wigner-Yanase-Dyson conjecture. Advances in Mathematics, 11:267 288, 1973.

[40] H. Brendan Mc Mahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Conference on Learning Theory, 2010.

[41] H. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efﬁcient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artiﬁcal Intelligence and Statistics, 2017.

[42] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Proceedings of the 55th IEEE Conference on Decision and Control, 2016.

[43] Ken-ichiro Moridomi, Kohei Hatano, and Eiji Takimoto. Online linear optimization with the log-determinant regularizer. IEICE Transactions on Information and Systems, E101-D(6): 1511 1520, 2018.

[44] Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms. ar Xiv, 2018.

[45] Anastasia Pentina and Christoph H. Lampert. A PAC-Bayesian bound for lifelong learning. In Proceedings of the 31st International Conference on Machine Learning, 2014.

[46] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017.

[47] Ankan Saha, Prateek Jain, and Ambuj Tewari. The interplay between stability and regret in online learning. ar Xiv, 2012.

[48] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 -194, 2011.

[49] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11, 2010.

[50] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, 2017.

[51] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017.

[52] Sebastian Thrun and Lorien Pratt. Learning to Learn. Springer Science & Business Media, 1998.

[53] Lijun Zhang, Tianbao Yang, Jinfeng Yi, and Rong Jin Zhi-Hua Zhou. Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems, 2017.

[54] Tong Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the International Conference on Learning Theory, 2005.

[55] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.