# on_the_stability_and_generalization_of_metalearning__2785760d.pdf

On the Stability and Generalization of Meta-Learning

Yunjuan Wang Department of Computer Science Johns Hopkins University Baltimore, MD, 21218

ywang509@jhu.edu

Raman Arora Department of Computer Science Johns Hopkins University Baltimore, MD, 21218

arora@cs.jhu.edu

We focus on developing a theoretical understanding of meta-learning. Given multiple tasks drawn i.i.d. from some (unknown) task distribution, the goal is to find a good pre-trained model that can be adapted to a new, previously unseen, task with little computational and statistical overhead. We introduce a novel notion of stability for meta-learning algorithms, namely uniform meta-stability. We instantiate two uniformly meta-stable learning algorithms based on regularized empirical risk minimization and gradient descent and give explicit generalization bounds for convex learning problems with smooth losses and for weakly convex learning problems with non-smooth losses. Finally, we extend our results to stochastic and adversarially robust variants of our meta-learning algorithm.

1 Introduction

Traditional machine learning algorithms excel at generalizing, but they often require extensive training data and assume that both training and test data come from the same distribution or task. In real-world scenarios, large sets of training data from a single task are often lacking. Instead, training data may stem from diverse tasks with shared similarities, while test data come from entirely new tasks. The challenge is to rapidly adapt to these unseen tasks without the need to train from scratch.

To address this challenge, meta-learning, also referred to as learning-to-learn, has emerged as an effective approach. Meta-learning has gained significant attention recently [Hospedales et al., 2021], with applications spanning across various domains including computer vision [Nichol et al., 2018] and robotics [Al-Shedivat et al., 2017], ranging from few-shot classification [Snell et al., 2017], hyperparameter optimization [Franceschi et al., 2018], to personalized recommendation systems [Wang et al., 2022].

As the name suggests, meta-learning operates on two levels of abstraction to enhance learning over time. On an intra-task level, the learner needs to find models that perform well on individual tasks. On a meta-level, the learner needs to figure out useful meta-information, perhaps a prior over tasks, that relates different tasks and allows transferring and adaptation of knowledge to new unseen tasks efficiently (both in terms of statistical as well computational overhead). It is typical to represent such meta-information in the form of a pre-trained model, which we can represent using certain meta-parameters. Distinct from a standard setting, meta-learning involves training on a diverse set of tasks. At test time, we evaluate the performance of the pre-trained model on new unseen tasks while allowing it to adapt using a small sample on the test task.

An increasing body of empirical research is dedicated to advancing meta-learning algorithms, among which model-agnostic meta-learning (MAML) [Finn et al., 2017] stands out as a prominent approach. MAML is designed to find a good meta-parameter w which facilitates the learning of task-specific parameters through a single step of gradient descent. In particular, given a set of m tasks denoted as {Dj}m j=1, MAML estimates the meta-parameter as w = argminw 1 m Pm j=1 L(uj, Dj), where task-specific parameters are computed as uj = w η L(w, Dj).

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

However, a notable limitation of MAML is that it requires computing second-order derivatives, which is computationally demanding for deep neural networks in practical applications. This computational complexity also poses a challenge for a theoretical understanding of MAML, an aspect that remains largely under-explored. To mitigate this challenge, several MAML variants have been proposed, including first-order MAML [Finn et al., 2017], Reptile [Nichol et al., 2018], and i MAML [Rajeswaran et al., 2019]. Owing to its success, MAML has been used for robust adversarial meta-learning [Yin et al., 2018, Goldblum et al., 2020, Wang et al., 2021, Collins et al., 2020], differential private meta-learning [Li et al., 2019], and personalized federated learning [Chen et al., 2018, Fallah et al., 2020].

Another popular framework for meta-learning is based on a proximal update, wherein the task-specific parameter are iteratively learned by minimizing the empirical loss and an ℓ2 regularizer [Denevi et al., 2018, Zhou et al., 2019, Denevi et al., 2019a, 2020, Jiang et al., 2021]. Given a task D and a meta-parameter w, the task-specific parameter u are defined as u = argminu L(u; D) + λ

2 u w 2. This regularization strategy ensures that the task-specific parameter remains close to the meta-parameter. A similar strategy has been explored in other contexts. For example, Kuzborskij and Orabona [2017] study the problem of hypothesis transfer learning and show a fast rate on the generalization error of a task-specific parameter u returned by regularized empirical risk minimization conditioned on a good meta-parameter w. Yet, it remains unclear how to ensure finding such a good meta-parameter, provably. Relatedly, Denevi et al. [2019b] study stochastic gradient descent with biased regularization for linear model and incrementally update the bias (meta-parameter). Concurrently, Zhou et al. [2019] proposed the Meta-Prox algorithm as a generic stochastic metalearning approach. Specifically, given a set of meta-training tasks D1, . . . , Dm, the meta-parameter w is estimated by solving minw Pm j=1 minu L(u, Dj) + λ

2 u w 2. Zhou et al. [2019] argue that Meta-Prox is a generalization of MAML since the gradient descent update in MAML can be viewed as taking the first-order Taylor expansion of the objective, [Zhou et al., 2019, Section 3.1].

In this work, we adopt the framework of Zhou et al. [2019] to study meta-learning from a theoretical perspective. Given m tasks drawn i.i.d. from some (unknown) task distribution µ, our goal is to find a good pre-trained model (the meta-parameter) which can be adapted to a new unseen task, drawn i.i.d. from µ, at test time, using gradient descent. Our key contributions are as follows.

1. We introduce a novel notion of stability for meta-learning algorithms, namely uniform meta-stability. For β uniformly meta-stable algorithm, we bound the generalization gap by O( β log (mn/δ) + p

log (1/δ)/(mn)). 2. We consider two variants of task-specific learning based on regularized empirical risk minimization (RERM) and gradient descent (GD) within our meta-learning framework. We apply our stability-based analysis to these variants to learning problems with convex, smooth losses and weakly convex, non-smooth losses. Our results are summarized in Table 1.

Algorithm Loss Conditions Uniform meta-stability β

Algo. 1 with RERM convex, G-Lipschitz γ 1

λn Algo. 1 with RERM convex, H-smooth, M-bounded γ 1

λ, λ H HM λ(2n 1) + HM λ(m+1) Algo. 1 with GD convex, G-Lipschitz, H-smooth η 2 H+2λ, γ 1 λT

λn Algo. 1 with GD ρ-weakly convex, G-Lipschitz η 1

λT , λ 2ρ G2p η

λn Algo. 3 with GD ρ-weakly convex, G-Lipschitz η 1

λT , λ 2ρ G2p η

Table 1: Bounds on uniform meta-stability β for different families of learning problems. Here, η is the step-size for GD for task-specific learning, γ is the step-size for GD for meta-parameter learning, m is the number of tasks during training, n is the number of training data for the task at test time.

3. We extend our results to stochastic and adversarially robust variants of our meta-learning algorithm.

1.1 Related Work Algorithmic Stability Analysis. In many machine learning problems, standard learning theoretic tools, such as uniform convergence, do not apply since the associated complexity measures are unbounded or undefined (e.g., nearest neighbor classification), or yield guarantees that are not meaningful. Stability-based analysis is an alternative approach for obtaining generalization bounds

in such settings, introduced by Bousquet and Elisseeff [2002] and further developed in a long line of influential works [Elisseeff et al., 2005, Mukherjee et al., 2006, Shalev-Shwartz et al., 2010, Liu et al., 2017]. More recently, there have been significant breakthroughs in this field, with the work of Feldman and Vondrak [2018, 2019], Bousquet et al. [2020], Klochkov and Zhivotovskiy [2021], thereby improving the high probability bounds for uniformly stable learning algorithms beyond those established by Bousquet and Elisseeff [2002]. These results are complemented by Hardt et al. [2016], who provide the generalization bounds via algorithmic stability analysis of stochastic gradient for stochastic convex optimization with smooth loss functions. Subsequent work by Bassily et al. [2020] improves upon these results by removing the smoothness assumption, while Zhou et al. [2022], Lei [2023] advance the state-of-the-art by relaxing the convexity assumption.

Theoretical Guarantees for Meta-Learning. There has been significant progress in understanding the theoretical aspects of meta-learning, both in terms of convergence guarantees [Fallah et al., 2019, Ji et al., 2020, Mishchenko et al., 2023] and the generalization guarantees. The first generalization analysis can be traced back to Baxter [2000], who assumed that all tasks are sampled i.i.d. from the same task distribution. Subsequent works have enriched the guarantees through various learning theoretic constructs, including VC theory [Ben-David and Schuller, 2003, Maurer, 2009, Maurer et al., 2016], information-theoretic tools [Chen et al., 2021, Jose and Simeone, 2021, Jose et al., 2021, Rezazadeh et al., 2021, Hellström and Durisi, 2022], PAC-Bayes framework [Pentina and Lampert, 2014, Amit and Meir, 2018, Rothfuss et al., 2021, Farid and Majumdar, 2021, Liu et al., 2021, Ding et al., 2021, Rezazadeh, 2022, Riou et al., 2023, Zakerinia et al., 2024], etc. Other works that do not rely on the task distribution assumption instead choose to get a handle on the bound by defining certain metrics to measure either the task similarity [Du et al., 2020, Tripuraneni et al., 2020, Guan and Lu, 2021] or the divergence between the new tasks and the training sample for the training tasks [Fallah et al., 2021]. Finally, several works focus on the online meta-learning setting, also referred to as the lifelong learning [Pentina and Lampert, 2014, Balcan et al., 2019, Denevi et al., 2019a,b, Meunier and Alquier, 2021].

A prominent line of work, starting with that of Maurer [2005], focuses on giving theoretical guarantees for meta-learning via algorithmic stability analysis. More recently, Chen et al. [2020] establish connections between single-task learning with support/query (episodic) meta-learning algorithms, providing generalization gap of O(1/ m) (where m is the number of tasks) for smooth functions that is independent of the sample size n this was shown to be nearly optimal in Guan et al. [2022]. Subsequently, Fallah et al. [2021] show a bound of O(1/mn) for strongly convex functions and by leveraging a new notion of stability. Al-Shedivat et al. [2021] extend the result of Maurer [2005] to practical meta-learning algorithms for Lipschitz and smooth losses. Farid and Majumdar [2021] derive a PAC-Bayes bound to address the qualitatively different challenges of generalization within the task compared to that at the meta-level. Other relevant work includes analyzing the stability of bilevel optimization [Bao et al., 2021] and federated learning [Sun et al., 2024] for smooth functions.

2 Problem Setup and Preliminaries

Notation. Throughout the paper, we denote scalars and vectors with lowercase italics and lowercase bold Roman letters, respectively; e.g., u, u. We work in a Euclidean space and use and 2 to denote the ℓ2 norm. We use [n] to represent the set {1, 2, . . . , n}, and define U[n] to be the uniform distribution over [n]. Let ΠW be the Euclidean projection onto W. We adopt the standard O-notation and use and O interchangeably. We use O to hide poly-logarithmic dependence on the parameters.

Let X, Y denote the input and output spaces, respectively. Consider a supervised learning setting where each data point is denoted by z = (x, y) drawn from some unknown distribution D over Z = X Y. We consider a hypothesis space H (maps from X Y) parameterized by w W, where W Rd is a closed set with radius D. Let ℓ: Rd Z R+ denote the loss function. We say that a loss function ℓis M-bounded if w W, z D, ℓ(w, z) M; ℓ is µ-strongly convex if w1, w2 W, z D, ℓ(w1, z) ℓ(w2, z) + ℓ(w2, z), w1 w2 + µ 2 w1 w2 2 2; if µ = 0, we say ℓ( , z) is convex. We say ℓis G-Lipschitz continuous if w1, w2 W, z D, ℓ(w1, z) ℓ(w2, z) 2 G w1 w2 2; ℓis H-smooth if , w1, w2 W, z D, ℓ(w1, z) ℓ(w2, z) 2 H w1 w2 2.

In a standard (single-task) learning setup, given a model w, the expected loss on task D and the empirical loss on a training sample S drawn i.i.d. from D, are defined, respectively, as follows.

L(w, D) = Ez D [ℓ(w, z)]; L(w, S) = 1

n P z S ℓ(w, z). In a meta-learning framework, we consider distributions {Dj}m j=1 associated with m different tasks that are drawn from some (unknown) task distribution µ. For each task j, we assume that the learner has access to n training examples drawn i.i.d. from Dj, i.e., Sj = zi j n i=1 Dn j . We denote the cumulative training data as S = {Sj}m j=1, and refer to it as the meta-sample.

A meta-learning algorithm A takes the meta-sample S as input and outputs an algorithm A(S) : (X Y)n H. The performance of the meta-algorithm A is measured in terms of its ability to generalize w.r.t. loss ℓ( ) to a new (previously unseen) task from the task distribution µ; we also refer to it as the transfer risk: L(A(S), µ) = ED µES Dn L(A(S)(S), D). The goal of meta-learning is to learn a useful prior over tasks to help with rapid adaptation to new tasks. Formally, we pose the problem as learning a meta-model, parameterized by what we will refer to as meta-parameter w, that performs well on a variety of tasks. The hope is that the meta-parameter w can be adapted easily to a new task D µ; in particular, that a task-specific model u can be quickly learned from a task-specific training set S Dn of size n using the following proximal update:

u = argminu W L(u, S) + λ

2 u w 2 , where λ > 0 is a regularization parameter.

Algorithm 1 Prox Meta-Learning Algorithm A Input: Meta-sample S={Sj}m j=1, epochs T, K, step sizes γ, η, regularization parameter λ 1: w1 = 0. 2: for t = 1, 2, . . . , T do 3: for j = 1, . . . , m do 4: u(wt, Sj) = Atask(wt, Sj, K, η, λ) % Using Algorithm 2 5: end for 6: Calculate the gradient, j [m], FSj(u(wt, Sj), wt)= λ(u(wt, Sj) wt). 7: Updatewt+1=wt γ

m Pm j=1 FSj(u(wt,Sj),wt) 8: wt+1 = ΠW(wt+1) 9: end for 10: return @Atask(w T +1, , K, η, λ)

Algorithm 2 Task-specific Algorithm Atask Input: Pretrained model w, training data S, #epochs K, step size η, reg. parameter λ 1: Option 1 (RERM): 2: u(w,S)=argminu WL(u, S)+ λ

3: Option 2 (GD): Set u(1)(w, S) = w 4: for t = 1, 2, . . . , K 1 do 5: u(k+1)(w, S)=u(k)(w, S) η( L(u(k)(w, S), S) +λ(u(k)(w, S) w)) 6: u(k+1)(w, S) = ΠW(u(k+1)(w, S)) 7: end for 8: return Option 1 (RERM): u(w, S) Option 2 (GD): 1

K PK k=1u(k)(w, S)

The meta-parameter w itself is learned on the given meta-sample S by minimizing a regularized empirical loss averaged over tasks, where the regularization term penalizes the task-specific models in proportion to the ℓ2 distance from the meta-parameter [Zhou et al., 2019]:

bw = argmin w W

j=1 min u W FSj(u, w) := argmin w W

j=1 min u W

L(u, Sj) + λ

2 u w 2 . (1)

The formulation above involves a bi-level optimization problem. The upper-level optimization involves finding the meta-parameter w which requires solving the lower-level optimization problem of finding task-specific model parameters u. We consider both Gradient Descent (GD) as well Regularized Empirical Risk Minimization (RERM) for task-specific learning (see Algorithm 2 for more details); for meta-learning we employ a gradient descent method (see Algorithm 1).

We would like to bound the transfer risk in terms of the empirical multi-task risk:

L(A(S), S) = 1

m Pm j=1 L(A(S)(Sj), Sj). To do so, we rely on the stability of the meta-learning algorithm.

Stability of Meta-Learning Algorithm. Given a meta-sample S = {Sj}m j=1, define S(j) to be the meta-sample obtained by replacing the training samples Sj for the j-th task, in S, by another i.i.d. sample S j Dn j . We refer to S, S(j) as neighboring meta-samples. For a task-specific training sample S = zi n i=1, let S(i) denote the training data obtained by replacing the i-th example zi S by another example z D drawn independently; we refer to S, S(i) as neighboring samples.

Theorem 2.1 (Maurer [2005]). Suppose the meta-algorithm A satisfies:

1. (Uniform Stability of Single-Task Learning) For any meta-sample S and any S, S(i), ℓ(A(S)(S), z) ℓ(A(S)(S(i)), z) β. 2. (Uniform Stability of Meta-Learning) For any S, S(j) and any given training set S D, L(A(S)(S), S) L(A(S(j))(S), S) β .

Then, for M-bounded loss ℓ, with probability at least 1 δ, we have that

L(A(S), µ) L(A(S), S) + (mβ + M) p

log (1/δ) /m + β.

Theorem 2.1 follows using a simple extension of arguments in Bousquet and Elisseeff [2002]. By utilizing sharper bounds tailored for uniformly stable algorithms [Bousquet et al., 2020], a tighter bound can be achieved, as demonstrated in Theorem 2.2 below. A similar result was shown in Guan et al. [2022] for episodic training algorithms (except there is no β).

Theorem 2.2. Suppose the meta-algorithm A satisfies the same conditions as shown in Theorem 2.1. Then for M-bounded loss ℓ, with probability at least 1 δ, we have that

L(A(S), µ) L(A(S), S) + β log (m) log (1/δ) + M p

log (1/δ) /m + β.

3 Uniform Meta-Stability

Motivated by prior work (i.e., Theorem 2.1 and the definitions therein), we introduce a new notion of stability which measures the sensitivity of the learning algorithm as we replace both a task in the meta-sample as well as a single training example available for the task at test time.

Definition (Uniform Meta-Stability). We say that a meta-learning algorithm A is βuniformly meta-stable if for any neighbouring meta-samples S, S(j), and neighboring samples S, S(i), for any task D µ and any z D, we have that ℓ(A(S)(S), z) ℓ(A(S(j))(S(i)), z) β.

The definition above is rather natural. Intuitively, for a meta-learning algorithm to transfer well, we require that the learning algorithms, i.e., A(S) and A(S ), returned on two neighboring meta-samples, when trained on two neighboring samples return models that predict similarly. Our first result bounds the generalization gap in terms of the uniform meta-stability parameter.

Theorem 3.1. Consider a meta-learning problem for some M-bounded loss function ℓand task distribution µ. Let S be a meta-sample consisting of training samples on m tasks each of size n, and let S D be a sample of size n on a previously unseen task D µ. Then, for any β-uniformly meta-stable learning algorithm A, we have that with probability 1 δ,

L(A(S), µ) L(A(S), S) + β log (mn) log (1/δ) + M p

log (1/δ) /(mn).

The result above is a direct analogue of Theorem 2.1 with stability parameters β, β both subsumed into a single meta-stability parameter. We do obtain a faster rate of convergence as we instantiate concrete algorithms and specialize our results to specific problems in Section 4.1, we will see a notable improvement in rates from 1/ m to 1/m, for n > m.

We conclude the section by presenting an alternate notion of algorithmic meta-stability and a basic result that directly bounds the generalization gap for the meta-learning problem.

Definition (On-Average Meta-Stability). Let µ be an (unknown) underlying task distribution. We say that a meta-learning algorithm A is β-on-average-replace-one-meta-stable if

ES {Dn j} m

j=1,(S j,z j) Dn+1 j ,{Dj}m j=1 µm,j U[m],i U[n] ℓ(A(S)(Sj), zi j) ℓ(A(S(j))(S(i) j ), zi j) β.

Theorem 3.2. Let µ be an underlying task distribution. Given a meta-sample S, test task D µ, and S Dn, for any β -on-average-replace-one-meta-stable meta-learning algorithm A, we have that

ES {Dn j } m

j=1,{Dj}m j=1 µm[L(A(S), µ) L(A(S), S)] β.

4 Bounding Transfer Risk

In this section, we consider a concrete meta-learning algorithm given in Algorithm 1.

4.1 Convex and Smooth Losses

We begin with meta-learning problems with convex, Lipschitz (and potentially smooth) losses.

Lemma 4.1. Assume that the loss function ℓis convex and G-Lipschitz loss. Let S, S(j) denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then, the following holds for Algorithm 1 with RERM for task-specific learning (i.e., Option 1 for Algorithm 2) T 1,

sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) G

Further, if ℓis convex, M-bounded and H-smooth, then setting λ H, γ 1

λ, we have T 1,

sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) 2

2HM 2λn H + n 2λn H 4

2HM (m + 1) .

We can now use the result above with Theorem 3.1 to get the following bound on the transfer risk.

Theorem 4.2. The following holds for Algorithm 1 with step-size γ 1

λ on a given meta-sample S, and RERM for task-specific learning (i.e., Option 1 for Algorithm 2), for all T 1:

1. For convex, M-bounded, and G-Lipschitz loss functions, with probability at least 1 δ

L(A(S), µ) L(A(S), S) + G2

log (mn) log (1/δ) + M p

log (1/δ) mn .

2. For convex, M-bounded, and H-smooth loss functions (H λ), with probability at least 1 δ

L(A(S), µ) L(A(S), S)+ HM (2n 1)λ + HM (m + 1)λ

log (mn) log (1/δ)+ M p

log (1/δ) mn .

Next, we give analogous results for GD for task-specific learning (i.e., Option 2 for Algorithm 2), albeit for smooth loss functions. Lemma 4.3 bounds the output sensitivity of the meta-learning algorithm. We use it with Theorem 3.1 to give the generalization guarantee in Theorem 4.4.

Lemma 4.3. Assume that the loss function is convex, G-Lipschitz and H-smooth. Let S, S(j)

denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then the following holds for Algorithm 1 with GD for task-specific learning (i.e., Option 2 for Algorithm 2) with η 2 H+2λ, for all T 1 as long as we set γ 1 λT ,

sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) 4e G

Theorem 4.4. Assume that the loss function is convex, M-bounded, G-Lipschitz and H-smooth. Suppose we run Algorithm 1 for T iterations with γ 1 λT on a given meta-sample S, and GD for task-specific learning (Option 2, Algorithm 2) with η 2 H+2λ. Then, with probability at least 1 δ,

L(A(S), µ) L(A(S), S)+ G2

log (mn) log (1/δ)+ M p

log (1/δ) mn .

The results above show that meta-stable learning algorithms do not overfit. The bound on the generalization gap of O( 1

n + 1 mn) is tighter than what we would obtain using prior work.

Indeed, we show that Theorem 2.2 yields a rate of O( 1

n + 1 m) (see Theorems C.2 and C.3 in Appendix), which is worse for all m n2. Notably, the bounds on the generalization gap are independent of the number of iterations of the meta learning Algorithm 1 and the number of iterations of GD for Algorithm 2. This holds since the objective we are minimizing is strongly convex (given the strongly convex regularizer), which ensures that the output sensitivity (in Lemmas 4.3 and 4.1 are independent of T and K. In itself, this should not be surprising since we only bound the generalization error in terms of the empirical error the latter may not be small unless the algorithms have converged. To get a better handle on the generalization error we focus on excess (transfer) risk bounds in Section 4.3. But first we give a similar development for another important problem class.

4.2 Weakly Convex and Non-smooth Losses

Here, we focus on a more practical setting of learning problems with loss functions that are weakly convex and non-smooth. The notion of weak convexity is often used in non-convex optimization literature in a variety of problems including robust phase retrieval [Davis et al., 2020] and dictionary learning [Davis and Drusvyatskiy, 2019]; see Drusvyatskiy [2017] for an extended discussion.

Definition. A function f(w) is ρ-weakly convex w.r.t. if f(w) + ρ

2 w 2 is convex in w.

The class of weakly convex functions is contained within the larger class of non-smooth functions and semi-smooth functions [Mifflin, 1977]. It includes convex functions and smooth functions with Lipschitz continuous gradient as special cases; ρ < 0 implies that the function is strongly convex. An important example from a practical perspective is that of training over-parameterized two-layer neural networks with smooth activation functions using a smooth loss [Richards and Rabbat, 2021]. We first bound the sensitivity of Algorithm 1 for weakly convex and non-smooth losses.

Lemma 4.5. Assume that the loss function is ρ-weakly convex and G-Lipschitz. Let S, S(j) denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then the following holds for Algorithm 1 with λ 2ρ, and GD for task-specific learning (i.e., Option 2 for Algorithm 2) with η 1

λ, for all T 1 as long as we set γ 1 λT , sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) (8e G + 2G) rη

Using the result above in conjunction with Thm 3.1 gives the following bound on the transfer risk. Theorem 4.6. Assume that the loss function is ρ-weakly convex, M-bounded, and G-Lipschitz. Suppose we run Algorithm 1 for T iterations with γ 1 λT , λ 2ρ on a meta-sample S, and GD for task-specific learning (Option 2, Algorithm 2) with η 1

λ, Then, with probability at least 1 δ,

L(A(S), µ) L(A(S), S) + G2 rη

log (mn) log (1/δ)+ M p

log (1/δ) mn .

Proof of Theorem 4.6 follows from Lemma 4.5 and Theorem 3.1. A few remarks are in order.

For learning rate γ 1 λT , Theorem 4.6 gives a rate of O( η + 1

n + 1 mn) on the generalization gap. This naturally suggests setting η = 1 λK , where K min {m, n} is the number of iterations of GD in task-specific learning. Then, similar to the discussion in Section 4.1, Theorem 4.6 gives a tighter bound, when n > m, than those derived using prior work (Theorem 2.2); we refer the reader to Theorem D.4 in the appendix for further details.

Our proof technique shares similarities with Bassily et al. [2020]. However, our result is not a straightforward application of theirs as we deal with a bi-level optimization problem and focus on weakly convex functions. It is worth noting that our results for weakly convex non-smooth losses require regularization parameter λ 2ρ, which can be chosen in practice using cross-validation.

The work most related to ours is that of Guan et al. [2022]. However, our results are fundamentally different from theirs in several aspects. Firstly, the algorithms we study are different. Guan et al. [2022] focus on support/query (S/Q) training strategies (aka episodic training) where each task Sj is split into two non-overlapping parts the support set Str j for training the task-specific parameter and the query set Sts j for measuring the algorithm s performance [Vinyals et al., 2016]. The metaparameter is learned by minimizing the loss computed over the query set. Such S/Q training strategy is popular for modern gradient-based meta-learning algorithm such as MAML for few-shot learning [Finn et al., 2017], where the optimization objective can be written as minw 1

m Pm j=1 L(w L(w, Str j ), Sts j ). One notable limitation is that Guan et al. [2022] assume that the loss function on the task level, e.g., R(w, Sj) = L(w L(w, Str j ), Sts j ), is convex or (Hölder) smooth. Such an assumption is highly impractical, as demonstrated by [Mishchenko et al., 2023, Theorem 1, Theorem 2], which provides several counterexamples where L is convex and smooth but R is neither convex nor smooth. In contrast, we directly deal with L being weakly convex and nonsmooth. Our approach requires a more involved proof that deals with stability of bi-level optimization. This is in stark contrast with Guan et al. [2022] who directly reduce the meta-learning problem to a single-task learning problem without considering the bi-level structure of the problem.

The work of Fallah et al. [2021] proposed a notion of stability similar to ours. The difference is that they consider S/Q training and define the stability by changing a mini-batch of samples in Str j as

well as a single sample in Sts j . Moreover, their focus is primarily on strongly convex losses. They discuss generalization to training tasks and unseen tasks separately, as they do not assume all tasks are sampled from the same task distribution. Another related work of Guan and Lu [2021] present a generalization bound of O( p

C/mn) under a task relatedness assumption, where C captures the logarithm of the covering number of hypothesis class that possibly depends on the dimension d. More recently, Riou et al. [2023] provide generalization bounds with a fast rate of O( 1

n), albeit under an additional extended Bernstein s condition.

4.3 Excess Transfer Risk

In the previous sections, we focused on establishing that meta-stable rules do not overfit to the meta-sample. In this Section, we focus on the question of whether meta-learning Algorithm 1 can achieve a small generalization error, i.e., are they guaranteed to transfer well on unseen tasks? We show that by focusing on the computational aspects, i.e., by bounding the optimization error in terms of the number of iterations. Furthermore, we give bounds on excess risk, wherein the benchmark is the performance of the best possible in-class predictor. Let u = argminu W L(u, D), u j = argminu W L(u, Sj), j [m] be the optimal task-specific hypotheses for the unseen task and the given training tasks, respectively. Given a meta-algorithm A, the excess transfer risk can be decomposed as follows:

L(A(S)(S),D) L(u ,D) | {z } Excess Transfer Risk Erisk(A)

=L(A(S)(S),D) 1

j=1 L(A(S)(Sj),Sj)

| {z } Generalization Gap Egen(A)

L(A(S)(Sj), Sj) L(u j,Sj)

| {z } Optimization and Approximation Error Eopt+app(A)

L(u j, Sj) L(u , Sj)

j=1 [L(u , Sj) L(u , D)]

| {z } E j [m],Sj Dn j ,Dj µ,D µ=0

To control excess risk, we need to bound Egen(A) and Eopt+app(A) simultaneously. The bounds on the first term are presented in the previous section. Here, we focus on analyzing the second term. Theorem 4.7. Assume that the loss ℓis convex and G-Lipschitz. Define u j =argminu L(u, Sj), j [m]. Suppose we run Algorithm 1 for T iterations with step-size γ = 1 λT , and using GD for taskspecific learning (i.e., Option 2 for Algorithm 2), to find an algorithm A(S) = Atask(w T +1, ) which is then run on Sj for K iterations with step-size η 1 2λ. Then, we have that

L(A(S)(Sj), Sj) inf u L(u, Sj) D2

ηK +G2η+GDηλ+λ w T +1 bw 2+λσ2

where bw is defined in Equation (1). Here σ2 := 1

m Pm j=1 bw u j 2 is the approximation error, and

w T +1 bw 2 1

T (D2 + D2 ληK + η(G+2λD)2

λ ) is the optimization error.

Finally, to bound the excess transfer risk for convex and non-smooth losses, we use Theorem 4.6 with Theorem 4.7 to get that in expectation over the sampling of data (meta-sample S and sample S)

E[Erisk(A)] E[Egen(A)]+E[Eopt+app(A)] G2 r

ηK +G2η+GDηλ+ λD2

T +η(G+2λD)2+λσ2.

By properly choosing step size η = O 1 λK2/3 , we obtain that the expected excess transfer risk decays at a rate of O( 1 λK1/3 + 1 λm + 1

T +λσ2). Similarly, for convex, Lipschitz and smooth losses, applying Theorem 4.4 with Theorem 4.7 and selecting η = O( 1 λ

K ) results in an expected

excess transfer risk of O( 1 λ

K + 1 λm + 1

T + λσ2). Therefore, as K, T, m, n tend to infinity, the excess risk converges to σ2. As σ represents the average distance between the optimal task-specific parameters uj s and the optimal estimated meta-parameter bw, the excess risk is small when σ is small. It is also typical to set the regularization parameter λ inversely proportional to the sample size n (e.g., λ = O(1/ n)).

Denevi et al. [2019a] study the same algorithm as ours except in the online setting. However, the function classes they consider are limited to compositions of linear hypothesis classes with convex and closed losses. In contrast, our work considers a broader range of functions, encompassing not only convex, Lipschitz, and smooth functions but also weakly-convex and non-smooth functions. The

bound on expected excess risk shown in Denevi et al. [2019a] takes the form O( Varm n + 1 m), where Varm captures the relatedness among the tasks sampled from the task environment. Unfortunately,

this bound relies on a specific choice of λ = O 1 Varm

, which depends on Varm a

quantity that is often not known a priori in practice. To compare with our work, set K = n, T = m, η = O(1/ n), and λ = O(1/ n).Then, applying Theorem 4.4 with Theorem 4.7, we obtain that E[Erisk(A)] n

m + max(1,σ2) n . Considering both Varm and σ as constants, the bound on expected excess risk based on our analysis is tighter than that of Denevi et al. [2019a] when n m, a common setting studied in meta-learning framework.

We also conduct a simple experiment to empirically verify the tightness of our generalization bounds, which we defer to Appendix A due to space limitations.

5 Implications of the Generalization Bounds

Next, we present stochastic and adversarially robust variants of the meta-learning Algorithm 1.

5.1 Proximal Meta-Learning with Stochastic Optimization We adapt Algorithm 1 to utilize sampling-with-replacement where at each iteration we process the training set of a single task; see Algorithm 3 for more details. We show that with high probability the sensitivity of this stochastic meta-learning algorithm is bounded.

Lemma 5.1. Assume that the loss function is ρ-weakly convex and G-Lipschitz. Let S, S(j) denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then, with probability at least 1 exp T 2e2/m2 , the following holds for Algorithm 3 with λ 2ρ, and GD for taskspecific learning (i.e., Option 2 for Algorithm 2) with η 1

λ, for all T 1 as long as we set γ 1 λT ,

sup S,S,i [n],j [m]

A(S)(S) A(S(j))(S(i)) (8e G + 2G) rη

5.2 Robust Adversarial Proximal Meta-Learning

Algorithm 3 Stochastic Prox Meta-Learning Input: Meta-sample S={Sj}m j=1, epochs T, K, step size γ, η, regularization parameter λ. 1: w1 = 0. 2: for t = 1, 2, . . . , T do 3: Sample jt U[m]. 4: u(wt, Sjt) = Atask(wt, Sjt, K, η, λ). % Using Algorithm 2 5: Calculate the gradient FSjt(u(wt, Sjt), wt)= λ(u(wt, Sjt) wt). 6: Update wt+1 =wt γ FSjt(u(wt, Sjt), wt). 7: wt+1 = ΠW(wt+1). 8: end for 9: return @Atask(w T +1, , K, η, λ)

We consider inference-time adversarial attacks with a general threat model B : X 2X . Specifically, given an input example x X, B(x) Rd represents the set of all possible perturbations of x that an adversary can choose from. This includes the typical examples such as the Lp threat models that are often considered in practice, or a discrete set of designed transformations.

Given a model parameter w, let ℓ(w, z) = max z B(z) ℓ(w, z) denote the adversarial loss. We adapt the standard meta-learning framework simply by considering the robust variant, ℓ, of the standard loss ℓ. We denote the robust transfer risk and empirical robust multi-task risk as Lrob(A(S), µ) and Lrob(A(S), S). Now, given meta-sample S, the goal is to learn a robust prior (e.g., a pre-trained model) for rapid adaptation to and robust generalization on new tasks. We adopt the framework presented in Section 2 except we use robust loss for task-specific training; indeed, using GD (Option 2) on robust loss in Algorithm 2 yields adversarial training. We use Algorithm 1 for meta-learning. We now relate a loss function with its adversarially robust counterpart.

Proposition 5.2. Given a loss function ℓ( , z) and its adversarial counterpart ℓ( , z), the following holds: (1) If ℓis G-Lipschitz (in its first argument), then ℓis G-Lipschitz. (2) ℓis not H-smooth even if ℓis H-smooth. (3) If ℓis H-smooth in w, then ℓis H-weakly convex in w. Using the result above with Theorem 3.1 yields the following bound on robust (transfer) risk.

Corollary 5.3. Assume that the loss ℓis M-bounded and H-smooth. Suppose we run Algorithm 1 for T iterations with γ 1 λT , η 1

λ, λ > 2H, and wherein task-specific learning Algorithm 2 (GD) is invoked with robust loss ℓ, we have that with probability at least 1 δ,

Lrob(A(S), µ) Lrob(A(S), S)+ G2 rη

log (mn) log (1/δ)+ M p

log (1/δ) mn .

Note that prior work on robust adversarial meta-learning [Yin et al., 2018, Goldblum et al., 2020, Wang et al., 2021] focuses on empirical study of the problem; we present first theoretical guarantees.

6 Conclusion

In this paper, we introduce a novel notion of stability for meta-learning algorithms, namely uniform meta-stability, and offer a tighter bound on the generalization gap for the meta-learning problem compared to existing literature. We instantiate uniformly meta-stable learning algorithms and give generalization guarantees for both convex, smooth losses as well as weakly convex and non-smooth losses. Several avenues for further exciting research remain. For instance, it remains to be seen if our bounds are tight. Can we show lower bounds on the generalization error for meta-learning? Additionally, understanding how meta-learning relates to federated learning may offer insights on how to extend the theory to broader applications and inform the design of new algorithms. Finally, motivated by data privacy considerations, it would be interesting to extend our setup to privacypreserving meta-learning, similar in spirit to the recent work of Zhou and Bassily [2022].

Acknowledgments and Disclosure of Funding

This research was supported, in part, by the DARPA GARD award HR00112020004, NSF CAREER award IIS-1943251, funding from the Institute for Assured Autonomy (IAA) at JHU, and the Spring 22 workshop on Learning and Games at the Simons Institute for the Theory of Computing. YW acknowledges the support of Amazon Fellowship.

Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. ar Xiv preprint ar Xiv:1710.03641, 2017.

Maruan Al-Shedivat, Liam Li, Eric Xing, and Ameet Talwalkar. On data efficiency of meta-learning. In International Conference on Artificial Intelligence and Statistics, pages 1369 1377. PMLR, 2021.

Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended pac-bayes theory. In International Conference on Machine Learning, pages 205 214. PMLR, 2018.

Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In International Conference on Machine Learning, pages 424 433. PMLR, 2019.

Fan Bao, Guoqiang Wu, Chongxuan Li, Jun Zhu, and Bo Zhang. Stability and generalization of bilevel programming in hyperparameter optimization. Advances in neural information processing systems, 34:4529 4541, 2021.

Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33: 4381 4391, 2020.

Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12: 149 198, 2000.

Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 567 580. Springer, 2003.

Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002.

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610 626. PMLR, 2020.

Fei Chen, Mi Luo, Zhenhua Dong, Zhenguo Li, and Xiuqiang He. Federated meta-learning with fast convergence and efficient communication. ar Xiv preprint ar Xiv:1802.07876, 2018.

Jiaxin Chen, Xiao-Ming Wu, Yanke Li, Qimai Li, Li-Ming Zhan, and Fu-lai Chung. A closer look at the training strategy for modern meta-learning. Advances in Neural Information Processing Systems, 33:396 406, 2020.

Qi Chen, Changjian Shui, and Mario Marchand. Generalization bounds for meta-learning: An information-theoretic analysis. Advances in Neural Information Processing Systems, 34:25878 25890, 2021.

Liam Collins, Aryan Mokhtari, and Sanjay Shakkottai. Task-robust model-agnostic meta-learning. Advances in Neural Information Processing Systems, 33:18860 18871, 2020.

Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019.

Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908 1930, 2019.

Damek Davis, Dmitriy Drusvyatskiy, and Courtney Paquette. The nonsmooth landscape of phase retrieval. IMA Journal of Numerical Analysis, 40(4):2652 2695, 2020.

Giulia Denevi, Carlo Ciliberto, Dimitris Stamos, and Massimiliano Pontil. Learning to learn around a common mean. Advances in neural information processing systems, 31, 2018.

Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566 1575. PMLR, 2019a.

Giulia Denevi, Dimitris Stamos, Carlo Ciliberto, and Massimiliano Pontil. Online-within-online meta-learning. Advances in Neural Information Processing Systems, 32, 2019b.

Giulia Denevi, Massimiliano Pontil, and Carlo Ciliberto. The advantage of conditional meta-learning for biased regularization and fine tuning. Advances in Neural Information Processing Systems, 33: 964 974, 2020.

Nan Ding, Xi Chen, Tomer Levinboim, Sebastian Goodman, and Radu Soricut. Bridging the gap between practice and pac-bayes theory in few-shot meta-learning. Advances in Neural Information Processing Systems, 34:29506 29516, 2021.

Dmitriy Drusvyatskiy. The proximal point method revisited. ar Xiv preprint ar Xiv:1712.06038, 2017.

Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020.

Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil, and Leslie Pack Kaelbing. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(1), 2005.

Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. arxiv preprint: 1908.10400, 2019.

Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning: A metalearning approach. ar Xiv preprint ar Xiv:2002.07948, 2020.

Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic metalearning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34:5469 5480, 2021.

Alec Farid and Anirudha Majumdar. Generalization bounds for meta-learning via pac-bayes and uniform stability. Advances in neural information processing systems, 34:2173 2186, 2021.

Vitaly Feldman and Jan Vondrak. Generalization bounds for uniformly stable algorithms. Advances in Neural Information Processing Systems, 31, 2018.

Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pages 1270 1279. PMLR, 2019.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126 1135. PMLR, 2017.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568 1577. PMLR, 2018.

Micah Goldblum, Liam Fowl, and Tom Goldstein. Adversarially robust few-shot learning: A meta-learning approach. Advances in Neural Information Processing Systems, 33:17886 17895, 2020.

Jiechao Guan and Zhiwu Lu. Task relatedness-based generalization bounds for meta learning. In International Conference on Learning Representations, 2021.

Jiechao Guan, Yong Liu, and Zhiwu Lu. Fine-grained analysis of stability and generalization for modern meta learning algorithms. Advances in Neural Information Processing Systems, 35: 18487 18500, 2022.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225 1234. PMLR, 2016.

Fredrik Hellström and Giuseppe Durisi. Evaluated cmi bounds for meta learning: Tightness and expressiveness. Advances in Neural Information Processing Systems, 35:20648 20660, 2022.

Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9): 5149 5169, 2021.

Kaiyi Ji, Jason D Lee, Yingbin Liang, and H Vincent Poor. Convergence of meta-learning with task-specific adaptation over partial parameters. Advances in Neural Information Processing Systems, 33:11490 11500, 2020.

Weisen Jiang, James Kwok, and Yu Zhang. Effective meta-regularization by kernelized proximal regularization. Advances in Neural Information Processing Systems, 34:26212 26222, 2021.

Sharu Theresa Jose and Osvaldo Simeone. Information-theoretic generalization bounds for metalearning and applications. Entropy, 23(1):126, 2021.

Sharu Theresa Jose, Osvaldo Simeone, and Giuseppe Durisi. Transfer meta-learning: Informationtheoretic bounds and information meta-risk minimization. IEEE Transactions on Information Theory, 68(1):474 501, 2021.

Yegor Klochkov and Nikita Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate o(1/n). Advances in Neural Information Processing Systems, 34:5065 5076, 2021.

Ilja Kuzborskij and Francesco Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106:171 195, 2017.

Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pages 191 227. PMLR, 2023.

Jeffrey Li, Mikhail Khodak, Sebastian Caldas, and Ameet Talwalkar. Differentially private metalearning. ar Xiv preprint ar Xiv:1909.05830, 2019.

Tianyu Liu, Jie Lu, Zheng Yan, and Guangquan Zhang. Pac-bayes bounds for meta-learning with data-dependent prior. ar Xiv preprint ar Xiv:2102.03748, 2021.

Tongliang Liu, Gábor Lugosi, Gergely Neu, and Dacheng Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, pages 2159 2167. PMLR, 2017.

Andreas Maurer. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6 (6), 2005.

Andreas Maurer. Transfer bounds for linear feature learning. Machine learning, 75(3):327 350, 2009.

Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1 32, 2016.

Dimitri Meunier and Pierre Alquier. Meta-strategy for learning tuning parameters with guarantees. Entropy, 23(10):1257, 2021.

Robert Mifflin. Semismooth and semiconvex functions in constrained optimization. SIAM Journal on Control and Optimization, 15(6):959 972, 1977.

Konstantin Mishchenko, Slavomir Hanzely, and Peter Richtárik. Convergence of first-order algorithms for meta-learning with moreau envelopes. ar Xiv preprint ar Xiv:2301.06806, 2023.

Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25:161 193, 2006.

Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Anastasia Pentina and Christoph Lampert. A pac-bayesian bound for lifelong learning. In International Conference on Machine Learning, pages 991 999. PMLR, 2014.

Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.

Yao-Feng Ren and Han-Ying Liang. On the best constant in marcinkiewicz zygmund inequality. Statistics & probability letters, 53(3):227 233, 2001.

Arezou Rezazadeh. A unified view on pac-bayes bounds for meta-learning. In International Conference on Machine Learning, pages 18576 18595. PMLR, 2022.

Arezou Rezazadeh, Sharu Theresa Jose, Giuseppe Durisi, and Osvaldo Simeone. Conditional mutual information-based generalization bound for meta learning. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1176 1181. IEEE, 2021.

Dominic Richards and Mike Rabbat. Learning with gradient descent and weakly convex losses. In International Conference on Artificial Intelligence and Statistics, pages 1990 1998. PMLR, 2021.

Charles Riou, Pierre Alquier, and Badr-Eddine Chérief-Abdellatif. Bayes meets bernstein at the meta level: an analysis of fast rates in meta-learning with pac-bayes. ar Xiv preprint ar Xiv:2302.11709, 2023.

Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, and Andreas Krause. Pacoh: Bayes-optimal meta-learning with pac-guarantees. In International Conference on Machine Learning, pages 9116 9126. PMLR, 2021.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635 2670, 2010.

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.

Zhenyu Sun, Xiaochun Niu, and Ermin Wei. Understanding generalization of federated learning via stability: Heterogeneity matters. In International Conference on Artificial Intelligence and Statistics, pages 676 684. PMLR, 2024.

Nilesh Tripuraneni, Michael Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33:7852 7862, 2020.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.

Chunyang Wang, Yanmin Zhu, Haobing Liu, Tianzi Zang, Jiadi Yu, and Feilong Tang. Deep meta-learning in recommendation systems: A survey. ar Xiv preprint ar Xiv:2206.04415, 2022.

Ren Wang, Kaidi Xu, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Chuang Gan, and Meng Wang. On fast adversarial robustness adaptation in model-agnostic meta-learning. ar Xiv preprint ar Xiv:2102.10454, 2021.

Jiancong Xiao, Yanbo Fan, Ruoyu Sun, Jue Wang, and Zhi-Quan Luo. Stability analysis and generalization bounds of adversarial training. Advances in Neural Information Processing Systems, 35:15446 15459, 2022.

Yue Xing, Qifan Song, and Guang Cheng. On the algorithmic stability of adversarial training. Advances in neural information processing systems, 34:26523 26535, 2021.

Chengxiang Yin, Jian Tang, Zhiyuan Xu, and Yanzhi Wang. Adversarial meta-learning. ar Xiv preprint ar Xiv:1806.03316, 2018.

Hossein Zakerinia, Amin Behjati, and Christoph H Lampert. More flexible pac-bayesian metalearning by learning learning algorithms. ar Xiv preprint ar Xiv:2402.04054, 2024.

Pan Zhou, Xiaotong Yuan, Huan Xu, Shuicheng Yan, and Jiashi Feng. Efficient meta learning via minibatch proximal update. Advances in Neural Information Processing Systems, 32, 2019.

Xinyu Zhou and Raef Bassily. Task-level differentially private meta learning. Advances in Neural Information Processing Systems, 35:20947 20959, 2022.

Yi Zhou, Yingbin Liang, and Huishuai Zhang. Understanding generalization error of sgd in nonconvex optimization. Machine Learning, pages 1 31, 2022.

Supplementary Material

A Experiments

In this section, we conduct a simple experiment to empirically verify our generalization bounds.

Setting. Following the experimental setting in Nichol et al. [2018] and Zhou et al. [2019], we consider a synthetic one-dimensional sine wave regression problem. The goal is to approximate the distribution of parameters of function f(x; α, β) = α sin(x + β). The task environment µ is a joint distribution D(α, β) of the parameters α and β. We take D(α, β) to be a product distribution of D(α) = U([ 5, 5]), D(β) = U([0, π]). We generate the meta-sample by first sampling m training tasks, i.e., m pairs of (α, β) sampled independently from D(α, β). For each of these m tasks, we sample n = 10 points, x1, . . . , x10 uniformly on [ 5, 5] and label them as yi = f(xi; α, β). Similarly, at test time we generate a new task from the task distribution and generate a training sample of size n (by sampling x s uniformly on the interval [ 5, 5] and labeling them using f(x; α, β). We sample 1000 new tasks at test time. For each of the test task, we also generate an evaluation set of size 200, and use it to estimate the mean-squared error between the predictions of the learned model and the true labels. Our hypothesis class is a two layer network of width 40 and tanh( ) activation function. We run Algorithm 1 for T = 100 iterations with a step size of γ = 0.1 and regularization parameter λ = 0.5. Algorithm 2 (GD) is run for K = 15 iterations with step size η = 0.02. The experiment is conducted on a T4 GPU.

Results. We report the transfer risk, the average empirical risk (over tasks), and the generalization gap for different values of m and n in Figure 1. In the plot on the left, we fix n = 10, and vary the number of tasks m from 10 to 5000. In the plot in the middle, we fix m = 1000, and change the number of samples n from 5 to 1000. In the plot on the right, we choose m = n, and scale m and n simultaneously from 10 to 1000. We observe that in all of these three scenarios, as m (and/or n) increase, both the generalization gap as well as the transfer risk decrease. Moreover, the generalization gap decreases at rates approximately O(1/m + 1/n + 1/ mn) for these three scenarios as suggested by our theoretical result.

Figure 1: Error plots as a function of: (left) the number of tasks m for fixed n = 10; (middle) the number of training samples n on a test task for fixed m = 1K; (right) both m and n.

B Missing Proofs of Section 3

The following Lemmas and Theorems are used for proving Theorem 3.1.

Lemma B.1 (Bounded differences/Mc Diarmid s inequality). Consider a function f of independent random variables z1, . . . , zn that take their value in Z. Suppose that f satisfies the bounded differences property, namely, for any i = 1, . . . , n and any z1, . . . , zn, z i Z, it holds that

f(z1, . . . , zn) f(z1, . . . , zi 1, z i, zi+1, . . . , zn) β.

Then we have for any p 2,

f(z1, . . . , zn) Ef(z1, . . . , zn) p 2 npβ.

Theorem B.2 (Marcinkiewicz-Zygmund s inequality Ren and Liang [2001]). Let x1, . . . , xn be independent centered random variables with a finite p-th moment for p 2. Then

Theorem B.3. Let Z = (z1, . . . , zn) be a vector of independent random variables each taking values in Z. Let Z = (Z1, . . . , Zm) be a vector of independent random vectors each taking values in Zn. Let gj,i : (Zn)m Zn R be some functions such that the following holds for any i [n], j [m]:

(1). |E[gj,i(Z, Z)|Zj, zi]| M a.s.,

(2). E gj,i(Z, Z)|Z[m]\{j}, z[n]\{i} = 0 a.s.,

(3). gj,i has a bounded difference β w.r.t. all variables except the (j, i)-th variable.

Then we have

i=1 gj,i(Z, Z)

mn β log (mn) + M mn.

Proof of Theorem B.3. The proof is an extension of [Bousquet et al., 2020, Theorem 4].

Without loss of generality, we suppose that n = 2k, m = 2r. Otherwise, we can add extra functions that equal to zero. Consider a sequence of partitions C0, . . . , Ck with C0 = {{i} : i [n]}, Ck = {[n]}, and to get Cl from Cl+1 we split each subset into Cl+1 into two equal parts. We have

C0 = {1} , . . . , 2k , C1 = {1, 2} , {3, 4} , . . . , 2k 1, 2k , Ck = 1, . . . , 2k .

By construction, we have |Cl| = 2k l and |C| = 2l for each C Cl. For each i [n] and l = 0, . . . , k, denote by Cl(i) Cl the only set from Cl that contains i. In particular, C0(i) = {i} and Ck(i) = [n].

Similarly, we consider a sequence of partitions E0, . . . , Er with E0 = {{j} : j [m]}, Er = {[m]}, and to get Eq from Eq+1 we split each subset in Eq+1 into two equal parts. We have

E0 = {{1} , . . . , {2r}} , E1 = {{1, 2} , {3, 4} , . . . , {2r 1, 2r}} , Er = {{1, . . . , 2r}} .

By construction, we have |Eq| = 2r q and |E| = 2q for each E Eq. For each j [m] and q = 0, . . . , r, denote by Eq(j) Eq the only set from Eq that contains j. In particular, E0(j) = {j} and Er(j) = [m].

For each i [n], j [m] and every l = 0, . . . , k, q = 0, . . . , r, consider the random variables

gq,l j,i = gq,l j,i(Zj, Z[m]\Eq(j), zi, z[n]\Cl(i)),

i.e., conditioned on Zj, zi and all the vectors that are not in the same set as Zj in the partition Eq and all the variables that are not in the same set as zi in the partition Cl. In particular, g0,0 j,i = gj,i, gr,k j,i = E[gj,i|Zj, zi]. We can write a telescopic sum as follows:

gj,i E[gj,i|Zj, zi] =

q=0 gq,0 j,i gq+1,0 j,i +

l=0 gr,l j,i gr,l+1 j,i ,

and the total sum of interest satisfies by the triangle inequality

i=1 E [gj,i|Zj, zi]

i=1 gq,0 j,i gq+1,0 j,i

i=1 gr,l j,i gr,l+1 j,i

Since |E[gj,i|Zj, zi]| M and E (E [gj,i|Zj, zi]) = 0, by applying Mc Diarmid inequality in Lemma B.1, we have

i=1 E [gj,i|Zj, zi]

We observe that

gq+1,l+1 j,i (Zj, Z[m]\Eq+1(j), zi, z[n]\Cl+1(i))

= E h gq+1,l j,i (Zj, Z[m]\Eq+1(j), zi, z[n]\Cl(i))|zi, z[n]\Cl+1(i) i

(The expectation is take w.r.t. the variable zs, s Cl+1(i)\Cl(i))

= E h gq,l+1 j,i (Zj, Z[m]\Eq(j), zi, z[n]\Cl+1(i))|Zj, Z[m]\Eq+1(j) i

(The expectation is take w.r.t. the variable Zs, s Eq+1(j)\Eq(j))

As function gq,l j,i preserves the bounded differences property, if we apply Mc Diarmid s inequality conditioned on Zj, Z[m]\Eq+1(j), zi, z[n]\Cl+1(i), we obtain a uniform bound gq,0 j,i gq+1,0 j,i Zj, Z[m]\Eq+1(j), zi, z[n]\C0(i) 2

2q+1 β gr,l j,i gr,l+1 j,i Zj, Z[m]\Er(j), zi, z[n]\Cl+1(i) 2

as there are 2l indices in Cl+1(i)\Cl(i) and 2q indices in Eq+1(j)\Eq(j).

Now we focus on P

i C0 gq,0 j,i gq+1,0 j,i for Eq Eq and P

i Cl gr,l j,i gr,l+1 j,i for Cl Cl, respectively. Since gq,0 j,i gq+1,0 j,i for j Eq, i C0 depends on Zj, Z[m]\Eq(j), zi, z[n]\C0(i), the terms are independent and zero mean conditioned on Z[m]\Eq(j). Applying Theorem B.2, we have

i C0 gq,0 j,i gq+1,0 j,i

gq,0 j,i gq+1,0 j,i 2 (Z[m]\Eq)

Integrating with respect to (Z[m]\Eq) and using gq,0 j,i gq+1,0 j,i 2

2q+1 β, we have

i C0 gq,0 j,i gq+1,0 j,i

2q+1 β = 12

Applying triangle inequality over all sets C0 C0, Eq Eq gives us that

i [n] gq,0 j,i gq+1,0 j,i

Eq Eq,C0 C0

j Eq,i C0 gq,0 j,i gq+1,0 j,i

Similarly, gr,l j,i gr,l+1 j,i for j Er, i Cl depends on zi, z[n]\Cl+1(i), the terms are independent and zero mean conditioned on z[n]\Cl+1(i). Applying Theorem B.2, we have

i Cl gr,l j,i gr,l+1 j,i

36 2l+r 1 2l+r X

gr,l j,i gr,l+1 j,i 2 (z[n]\Cl)

Integrating with respect to (z[n]\Cl) and using gr,l j,i gr,l+1 j,i 2

2l+1 β, we have

i Cl gr,l j,i gr,l+1 j,i

2l+1 β = 12

2 2l+0.5r β.

Applying triangle inequality over all sets Cl Cl, Er Er gives us that

i [n] gr,l j,i gr,l+1 j,i

Er Er,Cl Cl

j Er,i Cl gr,l j,i gr,l+1 j,i

2 2l+0.5r β

Recall that 2k < 2n, 2r < 2m due to the possible extension of the sample. Therefore we have

i=1 gq,0 j,i gq+1,0 j,i

i=1 gr,l j,i gr,l+1 j,i

2mn β( log (m) + log (n) )

mn β log (mn)

Combined with Equation (2) get the required bound.

We now restate and prove Theorem 3.1. Theorem 3.1. Consider a meta-learning problem for some M-bounded loss function ℓand task distribution µ. Let S be a meta-sample consisting of training samples on m tasks each of size n, and let S D be a sample of size n on a previously unseen task D µ. Then, for any β-uniformly meta-stable learning algorithm A, we have that with probability 1 δ,

L(A(S), µ) L(A(S), S) + β log (mn) log (1/δ) + M p

log (1/δ) /(mn).

Proof of Theorem 3.1. In order to make use of Theorem B.3, we consider the following functions:

gj,i = gj,i(Z, Z) = E(S j,z j) Dn+1 j ,Dj µE(S,z) Dn+1,D µℓ(A(S(j))(S), z) ℓ(A(S(j))(S(i) j ), zi j)

By the definition of uniform meta stability, we can write the following decomposition:

|mn (L(A(S), µ) L(A(S), S))|

i=1 E(S,z) Dn+1,D µℓ(A(S)(S), z) ℓ(A(S)(Sj), zi j)

i=1 E(S,z) Dn+1, D µ E(S j,z j) Dn+1 j , Dj µ

ℓ(A(S)(S), z) ℓ(A(S(j))(S(i) j ), zi j)

+ ℓ(A(S(j))(S(i) j ), zi j) ℓ(A(S)(Sj), zi j)

i=1 E(S,z) Dn+1, D µ E(S j,z j) Dn+1 j , Dj µ

ℓ(A(S)(S), z) ℓ(A(S(j))(S(i) j ), zi j + mn β

Moreover, we have E [gj,i|S1, . . . , Sj 1, Sj+1, . . . , Sm, z1, . . . , zi 1, zi+1, . . . , zn] = 0 and |gj,i| 2M a.s. for i [n], j [m]. Applying Theorem B.3 as well as [Bousquet and Elisseeff, 2002, Lemma 1] achieves the results.

Theorem 3.2 can be directly proved by the definition of uniform meta-stability.

Theorem 3.2. Let µ be an underlying task distribution. Given a meta-sample S, test task D µ, and S Dn, for any β -on-average-replace-one-meta-stable meta-learning algorithm A, we have that

ES {Dn j } m

j=1,{Dj}m j=1 µm[L(A(S), µ) L(A(S), S)] β.

Proof of Theorem 3.2. Since S and z are both drawn i.i.d. from D, and S and S j are both drawn i.i.d. from µ, we have

ES {Dn j} m

j=1,{Dj}m j=1 µm ES Dn,D µL(A(S)(S), D)

= ES {Dn j} m

j=1,(S j) Dn j ,{Dj}m j=1 µm,j U[m]L(A(S(j))(Sj), Dj)

= ES {Dn j} m

j=1,(S j,z j) Dn+1 j ,{Dj}m j=1 µm,j U[m],i U[n]ℓ(A(S(j))(S(i) j ), zi j)

ES {Dn j} m

j=1,{Dj}m j=1 µm

j=1 L(A(S)(Sj), Sj)

= ES {Dn j} m

j=1,{Dj}m j=1 µm,j U[m] [L(A(S)(Sj), Sj)]

= ES {Dn j} m

j=1,(S j,z j) Dn+1 j ,{Dj}m j=1 µm,j U[m],i U[n]ES Dnℓ(A(S)(Sj), zi j)

As a result, we have

ES {Dn j } m

j=1,{Dj}m j=1 µm[L(A(S), µ) L(A(S), S)]

= ES {Dn j} m

j=1,(S j,z j) Dn+1 j ,{Dj}m j=1 µm,j U[m],i U[n] ℓ(A(S(j))(S(i) j ), zi j) ℓ(A(S)(Sj), zi j)

β (By definition of β -on-average-replace-one-meta-stable)

C Missing Proofs of Section 4.1

Lemma C.1 (Shalev-Shwartz and Ben-David [2014]). Given S and S(i), for a fixed w, define u(w, S) and u(w, S(i)) is achieved via Algo. 1 with Option 1 RERM. Then if ℓis convex, G-Lipschitz, we have sup S,i [n] u(w, S) u(w, S(i)) 4G

λn. If ℓis convex and H-smooth (H λn

2 ), we have u(w, S) u(w, S(i))

ℓ(w, zi) + p

Lemma 4.1. Assume that the loss function ℓis convex and G-Lipschitz loss. Let S, S(j) denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then, the following holds for Algorithm 1 with RERM for task-specific learning (i.e., Option 1 for Algorithm 2) T 1,

sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) G

Further, if ℓis convex, M-bounded and H-smooth, then setting λ H, γ 1

λ, we have T 1,

sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) 2

2HM 2λn H + n 2λn H 4

2HM (m + 1) .

Proof of Lemma 4.1. We slightly abuse the notation, at iteration t, define wt = A(S), w t = A(S(j)). Given w T +1, define u(w T +1, S) = A(S)(S), u(w T +1, S(i)) = A(S(j))(S(i)).

We first consider the setting where the loss ℓis convex, G-Lipschitz. Recall that FS(u, w) = L(u, S) + λ

2 u w 2. If ℓis convex, then FS(u, w) is λ-strongly-convex w.r.t u. Define u(w, S) = argminu W FS(u, w), u(w , S) = argminu W FS(u, w ). We have the following:

FS(u(w , S), w) FS(u(w, S), w) λ u(w, S) u(w , S) 2

FS(u(w, S), w ) FS(u(w , S), w ) λ u(w, S) u(w , S) 2

Sum the above gives us that

2λ u(w, S) u(w , S) 2

FS(u(w , S), w) FS(u(w, S), w) + FS(u(w, S), w ) FS(u(w , S), w )

u(w , S) w 2 u(w, S) w 2 + u(w, S) w 2 u(w , S) w 2

= λ u(w, S) u(w , S), w w

λ u(w, S) u(w , S) w w

This gives us that

u(w, S) u(w , S) 1

2 w w . (3)

Similarly, define u(w , S ) = argminu W FS (u, w ), we have

FS(u(w , S ), w) FS(u(w, S), w) λ u(w, S) u(w , S ) 2

FS (u(w, S), w ) FS (u(w , S ), w ) λ u(w, S) u(w , S ) 2

Sum the above gives us that

2λ u(w, S) u(w , S ) 2

FS(u(w , S ), w) FS(u(w, S), w) + FS (u(w, S), w ) FS (u(w , S ), w )

= L(u(w , S ), S) L(u(w, S), S) + L(u(w, S), S ) L(u(w , S ), S )

u(w , S ) w 2 u(w, S) w 2 + u(w, S) w 2 u(w , S ) w 2

2G u(w, S) u(w , S ) + λ u(w, S) u(w , S ), w w (ℓis G-Lipschitz)

2G u(w, S) u(w , S ) + λ u(w, S) u(w , S ) w w

This gives us that

u(w, S) u(w , S ) 1

Finally, at iteration t, we have

j=1 u(wt, Sj)

j=1 u(w t, S j)

(Projection is non-expansive)

(1 γλ)(wt w t) + γλ 1

u(wt, Sj) u(w t, S j)

(1 γλ) wt w t + γλ m 1

m 1 2 wt w t + 1

2 wt w t + G

(Equation (3), (4))

2 ) wt w t + γG

m Choose γ 1

λ, t. Rearrange gives us that wt+1 w t+1

(1 γλ/2)t+1 wt w t (1 γλ/2)t + γG

(1 γλ/2)t+1

Note that at initialization when t = 1 we have w1 w 1 = 0. Telescoping from t = 1 to T + 1 w T +1 w T +1

(1 γλ/2)T γG

(1 γλ/2)t+1

Calculate gives us that

w T +1 w T +1 2G

Similarly, define u(w T +1, S) = argminu W FS(u, w T +1), u(w T +1, S(i)) = argminu W FS(i)(u, w T ), we have

FS(u(w T +1, S(i)), w T +1) FS(u(w T +1, S), w T +1) λ u(w T +1, S) u(w T +1, S(i)) 2

FS(i)(u(w T +1, S), w T +1) FS(i)(u(w T +1, S(i)), w T +1) λ u(w T +1, S) u(w T +1, S(i)) 2

Sum the above gives us that

2λ u(w T +1, S) u(w T +1, S(i)) 2

FS(u(w T +1, S(i)), w T +1) FS(u(w T +1, S), w T +1)

+ FS(i)(u(w T +1, S), w T +1) FS(i)(u(w T +1, S(i)), w T +1)

= L(u(w T +1, S(i)), S) L(u(w T +1, S), S) + L(u(w T +1, S), S(i)) L(u(w T +1, S(i)), S(i))

u(w T +1, S(i)) w T +1 2 u(w T +1, S) w T +1 2

+ u(w T +1, S) w T +1 2 u(w T +1, S(i)) w T +1 2 !

u(w T +1, S) u(w T +1, S(i)) + λ D u(w T +1, S(i)) u(w T +1, S(i)), w T +1 w T +1 E

(ℓis G-Lipschitz)

u(w T +1, S) u(w T +1, S(i)) + λ u(w T +1, S) u(w T +1, S(i)) w T +1 w T +1

This gives us that u(w T +1, S) u(w T +1, S(i)) 1

w T +1 w T +1 + 2G

We now consider the surrogate loss ℓis convex, non-negative and H-smooth. Note that such loss is also self-bounded. From a similar argument, we have

u(w, S) u(w , S) 1

2λ u(w, S) u(w , S ) 2

FS(u(w , S ), w) FS(u(w, S), w) + FS (u(w, S), w ) FS (u(w , S ), w )

= L(u(w , S ), S) L(u(w, S), S) + L(u(w, S), S ) L(u(w , S ), S )

u(w , S ) w 2 u(w, S) w 2 + u(w, S) w 2 u(w , S ) w 2

( L(u(w, S), S) + L(u(w , S ), S ) ) u(w , S ) u(w, S) +H u(w , S ) u(w, S) 2

(ℓis H-smooth)

+ λ u(w, S) u(w , S ), w w

2HL(u(w, S), S) + p

2HL(u(w , S ), S ) u(w , S ) u(w, S)

+ H u(w , S ) u(w, S) 2 + λ u(w, S) u(w , S) w w (ℓis H-smooth)

which is equivalent as

u(w, S) u(w , S )

2HL(u(w, S), S) + p

2HL(u(w , S ), S ) + λ w w

L(u(w, S), S) + p

L(u(w , S ), S ) + w w (6)

Finally, at iteration t, we have wt+1 w t+1

j=1 u(wt, Sj)

j=1 u(w t, S j)

(Projection is non-expansive)

(1 γλ)(wt w t) + γλ 1

u(wt, Sj) u(w t, S j)

(1 γλ) wt w t + γλ

m 1 2 wt w t

L(u(wt, Sj), Sj) + q

L(u(w t, S j), S j) + wt w t

(Equation (3), (6))

2m γλ wt w t + γ

L(u(wt, Sj), Sj) + q

L(u(w t, S j), S j)

Telescope gives us that

w T +1 w T +1 γ

2m γλ T t q

L(u(wt, Sj), Sj)+ q

L(u(w t, S j), S j)

2HM λ(m + 1), (7)

where the last line holds if we consider M-bounded loss. Otherwise, we have

w T +1 w T +1 2

2H λ(m + 1)

max t [T ] L(u(wt, Sj), Sj) + r

max t [T ] L(u(w t, S j), S j)

Therefore, we have

2λ u(w, S) u(w , S(i)) 2

FS(u(w , S(i)), w) FS(u(w, S), w) + FS(i)(u(w, S), w ) FS(i)(u(w , S(i)), w )

= L(u(w , S(i)), S) L(u(w, S), S) + L(u(w, S), S(i)) L(u(w , S(i)), S(i))

u(w , S(i)) w 2 u(w, S) w 2 + u(w, S) w 2 u(w , S(i)) w 2

ℓ(u(w , S(i)), zi) ℓ(u(w , S(i)), z ) + ℓ(u(w , S), z ) ℓ(u(w , S), zi)

+ λ D u(w, S) u(w , S(i)), w w E

2Hℓ(u(w, S), zi) + q

2Hℓ(u(w , S(i)), z ) u(w , S(i)) u(w, S)

u(w , S(i)) u(w, S) 2 + λ u(w, S) u(w , S) w w (ℓis H-smooth)

Rearrange gives us, u(w, S) u(w , S(i)) 1 2λn H

2Hℓ(u(w, S), zi) + q

2Hℓ(u(w , S(i)), z )

+ λn 2λn H w w (9)

Plug in w T +1 and w T +1 gives us that u(w T +1, S) u(w T +1, S(i)) 2

2HM 2λn H + n 2λn H 4

2HM (m + 1)

If we apply Lemma 4.1 and Lemma C.1 with Theorem 2.2, we have the following theorem. Theorem C.2. The following holds for Algorithm 1 with step-size γ 1

λ on a given meta-sample S, and RERM for task-specific learning (i.e., Option 1 for Algorithm 2), for all T 1:

1. For convex, M-bounded, and G-Lipschitz loss functions, with probability at least 1 δ

L(A(S), µ) L(A(S), S) + G2

λm log (m) log (1/δ) + M m

log (1/δ) + G2

2. For convex, M-bounded, and H-smooth loss functions (H λ), with probability at least 1 δ

L(A(S), µ) L(A(S), S) + HM (m + 1)λ log (m) log (1/δ) + M m

log (1/δ) + HM

Proof of Theorem C.2. We slightly abuse the notation, at iteration t, define wt = A(S), w t = A(S(j)), u(w T +1, S) = A(S)(S), u(w T +1, S(i)) = A(S(j))(S(i)). Apply Lemma 4.1 gives us that L(A(S)(S), S) L(A(S(j))(S), S) = L(u(w T +1, S), S) L(u(w T +1, S), S)

G u(w T +1, S) u(w T +1, S) (G-Lipschitz)

w T +1 w T +1 (Equation (3))

λm Apply Lemma C.1 gives us that ℓ(A(S)(S), z) ℓ(A(S)(S(i)), z) = ℓ(u(w T +1, S), z) ℓ(u(w T +1, S(i)), z)

G u(w T +1, S) u(w T +1, S(i))

λn Apply Theorem 2.2 with β = G2

λm, β = 4G2

λn achieves the results.

Similarly, if the loss is M-bounded, convex, non-negative and H-smooth, we have L(A(S)(S), S) L(A(S(j))(S), S)

= L(u(w T +1, S), S) L(u(w T +1, S), S)

2HL(u(w T +1, S), S) u(w T +1, S) u(w T +1, S) + H

u(w T +1, S) u(w T +1, S) 2

(ℓis H-smooth)

w T +1 w T +1 + H

w T +1 w T +1 2 (Equation (7))

4HM (m + 1)λ + 4H2M (m + 1)2λ2

8HM (m + 1)λ (λ H)

Apply Lemma C.1 gives us that ℓ(A(S)(S), z) ℓ(A(S)(S(i)), z)

= ℓ(u(w T +1, S), z) ℓ(u(w T +1, S(i)), z)

2Hℓ(u(w T +1, S), z) u(w T +1, S) u(w T +1, S(i)) + H

u(w T +1, S) u(w T +1, S(i)) 2

Apply Theorem 2.2 with β = 8HM (m+1)λ, β = 24HM

λn achieves the results.

Theorem 4.2. The following holds for Algorithm 1 with step-size γ 1

λ on a given meta-sample S, and RERM for task-specific learning (i.e., Option 1 for Algorithm 2), for all T 1:

1. For convex, M-bounded, and G-Lipschitz loss functions, with probability at least 1 δ

L(A(S), µ) L(A(S), S) + G2

log (mn) log (1/δ) + M p

log (1/δ) mn .

2. For convex, M-bounded, and H-smooth loss functions (H λ), with probability at least 1 δ

L(A(S), µ) L(A(S), S)+ HM (2n 1)λ + HM (m + 1)λ

log (mn) log (1/δ)+ M p

log (1/δ) mn .

Proof of Theorem 4.2. We slightly abuse the notation, at iteration t, define wt = A(S), w t = A(S(j)), u(w T +1, S) = A(S)(S), u(w T +1, S(i)) = A(S(j))(S(i)). For ℓto be convex and GLipschitz, applying Lemma 4.1 gives us that ℓ(A(S)(S), z) ℓ(A(S(j))(S(i)), z) = ℓ(u(w T +1, S), z) ℓ(u(w T +1, S(i)), z)

G u(w T +1, S) u(w T +1, S(i))

Further apply Theorem 3.1 gives us the result. For ℓto be convex, M-bounded and H-smooth, we have ℓ(A(S)(S), z) ℓ(A(S(j))(S(i)), z)

= ℓ(u(w T +1, S), z) ℓ(u(w T +1, S(i)), z)

2Hℓ(u(w T +1, S), z) u(w T +1, S) u(w T +1, S(i)) + H

u(w T +1, S) u(w T +1, S(i)) 2

2HM 2λn H + n 2λn H 4

2HM (m + 1)

2HM 2λn H + n 2λn H 4

2HM (m + 1)

4HM (2n 1)λ + 8HM (m + 1)λ + 8H2M (2n 1)2λ2 + 16H2M (m + 1)2λ2

12HM (2n 1)λ + 24HM (m + 1)λ (H λ)

Apply Theorem 3.1 gives the results.

Lemma 4.3. Assume that the loss function is convex, G-Lipschitz and H-smooth. Let S, S(j)

denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then the following holds for Algorithm 1 with GD for task-specific learning (i.e., Option 2 for Algorithm 2) with η 2 H+2λ, for all T 1 as long as we set γ 1 λT ,

sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) 4e G

Proof of Lemma 4.3. We slightly abuse the notation, at iteration t, define wt = A(S), w t = A(S(j)), u(w T +1, S) = A(S)(S), u(w T +1, S(i)) = A(S(j))(S(i)). Recall that FS(u, w) = L(u, S) + λ 2 u w 2. If ℓis convex, then FS(u, w) is λ-strongly-convex w.r.t u. If ℓis H-smooth, then

L(u, S) L(v, S), u v 1

H L(u, S) L(v, S) 2

Given S, S , for any w, w , we have u(k+1)(w, S) u(k+1)(w , S )

u(k)(w, S) u(k)(w , S ) η L(u(k)(w, S), S) + λ u(k)(w, S) w

(Projection is non-expansive)

L(u(k)(w , S ), S ) λ u(k)(w , S ) w

(1 ηλ) u(k)(w, S) u(k)(w , S ) + ηλ w w + 2ηG

Given S, for any w, w , we have u(k+1)(w, S) u(k+1)(w , S)

u(k)(w, S) u(k)(w , S) (Projection is non-expansive)

η L(u(k)(w, S), S)+λ u(k)(w, S) w L(u(k)(w , S), S) λ u(k)(w , S) w

u(k)(w, S) u(k)(w , S) + λη(w w )

η L(u(k)(w, S), S) L(u(k)(w , S), S)+λ u(k)(w, S) u(k)(w , S)

u(k)(w, S) u(k)(w , S)

η L(u(k)(w, S), S) L(u(k)(w , S), S)+λ u(k)(w, S) u(k)(w , S)

(1 λη)2 u(k)(w, S) u(k)(w , S) 2

+ η2 2η(1 ηλ)

L(u(k)(w, S), S) L(u(k)(w , S), S) 2 !1/2

(L is H-smooth, η 2 H+2λ)

(1 λη) u(k)(w, S) u(k)(w , S) + λη w w (10)

Combine the above two cases gives us that

u(k+1)(w, Sj) u(k+1)(w , S j)

u(k)(w, Sj) u(k)(w, S j) + 1

u(k)(w, Si) u(k)(w , S i)

u(k)(w, Sj) u(k)(w , S j) + λη w w + 2ηG

Given wt, w t, when k = 1, u(1)(wt, S j) = wt, u(1)(w t, Sj) = u(1)(w t, S j) = w t. Telescoping gives us that

u(k)(wt, Sj) u(k)(w t, S j) 1 + (1 λη)k 1 wt w t + 2G

2 wt w t + 2G

Finally, at iteration t, we have wt+1 w t+1

k=1 u(k)(wt, Sj)

k=1 u(k)(w t, S j)

(Projection is non-expansive)

(1 γλ)(wt w t) + γλ 1

u(k)(wt, Sj) u(k)(w t, S j)

(1 γλ) wt w t + γλ 2 wt w t + 2G

(Equation (11))

= (1 + γλ) wt w t + 2γG

Note that w1 = w 1 = 0. Choosing γ 1 λT and telescoping gives us that

w T +1 w T +1 1 + 1

wt w t + 2G

where the inequality holds because (1 + 1

T )T e. Similarly, we have u(k+1)(w, S) u(k+1)(w , S(i))

u(k)(w, S) u(k)(w , S(i)) η L(u(k)(w, S), S) + λ u(k)(w, S) w

(Projection is non-expansive)

L(u(k)(w , S(i)), S(i)) λ u(k)(w , S(i)) w

(1 ηλ) u(k)(w, S) u(k)(w , S(i))

+ η L(u(k)(w, S), S) L(u(k)(w , S(i)), S)

+ η L(u(k)(w , S(i)), S) L(u(k)(w , S(i)), S(i))

ηλ w w + 2Gη

(1 ηλ) u(k)(w, S) u(k)(w , S(i))

+ η L(u(k)(w, S), S) L(u(k)(w , S(i)), S)

ηλ w w + 2Gη

(1 λη)2 u(k)(w, S) u(k)(w , S(i)) 2

+ η2 2η(1 ηλ)

L(u(k)(w, S), S) L(u(k)(w , S), S(i)) 2 !1/2

(L is H-smooth, η 2 H+2λ)

(1 λη) u(k)(w, S) u(k)(w , S(i)) + ηλ w w + 2Gη

Therefore we have k [K 1],

u(k+1)(w T +1, S) u(k+1)(w T +1, S(i)) 1 + (1 λη)k 1 w T +1 w T +1 + 2G

As u(1)(w T +1, S) = w T +1, u(1)(w T +1, S) = w T +1, plug in Equation (12), we have

k=1 u(k)(w T +1, S) 1

k=1 u(k)(w T +1, S(i))

min 2, 1 + 1 ληK

w T +1 w T +1 + 2G

min 2, 1 + 1 ληK

The following theorem can be derived via Theorem 2.2 and Lemma 4.3.

Theorem C.3. Consider a meta-learning problem with convex, M-bounded, G-Lipschitz and Hsmooth loss function. Then, after T iterations of Algorithm 1 with γ 1 λT on a given meta-sample S, and GD for task-specific learning (i.e., Option 2 for Algorithm 2) with η 2 H+2λ, we have with probability at least 1 δ,

L(A(S), µ) L(A(S), S) + G2

λm log (m) log (1/δ) + M m

log (1/δ) + G2

Proof of Theorem C.3. We slightly abuse the notation, at iteration t, define wt = A(S), w t = A(S(j)), u(K)(w T +1, S) = A(S)(S), u(K)(w T +1, S(i)) = A(S(j))(S(i)). If the loss is M-bounded,

convex and G-Lipschitz, apply Equation (10) gives us L(A(S)(S), S) L(A(S(j))(S), S)

= L(u(K)(w T +1, S), S) L(u(K)(w T +1, S), S)

G u(K)(w T +1, S) u(K)(w T +1, S) (G-Lipschitz)

G (1 λη) u(K)(w T +1, S) u(K)(w T +1, S) + λη w T +1 w T +1 (Equation (10))

G w T +1 w T +1

λm (Equation (12))

Given S, S(i). For any w, by Equation (13), for all k [K 1], we have u(k+1)(w, S) u(k+1)(w, S(i)) 2Gη

n + (1 λη) u(k)(w, S) u(k)(w, S(i)) 2G

λn (Telescope)

Therefore, we have ℓ(A(S)(S), z) ℓ(A(S)(S(i)), z)

k=1 u(k)(w T +1, S), z

k=1 u(k)(w T +1, S(i)), z

k=1 u(k)(w T +1, S) 1

k=1 u(k)(w T +1, S(i))

Apply Theorem 2.2 with β = 2e G2

λm , with β = 2G2

λn gives us the result.

Theorem 4.4. Assume that the loss function is convex, M-bounded, G-Lipschitz and H-smooth. Suppose we run Algorithm 1 for T iterations with γ 1 λT on a given meta-sample S, and GD for task-specific learning (Option 2, Algorithm 2) with η 2 H+2λ. Then, with probability at least 1 δ,

L(A(S), µ) L(A(S), S)+ G2

log (mn) log (1/δ)+ M p

log (1/δ) mn .

Proof of Theorem 4.4. We denote wt = A(S), w t = A(S(j)), u(w T +1, S) = A(S)(S), u(w T +1, S(i)) = A(S)(S(i)). For ℓto be convex and G-Lipschitz, applying Lemma 4.3 gives us that ℓ(A(S)(S), z) ℓ(A(S(j))(S(i)), z)

k=1 u(k)(w T +1, S), z

k=1 u(w T +1, S(i)), z

k=1 u(k)(w T +1, S), z) 1

k=1 u(k)(w T +1, S(i))

min 2, 1 + 1 ληK

Further apply Theorem 3.1 gives us the result.

D Missing Proofs of Section 4.2

We start with a proposition that provide some equivalent characterizations of weak convexity. Proposition D.1 (Proposition 2.1 in Davis and Grimmer [2019]). Suppose f : Rd R { } is a closed function and ρ > 0, then the following are equivalent:

1. For any w1 Rd, f( ) + ρ

2 w1 is convex.

2. For any w1, w2 Rd with g(w1) f(w1), we have

f(w2) f(w1) + g(w1), w2 w1 ρ

2 w2 w1 2 .

3. For any w1, w2 Rd and λ > 0,

f(λw1 + (1 λ)w2) λf(w1) + (1 λ)f(w2) + ρλ(1 λ)

2 w1 w2 2 .

Lemma D.2. [Bassily et al. [2020]] Given S and S(i), for a fixed w, consider u(w, S) and u(w, S(i)) are achieved via Algo. 1 with gradient descent for K iterations. Then if ℓis convex and G-Lipschitz, we have supi [n] 1

K PK k=1 u(k)(w, S) 1

K PK k=1 u(k)(w, S(i)) 4GKη

Below we provide our key Lemma D.3 for the stability analysis. Lemma D.3. Consider a meta-learning problem with ρ-weakly convex and G-Lipschitz loss function. Let S, S(j) denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then, after T iterations of Algorithm 1 with γ 1 λT , λ 2ρ, and GD for task-specific learning (i.e., Option 2 for Algorithm 2) with η 1

sup S,j [m]

w T +1 w T +1 2e G rη

Proof of Lemma D.3. If ℓis ρ-weakly convex, then from Proposition D.1 we have that

ℓ(u) ℓ(v), u v ρ u v 2 , u, v Rd

Given S and S , for any w and w , we have

k [K 1], u(k+1)(w, S) u(k+1)(w , S )

u(k)(w, S) u(k)(w , S ) η

L(u(k)(w, S), S) + λ u(k)(w, S) w

(Projection is non-expansive)

L(u(k)(w , S ), S) λ u(k)(w , S ) w !

= (1 ηλ) u(k)(w, S) u(k)(w , S ) + ηλ w w + 2ηG

1 + (1 ηλ)k w w + 2G

λ (Telescope, u(1)(w, S) = w, u(1)(w , S) = w .)

And therefore 1 K

k=1 u(k)(w, S) 1

k=1 u(k)(w , S )

min 2, 1 + 1 ληK

We now focus on the situation where we give S and S with a fix w. For simplicity, we define δk = u(k)(wt, S) u(k)(w t, S) . Note that δ1 = wt w t . We have

δk+1 = u(k+1)(wt, S) u(k+1)(w t, S)

u(k)(wt, S) u(k)(w t, S) η

L(u(k)(wt, S), S)+λ u(k)(wt, S) wt

(Projection is non-expansive)

L(u(k)(w t, S), S) λ u(k)(w t, S) w t !

λη wt w t +

(1 ηλ) u(k)(wt, S) u(k)(w t, S)

η L(u(k)(wt, S), S) L(u(k)(w t, S), S)

= λη wt w t + k

where we define

(1 ηλ) u(k)(wt, S) u(k)(w t, S) η L(u(k)(wt, S), S) L(u(k)(w t, S), S) .

We have that

2 k = (1 ηλ)2δ2 k + 4η2G2

2η(1 ηλ) D u(k)(wt, S) u(k)(w t, S), L(u(k)(wt, S), S) L(u(k)(w t, S), S) E

(1 ηλ)2δ2 k + 4η2G2 + 2η(1 ηλ)ρδ2 k (ℓis ρ-weakly convex)

(1 ηλ)δ2 k + 4η2G2 (η 1

Therefore, we have

δ2 k+1 + λ2η2 wt w t 2 2ληδk+1 wt w t 2 k (1 ηλ)δ2 k+4η2G2 (14)

Rearrange it gives us that

δ2 k+1 (1 ηλ)k+1 + λ2η2 wt w t 2

(1 ηλ)k+1 2λη wt w t δk+1 (1 ηλ)k+1 δ2 k (1 ηλ)k + 4η2G2

Telescoping from k = 1 to K gives us that

δ2 K+1 (1 ηλ)K+1 +

λ2η2 wt w t 2

2λη wt w t δk+1 (1 ηλ)k+1 +

δ2 K+1 + λ2η2 wt w t 2 K X

k=1 (1 ηλ)K k 2λη wt w t δK+1

λ + 2λη wt w t

k=1 δk+1(1 ηλ)K k

λ + 2λη(1 ηλ) wt w t

k=1 δk(1 ηλ)K k (15)

Now we start proving the following bound by induction:

δK 2 wt w t + 2G rη

This claim holds when k = 1. For the inductive step, we assume it holds for some k [K] and prove the result for k + 1. We consider the following two cases. If δk+1 maxs [k] δk, induction automatically holds. Otherwise, δk+1 > maxs [k] δs. Applying Equation (15) gives us that

δ2 k+1 + λ2η2 wt w t 2 k X

j=1 (1 ηλ)k j 2λη wt w t δk+1

λ + 2λη(1 ηλ) wt w t

j=1 δk(1 ηλ)k j

λ + 2λη(1 ηλ) wt w t δk+1

j=1 (1 ηλ)k j

which is equivalent to

δ2 k+1 + λ2η2 wt w t 2 k X

j=1 (1 ηλ)k j

λ + 2λη wt w t δk+1

j=1 (1 ηλ)k j

Rearrange gives us that

δk+1 λη wt w t

j=1 (1 ηλ)k j

j=1 (1 ηλ)k j

λη wt w t 2 1 (1 ηλ)k+1

Therefore, we have

k [K 1] u(k+1)(wt, S) u(k+1)(w t, S)

j=1 (1 ηλ)k j

2 wt w t + 2G rη

And therefore 1 K

k=1 u(k)(wt, S) 1

k=1 u(k)(w t, S)

2 wt w t + 2G rη

As a result, wt+1 w t+1

k=1 u(k)(wt, Sj)

k=1 u(k)(w t, S j)

(Projection is non-expansive)

= (1 γλ) wt w t + γλ

k=1 u(k)(wt, Sj) 1

k=1 u(k)(w t, S j)

(1 γλ) wt w t + m 1

m γλ 2 wt w t + 2G rη

2 wt w t + 2G

(1 + γλ) wt w t + 2Gγ p

m Telescoping gives us that w T +1 w T +1 (1 + γλ)T 2G rη

Choosing γ 1 λT gives us that w T +1 w T +1 (1 + 1

We remark that if we consider convex and non-smooth loss function by setting ρ = 0, then follow a similar argument, Equation (14) can be replaced by

δ2 k+1 + λ2η2 wt w t 2 2ληδk+1 wt w t 2 k (1 ηλ)2δ2 k+4η2G2

And therefore Equation (16) can be replaced by u(k+1)(wt, S) u(k+1)(w t, S)

1 + (1 ηλ)2 k X

j=1 (1 ηλ)2k 2j

2 2 ηλ wt w t + 2G rη

2 wt w t + 2G rη

and the rest follows.

Theorem D.4. Consider a meta-learning problem with ρ-weakly convex, M-bounded, G-Lipschitz loss function. Then, after T iterations of Algorithm 1 with γ 1 λT , λ 2ρ, and GD for task-specific learning (i.e., Option 2 for Algorithm 2) with η 1

λ, we have with probability at least 1 δ,

L(A(S), µ) L(A(S), S)+ G2 rη

log (m) log (1/δ)+ M m

log (1/δ)+ G2

By setting η = 1 λK , we have

L(A(S), µ) L(A(S), S)+ G2

log (m) log (1/δ)+ M m

log (1/δ)+ G2

Proof of Theorem D.4. If the loss is M-bounded and G-Lipschitz, apply Lemma D.3 gives us L(A(S)(S), S) L(A(S(j))(S), S)

k=1 u(k)(w T +1, S), S

k=1 u(k)(w T +1, S), S

k=1 u(k)(w T +1, S) 1

k=1 u(k)(w T +1, S)

(G-Lipschitz)

2G w T +1 w T +1 + 2G2 rη

λ (Equation (16))

4e G2 + 2G2 rη

λm (Lemma D.3)

4e G2 + 2G2 1 λ

λm (Set η 1 λK )

On the other hand, applying Lemma D.2 gives us that ℓ(A(S)(S), z) ℓ(A(S)(S(i)), z) = ℓ(u(K+1)(w T +1, S), z) ℓ(u(K+1)(w T +1, S(i)), z)

G u(K+1)(w T , S) u(K+1)(w T , S(i))

K (Set η 1 λK )

Plug back into Theorem 2.2 gives the result.

Lemma 4.5. Assume that the loss function is ρ-weakly convex and G-Lipschitz. Let S, S(j) denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then the following holds for Algorithm 1 with λ 2ρ, and GD for task-specific learning (i.e., Option 2 for Algorithm 2) with η 1

λ, for all T 1 as long as we set γ 1 λT ,

sup S,S,j [m],i [n]

A(S)(S) A(S(j))(S(i)) (8e G + 2G) rη

Proof of Lemma 4.5. We slightly abuse the notation, at outer iteration t, define wt = A(S), w t = A(S(j)). Given wt, at inner iteration k, define u(k)(wt, S) = A(S)(S), u(k)(w t, S(i)) = A(S(j))(S(i)). We now provide the upper bound on 1

K PK k=1 u(k)(w T +1, S) 1

K PK k=1 u(k)(w T +1, S(i)) . Recall that if ℓis ρ-weakly convex, then we have

ℓ(u) ℓ(v), u v ρ u v 2

We apply a similar procedure as Lemma D.3. For simplicity, we define δk = u(k)(wt, S) u(k)(w t, S(i)) . Note that δ1 = wt w t . We have

δk+1 = u(k+1)(wt, S) u(k+1)(w t, S(i))

u(k)(wt, S) u(k)(w t, S(i)) η

L(u(k)(wt, S), S) + λ u(k)(wt, S) wt

L(u(k)(w t, S(i)), S(i)) λ u(k)(w t, S(i)) w t !

λη wt w t +

(1 ηλ) u(k)(wt, S) u(k)(w t, S(i))

η L(u(k)(wt, S), S) L(u(k)(w t, S(i)), S(i))

= λη wt w t + k

where we define

(1 ηλ) u(k)(wt, S) u(k)(w t, S(i))

η L(u(k)(wt, S), S) L(u(k)(w t, S(i)), S(i)) .

We have that

2 k = (1 ηλ)2δ2 k + 4η2G2

2η(1 ηλ) D u(k)(w T +1, S) u(k)(w T +1, S(i)), L(u(k)(w T +1, S), S) L(u(k)(w T +1, S(i)), S(i)) E

= (1 ηλ)2δ2 k + 4η2G2

2η(1 ηλ) D u(k)(w T +1, S) u(k)(w T +1, S(i)), L(u(k)(w T +1, S), S) L(u(k)(w T +1, S(i)), S) E

D u(k)(w T +1, S) u(k)(w T +1, S(i)), ℓ(u(k)(w T +1, S(i)), zi) ℓ(u(k)(w T +1, S(i)), z ) E

(1 ηλ)2δ2 k + 4η2G2 + 2η(1 ηλ)ρδ2 k + 4Gη(1 ηλ)

(1 ηλ)δ2 k + 4η2G2 + 4Gη(1 ηλ)

Therefore, we have

δ2 k+1 + λ2η2 wt w t 2 2ληδk+1 wt w t 2 k (1 ηλ)δ2 k+4η2G2+ 4Gη(1 ηλ)

Rearrange it gives us that

δ2 k+1 (1 ηλ)k+1 + λ2η2 wt w t 2

(1 ηλ)k+1 2λη wt w t δk+1 (1 ηλ)k+1

δ2 k (1 ηλ)k + 4η2G2

(1 ηλ)k+1 + 4Gη(1 ηλ)δk

n (1 ηλ)k+1

Telescoping from k = 1 to K gives us that

δ2 K+1 (1 ηλ)K+1 +

λ2η2 wt w t 2

2λη wt w t δk+1 (1 ηλ)k+1 +

(1 ηλ)k+1 +

4Gη(1 ηλ)δk

n (1 ηλ)k+1

δ2 K+1 + λ2η2 wt w t 2 K X

k=1 (1 ηλ)K k 2λη wt w t δK+1

λ + 2λη wt w t

k=1 δk+1(1 ηλ)K k + 4Gη(1 ηλ)

k=1 δk(1 ηλ)K k

λ + 2λη(1 ηλ) wt w t + 4Gη(1 ηλ)

k=1 δk(1 ηλ)K k (17)

Now we start proving the following bound by induction:

λ + 4 wt w t + 8G(1 ηλ)

This claim holds when k = 1. For the inductive step, we assume it holds for some k [K] and prove the result for k + 1. We consider the following two cases. If δk+1 maxs [k] δk, induction automatically holds. Otherwise, δk+1 > maxs [k] δs. Applying Equation (17) gives us that

δ2 k+1 + λ2η2 wt w t 2 k X

j=1 (1 ηλ)k j 2λη wt w t δk+1

λ + 2λη(1 ηλ) wt w t + 4Gη(1 ηλ)

j=1 (1 ηλ)k j

which is equivalent to

2λη wt w t +2λη(1 λη) wt w t

j=1 (1 ηλ)k j+ 4Gη(1 ηλ)

j=1 (1 λη)k j

2λη wt w t +2λη(1 λη) wt w t

j=1 (1 ηλ)k j+ 4Gη(1 ηλ)

j=1 (1 λη)k j

Therefore, we have

2λη wt w t +2λη(1 λη) wt w t

j=1 (1 ηλ)k j

+ 4Gη(1 ηλ)

j=1 (1 λη)k j !

λ + 4 wt w t + 8G(1 ηλ)

Plug in Lemma D.3 gives us that

δk+1 (8e G + 2G) rη

Therefore we have 1 K

k=1 u(k)(w T +1, S) 1

k=1 u(k)(w T +1, S(i))

(8e G + 2G) rη

Moreover, setting η = 1 λK gives us that 1 K

k=1 u(k)(w T +1, S) 1

k=1 u(k)(w T +1, S(i))

Theorem 4.6. Assume that the loss function is ρ-weakly convex, M-bounded, and G-Lipschitz. Suppose we run Algorithm 1 for T iterations with γ 1 λT , λ 2ρ on a meta-sample S, and GD for task-specific learning (Option 2, Algorithm 2) with η 1

λ, Then, with probability at least 1 δ,

L(A(S), µ) L(A(S), S) + G2 rη

log (mn) log (1/δ)+ M p

log (1/δ) mn .

Proof of Theorem 4.6. For ℓto be G-Lipschitz, applying Lemma 4.1 gives us that ℓ(A(S)(S), z) ℓ(A(S(j))(S(i)), z)

k=1 u(k)(w T +1, S), z

k=1 u(k)(w T +1, S(i)), z

k=1 u(k)(w T +1, S) 1

k=1 u(k)(w T +1, S(i))

(8e G2 + 2G2) rη

Plug it back into Theorem 3.1 gives the result.

Theorem D.5 (Restatement of Theorem 4.7). Assume the loss ℓis convex and G-Lipschitz. Define u j =argminu L(u, Sj), j [m]. Suppose we run Algorithm 1 with GD for task-specific learning with γ = 1 λT to find an algorithm A(S) = Atask(w T +1, ) which is then run on Sj for K iterations with step-size η 1

λ. Then, we have that

L(A(S)(Sj), Sj) inf u L(u, Sj) D2

2η(1 ηλ)K + G2η 2(1 ηλ) + GDηλ

1 ηλ + λ w T +1 bw 2+λσ2

(1 ηλ)(2 ηλ)

where σ2 := 1 K PK j=1 bw u j 2, with bw as defined in Equation (1). w T +1 bw 2

1 T 8D2 + 4D2

ηλK + η(G+2λD)2

ηλK + η(G+2λD)2

Proof of Theorem D.5. Recall the definition bw=argminw W 1 m Pm j=1minu h L(u; Sj)+ λ

u (w, S) = argminu W h L(u; S) + λ

2 u w 2i , u j = argminu W L(u, Sj), j [m]. We

slightly abuse the notation by defining u(k)(wt, Sj) = A(S)(Sj) at inner iteration k for given wt. Then we have u(k+1)(wt, Sj) u j 2

= ΠW u(k)(wt, Sj) η L(u(k)(wt, Sj), Sj) + λ(u(k)(wt, Sj) wt) u j 2

(1 ηλ) u(k)(wt, Sj) u j + ηλ wt u j η L(u(k)(wt, Sj), Sj) 2

(1 ηλ)2 u(k)(wt, Sj) u j 2 + η2 L(u(k)(wt, Sj), Sj) 2 + η2λ2 wt u j 2

+ 2ηλ(1 ηλ) u(k)(wt, Sj) u j wt u j + η2λG wt u j

2η(1 ηλ) D L(u(k)(wt, Sj), Sj), u(k)(wt, Sj) u j E

= (1 ηλ) u(k)(wt, Sj) u j + ηλ wt u j 2 + η2 L(u(k)(wt, Sj), Sj) 2

+ η2λG wt u j 2η(1 ηλ) D L(u(k)(wt, Sj), Sj), u(k)(wt, Sj) u j E

u(k)(wt, Sj) u j 2 + ηλ 2 ηλ

wt u j 2 + η2G2 + η2λG wt u j

((a + b)2 (1 + p)a2 + (1 + 1/p)b2 with p = (2 ηλ)ηλ

2η(1 ηλ) D L(u(k)(wt, Sj), Sj), u(k)(wt, Sj) u j E

Rearrange it and telescope it gives us that

k=1 u(k)(wt, Sj), Sj

k=1 L(u(k)(wt, Sj), Sj) L(u j, Sj) (Jensen s inequality)

D L(u(k)(wt, Sj), Sj), u(k)(wt, Sj) u j E (Convexity)

u j 2 + η2G2K + η2λG PK j=1 wt u j + ηλ 2 ηλ PK j=1 wt u j 2

2η(1 ηλ)K + G2η 2(1 ηλ) + 2GDηλ

1 ηλ + λ wt bw 2 + λσ2

(1 ηλ)(2 ηλ) (σ2 = 1

K PK k=1 u j bw 2)

Follow from [Zhou et al., 2019, Theorem 1], we now control wt+1 bw 2. Define u (wt, Sj) = argminu FSj(u, wt) = argminu L(u, Sj) + λ

2 u wt 2. We start with the following:

wt+1 bw 2 =

k=1 u(k)(wt, Sj)

k=1 u(k)(wt, Sj)

wt bw 2 2γλ

wt bw, wt 1

k=1 u(k)(wt, Sj)

k=1 u(k)(wt, Sj)

We now bound the latter two terms separately as follows: wt 1

k=1 u(k)(wt, Sj)

k=1 u (wt, Sj) + 1

u (wt, Sj) u(k)(wt, Sj)

k=1 u (wt, Sj)

u(k)(wt, Sj) u (wt, Sj) 2

u(k)(wt, Sj) u (wt, Sj) 2

as well as *

wt bw, wt 1

k=1 u(k)(wt, Sj)

k=1 wt bw, wt u (wt, Sj)

D wt bw, u(k)(wt, Sj) u (wt, Sj) E

λ FSj(u(k), wt) 1

u(k)(wt, Sj) u (wt, Sj) 2

2 wt bw 2 1

u(k)(wt, Sj) u (wt, Sj) 2

m Pm j=1 FSj(u, w) is λ-strongly convex w.r.t. w)

2 wt bw 2 1

u(k)(wt, Sj) u (wt, Sj) 2

where the common term u(k)(wt, Sj) u (wt, Sj) 2 can be controlled as follows: u(k+1)(wt, Sj) u (wt, Sj) 2

u(k)(wt, Sj) η FSj(u(k)(wt, Sj), wt) u (wt, Sj) 2

u(k+1)(wt, Sj) u (wt, Sj) 2 2η D FSj(u(k)(wt, Sj), wt), u(k+1)(wt, Sj) u (wt, Sj) E

+ η2 FSj(u(k)(wt, Sj), wt) 2

(1 2ηλ) u(k+1)(wt, Sj) u (wt, Sj) 2 +η2 L(u(k)(wt, Sj), Sj) + λ(u(k)(wt, Sj) wt) 2

(FSj(u, w) is λ-strongly convex w.r.t. u)

(1 2ηλ) u(k+1)(wt, Sj) u (wt, Sj) 2 + η2(G + 2λD)2

4(1 2ηλ)k D2 + η(G + 2λD)2

2λ (Telescoping)

Plug back into Equation (19) gives us that

wt bw 2 2γλ

2 wt bw 2 1

u(k)(wt, Sj) u (wt, Sj) 2

u(k)(wt, Sj) u (wt, Sj) 2

(1 γλ) wt bw 2 + γλ

k=1 4(1 2ηλ)k D2 + η(G + 2λD)2

k=1 8(1 2ηλ)k D2 + η(G + 2λD)2

(1 γλ) wt bw 2+γλ 2D2

ηλK + η(G + 2λD)2

+γ2λ2 8D2+ 4D2

ηλK + η(G+2λD)2

Choosing γ = 1 λT gives us that

T ) wt bw 2 + 1

ηλK + η(G + 2λD)2

ηλK + η(G + 2λD)2

ηλK + η(G+2λD)2

ηλK + η(G + 2λD)2

2λ (Telescope)

E Missing Proofs in Section 5

Theorem E.1 (Bennett s inequality). Let x1, . . . , xn be independent r.v. with finite variance. Further assume |xi Exi| a a.s. for all i. Define Sn = Pn i=1 [xi E[xi]] and σ2 = Pn i=1 E (xi E[xi])2. Then for any t 0,

P (Sn > t) exp σ2

where h(u) = (1 + u) log (1 + u) u.

Lemma 5.1. Assume that the loss function is ρ-weakly convex and G-Lipschitz. Let S, S(j) denote neighboring meta-samples and S, S(i) the neighboring samples on a test task. Then, with probability at least 1 exp T 2e2/m2 , the following holds for Algorithm 3 with λ 2ρ, and GD for taskspecific learning (i.e., Option 2 for Algorithm 2) with η 1

λ, for all T 1 as long as we set γ 1 λT ,

sup S,S,i [n],j [m]

A(S)(S) A(S(j))(S(i)) (8e G + 2G) rη

Proof of Lemma 5.1. We slightly abuse the notation, at outer iteration t, define wt = A(S), w t = A(S(j)). Given wt, at inner iteration k, define u(k)(wt, S) = A(S)(S), u(k)(w t, S(i)) = A(S(j))(S(i)). From a similar argument as Lemma D.3, k [K 1], we have u(k+1)(wt, S) u(k+1)(w t, S ) 2 wt w t + 2G

λ u(k+1)(wt, S) u(k+1)(w t, S) 2 wt w t + 2G rη

Let us define rt = 1(Sjt = Sjt). Note that at every step t, EA(rt) = 1 m. Moreover, note that {rt : t [T]} is an independent sequence of Bernoulli random variables. As a result, wt+1 w t+1

k=1 u(k)(wt, Sjt)

k=1 u(k)(w t, S jt)

(Projection is non-expansive)

= (1 γλ) wt w t + γλ

k=1 u(k)(wt, Sjt) 1

k=1 u(k)(w t, S jt)

(1 γλ) wt w t +γλ(1 rt) 2 wt w t +2G rη

wt w t + 2G

(1 + (1 rt)γλ) wt w t + 2Gγ p

(1 + γλ) wt w t + 2Gγ p

Telescoping gives us that

w T +1 w T +1 2G rη

λ(1 + γλ)T + 2Gγ

t=1 (1 + γλ)t 1 rt

Further taking expectation w.r.t the randomness of the algorithm and gives us that

EA w T +1 w T +1 (1 + γλ)T 2G rη

Choosing γ 1 λT gives us that

EA w T +1 w T +1 (1 + 1

Plug this back into Equation (18) gives us that

EA u(K+1)(w T+1, S) u(K+1)(w T+1, S(i)) 2G rη

λ + 4EA wt w t + 8G(1 ηλ)

(8e G + 2G) rη

Setting γ 1 λT . We note that for each rt has variance smaller than 1 m. Define random variable xt := (1 + 1

T )t 1rt. We have

|xt E[xt]| = (1 + 1

T )t 1 (rt E[rt]) e |xt E[xt]| (1 + 1

t=1 E (xi E[xt])2

Hence by Bennett s inequality Theorem E.1, we have

Therefore, with probability at least 1 exp T 2e2/m2 , we have

w T +1 w T +1 2G rη

t 1 2e G rη

and therefore with probability at least 1 exp T 2e2/m2 , we have

k [K 1], u(K+1)(w T+1, S) u(K+1)(w T+1, S(i)) 2G rη

λ + 4 wt w t + 8G(1 ηλ)

(8e G + 2G) rη

λn By triangle inequality, we have with probability at least 1 exp T 2e2/m2 , 1 K

k=1 u(k)(w T+1, S) 1

k=1 u(k)(w T+1, S(i))

(8e G + 2G) rη

Proposition 5.2. Given a loss function ℓ( , z) and its adversarial counterpart ℓ( , z), the following holds: (1) If ℓis G-Lipschitz (in its first argument), then ℓis G-Lipschitz. (2) ℓis not H-smooth even if ℓis H-smooth. (3) If ℓis H-smooth in w, then ℓis H-weakly convex in w.

Proof of Proposition 5.2. Given w1, w2, define z1 argmax z B(z) ℓ(w1, z)

z2 argmax z B(z) ℓ(w2, z).

For the first item, it holds as ℓ(w1, z) ℓ(w2, z) = ℓ(w1, z1) ℓ(w2, z2)

= max {|ℓ(w1, z1) ℓ(w2, z1)| , |ℓ(w1, z2) ℓ(w2, z2)|} G w1 w2 .

For the second item, the non-smoothness of the adversarial loss has been verified in Xing et al. [2021], Xiao et al. [2022]. For the third item, ℓ(w, z) is H-smooth implies that ℓ(w, z) is H-weakly convex, and further derive that ℓ(w, z) is H-weakly convex because ℓ(w2, z) = ℓ(w2, z2) ℓ(w2, z1) (By definition of z1, z2)

ℓ(w1, z1) + g(w1, z1), w2 w1 ρ

(g(w1, z1) ℓ(w2, z1), apply Proposition D.1)

= ℓ(w1, z) + g(w1, z), w2 w1 ρ

2 w2 w1 2 (Redefine g(w1, z) ℓ(w1, z))

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We further expand on the claims made in abstract and introduction in Section 3 and 4. The detailed proofs are provided in the Appendix.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

B. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes] Justification: We list the required assumptions in the statement of each theorems. Several limitations are described together with future work in Section 6.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

C. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All the proofs are provided in the Appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

D. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Please see Section A in the Appendix.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

E. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: Please see Section A in the Appendix. The code is provided in the supplementary file.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

F. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: Please see Section A in the Appendix.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

G. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Please see Section A in the Appendix.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. H. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Please see Section A in the Appendix. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). I. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The theoretical nature of the results means there are minimal ethical concerns. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). J. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: See Section 5.2 and 6. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

K. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

L. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: The paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. M. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. N. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. O. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.