# fastrate_pacbayesian_generalization_bounds_for_metalearning__aaccc885.pdf

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Jiechao Guan 1 2 Zhiwu Lu 3 2

PAC-Bayesian error bounds provide a theoretical guarantee on the generalization abilities of metalearning from training tasks to unseen tasks. However, it is still unclear how tight PAC-Bayesian bounds we can achieve for meta-learning. In this work, we propose a general PAC-Bayesian framework to cope with single-task learning and metalearning uniformly. With this framework, we generalize the two tightest PAC-Bayesian bounds (i.e., kl-bound and Catoni-bound) from singletask learning to standard meta-learning, resulting in fast convergence rates for PAC-Bayesian meta-learners. By minimizing the derived two bounds, we develop two meta-learning algorithms for classiﬁcation problems with deep neural networks. For regression problems, by setting Gibbs optimal posterior for each training task, we obtain the closed-form formula of the minimizer of our Catoni-bound, leading to an efﬁcient Gibbs meta-learning algorithm. Although minimizing our kl-bound can not yield a closed-form solution, we show that it can be extended for analyzing the more challenging meta-learning setting where samples from different training tasks exhibit interdependencies. Experiments empirically show that our proposed meta-learning algorithms achieve competitive results with respect to latest works.

1. Introduction

Inspired by human beings ability of utilizing past experience to efﬁciently learn a novel task, meta-learning, also referred to as learning to learn (Thrun & Pratt, 1998), has received much attention from the machine learning commu-

1School of Information, Renmin University of China, Beijing, China 2Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China 3Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing, China. Correspondence to: Zhiwu Lu <luzhiwu@ruc.edu.cn>, Jiechao Guan <2014200990@ruc.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

nity in the last decade. The goal of meta-learning is thus to transfer the knowledge extracted from training tasks to unseen tasks for fast adaptation. Successful applications of such learning paradigm have been witnessed in computer vision (Snell et al., 2017; Ye et al., 2020), natural language processing (Wu et al., 2020; Chen et al., 2021), reinforcement learning (Li et al., 2021) and other related ﬁelds.

Apart from practical applications, the theoretical analysis of meta-learning has also been developed in the last decade. The pioneering work was provided by (Baxter, 2000), which ﬁrst assumed that all learning tasks are independently and identically distributed (i.i.d.) from a task environment. Under this assumption, the following works investigated metalearning from the standpoint of VC-theory (Maurer, 2009; Maurer et al., 2016) or algorithmic stability (Maurer, 2005; Chen et al., 2020). In recent years, there has also emerged an interest in studying meta-learning via PAC-Bayesian analysis (Pentina & Lampert, 2014; 2015; Amit & Meir, 2018). However, it is still unclear how tight PAC-Bayesian generalization bounds we can achieve for meta-learning.

This paper intends to tackle this issue and provides fast-rate PAC-Bayesian bounds for meta-learning. Our motivation is to generalize the two tightest PAC-Bayesian bounds, klbound and Catoni-bound, from the i.i.d. single-task learning setting (Maurer, 2004; Catoni, 2007), to the standard metalearning setting where observations from different tasks are independent but not identically distributed. The main tool to achieve this goal is a useful lemma in (Berend & Tassa, 2010) that bounds the sum of independent bounded random variables with the sum of i.i.d. Bernoulli random variables. We can then apply the demonstration techniques in single-task learning to meta-learning, obtaining two fast-rate PAC-Bayesian bounds, which are still called kl-bound and Catoni-bound for convenience (Theorems 3-4). As a result, we unify the demonstration framework of PAC-Bayesian theory for single-task learning and meta-learning, and provide two up-to-date tightest bounds for meta-learners (see comparisons between different bounds in Table 1). By setting the derived two bounds as minimization objective, we also develop two bound-minimizing meta-learning algorithms for classiﬁcation problems with deep neural networks.

Next, we give more theoretical results and applications of our derived two bounds. For the Catoni-bound, we show

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

how to obtain a closed-form solution to ﬁnding its minimum. Concretely, by choosing Gibbs posterior for each training task in this bound, we can yield an explicit form of the Gibbs optimal hyper-posterior, which is the training objective in PAC-Bayesian meta-learning setting. Directly approximating such closed-form hyper-posterior leads to an efﬁcient meta-learning algorithm for regression problems. For the klbound, we extend it to the more challenging meta-learning setting where dependence exists among samples from different tasks. The approach undertaken to establish our result is based on the decomposition of the dependency graph of the training data (Ralaivola et al., 2010). The extended PAC-Bayesian kl-bound admits an analogous form of our previous one for the standard setting, up to a multiplicative factor that represents the degree of interdependence within the dependency graph. This bound thus reveals the impact of data dependence for meta-learning. Finally, we conduct experiments on several benchmarks. The empirical results show that our proposed meta-learning algorithms achieve competitive performance with respect to latest works.

Our main contributions are summarized as follows: (1) We propose a uniﬁed framework that can generalize PAC-Bayesian analysis from single-task learning to metalearning. The fast-rate PAC-Bayesian kl-bound and Catonibound are hence derived for meta-learning, followed by two bound-minimizing classiﬁcation algorithms. (2) By setting Gibbs optimal posterior for each training task in our Catoni-bound, we obtain the explicit form of the optimal hyper-posterior. We thus develop an efﬁcient regression algorithm that only needs to approximate the Gibbs optimal hyper-posterior, instead of learning posterior for each task and hyper-posterior simultaneously like previous methods. (3) We extend our kl-bound to the meta-learning setting where tasks are dependent. To the best of our knowledge, this is the ﬁrst PAC-Bayesian bound for meta-learning with data from different tasks exhibiting interdependencies. (4) Experiment results on both classiﬁcation and regression problems empirically validate the effectiveness of our theoretical analysis for meta-learning, especially the tightness and applicability of our PAC-Bayesian Catoni-bound.

2. Related Work

PAC-Bayesian Theory. The ﬁrst PAC-Bayesian generalization bound was provided by (Mc Allester, 1998). (Seeger, 2002) developed such theory and obtained the tightest PACBayesian kl-bound. (Germain et al., 2009) provided a general method to demonstrate a PAC-Bayesian bound. We need to point out that in the above works, the i.i.d. assumption is necessary to obtain the tightest kl-bound. Besides that, (Catoni, 2007) yielded a generalization bound of fast rate by just assuming the independence of observations. Un-

til now, the kl-bound in (Maurer, 2004) and Catoni s bound are still among the tightest PAC-Bayesian bounds. More explanations for the similarities between these tight bounds can be found in (Audibert, 2010) [Chapter 2]. Other works extended PAC-Bayesian bounds to general cases, such as martingale setting (Seldin et al., 2012), heavy-tailed data (Alquier & Guedj, 2018), non-i.i.d. data (Ralaivola et al., 2010) or unbounded loss functions (Holland, 2019; Haddouche et al., 2021). In this work, we focus on bounded loss function. We ﬁrst propose the generalized PAC-Bayesian kl-bound and Catoni-bound for independent (not necessarily identically distributed) random variables (Theorem 2). Then, we show how to apply them to derive fast-rate PACBayesian generalization error bounds for meta-learning.

Meta-Learning Theory. The ﬁrst systematic analysis of meta-learning theory was proposed by (Baxter, 2000), which assumed that the data distributions of different tasks are i.i.d. sampled from the same task environment. Baxter then gave a covering number based generalization bound for meta-learning. After that, (Maurer, 2005) investigated meta-learning theory from the perspective of algorithm stability and proposed a new indicator, called transfer risk, to measure the performance of a meta-learning algorithm. Other works followed this line and intended to provide a tighter bound on the transfer risk (Maurer, 2009; Denevi et al., 2018; Chen et al., 2020). In this work, we study metalearning from the perspective of PAC-Bayesian analysis.

PAC-Bayesian Meta-Learning Theory. The pioneering work of PAC-Bayesian meta-learning theory was proposed by (Pentina & Lampert, 2014). They adopted the task environment notation from Baxter and proposed the concept of hyper-posterior. The hyper-posterior is trained over the observed tasks and generates an informative prior when encountering a novel task for fast adaptation. The PACBayesian bound on the transfer risk (of the learned hyperposterior) is always composed of three parts: empirical multi-task error, environment-level complexity and tasklevel complexity (see Table 1). (Pentina & Lampert, 2015) further generalized PAC-Bayesian meta-learning theory to the non-i.i.d. cases where different tasks are dependent or the task environment is changing. (Amit & Meir, 2018) provided a new PAC-Bayesian bound and applied their theoretical results to the optimization of deep neural networks. (Rothfuss et al., 2021) generalized the PAC-Bayesian metalearning theorem to the unbounded loss and derived the PACoptimal hyper-posterior. (Farid & Majumdar, 2021) studied meta-learning with both PAC-Bayes and uniform stability analysis. However, both the environment-level complexity and the task-level complexity in these PAC-Bayesian bounds suffered from slow convergence rates (e.g., the environmentlevel and task-level complexities in (Pentina & Lampert, 2015) are of order O( 1 n) and O( 1 n m), respectively, with n training tasks and m samples per task). In this work, we

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Table 1. Different PAC-Bayesian meta-learning bounds on er(Q). Bound = Empirical Error + Environment-level Complexity + Task-level Complexity. n is the number of training tasks. m is the sample size per task. P, Q M1(M1(H)) are the hyper-prior and hyper-posterior respectively. P, Qi = Q(Si, P) M1(H) are the prior and the posterior for the i-th training task. In our Catoni-bound, the constant C > 1. Explicit forms of different bounds are given in Table 2 of the Appendix.

Classical Bounds Empirical Error Environment-Level Complexity Task-Level Complexity

(Pentina & Lampert, 2014) ber(Q) O K(Q,P) n

O K(Q,P)+Pn i=1 EP QK(Qi,P )

(Amit & Meir, 2018) ber(Q) O q

K(Q,P)+ln n

K(Q,P)+EP QK(Qi,P )+ln (2nm)

(Rothfuss et al., 2021) ber(Q) O K(Q,P) n

O K(Q,P)+Pn i=1 EP QK(Qi,P )

kl-bound (ours) ber(Q) O q

K(Q,P)+ln n

O K(Q,P)+EP Q Pn i=1 K(Qi,P )+ln nm mn

Catoni-bound (ours) C ber(Q) O K(Q,P)

O K(Q,P)+EP Q Pn i=1 K(Qi,P ) mn

derive two fast-rate PAC-Bayesian bounds for meta-learning (Theorem 3-4). Speciﬁcally, the task-level complexities in our kl-bound and Catoni-bound have a rate of O( 1

nm). The environment-level complexity in our Catoni-bound also have a fast rate of O( 1

n) (see Table 1 for detailed comparisons). Furthermore, we generalize our kl-bound to the meta-learning setting of dependent observations. Note that although (Pentina & Lampert, 2015) claimed that they derived a bound for dependent tasks, they still assumed the independence of the samples from different tasks. In contrast, we provide a more general PAC-Bayesian bound for meta-learning with dependent samples from different tasks, and show that this bound is much tighter than that in (Pentina & Lampert, 2015) (see discussion below Theorem 6).

3. Preliminary

A supervised learning problem is characterized by the sample space Z, a hypothesis space H, and a loss function l : H Z R, where Z = X Y is the product space of input space X and label space Y. We assume that l is [0, 1]-valued. Actually, if the loss function l is bounded and locates in the interval [0, M] (M > 0), we can use the rescaling technique and focus on the [0, 1]-valued loss l/M. Let [K] denote the set {1, ..., K}, for any integer K. M1(A) denotes the set of probability measures over the set A. Throughout the paper, we ignore measurability issues.

3.1. PAC-Bayesian Theory for Single-Task Learning

In PAC-Bayesian single-task learning setting, a task is characterized by an unknown distribution D over the space Z, from which a size-m sample S = {zi}m i=1 is provided, with each zi drawn i.i.d. from D. For any hypothesis h H, denote by er(h, D) Ez Dl(h, z) its expected error over D and by ber(h, S) 1

m Pm i=1 l(h, zi) its empirical error over S. In the PAC-Bayesian framework, the goal is to output a posterior Q = Q(S, P) M1(H) by training an algorithm with the sample S and the prior

P M1(H) as input. The expected error and empirical error of a randomized classiﬁer associated with distribution Q are deﬁned as er(Q, D) Eh Qer(h, D) and ber(Q, S) Eh Q ber(h, S), respectively. Denote the KL-divergence between distributions Q and P by K(Q, P) = Eh Q ln d Q

d P , where d Q

d P represents the Radon-Nikodym derivative of Q with respect to P. Denote the relative entropy between the Bernoulli random variables with success rate p and q by kl(p, q) = p ln p

q + (1 p) ln 1 p

1 q . Then the PAC-Bayesian kl-bound and Catoni-bound for single-task learning are:

Theorem 1 (Germain et al., 2009) [Corollary 2.1-2.2] Let l be the binary-valued misclassiﬁcation loss. For any ﬁxed prior P M1(H), any data distribution D over Z, any δ > 0, any positive constant λ > 0, with probability at least 1 δ over the draw of i.i.d. sample S Dm, the following two inequalities hold for any posterior Q M1(H):

kl( ber(Q, S), er(Q, D)) K(Q, P) + ln 2 m

m ) ber(Q, S)+ K(Q, P)+ln(1/δ)

3.2. PAC-Bayesian Theory for Meta-Learning

In PAC-Bayesian meta-learning setting, the learner is given n different training tasks, each of which is associated with a data distribution Di(1 i n) over the sample space Z (i.e. Di M1(Z)). In each task, the size-m i.i.d. training sample Si = {zij}m j=1 Dm i is provided. To develop meta-learning theory, we adopt the task environment concept proposed by (Baxter, 2000). Speciﬁcally, we assume that the n different data distributions {Di}n i=1 are sampled from the same distribution, referred to as environment τ. Thus, the environment τ can be regarded as a probability measure over the set of all distributions (i.e. τ M1(M1(Z))).

Next, we follow the PAC-Bayesian meta-learning framework proposed by (Pentina & Lampert, 2014). We regard the prior P H for each training task as a random variable. P

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

is sampled randomly from a distribution called hyper-prior P M1(M1(H)) given before seeing the n training tasks. The goal of meta-learning is to output a hyper-posterior Q M1(M1(H)) by training a meta-learning algorithm with hyper-prior P and datasets {Si}n i=1 as input. When encountering the new task, a new prior will be sampled from the learned hyper-posterior Q that contains the information of the n training tasks for fast adaptation. Under the PACBayesian framework, choosing an informative prior for a new task can achieve a tighter generalization bound and better performance (Catoni, 2007). Since different tasks are sampled from the same environment and share some similarities, the informative prior drawn from the learned hyper-posterior is expected to adapt to the new task quickly. Formally, the quality of the learned hyper-posterior Q can be measured by the transfer risk (Maurer, 2009; Pentina & Lampert, 2014) on the new data distribution D sampled independently from the same task environment τ:

er(Q) EP QED τES Dmer(Q(S, P), D). (1)

Notice that we can not minimize the transfer risk directly since the environment τ and the sampled probability measure D are unknown. Instead, we will choose to minimize the following empirical multi-task risk over the n training tasks ber(Q) to obtain a high-quality hyper-posterior Q:

ber(Q) EP Q 1 n

i=1 ber(Q(Si, P), Si). (2)

Furthermore, to obtain a PAC-Bayesian bound for er(Q), we also consider the following expected multi-task risk:

eer(Q) EP Q1/n

i=1 er(Q(Si, P), Di). (3)

We will write Q(Si, P) as Qi for abbreviation when the context is clear. The notations of PAC-Bayesian singletask learning and meta learning are listed in Table 6 in Appendix H for convenient reference. The PAC-Bayesian meta-learning theory thus will give an upper bound on the transfer risk er(Q) of the hyper-posterior Q according to its empirical multi-task risk ber(Q) or its expected multi-task risk eer(Q). Such results will be detailed in the next section.

4. Theoretical Results

We provide fast-rate PAC-Bayesian bounds for metalearning in two common scenarios: (1) All samples and all tasks are independent (we also show how to derive the Gibbs optimal hyper-posterior by minimizing our Catonibound). (2) Samples from different tasks are dependent.

4.1. PAC-Bayesian Bounds for Meta-Learning with Independent Samples

To give fast-rate PAC-Bayesian bounds for meta-learning, we need to choose convex function D(p, q) and then bound

the moment generating function (MGF) of D(p, q) (i.e., E exp{D(er(Q), eer(Q))} and E exp{D( eer(Q), ber(Q))}). Almost all existing works set D(p, q) = p q, apply Hoeffding s lemma to bound the MGF of D(p, q), and ﬁnally obtain a PAC-Bayesian meta learning bound of O(1/t + t/K)( t > 0), which suffers a slow convergence rate of O(1/

K) (K > 0). In contrast, we set D(p, q) as kl(q, p) or 1 Φ λ

K (p) q, (λ > 0), as what we do to obtain the PAC-Bayesian kl-bound and Catoni-bound in singletask learning. However, since eer(Q) and ber(Q) are the summations of independent [0, 1]-valued random variables (not i.i.d. {0, 1}-valued ones as in Theorem 1), we can not directly apply the results in Theorem 1 to bound the MGF of D(p, q). To overcome this challenge, we use the following lemma to bound the expectation of the function of the sum of independent [0, 1]-valued random variables (rvs) with the expectation of the function of the sum of i.i.d. {0, 1}-valued ones. Such result is originated from (Berend & Tassa, 2010), and more explanations can be found in Appendix A.

Lemma 1 Let {ξk}K k=1 be a sequence of independent random variables with P(0 ξk 1) = 1, and {ηk}K k=1 be a sequence of i.i.d. Bernoulli random variables with Eηk = K 1(PK k=1 Eξk). Then for any convex function g,

k=1 ξk) Eg( 1

With such lemma, it is more convenient for us to derive generalized PAC-Bayesian kl-bound or Catoni-bound for independent [0, 1]-valued random variables as follow.

Theorem 2 Let F be a set of random variables f. Let S = {ξk}K k=1 be a sequence of random variables with each component ξk (k [K]) drawn independently according to the measure µk over the set Ak. Let R(f) = 1 K PK k=1 Eξkgk(f, ξk), r(f) = 1 K PK k=1 gk(f, ξk), where gk : F Ak [0, 1] is a bounded function. Denote Ef ρ(R(f)), Ef ρ(r(f)) by ρ(R), ρ(r) respectively. Then δ > 0, λ > 0, for any predeﬁned distribution π M1(F), with probability at least 1 δ over the draw of S, the following two inequalities hold for any measure ρ over F:

kl(ρ(r), ρ(R)) K(ρ, π) + ln (2

K ) + K(ρ, π) + ln(1/δ)

Proof Sketch. Note that D(ρ(R), ρ(r)) 1 λ K(ρ, π) + ln Ef πESeλD(R(f),r(f))/δ holds with high probability for any convex function D( , ). With Lemma 1 we can bound ESeλD(R(f),r(f)) with the MGF of convex function of the sum of i.i.d. Bernoulli rvs. Setting D(p, q) as kl(q, p) or Φ λ

K (p) q, and using Theorem 1 ﬁnish the proof.

1Φa(p) = a 1 ln{1 [1 exp( a)]p}, a R, p [0, 1]

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Setting F = H, gk = l, Ak = Z, we can recover the result in Theorem 1. As shown in (Maurer, 2004) [Equation (2)], the PAC-Bayesian kl-bound in the right hand side (RHS) of the above ﬁrst inequality gives the optimal order in K. Applying Pinsker s inequality kl(p, q) 2(p q)2, we can obtain a bound on the deviation |ρ(R) ρ(r)| with a slow convergence rate of O(1/

K). Further, applying the stronger version kl(p, q) (p q)2/(2q) when q > p gives a fast convergence rate of O(log K/K) for generalization bound. Such analysis leads to our fast-rate PACBayesian kl-bound for meta-learning as below.

Theorem 3 For any predeﬁned hyper-prior P, with probability at least 1 δ over the draw of the training sample {Si}n i=1, the following holds for any hyper-posterior Q:

er(Q) ber(Q)+

K(Q, P)+ln 2 n

where = K(Q, P) + EP Q Pn i=1 K(Qi, P) + ln 2 mn

To prove the above result, applying the kl-bound in Theorem 2 to bound kl(er(Q), eer(Q)) and kl( eer(Q), ber(Q)) respectively and then using the union bound can obtain an upper bound on er(Q). As shown in (Pentina & Lampert, 2014; Amit & Meir, 2018), the proof of the bound for metalearning is always divided into two parts: bounding the deviations er(Q) eer(Q) and eer(Q) ber(Q) respectively. The contribution of our Theorem 3 lies in the task-level complex-

2 b er(Q) mn + 2

mn on the deviation eer(Q) ber(Q). When the empirical error ber(Q) is close to zero, such bound has an magnitude of O( ln (mn)

mn ), which is tighter than the bound of O( 1 n m) in (Pentina & Lampert, 2014, Theorem 1) when n << m. Besides that, the environment-level

complexity of our derived kl-bound O( q

n ) halves the logarithmic dependence of n in the numerator, while it is

n ) in the bound of (Amit & Meir, 2018). Moreover, using the Catoni-bound in Theorem 2, we can achieve our tighter PAC-Bayesian Catoni-bound for meta-learning.

Theorem 4 For any predeﬁned hyper-prior P, any δ (0, 1), any C1, C2 > 1, with probability at least 1 δ over the draw of the training sample {Si}n i=1, the following holds for any hyper-posterior Q:

er(Q) C1C2 ln C1 ln C2

(C1 1)(C2 1) ber(Q)+ C1 K(Q, P) + ln(2/δ)

+ C1C2 ln C1 K(Q, P)+EP Q Pn i=1 K(Qi, P)+ln(2/δ)

(C1 1)(C2 1)nm .

The proof strategy is also to bound the deviations er(Q) eer(Q) and eer(Q) ber(Q) respectively. If we suppress the KL-complexities in the numerator, the bound on er(Q) eer(Q) has an order of O( 1

n) and the bound on eer(Q)

ber(Q) has an order of O( 1

mn). Both bounds have a fast convergence rate w.r.t. the number of their observations. Therefore, Theorem 4 provides the tightest PAC-Bayesian bound for standard meta-learning in this paper. Setting the above kl-bound and Catoni-bound as objective functions can lead to two meta-learning algorithms for classiﬁcation problems, which will be detailed in the experiment section.

4.2. Optimizing PAC-Bayesian Catoni-Bound with Gibbs Optimal Hyper-posterior

In this subsection, we show how to utilize our Catoni-bound to obtain the closed-form formula of the Gibbs optimal hyper-posterior. As shown in (Zhang, 2006) [Equation (5)], we can derive an explicit form of the optimal posterior ρ for minρ{βρ(r) + K(ρ, π)}, where β R. The obtained posterior is called Gibbs optimal posterior (Catoni, 2007). Next, we apply such strategy to the minimization of our PAC-Bayesian Catoni-bound for meta-learning in Theorem 4 to establish the explicit form of the Gibbs optimal hyper-posterior. We ﬁrst give a corollary of Theorem 4 by choosing the Gibbs optimal posterior for each training task.

Corollary 1 i [n], any prior P M1(H), any training data {Si}n i=1, let Q i be the Gibbs optimal posterior such that d Q i d P = exp{ m ber(h, Si)}/Z(Si, P), where Z(Si, P) = R

H e m b er(h,Si)d P(h) is a normalization constant. Then δ > 0, C1 > 1, with probability at least 1 δ over the draw of training datasets {Si}n i=1, the following holds for any hyper-posterior Q:

er(Q) e C1 ln C1 (C1 1)(e 1)EP Q 1 nm

i=1 [ln Z(Si, P)]

+ C1 K(Q, P) + ln 2

n(C1 1) + e C1 ln C1 K(Q, P) + ln 2

nm(C1 1)(e 1) .

An analogous result that is also derived by setting Gibbs optimal posterior for each task can be found in (Rothfuss et al., 2021) [Corollary 1]. Omitting the empirical error part, both Rothfuss s bound and our bound share the same order of O ( 1

n + 1 mn)K(Q, P) . However, Rothfuss s bound has an extra constant (i.e., 1/8) that can not vanish with the increase of the size of training samples. Therefore, our generalization bound has a better asymptotic behaviour. Next we can obtain the explicit form of Gibbs optimal hyper-posterior by minimizing the RHS of the inequality in Corollary 1.

Corollary 2 (Gibbs Optimal Hyper-posterior) For any hyper-prior P M1(M1(H)) and any training datasets {Si}n i=1, the hyper-posterior Q M1(M1(H)) that minimizes the PAC-Bayesian meta-learning bound in Corollary 1 has the following explicit form:

d P (P) = exp{ β

i=1 ln Z(Si, P)}/Z(S, P),

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

where β = e C1 ln C1 (C1 1)(e 1)α, α = e C1 ln C1 nm(C1 1)(e 1) + C1 n(C1 1),

Z(S, P) = R

M1(H) exp{ β

nm Pn i=1 ln Z(Si, P)}d P(P) is a normalization constant.

As pointed out by (Rothfuss et al., 2021) [Proposition 1], the explicit form of the Gibbs optimal hyper-posterior in Corollary 2 makes it much easier to optimize the metalearning bound in Corollary 1, than to directly optimize the Catoni-bound in Theorem 4 which needs to learn Q and {Qi}n i=1 simultaneously. We will use statistical inference methods to approximate the above Q and thus develop an efﬁcient Gibbs optimal hyper-posterior (GOHP) algorithm for regression problems in the experiment section.

4.3. PAC-Bayesian kl-Bound for Meta-Learning with Dependent Samples

In this subsection, we consider the meta-learning setting where dependence exists between different tasks and between different samples. Our strategy is to split the dependent random variables into different groups of independent random variables. Such splitting strategy is originated from (Hoeffding, 1963) and typically developed by (Janson, 2004) with graph decomposition techniques. We introduce two signiﬁcant concepts about Dependence Graph that help us analyze the meta-learning setting with dependent samples.

Deﬁnition 1 (Dependence Graph) Let S = {ξ1, . . . , ξK} be a set of K random variables. The dependence graph Γ(S) = (V, E) of S is such that

the set of vertices V of Γ(S) is V = [K]}.

(i, j) / E (i.e., there is no edge between i and j) ξi and ξj are independent.

Deﬁnition 2 (Fractional Covers (Janson, 2004)) Let Γ = (V, E) be an undirected graph with V = [K].

C V is independent if the vertices in C are independent (i.e., no two vertices in C are connected).

C = {Cj}J j=1, with Cj V , is a proper cover of V if each Cj is independent and SJ j=1 Cj = V .

C = {(Cj, wj)}J j=1, with Cj V and wj [0, 1], is a proper exact fractional cover of V if Cj is independent and i V , PJ j=1 wj1i Cj = 1; w(C) = PJ j=1 wj is deﬁned as the chromatic weight of C.

The fractional chromatic number χ (Γ) is the minimum chromatic weight over all proper exact fractional covers of the dependence graph Γ = (V, E).

Then we can obtain a chromatic PAC-Bayesian kl-bound for dependent random variables S = {ξ1, . . . , ξK}.

Theorem 5 In the same setting of Theorem 2 with the only difference that S = {ξk}K k=1 is a sequence of dependent random variables. Let χ (S) denote the fractional chromatic number of the dependence graph of S. Then with probability with at least 1 δ over the draw of S, the following holds for any measure ρ over F:

kl(ρ(r), ρ(R)) χ (S)

K [K(ρ, π) + ln(2

A previous result to deal with non-identically nonindependently distributed data is the PAC-Bayesian chromatic bound in (Ralaivola et al., 2010) [Theorem 28], whose

order is about O( q

K ). The main difference between Ralaivola s bound and ours is as follow: the bound in Ralaivola s Theorem 28 is obtained by directly using a chromatic concentration inequality from (Janson, 2004) to bound the moment generating function (MGF) of the kl-divergence of dependent samples; instead, we employ graph decomposition techniques from (Janson, 2004) to encode dependent samples into independent sets and then apply our Lemma 1 to bound the MGF of the kl-divergence of independent samples, leading to a tighter chromatic PAC-Bayes bound of O( ln K

K ) in Theorem 5. Finally, applying the above theorem to bound kl(er(Q), eer(Q)) and kl( eer(Q), ber(Q)) respectively, we obtain our chromatic PAC-Bayesian bound for meta-learning with dependent tasks and dependent samples.

Theorem 6 For any given hyper-prior P, with probability at least 1 δ over the draw of the training sample {Si}n i=1, the following holds for any hyper-posterior Q:

er(Q) ber(Q) +

where 1 = χ (D)[K(Q, P) + ln( 2

n χ (D))], 2 =

χ (S) K(Q, P) + EP Q Pn i=1 K(Qi, P) + ln 2 mn δ

χ (D), χ (S) denote the fractional chromatic numbers of the dependence graphs of D = {Di}n i=1, S = {Si}n i=1.

It is not difﬁcult to see that, when all samples S = {zij}n,m i,j=1 in meta-learning are sampled independently, the proper exact fractional cover of Γ(S) = {V, E} is {(V, 1)}. Therefore χ (S) = 1. We can also obtain that χ (D) = 1 when all distributions {Di}n i=1 are sampled independently from the task environment τ. In this case, Theorem 6 degrades to the PAC-Bayesian kl-bound for meta-learning with independence assumption in Theorem 3. Another example for calculating χ (D) in dependent meta learning setting is provided in Example 1 in Appendix D.

We should point out that in Theorem 6 (or Proposition 8 in the Appendix D), our environment-level complexity r

2n [K(Q, P) + ln( 2

n χ (D))] on er(Q) eer(Q) is

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Table 2. Comparisons of different PAC-Bayesian meta-learning methods. The average test bounds and test errors are reported over 20 test tasks (the shows the 95% conﬁdence interval) in three different pixel-shufﬂed environments.

100 Pixels Swaps 200 Pixels Swaps 300 Pixels Swaps Method Test Bound Test Error (%) Test Bound Test Error (%) Test Bound Test Error (%) Variational Bayes N/A 1.606 0.001 N/A 1.962 0.001 N/A 2.649 0.130 MAML N/A 1.876 0.001 N/A 2.241 0.002 N/A 2.788 0.102 (Seeger, 2002) 0.133 0.034 1.629 0.000 0.285 0.049 1.972 0.001 0.408 0.062 2.523 0.001 (Pentina & Lampert, 2014) 0.190 0.022 1.939 0.001 0.240 0.030 2.631 0.002 0.334 0.036 3.767 0.003 (Amit & Meir, 2018) 0.126 0.012 1.587 0.001 0.197 0.019 1.948 0.001 0.270 0.018 2.630 0.001 (Rothfuss et al., 2021) 0.174 0.023 1.921 0.001 0.224 0.030 2.634 0.001 0.318 0.036 3.754 0.003 kl-bound (ours) 0.119 0.024 1.746 0.001 0.189 0.027 2.594 0.001 0.359 0.042 2.993 0.002 Catoni-bound (ours) 0.093 0.027 1.545 0.001 0.128 0.025 1.889 0.001 0.210 0.035 2.433 0.001

tighter than that in (Pentina & Lampert, 2015) [Theorem 3], which only bounds the deviation er(Q) eer(Q) with q

n [K(Q, P) + ln (J/δ)], where J denotes the size of the dependence graph Γ(D). Nevertheless, we also need to admit that in the current version we are unable to extend our PAC-Bayes Catoni-bound to the generalized meta-learning setting. This may somewhat imply that the kl-bound is more ﬂexible than the Catoni-bound to be applied to learning settings where training data show some dependencies.

5. Experiments

In this section, we empirically demonstrate the effectiveness of our theoretical analysis for meta-learning over the classiﬁcation and regression problems with deep neural networks. For classiﬁcation problems, we directly set our kl-bound (Theorem 3) and Catoni-bound (Theorem 4) as minimization objectives, with the bounded cross-entropy loss (P erez Ortiz et al., 2021) for model optimization. For regression problems, we develop a Gibbs optimal hyper-posterior algorithm with Bayesian neural networks (GOHP-NN). Concretely, we choose the classical statistical inference method, called Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016), to approximate the Gibbs optimal hyperposterior Q in Corollary 2 (with C1 = 2). The mean squared error is selected as the loss function for regression problems. In practice, the squared loss is always bounded, so we choose it for model optimization for fair comparisons with existing methods. The detailed pseudo-code of our proposed meta-learning algorithms for classiﬁcation and regression problems can be found in the Appendices E-F.

5.1. Experimental Setup

Classiﬁcation Environments. We conduct classiﬁcation experiments in three different task environments, based on the augmentations of the MNIST dataset (Yann, 1998). Each task from the same environment is constructed by a ﬁxed number of pixel swaps to ensure the task relatedness. The three environments are created by swapping 100/200/300 pixels respectively to increase the classiﬁcation difﬁculty. During the meta-training phase, we choose 10 training

tasks, each of which is composed of 60,000 training examples; while in the meta-test phase, each task is constructed with fewer training samples (2,000). We choose a fullyconnected network with 3 hidden layers and a linear output layer as backbone. All experiment details are set the same as that in (Amit & Meir, 2018). We report the test bounds and test errors on the novel tasks of various methods in Table 2.

Regression Environments. We conduct regression experiments with one synthetic and four real-world meta-learning environments. The ﬁrst synthetic environment is composed of regression tasks that can be interpreted as a 2-dimensional mixture of Cauchy distributions plus a random Gaussian Processes function. For the second environment, we employ datasets corresponding to different calibration sessions of Swiss Free Electron Laser(Swiss FEL) (Milne et al., 2017). The other two environments are constructed by using the datasets from Physio Net 2012 challenge (Silva et al., 2012), which contains the time series of electronic health measurements from patients, in terms of the Glasgow Coma Scale (GCS) and the hematocrit value (HCT). Finally, we create the Berkeley-Sensor environment where the tasks need to make prediction of temperature measurements corresponding to the sensors installed in different places of one building (Madden, 2004). More detailed information about the tasks in regression environments can be found in the Appendix F.1. We set the same experiment setup as that in (Rothfuss et al., 2021). Then we report the root mean squared errors (RMSE) over the novel tasks in Table 3. The results of other methods are directly cited from (Rothfuss et al., 2021) [Table 1]. Note that Rothfuss et al. do not report test bounds over the novel tasks in their original work. So Table 3 only reports the test errors for fair comparison.

5.2. Experimental Results

Classiﬁcation Results. From Table 2, we can see that minimizing our proposed two generalization bounds can achieve competitive results w.r.t. existing PAC-Bayesian meta-learning methods, in terms of test bounds and test errors over the novel tasks. Particularly, our Catoni-bound method can obtain the tightest test bounds and lowest predic-

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Table 3. Comparison of meta-learning algorithms in terms of test RMSE in 5 regression environments. Reported are mean and standard deviation across 5 seeds. Our GOHP-NN achieves competitive averaged error over 5 environments.

Method Cauchy Swiss Fel Physionet-GCS Physionet-HCT Berkeley-Sensor Vanilla BNN (Liu & Wang, 2016) 0.327 0.008 0.529 0.022 2.664 0.274 3.938 0.869 0.109 0.004 MLL-GP (Fortuin & R atsch, 2019) 0.216 0.003 0.974 0.093 1.654 0.094 2.634 0.144 0.058 0.002 MLAP (Amit & Meir, 2018) 0.219 0.004 0.486 0.026 2.009 0.248 2.470 0.039 0.050 0.005 MAML (Finn et al., 2017) 0.219 0.004 0.730 0.057 1.895 0.141 2.413 0.113 0.045 0.003 BMAML (Yoon et al., 2018) 0.225 0.004 0.577 0.044 1.894 0.062 2.500 0.002 0.073 0.014 PACOH-GP (Rothfuss et al., 2021) 0.209 0.008 0.376 0.024 1.498 0.081 2.361 0.047 0.065 0.005 PACOH-NN (Rothfuss et al., 2021) 0.195 0.001 0.372 0.002 1.561 0.061 2.405 0.017 0.043 0.001 GOHP-NN (ours) 0.198 0.016 0.333 0.013 1.521 0.067 2.422 0.013 0.043 0.004

1 2 3 4 5 6 7 8 9 10 Number of training-tasks

Bound on new task

Amit Pentina kl Catoni

1 2 3 4 5 6 7 8 9 10 Number of training-tasks

Error on new task [%]

Amit Pentina kl Catoni

5 10 15 20 25 30 35 40 45 50 55 60 Sample size per training-task

Bound on new task

x103 x103 x103 x103

Amit Pentina kl Catoni

5 10 15 20 25 30 35 40 45 50 55 60 Sample size per training-task

Error on new task [%]

x103 x103 x103 x103

Amit Pentina kl Catoni

(d) Figure 1. Comparisons between our bounds (i.e., kl-bound & Catoni-bound) and other bounds (i.e., Pentina-bound & Amit-bound). Both test bounds and test errors are averaged over 20 meta-test classiﬁcation tasks from 200-pixel-shufﬂed environment. (a)-(b): Results across a range of number n of training tasks. (c)-(d): Results across a range of sample size m per training task.

tion errors over all tasks from different environments. This is consistent with our theoretical results that the Catoni-bound is so far the tightest PAC-Bayesian bound for meta-learning. Meanwhile, the prediction performance of different methods gets worse with the increase of the number of pixel swaps. This indicates the importance of the task relatedness of environment to the success of meta-learning approaches.

Regression Results. From Table 3, we can observe that our proposed GOPH-NN algorithm can achieve comparable results w.r.t. the state-of-the-art PAC-Bayesian metalearning method PACOH (Rothfuss et al., 2021) over 5 regression environments. Speciﬁcally, our GOPH-NN can obtain the lowest test errors on 2 regression environments, yield competitive test errors on other 3 regression environments. The detailed reasons that GOHP can achieve analogous results with PACOH can be found in Remark 2 in Appendix F. Therefore, we can conclude that our proposed PAC-Bayesian Catoni-bound, can obtain comparable performance with respect to the latest meta-learning regression methods, in terms of average test errors on the novel tasks.

Convergence Analysis. We compare the convergence rates of our two PAC-Bayesian bounds and other two classical bounds (Pentina & Lampert, 2014; Amit & Meir, 2018) in the 200-pixel-shufﬂed classiﬁcation environment. The average test bounds and test errors are calculated over the novel tasks across a wide range of the number n of training tasks and across a large range of the sample size m per task. The visualization is shown in Figure 1. We can ﬁnd that:

(1) Both the test bounds and the test errors of all methods decrease with the increase of n and m. (2) Our proposed two bounds can achieve competitive performance with respect to the existing methods. Particularly, our PAC-Bayesian Catoni-bound obtains the tightest test bounds and the lowest test errors over the novel tasks, empirically validating the effectiveness of our theoretical analysis for meta-learning.

6. Conclusions

This work provides a uniﬁed demonstration framework of the PAC-Bayesian bounds for single-task learning and metalearning. The tightest PAC-Bayesian kl-bound and Catonibound in single-task learning are generalized to the metalearning setting, followed by two bound-minimizing metalearning classiﬁcation algorithms. Next, we show how to obtain the closed-form formula of the Gibbs optimal hyperposterior by minimizing our Catoni-bound, leading to an efﬁcient meta-learning regression algorithm. In addition, we obtain a chromatic PAC-Bayesian kl-bound for the more challenging meta-learning setting where training data show some dependencies. Experiments on classiﬁcation and regression problems further validate the effectiveness of our proposed PAC-Bayesian bounds. In particular, our Catonibound obtains the tightest test bounds and the lowest test errors in classiﬁcation problems, and achieves comparable results with existing methods in regression problems. Overall, we show how to derive two fast-rate PAC-Bayesian bounds for meta-learning, and show how to apply these bounds to different settings to yield more theoretical results.

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Acknowledgements

Jiechao would like to thank Prof. Yong Liu from GSAI for insightful comments on the writing of this paper, and thank Prof. Zongfei Fu and Dr. Yin Gao from School of Mathematics in RUC for helpful discussions. We also thank all reviewers for their constructive suggestions to improve the quality of this paper. This work was supported in part by National Natural Science Foundation of China (61976220 and 61832017), Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098), and Large-Scale Pre Training Program 468 of Beijing Academy of Artiﬁcial Intelligence (BAAI). Prof. Zhiwu Lu is the corresponding author of this paper.

Alquier, P. and Guedj, B. Simpler PAC-Bayesian bounds for hostile data. Machine Learning, 107(5):887 902, 2018.

Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In International Conference on Machine Learning (ICML), pp. 205 214, 2018.

Audibert, J.-Y. PAC-Bayesian aggregation and multi-armed bandits. ar Xiv preprint ar Xiv:1011.3396, 2010.

Baxter, J. A model of inductive bias learning. Journal of Artiﬁcial Intelligence Research, 12:149 198, 2000.

Berend, D. and Tassa, T. Efﬁcient bounds on bell numbers and on moments of sums of random variables. Probability and Mathematical Statistics, 30:185 205, 2010.

Catoni, O. PAC-Bayesian Supervised Classiﬁcation: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, 2007.

Chen, J., Wu, X., Li, Y., LI, Q., Zhan, L., and Chung, F. A closer look at the training strategy for modern metalearning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 396 406, 2020.

Chen, Y., Guo, X., Wang, C., Qiu, J., Qi, G., Wang, M., and Li, H. Leveraging table content for zero-shot text-to-SQL with meta-learning. In AAAI Conference on Artiﬁcial Intelligence (AAAI), pp. 3992 4000, 2021.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learn around A common mean. In Advances in Neural Information Processing Systems (Neur IPS), pp. 10190 10200, 2018.

Farid, A. and Majumdar, A. Generalization bounds for meta-learning via PAC-Bayes and uniform stability. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML), pp. 1126 1135, 2017.

Fortuin, V. and R atsch, G. Deep Mean Functions for Meta-Learning in Gaussian Processes. ar Xiv preprint ar Xiv:1901.08098, 2019.

Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classiﬁers. In International Conference on Machine Learning (ICML), pp. 353 360, 2009.

Haddouche, M., Guedj, B., Rivasplata, O., and Shawe Taylor, J. PAC-Bayes unleashed: Generalisation bounds with unbounded losses. Entropy, 23(10), 2021.

Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Holland, M. J. PAC-Bayes under potentially heavy tails. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2711 2720, 2019.

Janson, S. Large deviations for sums of partly dependent random variables. Random Structures & Algorithms, 24 (3):234 248, 2004.

Li, K., Gupta, A., Reddy, A., Pong, V. H., Zhou, A., Yu, J., and Levine, S. MURAL: meta-learning uncertainty-aware rewards for outcome-driven reinforcement learning. In International Conference on Machine Learning (ICML), pp. 6346 6356, 2021.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2370 2378, 2016.

Madden, S. Intel lab data. http://db.csail.mit. edu/labdata/labdata.html, 2004. Accessed: Sep 8, 2020.

Maurer, A. A note on the PAC-Bayesian theorem. ar Xiv preprint ar Xiv:cs/0411099, 2004.

Maurer, A. Algorithmic stability and meta-learning. Journal of Machine Learning Research (JMLR), 6:967 994, 2005.

Maurer, A. Transfer bounds for linear feature learning. Machine Learning, 75(3):327 350, 2009.

Maurer, A., Pontil, M., and Romera-Paredes, B. The beneﬁt of multitask representation learning. Journal of Machine Learning Research (JMLR), 17:81:1 81:32, 2016.

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Mc Allester, D. A. Some PAC-Bayesian theorems. In Annual Conference on Learning Theory (COLT), pp. 230 234, 1998.

Mc Allester, D. A. Simpliﬁed PAC-Bayesian margin bounds. In Annual Conference on Learning Theory (COLT), pp. 203 215, 2003.

Milne, C. J., Schietinger, T., Aiba, M., Alarcon, A., Alex, J., Anghel, A., Arsov, V., Beard, C., Beaud, P., Bettoni, S., et al. Swiss FEL: the Swiss X-ray free electron laser. Applied Sciences, 7:720:1 720:57, 2017. URL https: //www.mdpi.com/2076-3417/7/7/720.

Pentina, A. and Lampert, C. H. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning (ICML), pp. 991 999, 2014.

Pentina, A. and Lampert, C. H. Lifelong learning with noni.i.d. tasks. In Advances in Neural Information Processing Systems (Neur IPS), pp. 1540 1548, 2015.

P erez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., and Szepesv ari, C. Tighter risk certiﬁcates for neural networks. Journal of Machine Learning Research (JMLR), 22(227):1 40, 2021.

Ralaivola, L., Szafranski, M., and Stempfel, G. Chromatic PAC-Bayes bounds for non-iid data: Applications to ranking and stationary β-mixing processes. Journal of Machine Learning Research (JMLR), 11:1927 1956, 2010.

Rothfuss, J., Fortuin, V., Josifoski, M., and Krause, A. PACOH: Bayes-optimal meta-learning with PACGuarantees. In International Conference on Machine Learning (ICML), pp. 9116 9126, 2021.

Seeger, M. W. PAC-Bayesian generalisation error bounds for Gaussian process classiﬁcation. Journal of Machine Learning Research (JMLR), 3:233 269, 2002.

Seldin, Y., Laviolette, F., Cesa-Bianchi, N., Shawe-Taylor, J., and Auer, P. PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12): 7086 7093, 2012.

Silva, I., Moody, G., Scott, D. J., Celi, L. A., and Mark, R. G. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In Computing in Cardiology, 2012.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 4077 4087, 2017.

Thrun, S. and Pratt, L. Learning to Learn. Kluwer Academic Publishers, 1998.

Wu, Q., Lin, Z., Wang, G., Chen, H., Karlsson, B. F., Huang, B., and Lin, C. Enhanced meta-learning for cross-lingual named entity recognition with minimal resources. In AAAI Conference on Artiﬁcial Intelligence (AAAI), pp. 9274 9281, 2020.

Yann, L. The MNIST database of handwritten digits. http: //yann.lecun.com/exdb/mnist/, 1998.

Ye, H., Hu, H., Zhan, D., and Sha, F. Few-shot learning via embedding adaptation with set-to-set functions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8805 8814, 2020.

Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 7343 7353, 2018.

Zhang, T. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307 1321, 2006.

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

A. Auxiliary Results

We ﬁrst recall Lemma 2-5 before obtaining our results.

Lemma 2 (Berend & Tassa, 2010)[Lemma 5.2] Let {ξk}K k=1 be independent random variables with P(0 ξk ak) = 1. For 1 k K, let ηk be a random variable assuming only the values 0 and ak and having the same expectation as ξk, i.e. Eηk = Eξk. Then for any convex function g : R R,

k=1 ξk) Eg(

Lemma 3 (Berend & Tassa, 2010)[Proposition 3.1] Let {ξk}K k=1 be a sequence of independent random variables with P(0 ξk 1) = 1, {ηk}K k=1 be a sequence of i.i.d. Bernoulli random variables with Eηk = K 1(PK k=1 Eξk). Then for any convex function g,

k=1 ξk) Eg(

Remark 1 Using Jensen s inequality of convex function g, we have g(PK k=1 Eηk) = g(PK k=1 Eξk) Eg(PK k=1 ξk). Thus, what Lemma 3 does is to plug the term Eg(PK k=1 ξk) into the Jensen s inequality g(PK k=1 Eηk) Eg(PK k=1 ηk). So the inequality in Lemma 3 is truly tight.

Corollary 3 (Lemma 1 in the main paper) In the setting of Lemma 2 or Lemma 3, for any convex function g, Eg( 1

K PK k=1 ξk) Eg( 1

K PK k=1 ηk).

Proof. The composition function (g f)(x) of the convex function g and the linear function f(x) = ax + b is still convex. Setting a = 1

K , b = 0 completes the proof.

Lemma 4 (Change of Measure) Let F be a set of random variables f. Let S = {ξk}K k=1 be a sequence of random variables with each component ξk (k [K]) drawn independently according to the measure µk over the set Ak. Then, for any functions R(f), r(f) over F, either of which may be a statistic of S, any reference measure π over F, any λ > 0, and any convex function D : R R R, the following holds for any measure ρ over F:

D(EρR(f), Eρr(f))) K(ρ, π)+ln Ef πeλD(R(f),r(f))

where K(ρ, π) denotes the KL-divergence between the distributions ρ and π.

Lemma 5 (Catoni, 2007)[Lemma 1.1.1] Given independent input-output pairs {(xk, yk)}K k=1 (X Y)K

and a class of classiﬁcation rules F = {f : X Y}, let R(f) = K 1 PK k=1 P[f(xk) = yk], r(f) = K 1 PK k=1 1[f(xk) = yk]. Then for any real constant λ R, and any f F,

E exp{λ[Φ λ

K (R(f)) r(f)]} 1,

where 2Φa(p) = a 1 ln{1 [1 exp( a)]p}, a R, p [0, 1].

B. Proofs of the PAC-Bayesian Bounds for Meta-Learning with Independent Samples

B.1. Proof of PAC-Bayesian kl-Bound

Proposition 1 (Part of Theorem 2 in the main paper) In the setting of Lemma 4, set D(q, p) = kl(p, q), where kl(p, q) = p ln p

q +(1 p) ln 1 p

1 q . Let R(f) = 1

K PK k=1 Eξkgk(f, ξk),

K PK k=1 gk(f, ξk), where gk : F Ak [0, 1] is a bounded function. Then with probability at least 1 δ over the draw of S, the following holds for any measure ρ:

kl(ρ(r), ρ(R)) K(ρ, π) + ln 2

In particular, we have the explicit generalization bound:

|ρ(R) ρ(r)|

where = K(ρ, π) + ln 2

Proof. For any ﬁxed f F, let {ηk}K k=1 be i.i.d. Bernoulli random variables with Eηk = 1

K PK k=1 Eξkgk(f, ξk). Note that kl(p, q) is a convex function with respect to p and exp is a nondecreasing convex function, hence exp{λ kl(p, q)} is a convex function with respect to p (λ > 0). Then setting λ = K, we have

ESe K kl(r(f),R(f))

=Ee K kl( 1

K PK k=1 gk(f,ξk), 1

K PK k=1 Eηk)

K PK k=1 ηk, 1

K PK k=1 Eηk) (Corollary 3)

The last inequality holds due to the fact that for a binomial random variable η B(K, µ), Ee K kl( η

K ,µ) = PK k=0 K k ( k

K] (Maurer, 2004). Then recalling Lemma 4 and Markov s inequality

2Φa(p) is a one-to-one increasing function with respect to p [0, 1], and is convex when a > 0.

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

kl(ρ(r), ρ(R)) 1

K K(ρ, π) + ln Ef πe K kl(r(f),R(f))

K ln ESEf πe K kl(r(f),R(f))/δ

K ln Ef πESe K kl(r(f),R(f))/δ (Fubini)

which completes the proof of the ﬁrst assertion. Using Pinsker s inequality kl(p, q) 2(p q)2 we can directly obtain an explicit upper bound on |ρ(R) ρ(r)| with probability at least 1 δ as follow:

|ρ(R) ρ(r)|

K(ρ, π) + ln 2

For the second assertion, note that the ﬁrst assertion is equivalent to the below statement (Mc Allester, 2003) [Eq.(5)]:

ρ, ρ(R) sup ϵ : kl(ρ(r), ϵ) K(ρ, π) + ln 2

Thus we can use the tighter version of Pinsker s inequality kl(p, q) (p q)2

2q , when q > p. Then, if kl(p, q) x, we have [q (p + x)]2 x2 2px 0 and so |q p| x + p

x2 + 2px 2x + 2px. Combining this with the bound on kl(ρ(R), ρ(r)) in the ﬁrst assertion, we can give a high-probability bound on |ρ(R) ρ(r)| with s

2ρ(r)[K(ρ, π) + ln 2

K δ ] K + 2[K(ρ, π) + ln 2

We use the above proposition to bound kl(er(Q), eer(Q)) and kl( eer(Q), ber(Q)) respectively for meta-learning.

Proposition 2 For any δ (0, 1), with probability at least 1 δ over the draw of n distributions {Di}n i=1, the following holds for any hyper-posterior Q:

|er(Q) eer(Q)|

K(Q, P) + ln 2 n

Proof. Notice that

er(Q) = EP QE(D,S) τ Dmer(Q(S, P), D)

eer(Q) = EP Q 1 n

i=1 er(Q(Si, P), Di).

Recalling Proposition 1, we set K = n, f = P, π = P, ρ = Q, ξk = (Di, Si), gk(f, ξk) =

Eh Q(Si,P )Ez Dil(h, z) [0, 1]. Thus with Pinsker s inequality kl(p, q) 2(p q)2,

|er(Q) eer(Q)|

K(Q, P) + ln 2 n

Proposition 3 For any hyper-prior P, any δ (0, 1), with probability at least 1 δ over the draw of the training sample S = {Si}n i=1, the following holds for any hyper-posterior Q:

| eer(Q) ber(Q)|

where = K(Q, P) + EP Q Pn i=1 K(Qi, P) + ln 2 mn

Proof. Notice that

eer(Q)=EP QE(h1, ,hn) Q1 Qn 1 n

i=1 Ez Dil(hi, z),

ber(Q)=EP QE(h1, ,hn) Q1 Qn 1 nm

j=1 l(hi, zij).

Recall Proposition 1, we set f = (P, h1, ..., hn), π = P P n, ρ = Q Qn i=1 Qi, where Qi = Q(Si, P), ξk = zij, gk(f, ξk) = l(hi, zij). With probability 1 δ we have,

| eer(Q) ber(Q)|

where 1 = K(Q Qn i=1 Qi, P P n) + ln 2 mn

δ . Further, notice that K(Q Qn i Qi, P P n) = EQ Qn i=1 Qi ln d(Q Qn i=1 Qi) d(P P n) = EQ Qn i=1 Qi ln d Q

d P + Pn i=1 ln d Qi

d P = K(Q, P) + EP Q Pn i=1 K(Qi, P), which completes the whole proof.

Proof of Theorem 3 in the main paper. Note that the generalization bounds in Propositions 2-3 are both two-sided. Actually a one-sided version can be stated by replacing 2/δ in the two-sided version with 1/δ. Therefore, combining the one-sided bounds in Propositions 2-3, and applying the union bound to these two high-probability inequalities ﬁnishes the proof.

B.2. Proof of PAC-Bayesian Catoni-Bound

Proposition 4 (Part of Theorem 2 in the main paper) Set D(q, p) = Φ λ

K (q) p in Lemma 4. Let R(f) =

1 K PK k=1 Eξkgk(f, ξk), r(f) = 1 K PK k=1 gk(f, ξk), where gk : F Ak [0, 1] is a bounded function. Then with probability at least 1 δ over the draw of S, the following

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

holds for any measure ρ over F:

ρ(R) Φ 1 λ K [ρ(r) + K(ρ, π) + ln(1/δ)

K ρ(r) K(ρ,π)+ln(1/δ)

K } 1 exp{ λ/K}

λρ(r) K[1 exp{ λ/K}] + K(ρ, π) + ln(1/δ) K[1 exp{ λ/K}].

Proof. We proceed with similar arguments as that in the proof of Proposition 1. Denote {ηk}K k=1 as independent Bernoulli random variables with Eξkgk(f, ξk) = E ηk. Since D(q, p) is a convex function of p, and exp function is a nondecreasing convex function, thus λ R+, exp{λD(q, p)} is also a convex function with respect to p. Then we have

ESeλD(R(f),r(f))

K PK k=1 Eξk gk(f,ξk), 1

K PK k=1 gk(f,ξk))

K PK k=1 gk(f,ξk))

k ηk) (Corollary 3)

1 (Lemma 5).

Recalling Lemma 4 and Markov s inequality we have

K (ρ(R)) ρ(r) = D(ρ(R), ρ(r))

λ K(ρ, π) + ln Ef πeλD(R(f),r(f))

λ ln ESEf πeλD(R(f),r(f))/δ

=K(ρ, π)+ln(1/δ)

λ + ln Ef πESeλD(R(f),r(f))

K(ρ, π) + ln(1/δ)

Further, notice that the inverse function of Φa(p) is Φ 1 a (q) = 1 exp{ aq}

1 exp{ a} , and the basic inequality 1 exp( x) x, we ﬁnish the whole proof.

We can derive a more concise corollary of the above result.

Corollary 4 In the setting of Proposition 4, let λ = K ln C, where C > 1, then exp{ λ

C . Then with probability at least 1 δ over the draw of S, the following holds for any probability measure ρ:

ρ(R) C ln C

C 1 ρ(r) + C C 1 K(ρ, π) + ln(1/δ)

With almost the same proceedings as that in the proof of Propositions 2-3, we can immediately yield the following two propositions for meta-learning with the use of Proposition 4 and Corollary 4. The detailed proof is thus omitted.

Proposition 5 For any δ (0, 1), with probability at least 1 δ over the draw of n distributions {Di}n i=1, the following holds for any C > 1 and any hyper-posterior Q:

er(Q) C ln C

C 1 eer(Q) + C C 1 K(Q, P) + ln(1/δ)

Proposition 6 For any hyper-prior P, any δ (0, 1), with probability at least 1 δ over the draw of the training sample S = {Si}n i=1, the following holds for any hyper-posterior Q:

eer(Q) C ln C

C 1 ber(Q)+

C C 1 K(Q, P) + EP Q Pn i=1 K(Qi, P) + ln(1/δ) nm .

Proof of Theorem 4 in the main paper. Combining Proposition 5 and Proposition 6, we have with probability at least 1 δ, C1, C2 > 1,

er(Q) C1C2 ln C1 ln C2

(C1 1)(C2 1) ber(Q)+ C1 K(Q, P)+ln(2/δ)

+ C1C2 ln C1 K(Q, P)+EP Q Pn i=1K(Qi, P)+ln(2/δ)

(C1 1)(C2 1)nm .

C. Proof of the Theoretical Results of Gibbs Optimal Hyper-posterior

This section details the proof of the theoretical results about optimizing our PAC-Bayesian Catoni-bound with Gibbs optimal hyper-posterior. We ﬁrst give the following helpful lemma that exhibits the explicit form of the Gibbs optimal posterior to certain optimization problems.

Lemma 6 (Catoni, 2007)[Lemma 1.1.3] Let φ : F R be a measurable function. Then for any predeﬁned probability measure π M1(F), for any β > 0, the minimizing probability measure ρ of the below optimization problem

min ρ M1(F) βEf ρφ(f) + K(ρ, π)

has the following explicit form:

dπ (f) = e βφ(f)

where Z = R

F e βφ(f)dπ(f) is the normalization constant.

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Proof of Corollary 1 in the main paper . Recall Theorem 4 in the main paper. We can actually rewrite the PAC-Bayesian meta-learning bound on er(Q) in Theorem 4 as follow if we set C2 = e:

e C1 ln C1 (C1 1)(e 1) ber(Q)+ e C1 ln C1EP Q Pn i=1K(Qi, P) nm(C1 1)(e 1) | {z } I

+C1 K(Q, P) + ln 2

n(C1 1) + e C1 ln C1 K(Q, P) + ln 2

nm(C1 1)(e 1) .

Actually, term I can be written as follow if we set Q i as the Gibbs optimal posterior:

e C1 ln C1 (C1 1)(e 1) ber(Q) + EP Q Pn i=1 K(Q i , P) nm .

Then we have

ber(Q) + EP Q Pn i=1 K(Q i , P) nm

m ber(Q i , Si) + K(Q i , P)

i=1 Eh Q i m ber(h, Si) + ln d Q i d P

i=1 Eh Q i m ber(h, Si) + ln e m b er(h,Si)

ln Z(Si, P) ,

which ﬁnishes the whole proof.

Proof of Corollary 2 in the main paper. Recalling Corollary 1 we can obtain the form of the optimal Gibbs optimal hyper-posterior as follow:

Q = arg min Q M1(M1(H)) n e C1 ln C1 (C1 1)(e 1)EP Q 1 nm

i=1 [ ln Z(Si, P)]

+ C1 K(Q, P) + ln 2

n(C1 1) + e C1 ln C1 K(Q, P)+ln 2

nm(C1 1)(e 1)

= arg min Q M1(M1(H)) n e C1 ln C1 (C1 1)(e 1)EP Q 1 nm

i=1 [ ln Z(Si, P)]

+ C1 n(C1 1) + e C1 ln C1 nm(C1 1)(e 1) K(Q, P) o

= arg min Q M1(M1(H))

h EP Q β nm

i=1 ln Z(Si, P) + K(Q, P) i .

Applying Lemma 6 to the above minimization problem obtains the close-form of the Gibbs optimal hyper-posterior.

D. Proof of the PAC-Bayesian kl-Bound for Meta-Learning with Dependent Samples

We ﬁrst give a fundamental lemma about the property of the exact proper fraction cover of the dependence graph.

Lemma 7 (Janson, 2004)[Lemma 3.1] If C = {(Cj, wj)}J j=1 is an exact fractional cover of the dependence graph Γ = (V, E), with V = [K], then

Further, K = PJ j=1 wj|Cj|, where |Cj| is the size of Cj.

Proposition 7 (Theorem 5 in the main paper) Let F be a set of random variables f. Let S = (ξ1, , ξK) be a size-K random vector with each component ξk(k [K]) drawn according to the measure µk over the set Ak. Let R(f, S) = 1 K PK k=1 Eξkgk(f, ξk), r(f, S) = 1 K PK k=1 gk(f, ξk), where gk : F Ak [0, 1] is a bounded function. Then for any reference measure π over F, with probability at least 1 δ over the draw of S, the following holds for any measure ρ:

kl(Eρr(f, S), EρR(f, S))

K [K(ρ, π) + ln(2

where χ (S) denotes the fractional chromatic number of the dependence graph of S.

Proof. According to Lemma 3.2 in (Janson, 2004), the fractional chromatic number is achieved when the cover is the exact proper fractional cover. Therefore, let us just consider C = {(Cj, wj)}J j=1 as the exact proper fractional

cover of the dependence graph Γ(S). Denote πj = wj|Cj|

K , we have P

j πj = 1. Let S(j) = {ξk}k Cj, hence the elements in S(j) are all independent. Then we have

j=1 πjr(f, S(j)) =

l Cj gl(f, ξl)

l Cj gl(f, ξl)

k=1 gk(f, ξk) = r(f, S) (Lemma 7).

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Meanwhile we also have

j=1 πj ES(j)r(f, S(j)) = 1

l Cj Eξlgl(f, ξl)

k=1 Eξkgk(f, ξk) = R(f, S) (Lemma 7).

Then we have

kl(Eρr(f, S), EρR(f, S))

j=1 πjr(f, S(j)), Eρ

j=1 πj ES(j)r(f, S(j))

j=1 πj Eρkl r(f, S(j)), R(f, S(j)) (Jensen)

j=1 πj 1 |Cj| K(ρ, π) + ln Ef πe|Cj| kl(r(f,S(j)),R(f,S(j)))

πj |Cj| K(ρ, π)+ln ES(j)Ef πe|Cj| kl(r(f,S(j)),R(f,S(j)))

πj |Cj| K(ρ, π) + ln 2 p

=w K(ρ, π) + ln (2/δ)

PJ j=1 wj ln p

where the second inequality holds due to the change of measure lemma (cf. Lemma 4), the third inequality uses Markov s inequality, and the last inequality proceeds as the same as the proof of Proposition 1. Further denote αj = wj

w , then we have P

j αj = 1, and hence,

PJ j=1 wj ln p

j=1 αj ln q

|Cj| w (Jensen)

K ln (PJ j=1 wj)1/2(PJ j=1 wj|Cj|)1/2

w (Schwartz)

Combining the above results we ﬁnally obtain

kl(Eρr(f, S), EρR(f, S))

w K(ρ, π) + ln (2/δ)

Actually, the RHS of inequality (5) is an increasing function with respect to w (the detailed proof is left to readers), and

hence achieve its minimum when w = χ (S). Thus we complete our whole proof.

With almost the same proceedings as that in the proof of Propositions 2-3, we can immediately yield the following two propositions that apply to meta-learning setting where dependence exists among different samples . The detailed proof is left to readers.

Proposition 8 For any δ (0, 1), with probability at least 1 δ over the draw of n distributions D = {Di}n i=1, the following holds for any hyper-posterior Q:

|er(Q) eer(Q)|

v u u tχ (D)[K(Q, P) + ln( 2

where χ (D) denotes the fractional chromatic number of the dependence graph of D.

Proposition 9 For any hyper-prior P, any δ (0, 1), with probability at least 1 δ over the draw of the training sample S = {Si}n i=1, the following holds for any hyper-posterior Q:

| eer(Q) ber(Q)|

where = χ (S) K(Q, P) + EP Q Pn i=1 K(Qi, P) + ln 2 mn δ

χ (S) , χ (S) denotes the fractional chromatic num-

ber of the dependence graph of S = {Si}n i=1 = {zij}n,m i=1,j=1.

Proof of Theorem 6 in the main paper. Combining Propositions 8-9 and utilizing union bound give the high-probability bound on the transfer risk er(Q).

We further provide an example as follow to illustrate how to estimate the chromatic number χ (S) of the fractional cover in the dependent meta learning setting.

Example 1 Consider a meta sample S = {zij}n,m i,j=1, where

j [m], zij i.i.d. Di, but there exists dependency between samples drawn from different distributions. Then we can set the exact fractional cover of Γ(S) as {(Cj, 1)}n j=1, and hence the chromatic number χ (S) Pn j=1 1 = n.

E. Details of Classiﬁcation Experiments

E.1. Distributions of Neural Network

For classiﬁcation problems, we can develop two metalearning algorithms by directly setting our kl-bound and Catoni-bound as minimization objective functions. It sufﬁces to specify: (1) the explicit form of KL-divergences in our meta-learning bounds, (2) how to approximate the expectation P Q. To tackle the ﬁrst issue, we need to deﬁne

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

1 2 3 4 5 6 7 8 9 10 Number of training-tasks

Bound on new task

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(a) kl-bound

1 2 3 4 5 6 7 8 9 10 Number of training-tasks

Bound on new task

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(b) Catoni-bound

1 2 3 4 5 6 7 8 9 10 Number of training-tasks

Error on new task [%]

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(c) kl-bound error

1 2 3 4 5 6 7 8 9 10 Number of training-tasks

Error on new task [%]

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(d) Catoni-bound error

Figure 2. Average test bounds and test errors of our two bound-minimizing meta-learning algorithms on new classiﬁcation tasks for different numbers of training-tasks and for different pixel-shufﬂed environments (average over 20 meta-test tasks). (a)-(b): Test bounds of kl-bound and Catoni-bound. (c)-(d): Test errors of kl-bound and Catoni-bound.

the distribution of the hypothesis class H = {w : w Rd} (i.e., neural networks), where d is the dimension of the parameter w. As that in previous work (Amit & Meir, 2018), we ﬁrst set both the hyper-prior and hyper-posterior over M1(H) as isotropic Gaussian:

P = N(0, κ2 PId d), Qθ = N(θ, κ2 QId d),

where κP, κQ > 0 are both predeﬁned constants, θ is the optimization parameter. Then the KL-divergence between Qθ and P can be calculated as

K(Qθ, P) = ||θ||2 2 + κ2 Q 2κ2 P + ln κP

Next, we consider the form of prior and posterior over the hypothesis space H. We deﬁne the prior Pθ and the posteriors Qφi (φi Rd is the hyperparameter) as the factorized Gaussian distributions for computational convenience:

k=1 N(wk; µP,k, σ2 P,k),

k=1 N(wk; ui,k, σ2 i,k),

where θ = (µP , ρP ) R2d is composed of the means µP,k and log-variances of each weight ρP,k = ln σ2 P,k, k [d]. The posterior vectors (µi, ρi) R2d has a similar structure. Then the KL-divergence K(Qφi, Pθ) can be calculated as:

ln σ2 P,k σ2 i,k + (σ2 i,k + (µi,k µP,k)2)

σ2 P,k 1 . (7)

Secondly, to approximate the expectation P Q in our kl-bound and Catoni-bound, we utilize the Monte-Carlo method. Concretely, we calculate the expectations by averaging several Monte-Carlo samples of P and adding Gaussian noise to the parameter θ during the meta-training process: θ = θ + ϵ, ϵ N(0, κ2 QId d). Therefore from the

Algorithm 1 Catoni-bound-minimizing meta-learning algorithm (meta-training phase)

1: Input: Datasets from n training tasks: S1, ..., Sn. 2: Output: Parameters θ of hyper-posterior Qθ. 3: Initialize: 4: θ = (µP , ρP ) R2d, φi = (µi, ρi) R2d, i = 1, ..., n. 5: while not converged do 6: for i {1, ..n} do 7: Sample a mini-batch S i from datasets Si. 8: Calculate EPθ Qθ ber(Qi, Si) with the mini-batch S i by averaging Monte-Carlo draws. 9: Calculate K(Qθ, P) with Eq. (6). 10: Calculate EPθ QθK(Qφi, Pθ) with Eq. (7) by averaging Monte-Carlo draws. 11: end for 12: Calculate the meta-training Catoni-bound in Eq. (4) with EPθ Qθ ber(Qi, Si), K(Qθ, P) and EPθ QθK(Qφi, Pθ), i = 1, ..., n. 13: Calculate the gradient of Catoni-bound w.r.t {θ, φ1, ..., φn} using backpropagation. 14: Take an optimization step. 15: end while 16: Return θ

above discussions, we can derive two PAC-Bayesian algorithms for meta-learning classiﬁcation problems with deep neural networks. The detailed pseudo code of setting our PAC-Bayesian Catoni-bound as training objective is shown in Algorithm 1, where we set C1 = 2, C2 = 3 for convenience. The pseudo code of minimizing our kl-bound can be illustrated in a similar way. In practice, we set the parameters κP = 2000 and κQ = 0.001 respectively, and the conﬁdence parameter δ = 0.1. ADAM is chose as the optimizer with learning rate of 10 3 for all experiments.

During the meta-test phase, the informative prior is sampled randomly from the learned hyper-posterior Qθ, and we take this prior and the scarce data of the novel task as input to learn a posterior. The test bound is calculated on the

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

5 10 15 20 25 30 35 40 45 50 55 60 Sample size per training-task

Bound on new task

x103 x103 x103

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(a) kl-bound

5 10 15 20 25 30 35 40 45 50 55 60 Sample size per training-task

Bound on new task

x103 x103 x103

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(b) Catoni-bound

5 10 15 20 25 30 35 40 45 50 55 60 Sample size per training-task

Error on new task [%]

x103 x103 x103

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(c) kl-bound error

5 10 15 20 25 30 35 40 45 50 55 60 Sample size per training-task

Error on new task [%]

x103 x103 x103

Permuted Pixels - 100 pixel swaps Permuted Pixels - 200 pixel swaps Permuted Pixels - 300 pixel swaps

(d) Catoni-bound error

Figure 3. Average test bounds and test errors of our two bound-minimizing meta-learning algorithms on new classiﬁcation tasks for different sample sizes per training-task and for different pixel-shufﬂed environments (average over 20 meta-test tasks). (a)-(b): Test bounds of kl-bound and Catoni-bound. (c)-(d): Test errors of kl-bound and Catoni-bound.

novel task by using the PAC-Bayesian bound for single-task learning as the minimization objective (i.e., only calculate the task-level complexity in the meta-training bound).

E.2. Convergence Analysis of PAC-Bayesian kl-bound and Catoni-bound for Meta-Learning

In this subsection, we provide visualization of the convergence performance of our kl-bound and Catoni-bound. Such experiment is conducted for a large range of the number n of training tasks and a large range of the sample size m per task in different classiﬁcation environments. Detailed information can be found in Figure 2 & Figure 3. We can observe that: (1) With the increase of the number of metatraining tasks, our meta-learning algorithms by minimizing the proposed kl-bound and Catoni-bound can achieve lower test bounds and lower test errors over the novel task. This veriﬁes the asymptotic behaviour of our two meta-learning bounds. (2) When the number of training task or the sample size per task is rather small (i.e., n = 1 or m = 5, 000), both our kl-bound and Catoni-bound suffer performance degradation. However, when n 2 or m 10, 000, our two bound-minimizing algorithms obtain much better performance. This indicates the value of extracting knowledge from other similar training tasks that have sufﬁcient training data. (3) Our Catoni-bound can always achieve a lower level than kl-bound, in terms of the test bound and the test error, which is consistent with the tightness of the Catoni-bound.

F. Details of Regression Experiments

Table 4. The number n of meta-training tasks and the sample size m per task in different regression environments.

Cauchy Swiss FEL Physionet (GCS-HCT) Berkeley n 20 5 100 36 m 20 200 4 - 24 288

F.1. Regression Environments

We provide more details on the ﬁve regression environments in Table 4. The information includes the number of training

tasks and the sample size per task of different environments. The comprehensive introduction of each environment can be found in (Rothfuss et al., 2021) [Appendix E].

F.2. Approximating Gibbs Optimal Hyper-posterior with SVGD Inference Method

Algorithm 2 GOHP with SVGD approximation of Q

(meta-training phase)

1: Input: Hyper-prior P, datasets S1, ..., Sn. 2: Hyper-parameter: SVGD kernel function k( , ), step size η, scaler factor β in Eq. (8). 3: Output: Set of priors {Pφ1, ..., PφK}. 4: Initialize: φ := [φ1, ..., φK] , with φk P. 5: while not converged do 6: for k = 1, ..., K do 7: for i = 1, ..., n do 8: ln Zi,k MLL Estimator(Si, Pφk) 9: end for 10: φk ln Q φk ln P + β nm Pn i=1 φk ln Zi,k 11: end for 12: φ φ + η K φln Q + φK // SVGD update 13: end while 14: Return {Pφ1, ..., PφK}

In this subsection, we detail how to employ the inference method SVGD (Liu & Wang, 2016) to approximate the Gibbs optimal hyper-posterior Q . We borrow the idea from (Rothfuss et al., 2021) to develop our GOHP algorithm. Concretely, SVGD approximates Q as a set of particles ˆQ = {Pφ1, . . . , PφK}, where Pφ represents a prior with parameter φ. Initially, we sample K particles φk from the hyper-prior P. Then according to the explicit form of Q in Corollary 2 of the main paper, we can compute the gradient of Q w.r.t. the parameters φk, k [K]:

φk ln Q (φk)= φk ln P(φk)

i=1 φk ln Z(Si, Pφk), (8)

where the marginal log-likelihood (MLL) ln Z(Si, Pφk) is

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Table 5. Different PAC-Bayesian meta-learning bounds on er(Q). Bound = Empirical Error + Environment-level Complexity + Task-level Complexity. n is the number of training tasks, m is the size of each training dataset Si (i [n]). P, Q M1(M1(H)) are hyper-prior and hyper-posterior respectively. P, Qi = Q(Si, P) M1(H) are the prior and the posterior for the i-th training task. Our Catoni-bound holds for any positive constant C1, C2 > 1.

Classical Bounds Empirical Error Environment-level Complexity Task-level Complexity

Pentina & Lampert ber(Q) 1 n K(Q, P) + 1

2 K(Q,P)+Pn i=1 EP QK(Qi,P )

n m + 1 8 m + 1 n m ln 2

Amit & Meir ber(Q)

K(Q,P)+ln 2n

1 n Pn i=1 q

K(Q,P)+EP QK(Qi,P )+ln (2nm/δ)

Rothfuss et al. ber(Q) 1 n K(Q, P) + 1 8 n

K(Q,P)+Pn i=1 EP QK(Qi,P )

n m + 1 8n m + 1 n ln 1

kl-bound (ours) ber(Q)

K(Q,P)+ln 2 n

2 b er(Q) mn + 2

mn, = K(Q, P) + EP Q Pn i=1 K(Qi, P) + ln 2 mn

Catoni-bound (ours) C1C2 ln C1 ln C2

(C1 1)(C2 1) ber(Q)

C1 K(Q,P)+ln(2/δ)

C1C2 ln C1 K(Q,P)+EP Q Pn i=1 K(Qi,P )+ln(2/δ)

(C1 1)(C2 1)nm

approximated by numerical Monte Carlo estimates. Then we update the particles with the SVGD update rule:

φ φ + η K φln Q + φK,

where φ = [φ1, ..., φK] is the stacked particles matrix, φln Q = [ φ1 ln Q (φ1), ..., φK ln Q (φK)]

the stacked matrix of gradients, K = [k(φk, φk )]k,k the kernel matrix induced by the kernel function k( , ) and η the step size for updates. The Pseudo code for meta-training can be found in Algorithm 2.

Algorithm 3 GOHP with SVGD on the novel tasks (metatest phase)

1: Input: Set of priors {Pφ1, ..., PφK}, dataset Sn+1 from novel task. 2: Hyper-parameter: Kernel function k( , ), SVGD step size η, number of particles L. 3: Output: A set of neural networks parameters SK k=1{θk 1 ..., θk L}. 4: for k = 1, ..., K do 5: Initialize {θk 1, ..., θk L}, θk l Pφk, l [L]. 6: while not converged do 7: for l = 1, ..., L do 8: θk l ln Q (θk l ) θk l ln Pφk m θk l ber(h, Sn+1) 9: end for 10: θk l θk l + η L PL l =1[k(θk l , θk l ) θk l ln Q (θk l ) +

θk l k(θk l , θk l )]. // SVGD update 11: end while 12: end for 13: Return SK k=1{θk 1 ..., θk L}

F.3. Applying Gibbs Optimal Hyper-posterior to Novel Tasks

During the meta-test phase, the extracted knowledge from the n training tasks is now applied to a novel task by a

base learner. The base learner is given a training dataset Sn+1 Dn+1, where Dn+1 is sampled from the same environment τ. We still use statistical inference method to approximate the Gibb optimal posterior Q (Sn+1, P) deﬁned in Corollary 1 of the main paper. Then we use the SVGD update rule in Eq. (8) to update the parameter θ of the neural networks for the novel task. Algorithm 3 gives the steps of the training procedure during the meta-test phase.

For a data point x from the held-out evaluation set S Dn+1, the neural network predictor outputs a probability distribution as ˆp(y |x , Sn+1) 1 KL PK,L k,l=1 p(y |hθk l (x )). Then the mean prediction is set as the expectation of the distribution ˆp, i.e., ˆy = Ey ˆp(y |x , Sn+1). Thus the root mean squared error (RMSE) over the novel task Dn+1 is calculated as follow:

(x ,y ) S (y ˆy)2.

Remark 2 Note that our mete regression algorithm GOHP achieves analogous experimental results w.r.t. the latest PACOH algorithm on the regression datasets in Table 3 in the main paper. There are two reasons for GOHP s analogous performance w.r.t. PACOH: (1) Both algorithms minimize similar objectives: although our bound of O ( 1

n + 1 mn)K(Q, P) in Corollary 1 is sharper than that of O ( 1

n + 1 mn)K(Q, P) + θ in (Rothfuss et al., 2021)[Corollary 1], where θ is a constant that will not decrease with the increase of the number n of training tasks or the sample size m per task, Rothfuss s PACOH omits the constant term θ during the optimization process. Thus, both our minimization objective and that of PACOH are almost the same, except for the multiplicative factor before K(Q, P). (2) Both algorithms use the same approximation technique: both GOHP and PACOH employ the same inference method SVGD (Liu & Wang, 2016) to minimize O ( 1

n + 1 mn)K(Q, P) and update the hyper-posterior Q iteratively.

Fast-Rate PAC-Bayesian Generalization Bounds for Meta-Learning

Table 6. Notations of PAC-Bayesian single-task learning and PAC-Bayesian meta-learning. H is the hypothesis space, and l : H Z [0, 1] is the bounded loss function. In the main paper, we write ber(Q) ber(Q, S), er(Q) er(Q, τ) for abbreviation.

PAC-Bayesian Single-Task Learning PAC-Bayesian Meta-Learning

Sample z Z Sample S Zm

Training Set S = {zi}m i=1 Zm Training Set S = {Si}n i=1 (Zm)n

Unknown Task D M1(Z) Unknown Environment τ M1(M1(Z))

Prior P M1(H) Hyper-Prior P M1(M1(H))

Posterior Q M1(H) Hyper-Posterior Q M1(M1(H))

Empirical Error ber(Q, S) = Eh Q 1

m Pm i=1 l(h, zi) Empirical Error ber(Q, S) = EP Q 1

n Pn i=1 ber(Q(Si, P), Si)

Expected Error er(Q, D) = Eh QEz Dl(h, z) Transfer Error er(Q, τ) = EP QED τES Dmer(Q(S, P), D)

G. Explicit Forms of Different PAC-Bayesian Meta-Learning Bounds

In this section, we provide the explicit forms of different PAC-Bayesian bounds for meta-learning in Table 5. Note that Table 5 provides the detailed version of those PACBayesian bounds listed in Table 1 of the main paper.

Remark 3 We remark that it is infeasible to directly compare our results with the latest generalization bounds in (Farid & Majumdar, 2021) which combines PAC-Bayes analysis and algorithmic stability theory. The reasons lie in two aspects: (1) It is hard to compute a precise PAC-Bayes bound in (Farid & Majumdar, 2021)[Theorem 3], as the uniform stability parameter (referred as β) in their bound depends on L-Lipschitzness and B-smoothness of neural networks, where the constants L and B are always unknown and truly big (>> 1). (2) Nevertheless, we can make a rough comparison as follow. In the convex case, the stochastic gradient descent (SGD) algorithm with constant step sizes has a stability parameter β = O(L2T/m), where T is the total number of iteration. Recall that T = 200, m = 2, 000 in our meta-test phase, so β >> 200/2000 = 0.1 and is larger than our reported Catoni-bound in Table 2. Therefore, Farid s generalization bound in the convex case is not tight enough, let alone the bound in the non-convex case (e.g. deep neural networks).

H. Notations of Single-Task Learning and Meta-Learning

In this section, we provide notations of PAC-Bayesian single-task learning and PAC-Bayesian meta-learning in Table 6 for readers convenient reference.