# smoothness_adaptive_hypothesis_transfer_learning__490861d4.pdf

Smoothness Adaptive Hypothesis Transfer Learning

Haotian Lin 1 Matthew Reimherr 1

Many existing two-phase kernel-based hypothesis transfer learning algorithms employ the same kernel regularization across phases and rely on the known smoothness of functions to obtain optimality. Therefore, they fail to adapt to the varying and unknown smoothness between the target/source and their offset. This paper introduces Smoothness Adaptive Transfer Learning (SATL), a two-phase kernel ridge regression (KRR)-based algorithm to address these limitations. We first demonstrate that employing a misspecified fixed bandwidth Gaussian kernel in target-only KRR learning can achieve minimax optimality when the true function resides in Sobolev spaces. Leveraging this result, SATL enables the estimators to provably and universally adapt to the varying and unknown Sobolev smoothness of the source and offset functions. We derive the minimax lower bound of the learning problem in excess risk and show that SATL achieves a matching upper bound up to logarithmic factors. The optimal statistical rate reveals the factors influencing the transfer dynamics and efficacy, including the source sample size and the relative strength between domains. The theoretical findings and the effectiveness of SATL are confirmed by several experiments.

1. Introduction

Nonparametric regression is one of the most prevalent statistical problems studied in many communities in past decades due to its flexibility in modeling data. A large number of algorithms have been proposed, such as kernel regression, local regression, smoothing splines, and regression trees, to name only a few. However, the effectiveness of all the algorithms in these existing works is based on having sufficient samples drawn from the same target domain. When

1Department of Statistics, Pennsylvania State University, University Park, PA, USA. Correspondence to: Haotian Lin <hzl435@psu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

samples are scarce, either due to costs or other constraints, the performance of these algorithms can suffer empirically and theoretically.

Hypothesis Transfer Learning (HTL) (Li & Bilmes, 2007; Kuzborskij & Orabona, 2013; Du et al., 2017), which leverages models trained on the source domain and uses samples from the target domain to learn the model shift to the target model, is an appealing and promising mechanism. When the parameters of interest are infinite-dimensional (e.g., nonparametric models), Lin & Reimherr (2024) employed the reproducing kernel Hilbert space (RKHS) norm as a metric for assessing similarity in functional regression frameworks, linking the transferred knowledge to the employed RKHS structure. They leveraged offset transfer learning (OTL), which is one of the most popular HTL algorithms, to obtain target estimators in a two-phase manner: first, training a source model on the large sample size source dataset, then estimating the offset model between the target and source using the target dataset and trained source model. However, a noteworthy observation is that both phases of estimating the source model and offset model utilize the same RKHS regularization. This goes against the principle that led to OTL s success, which posits that the offset should ideally have a simpler structure than the target and source. A similar limitation also appears in a series of twophase HTL algorithms for finite-dimensional models (e.g., multivariate/high-dimensional linear regression) (Bastani, 2021; Li et al., 2022; Tian & Feng, 2022), which typically utilize the ℓ1 or ℓ2 norm of the offset parameters as the similarity measure. Since ℓ1-norm can reflect sparsity and ℓ2 usually serves to control complexity, using the same norm regularization across both phases is more robust to model structure heterogeneous and more defensible in finitedimensional models than in infinite-dimensional ones.

In the realm of nonparametric regression, although OTL has shown great success in practice, there are only a few studies that provide theoretical analysis (Wang & Schneider, 2015; Du et al., 2017), and these works are still limited in terms of problem settings, estimation procedures, and theoretical bounds. For example, although Wang & Schneider (2015) noticed the nature of simple offsets, they didn t use any quantity (like Sobolev or H older smoothness) to formularize the difference of target/source models and their offset. Their KRR-based OTL algorithm also employed the same kernel

Smoothness Adaptive Hypothesis Transfer Learning

to train the source model and the offset and thus has similar limitations as Lin & Reimherr (2024) methodologically. Du et al. (2017) formalized the varying structures via different H older smoothness, but the theoretical results derived are under too ideal assumptions, unverifiable in practice. Besides, neither their approaches nor the statistical convergence rates were adaptive, and their upper bound overlooked certain factors, failing to provide deeper insights into the influence of domain properties on the transfer learning dynamics and efficacy. This raises the following fundamental question that motivates our study:

Can we develop an HTL algorithm so that the different structures (smoothness) of the target/source functions and their offset can be adaptively learned?

Main contributions. This work answers the above question positively and makes the following contributions:

We propose Smoothness Adaptive Transfer Learning (SATL), building upon the prevalent two-phase offset transfer learning paradigm. Specifically, we study the setting where the target/source function lies in Sobolev space with order m0 while the offset function lies in Sobolev space with order m (where m > m0). One key feature of SATL is its universal capability to adapt to the unknown and varying smoothness of the target, source, and offset functions.

We first begin by establishing the robustness of the Gaussian kernel in misspecified KRR, i.e., for regression functions belonging to certain fractional Sobolev spaces (or RKHSs that are norm equivalent to such Sobolev spaces), employing a fixed bandwidth Gaussian kernel in target-only KRR yields minimax optimal generalization error. Remarkably, the optimal order of the regularization parameters follows an exponential pattern, which differs from the variable bandwidth setting and we conduct comprehensive experiments to support the finding. Furthermore, we demonstrate that an estimator, developed through standard training and validation methods, achieves the same optimality up to a logarithmic factor without prior knowledge of the true smoothness.

Leveraging these new results of the Gaussian kernel, SATL employs Gaussian kernels in both learning phases, avoiding the saturation effect of KRR and thus ensuring its universal and consistent adaptability to the diverse and unknown smoothness levels m0 and m. We also establish the minimax statistical lower bound for the learning problem in terms of excess risk and show that SATL achieves minimax optimality since it enjoys a matching upper bound (up to logarithmic factors). Crucially, our results shed light on the impact of signal strength from both domains on the dynamic and efficacy of OTL, which, to the best of our knowledge, has been largely overlooked in the existing literature. This insight enhances our understanding of the contributions of each phase in the transfer learning process.

1.1. Related Literature

Transfer Learning. OTL (a.k.a. bias regularization transfer learning) has been extensively researched in supervised regression. The work in Kuzborskij & Orabona (2013; 2017) focused on OTL in linear regression, establishing generalization bounds through Rademacher complexity. Wang & Schneider (2015) derived generalization bounds for applying KRR on OTL without formularizing the simple offset structure. Wang et al. (2016) assumed target/source regression functions in the Sobolev ellipsoid with order m0 and the offset in a smoother power Sobolev ellipsoid. They used finite orthonormal basis functions for modeling, which becomes restrictive if the chosen basis is misaligned with the eigenfunctions of the Sobolev ellipsoid. Du et al. (2017) further proposed a transformation function for the offset, thereby integrating many preview OTL studies and offering upper bounds on excess risk for both kernel smoothing and KRR. Apart from regression settings, generalization bounds for classification problems with surrogate losses have been studied in Aghbalou & Staerman (2023) via stability analysis techniques. Other results that study HTL outside OTL can be found in Orabona et al. (2009); Cheng & Shang (2015); Minami et al. (2024). Besides, OTL can also be viewed as a case of representation learning Du et al. (2020); Tripuraneni et al. (2020); Xu & Tewari (2021) by viewing the trained source model as a representation for target tasks.

The idea of OTL has also been recently adopted by the statistics community, which typically involves regularizing the offset via different metrics in parameter spaces. For example, Bastani (2021); Li et al. (2022); Tian & Feng (2022) considered ℓ1-distance OTL for high-dimensional (generalized) linear regression. Duan & Wang (2022); Tian et al. (2023) considered ℓ2-distance for general linear models. Lin & Reimherr (2024) utilized RKHS-distance for functional linear models. However, all of these works used the same type of distance while estimating the source model and the offset. For nonparametric regression, another study by Cai & Pu (2024) assumed the target/source models lie in H older spaces while the offset can be approximated with any desired accuracy by a polynomial function in L1. They proposed an algorithm based on local polynomial regression that adapts to H older smoothness, but the approach can be computationally intensive in practice. KRR under the covariate shift setting has also been studied in several works. Ma et al. (2022) derived optimal rates of the generalization error under different likelihood ratio bound conditions and proposed rate-optimal estimator based on reweighting KRR. Wang (2023) introduced a pseudo-labeling algorithm to address TL scenarios where the labels in the target domain are unobserved. For nonparametric classification, Kpotufe & Martinet (2021); Cai & Wei (2021); Reeve et al. (2021) developed adaptive classifiers based on K-nearest neighbors that are rate-optimal in different distribution shift settings.

Smoothness Adaptive Hypothesis Transfer Learning

Misspecification in KRR. This line of research focuses on using misspecified kernels in target-only KRR to achieve optimal statistical convergence rates.

In the realm of variable bandwidths, Eberts & Steinwart (2013) derived the convergence rates of the excess risk for KRR using Gaussian kernels when the true regression function lies in a Sobolev space. They found that appropriate choices of regularization parameters and the bandwidth will yield a non-adaptive rate that can be arbitrarily close (but not equal) to the optimal rate under the bounded response assumption. Building upon this work, Hamm & Steinwart (2021) further improved the non-adaptive rate, attaining optimal rates up to logarithmic factors. It is worth noting that their results show that both the optimal regularization parameter and bandwidth should decay in polynomial patterns, which is different from ours. Apart from regression setting, Li & Yuan (2019) studied using variable bandwidth Gaussian kernels to achieve optimality in a series of nonparametric statistical tests.

Another line of research considers fixed bandwidth kernels. For instance, Wang & Jing (2022) investigated the misspecification of Mat ern kernel-based KRR. They demonstrated that even when the true regression functions belong to a Sobolev space, utilizing misspecified Mat ern kernels can still attain minimax optimal convergence rates or, in some cases, a slower convergence rate (referred to as the saturation effect of KRR). Similarly, several other works have presented similar results on general RKHS with polynomial eigen-decay rate, and the true function resides in the power space of the RKHS, see Steinwart et al. (2009); Dicker et al. (2017); Blanchard & M ucke (2018); Fischer & Steinwart (2020); Lin & Cevher (2020); Zhang et al. (2023) and more references therein.

2. Preliminaries

Problem Formulation. Consider the two nonparametric regression models

yp,i = fp(xp,i) + ϵp,i, p {T, S}

where p is the task index (T for target and S for source), fp are unknown regression functions, xp,i X Rd is a compact set with positive Lebesgure measure and Lipschitz boundary, and ϵp,i are i.i.d. random noise with zero mean. The target and source regression function, f T and f S, belong to the (fractional) Sobolev space Hm0 with smoothness m0 d/2 over X. The joint probability distribution ρp(x, y) is defined on X Y for the data points {(xp,i, yp,i)}np i=1, and µp represents the marginal distribution of ρp on X. In this work, we assume the model shift (a.k.a. posterior drift) setting, where µT is equal to µS, while the regression function f T differs from f S. The goal of this paper is to find a function ˆf T based on the combined

data {(x T,i, y T,i)}n T i=1 {(x S,i, y S,i)}n S i=1 that minimizes the generalization error on the target domain, i.e.

E( ˆf T ) = Ex µT [( ˆf T (x) f T (x))2].

Non-Transfer Scenario. In the absence of source data, recovering f T using KRR is referred to as target-only learning and has been extensively studied. We now state some of its well-known results.

Proposition 2.1 (Target-only Learning). For a symmetric and positive semi-definite kernel K : X X R, let HK be the RKHS associated with K (Wendland, 2004). The KRR estimator is

ˆf T = argmin f HK

i=1 (y T,i f(x T,i))2 + λ f 2 HK

and we call the kernel K as the imposed kernel. Then, the convergence rate of the generalization error of ˆf T , E( ˆf T ), is given as follows.

1. (Well-specified Kernel) If HK coincides with Hm0, E( ˆf T ) can reach the standard minimax convergence

rate in high-probability given λ n 2m0 2m0+d , i.e.

E( ˆf T ) = OP

n 2m0 2m0+d T

2. (Misspecified Kernel) If the K is the Mat ern kernel then its induced space is isomorphic to Hm 0 with

m 0 > d 2. Furthermore, given λ n 2m 0 2m0+d and γ = min{2, m0

m 0 } , then

E( ˆf T ) = OP

n 2γm 0 2γm 0+d T

3. (Saturation Effect) For m 0 < m0

2 and any choice of parameter λ(n T ) satisfying that n T , we have

E( ˆf T ) = ΩP

n 4m 0 4m 0+d T

The well-specified result is well-known and can be found in a line of past work (Geer, 2000; Caponnetto & De Vito, 2007). The misspecified kernel result comes from a combination (with a modification) of Theorem 15 and 16 in Wang & Jing (2022). The saturation effect is proved by Li et al. (2023). The Proposition 2.1 indicates that for target-only KRR, even when the smoothness of the imposed RKHS, m 0, disagrees with the smoothness m0 of the Sobolev space to which f T belongs, the optimal rate of convergence is still achievable if m 0 m0/2 with the λ appropriately chosen.

Smoothness Adaptive Hypothesis Transfer Learning

However, if m 0 < m0/2, i.e., the true function is much smoother than the estimator itself, the saturation effect occurs, meaning that the information-theoretic lower bound n 2m0/(2m0+d) T seemingly cannot be achieved regardless of the selection of the regularization parameters in KRR (Bauer et al., 2007; Gao et al., 2008).

Transfer Learning Framework. We introduce the KRRbased version of OTL for nonparametric regression, which serves as the backbone of our proposed algorithm. Formally, OTL obtains the estimator for f T as ˆf T = ˆf S + ˆfδ via two phases. In the first phase, it obtains ˆf S by KRR with the source dataset {(x S,i, y S,i)}n S i=1. In the second phase, it generates pseudo offset labels {y T,i ˆf S(x T,i)}n T i=1 and then learns the ˆfδ via KRR by replacing target labels by pseudo offset labels. The main idea of OTL is that the f S can be learned well given sufficiently large source samples, and the offset fδ can be learned with much fewer target samples. We formulate the OTL variant of KRR as Algorithm 1.

Algorithm 1 OTL-KRR Input: Target and source training data {(x T,i, y T,i)}n T i=1 {(x S,i, y S,i)}n S i=1}; Self-specified KRR imposed kernel K Output: Target function estimator ˆf T = ˆf S + ˆfδ. Phase 1:

ˆf S = argmin f HK

i=1 (y S,i f(x S,i))2 + λ1 f 2 HK

ˆfδ = argmin f HK

i=1 (y T,i ˆf S(x T,i) f(x T,i))2+λ2 f 2 HK

Model Assumptions. We first state the smoothness assumption on the offset function fδ := f T f S.

The learning framework (Algorithm 1) reveals a smoothnessagnostic nature: the imposed kernels K (also the associated RKHSs) stay the same regardless of the level of smoothness of f S and fδ. More specifically, based on Proposition 2.1, the learning algorithm is rate-optimal when the smoothness of both imposed RKHSs HK in both steps matches that of f S, and fδ, i.e. the smoothness of f S and fδ stay the same. However, in the transfer learning context, such a smoothness condition on the offset function may not be precise enough. One should rather consider the offset function smoother than the target/source functions themselves to represent the similarity between f S and f T .

To illustrate this point, consider the following example. Suppose f T = f S + fδ where f S is a complex function with

low smoothness (less regularized) while fδ is rather simple (well regularized), e.g. a linear function. Then f S can be estimated well via larger n S while fδ is a highly smooth function and can also be estimated well via small n T due to its simplicity. In this example, the effectiveness of OTL relies on the similarity between f T and f S, i.e., the offset fδ possessing a simpler structure than the target and source models. Such simpler offset assumptions have been proven to make OTL effective in other models, e.g., high-dimensional linear regression works (Bastani, 2021; Li et al., 2022; Tian & Feng, 2022) assume the offset coefficient should be sparser than target/source coefficients. This motivates our endeavor to introduce the following smoothness assumptions to quantify the similarity between target and source domains.

Assumption 2.2 (Smoothness of Target/Source). There exists an m0 d/2 such that f T and f S belong to Hm0.

Assumption 2.3 (Smoothness of Offset). There exists an m m0 such that fδ := f T f S belongs to Hm.

Remark 2.4. The results of this paper are applicable not only to Sobolev spaces but also to those general RKHSs that are norm equivalent to Sobolev spaces. Thus, we can assume that f S, f T , and fδ belong to RKHSs with different regularity. Due to norm equivalency, our discussion is primarily focused on Sobolev spaces, and we refer readers to Appendix B.2 for the analysis pertaining to general RKHSs.

Assumption 2.2 is a very common assumption in nonparametric regression literature, and Assumption 2.3 naturally holds if Assumption 2.2 is satisfied. Compared to the offset assumption in Wang et al. (2016) where fδ is assumed to belong to the power space of Hm0, our setting presents a unique challenge. Since we consider the offset function in a Sobolev space with higher smoothness, which doesn t necessarily share the same eigenfunctions with Hm0, this renders orthonormal basis modeling less promising. Assumption 2.3 also makes our setting conceptually align with contemporary transfer learning models. For instance, in prevalent pretraining-finetuning neural networks, the pretrained feature extractor tends to encompass a greater number of layers, while the newly added fine-tuning structure typically involves only a few layers. In this analogy, m0 and m in our setting are akin to the deeper pre-trained layers and the shallow fine-tuned layers.

We also state a standard assumption that frequently appears in KRR literature (Fischer & Steinwart, 2020; Zhang et al., 2023) to establish theoretical results, which controls the noise tail probability decay speed.

Assumption 2.5 (Moment of error). There are constants σ,L > 0 such that for any r 2, the noise, ϵ, satisfies

E [|ϵp|r | x] 1

2r!σ2Lr 2, for p {T, S}.

Smoothness Adaptive Hypothesis Transfer Learning

3. Target-Only KRR with Gaussian Kernels

3.1. Motivation for Employing Gaussian Kernel

To achieve optimality in Algorithm 1 under the smoothness assumptions, an applicable approach is to employ distinct kernels that can accurately capture the correct smoothness of f S and fδ during both phases. This approach, however, faces the practical challenge of identifying the unknown smoothness m0 and m, which, in turn, induce different kernels and RKHS; this is an issue also prevalent in the target-only KRR context.

One potential solution is to leverage the robustness of Mat ern kernels, i.e., employ a misspecified Mat ern kernel as the imposed kernel in KRR. As indicated by Proposition 2.1, the optimal convergence rate is still attainable for some appropriately chosen Mat ern kernels and regularization parameters. Nonetheless, this still faces two problems:

(1) The rate with misspecified Mat ern kernel in Proposition 2.1 is still non-adaptive, i.e. one still needs to know the true smoothness when tuning λ1 and λ2.

(2) The risk of the saturation effect of KRR happening in both phases when a less smooth kernel is chosen.

While the former can be potentially addressed by crossvalidation or data-driven adaptive approach, the second one is more fatal as one might end up choosing a less smooth kernel and never be able to achieve the information-theoretic lower bounds because of the saturation effect.

Hence, there is a clear demand for a kernel with a more general and robust property, i.e., in the target-only KRR, for the regression function that lies in Hm0 with any m0 d/2, employing such a kernel ensures that there s always an optimal λ such that the optimal convergence rate is achievable. Motivated by the fact that the Gaussian kernel is the limit of Mat ern kernel Kν as ν and the RKHS associated with the Gaussian kernel is contained in the Sobolev space Hν for any ν > d/2 (Fasshauer & Ye, 2011), we show that the Gaussian kernel indeed possesses this desired property.

3.2. Theoretical Results

Consider the Target-Only learning KRR setting, where f0 Hα0 (we use α0 to denote smoothness in target-only context to distinguish from TL context) and the underlying supervised learning model setup as

yi = f0(xi) + ϵi, i = 1, , n.

First, we show the non-adaptive rate for the Gaussian kernel. Theorem 3.1 (Non-Adaptive Rate). Under the Assumptions 2.5, let the imposed kernel, K, be the Gaussian kernel with fixed bandwidth and ˆf be the corresponding KRR

estimator based on data {(xi, yi)}n i=1. If f0 Hα0, by choosing log(1/λ) n

2 2α0+d , for any δ (0, 1), when n is sufficient large, with probability at least 1 δ, we have

ˆf f0 2 L2 C log 4

2 n 2α0 2α0+d ,

where C is a constant independent of n and δ.

Remark 3.2. Although Eberts & Steinwart (2013); Hamm & Steinwart (2021) has studied the robustness of Gaussian kernel on misspecified KRR, their results are built on variable bandwidths, and the nearly rate-optimal results are established given both bandwidth and λ decay polynomially in n. In contrast, our result is built on fixed bandwidth Gaussian kernels and achieves the optimal rate with the optimal λ that behaves differently from theirs.

We note to the reader that while the RKHS associated with the Mat ern kernel coincides with a Sobolev space (i.e., they are the same space with slightly different, though equivalent, norms), the Gaussian kernel does not, making the behavior of the optimal λ totally different compared to the misspecified Mat ern kernel scenarios in Proposition 2.1. Particularly, even if the Gaussian kernel is the limit of Mat ern kernel Kν as ν , setting the smoothness parameter m 0 of the imposed Mat ern kernel as infinity in misspecified kernel case of Proposition 2.1 will never yield analytical results but only tells the optimal order of λ should converge to 0 faster than polynomial (limm 0 n 2m 0/(2m0+d) = 0). On the other side, our result identifies that λ should converge to 0 exponentially in n.

The exponential decay form for the optimal λ originates from managing the approximation error. The standard real interpolation technique (like Proposition 2.1) is inadequate for controlling this error when the intermediate term lies in RKHS of Gaussian kernels. We address this by the Fourier transform of the RKHS, which reveals this exponential form. We refer readers to Appendix C.1 for more details. To further highlight our findings, we compare our results with existing state-of-the-art works on misspecified KRR and refer readers to Appendix C.4 for a detailed discussion.

To develop an adaptive procedure without known α0, we employ a standard training and validation approach (Steinwart & Christmann, 2008). To this end, we construct a finite set that is an arithmetic sequence, i.e., A = {αmin < < αmax} where {αi} satisfy αmin > d/2, αmax large enough such that α0 αmax and αi αi 1 1/ log n. Split dataset D = {(xi, yi)}n i=1 into

D1 := {(x1, y1), , (xj, yj)}

D2 := {(xj+1, yj+1), , (xn, yn)}

The adaptive estimator is obtained by following the training and validation approach.

Smoothness Adaptive Hypothesis Transfer Learning

1. For each α A, obtain non-adaptive estimator ˆfλα by KRR with dataset D1.

2. Obtain the adaptive estimator ˆfλ ˆ α by minimizing empirical L2 error on D2, i.e.

ˆfλ ˆ α = argmin α A

i=j+1 (yi ˆfλα(xi))2

When constructing the collection of non-adaptive estimators over A, Theorem 3.1 suggests choosing the regularization parameter λ = exp{ Cn2/2α+d} for some constant C.

The following theorem shows the estimator from the training and validation approach achieves an optimal minimax rate up to a logarithmic factor in n.

Theorem 3.3 (Adaptive Rate). Under the same conditions of Theorem 3.1 and A = {α1, , αN} with αj αj 1 1/ log n. Then, for δ (0, 1), when n is sufficient large, with probability 1 δ, we have

E( ˆfλ ˆ α) C log 4

2α0 2α0+d ,

where C is a constant independent of n and δ.

Remark 3.4. If the marginal distribution of x, µ, is known, one can also apply the well-known Lepski s method (Lepskii, 1991) to obtain an adaptive estimator without known α0, which also achieves optimal nonadaptive rate up to a logarithmic factor as training and validation approach does.

4. Smoothness Adaptive Transfer Learning

We formally propose Smoothness Adaptive Transfer Learning in Algorithm 2.

Algorithm 2 Smoothness Adaptive Transfer Learning (SATL) Input: Target and source dataset DT and DS;

1: Let the smoothness candidate set for the source model f S as MS = { Q1 log(n S), , Q1N1 log(n S)} and the candidate set for the offset model fδ as Mδ = { Q2 log(n T ), , Q2N2 log(n T )} for some fixed positive number Q1, Q2 and integer N1, N2. 2: Obtain the adaptive source model ˆf S via the training and validation with the Gaussian kernel and MS. 3: Generate the label ˆe T,i = y T,i ˆf S(x T,i) and the offset dataset as DT = {(x T,1, ˆe T,1), , (x T,n T , ˆe T,n T ))}. 4: Using the offset datasets DT to obtain the adaptive offset model ˆfδ via the training and validation with the Gaussian kernel and Mδ.

While SATL can be viewed as a specification of the Algorithm 1, the desirable property exhibited by the Gaussian kernel surpasses all other misspecified kernel choices by allowing estimators from both phases always to be able to adapt to the true Sobolev smoothness of the functions inherently even with unknown the true values of m0, m.

4.1. Theoretical Analysis

In order to provide concrete theoretical bounds, we assume the offset function of f T and f S in the h-ball of Hm, i.e. f S is said to be h-transferable to f T if fδ Hm h. Hence, the parameter space is defined as

Θ(h, R, m0, m) = {(ρT , ρS) :

f S Hm0 R, fδ Hm h}

for some positive constants R and h. We note that to achieve rigorous optimality in the context of transfer learning under the regression setting, such an upper bound for the distance between parameters from both domains is often required, e.g., ℓ1 or ℓ0 distance in high-dimensional setting (Li et al., 2022; Tian & Feng, 2022; He et al., 2024), Fisher-Rao distance in low-dimensional setting (Zhang et al., 2022), RKHS distance in functional setting (Lin & Reimherr, 2024), etc.

Theorem 4.1 (Optimality of SATL). Suppose the Assumption 2.2, 2.3, and 2.5 hold, and n S and n T are sufficiently large but still in transfer learning regime (n S n T ), we have the lower bound for the transfer learning problem and the upper bound of SATL as follows.

1. (Lower bound) For δ (0, 1), with probability 1 δ

inf f sup Θ(h,R,m0,m) P n f f T 2 L2 CδR2

n 2m0 2m0+d S + n 2m 2m+d T ξL

where ξL h2/R2 and C is a constant independent of n S, n T , R, h, and δ. The inf is taken over all possible estimators f based on the target and source data.

2. (Upper bound) Suppose that ˆf T is the output of SATL. For δ (0, 1), with probability 1 δ, we have

ˆf T f T 2 L2 C log 8

2 R2 + σ2 S

( n S log n S

2m0 2m0+d + n T log n T

where ξU (h2 +σ2 T )/(R2 +σ2 S) and C is a constant independent of n S, n T , R, h, and δ.

Theorem 4.1 indicates that the convergence rate of excess risk for SATL consists of two terms: the first term is the

Smoothness Adaptive Hypothesis Transfer Learning

rough estimation error, and the second is the offset estimation error. The first term represents the error of learning f T with the source samples, while the second term is the error of learning the offset function with the target samples. The terms ξL and ξU control the transfer dynamic and efficacy; see discussion on Section 4.2, and we refer readers to Appendix D for their origin. Besides, the upper bound is tight up to logarithmic factors, which is a price paid for adaptivity. Note that the upper bound can be exactly tight when the Sobolev smoothness m0 and m are known.

Finally, it is also worth highlighting how the refinement of a simple offset provides a better convergence rate compared to homogeneous kernel regularization (Lin & Reimherr, 2024). Based on the saturation effort of KRR, using the same Mat ern kernel with smoothness m0 for both phases in Algoritm 1 will lead the offset estimation error to be n 2γm0/(2γm0+d) T , where γ = min{2, m/m0}, which is never faster than n 2m/(2m+d) T .

4.2. OTL Transfer Dynamic and Efficacy

In this part, we discuss some insights provided by our upper bound into the transfer dynamic, i.e., how OTL benefits the learning over the target domain, and the transfer efficacy, i.e., whether the OTL is taking effect.

OTL Dynamic. For offset and source models, we term the quantities h2 + σ2 T and R2 + σ2 S as the model total strength, i.e., the sum of upper signal strength bound and noise variance. Then, ξU quantifies the relative model total strength between the offset and source models. In comparison to the convergence rate of the target-only baseline estimator, n 2m0/(2m0+d) T , our results indicate that the transfer learning dynamic depends jointly on the sample size in source domain n S, and the constant ξU. Specifically, when the relative model total strength between offset and source is small, the rough estimation error predominates, and thus, the statistical convergence rate of ˆf T is much faster than the target-only baseline given n S n T . Conversely, a large ξU will make the offset estimation error the dominant term, but the statistical rate keeps the same as the target-only baseline up to a constant.

OTL Efficacy. In earlier theoretical works, Wang et al. (2016); Du et al. (2017) failed to identify the presence of ξU in their statistical rates. The bounds in some recent works, e.g., Li et al. (2022); Tian & Feng (2022), only identified ξU h2, i.e., the corresponding upper bound should be

n S log n S

2m0 2m0+d + n T log n T

2m 2m+d h2 !

This bound claimed that the OTL takes effect when the magnitude of h is small while disregarding the influence of the

source model s total strength. Conversely, our results reveal a new perspective: the transfer efficacy within the OTL framework jointly depends on the properties of offset and source models. This means that even with the same offset model, whether OTL takes effect can differ given different source models. Thus, the constant ξU can be viewed as a similarity measure between source and target domains under the OTL framework.

We further illustrate how our results provide a more accurate interpretation of the OTL efficacy compared to the form (1) via the following example. In Figure 1, we construct two source-target pairs termed as (S0, T 0) and (S1, T 1) with identical h. While with identical h, the two pairs possess difference angle θ0 and θ1 given different source model s total strength, thus implying the different degree of similarity between source and target domains. Geometrically, one can interpret ξU as a factor that approximately represents the angle θ0 and θ1 between source and target domains. While the form (1) suggests two pairs have the same OTL efficacy, our upper bound indicates the set (S0, T 0) has higher efficacy, which aligns with the fact that (S0, T 0) possesses higher similarity.

Figure 1. Geometric illustration for how ξU will affect the OTL. The circle represents a ball centered around the source with radius h. The length of the red and blue lines represents the magnitude of the model s total strength. Two source-target pairs (denoted by red and blue) possess the same offset while the source models total strength is different, leading to different angles θ0 and θ1 between domains.

Remark 4.2. It should be noted that the above geometric interpretation of ξU is somewhat rough due to the use of the upper bound of f S Hm0 and fδ Hm. As a result, ξU cannot precisely reflect the exact angle between f T and f S but only the angle between domains as we termed above. However, with some additional assumptions, one can obtain a fine-grained angle interpretation. For example, if one uses the Sobolev norm of f S and fδ directly and assumes the signal-to-noise ratio of source and offset models are bounded below, then ξU fδ 2 Hm/ f S 2 Hm0 and thereby is able to reflect the exact angle between f T and f S.

Smoothness Adaptive Hypothesis Transfer Learning

5. Experiments

This section aims to confirm our theoretical results in the target-only KRR and transfer learning sections. 1

5.1. Experiments for Target-Only KRR

Let X = [0, 1] and the marginal distribution of x be the uniform distribution over [0, 1]. Our objective is to empirically confirm the adaptability of Gaussian kernels in target-only KRR when f0 Hα([0, 1]) for different smoothness α. Specifically, we explore cases where f0 belongs to H2 and H3. To generate such f0 with the desired Sobolev smoothness, we set f0 to be the sample path that is generated from the Gaussian process with isotropic Mat ern covariance kernels Kν (Stein, 1999). We set ν = 2.01 and 3.01 to generate the corresponding f0 with smoothness 2 and 3, see Corollary 4.15 in Kanagawa et al. (2018) for detail discussion about the connection between ν and α. Formally, we consider the following data generation procedure: yi = f0(xi) + σϵi, where ϵi are i.i.d. standard Gaussian

noise, {xi}n i=1 i.i.d. U([0, 1]) and σ = 0.5.

KRR (non-adaptive), =2

Est. Decay=-0.8 True Decay=-0.8

KRR (adaptive), =2

Est. Decay=-0.8 True Decay=-0.8

1000 1500 2000 2500

KRR (non-adaptive), =3

Est. Decay=-0.85 True Decay=-0.86

1000 1500 2000 2500

KRR (adaptive), =3

Est. Decay=-0.86 True Decay=-0.86

sample size (n)

Figure 2. Error decay curves of target-only KRR based on Gaussian kernel, both axes are in log scale. The blue curves denote the average generalization errors over 100 trials. The dashed black lines denote the theoretical decay rates.

We verify both the nonadaptive and adaptive rate presented in Theorem 3.1 and 3.3. The sample size ranges from 1000 to 3000 in intervals of 100. For different α, we set λ = exp{ Cn 2 2α+1 } with a fixed C. To evaluate the adaptivity rate, we set the candidate smoothness as [1, 2, 3, 4, 5] and split the dataset equally in size to implement training and validation. The generalization error ˆf f L2 is obtained by Simpson s rule. For each combination of n and α, we repeat the experiments 100 times and report the average generalization error. To demonstrate the convergence rate

1The code to reproduce our experimental results is available at https://github.com/haotianlin/SATL.

of the error is sharp, we regress the logarithmic average generalization error, i.e. log( ˆf f L2), on log(n) and compare the regression coefficient to its theoretical counterpart 2α 2α+1.

We try different values of C lies in the equally spaced sequence [0.05, 0.1, , 4], and report the optimal curve in Figure 2 under the best choice of C. Remarkably, the empirical data points align closely with the theoretical lines for both nonadaptive and adaptive rates. The estimated regression coefficients also closely agree with the theoretical counterparts. Additionally, we also report the generalization error decay estimation results for other values of C and refer to Appedix E for more details.

5.2. Experiments for Transfer Learning

We now illustrate our theoretical analysis of SATL through two experiments with synthetic data. We generate the target/source functions and the offset function as follows: (i) The target function f T is a sample path of the Gaussian process with Mat ern kernel K1.01 such that f T H1; (ii) The offset function fδ is a sample path of Gaussian process with Mat ern kernel Kν with ν = 2.01, 3.01, 4.01 such that the offset fδ belongs to H2, H3, H4 respectively. Hence, we consider the following data generation procedure:

{xi,T }n T i=1, {xi,S}n S i=1 i.i.d. U([0, 1])

yi,T = f T (xi,T ) + σϵi,T i = 1, , n T yi,S = f T (xi,S) + fδ(xi,S) + σϵi,S i = 1, , n S

where ϵi,p are i.i.d. standard Gaussian noise and σ = 0.5.

To demonstrate the transfer learning effect, we consider two different settings:

(1) Fixed n T scenario: Fix n T as 50 and vary n S.

(2) Varying n T scenario: Set n S = n3/2 T while varying n T , i.e., the source sample size grows in a polynomial order of target sample size.

In the first scenario, it is expected that the generalization error first decreases and then remains unchanged as n S increases since the offset estimation error (a constant for fixed n T ) eventually dominates. In the second scenario,

the generalization error satisfies E( ˆf T ) = O(n 3m0 2m0+1 T +

n 2m 2m+1 T ξU) = O(n 2m 2m+1 T ). We consider the finite basis expansion (FBE) TL algorithm proposed in Wang et al. (2016) as a competitor. The authors originally used the Fourier basis in their paper, which produced weak results in our setting. Therefore, we compared SATL to their algorithm with Fourier basis and an additional modification by employing Bspline. We refer to Appendix E for implementation details on different types of Fourier basis and the other experiments details.

Smoothness Adaptive Hypothesis Transfer Learning

h =1 h =3 h =5

FBE (Fourier)

500 1000 1500

FBE (Bspline)

500 1000 1500 500 1000 1500

Target-Only OTL (m =2) OTL (m =3) OTL (m =4)

(a) Fixed n T scenario

h =1 h =3 h =5

75 100 125 150

75 100 125 150 75 100 125 150

Target-Only SATL Theoretical

(b) Varying n T scenario

Figure 3. Generalization error under different h and smoothness of fδ. Each curve denotes the average error over 100 trails, and the shadow regions denote one standard error of the mean. The left figure contains results for fixed n T scenario while the right figure is for varying n T scenario. In Figure 3b, the green line denotes the theoretical upper bound n 2m/(2m+1) T (up to constants).

Figure 3a presents the generalization error for the fixed n T scenario. As n S increases, the generalization error initially decreases and then gradually levels off, consistent with our expectations. Furthermore, if the smoothness of the offset function fδ is higher, a smaller error is obtained, which agrees mildly with our theoretical analysis. Finally, compared to the FBE approach, SATL achieves overall smaller errors. Figure 3b presents the generalization error for the varying n T scenario. Here, the error term is expected to

be upper bounded by n 2m 2m+1 T . One can see our empirical error is consistent with the theoretical upper bound asymptotically in all settings. Besides, the SATL outperforms the target-only learning KRR baseline in all settings.

6. Future Direction

Beyond Sobolev Space. In developing the optimality of target-only KRR with Gaussian kernels, we use the Fourier transform technique to control the approximation error (see Appendix C.1), which is feasibly applied to RKHS that are norm-equivalent to fractional Sobolev spaces. Although this makes our results quite broadly applicable when one is primarily interested in the smoothness of the functions, it certainly doesn t cover all possible structures of interest (e.g., periodic functions, etc). It is of interest to develop other mathematical tools to extend the current results for Sobolev spaces to more general RKHS.

Few-Shot Transfer Learning. Theorem 4.1 is developed based on the asymptotic rates of Theorem 3.3. Thus, the validity of theoretical results of SATL requires both n T and n S sufficiently large to allow some lower order terms van-

ish (note that although n T and n S need to be sufficiently large, n S n T still remains, which shows that the settings we consider are still within the regime of transfer learning). In the case when n T is extremely small, e.g., few-shot transfer learning, one needs a non-asymptotic rate for KRR with fixed bandwidth Gaussian kernel to develop the upper bound.

7. Conclusion

We presented SATL, a kernel-based OTL that uses Gaussian kernels as imposed kernels. This enables the estimators to adapt to the varying and unknown smoothness in their corresponding functions. SATL achieves minimax optimality (up to a logarithmic factor) as the upper bound of SATL matched the lower bound of the OTL problem. Notably, our Gaussian kernels result in target-only learning also serves as a good supplement to misspecified kernel learning literature.

Impact Statement

This paper aims to theoretically achieve hypothesis transfer learning adaptively under a nonparametric regression setting and provide corresponding statistical guarantees. It provides insights about the transfer dynamic, i.e., when offset transfer learning improves performance compared to target-only learning, and the necessity of adaptive learning in statistical transfer learning. Since the work is focused on a theoretical perspective, there is no present immediate ethical impact or societal implication that we feel must be specifically highlighted here.

Smoothness Adaptive Hypothesis Transfer Learning

Aghbalou, A. and Staerman, G. Hypothesis transfer learning with surrogate classification losses: Generalization bounds through algorithmic stability. In International Conference on Machine Learning, pp. 280 303. PMLR, 2023.

Bastani, H. Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964 2984, 2021.

Bauer, F., Pereverzev, S., and Rosasco, L. On regularization algorithms in learning theory. Journal of complexity, 23 (1):52 72, 2007.

Blanchard, G. and M ucke, N. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18(4):971 1013, 2018.

Cai, T. T. and Pu, H. Transfer learning for nonparametric regression: Non-asymptotic minimax analysis and adaptive procedure. ar Xiv preprint ar Xiv:2401.12272, 2024.

Cai, T. T. and Wei, H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1):100 128, 2021.

Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007.

Cheng, G. and Shang, Z. Joint asymptotics for seminonparametric regression models with partially linear structure. The Annals of Statistics, 43(3):1351 1390, 2015.

De Vore, R. A. and Sharpley, R. C. Besov spaces on domains in rˆ{d}. Transactions of the American Mathematical Society, 335(2):843 864, 1993.

Dicker, L. H., Foster, D. P., and Hsu, D. Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. 2017.

Du, S. S., Koushik, J., Singh, A., and P oczos, B. Hypothesis transfer learning via transformation functions. Advances in neural information processing systems, 30, 2017.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020.

Duan, Y. and Wang, K. Adaptive and robust multi-task learning. ar Xiv preprint ar Xiv:2202.05250, 2022.

Eberts, M. and Steinwart, I. Optimal regression rates for SVMs using Gaussian kernels. Electronic Journal of Statistics, 7(none):1 42, 2013. doi: 10.1214/12-EJS760. URL https://doi.org/10.1214/12-EJS760.

Fasshauer, G. E. and Ye, Q. Reproducing kernels of generalized sobolev spaces via a green function approach with distributional operators. Numerische Mathematik, 119: 585 611, 2011.

Fischer, S. and Steinwart, I. Sobolev norm learning rates for regularized least-squares algorithms. The Journal of Machine Learning Research, 21(1):8464 8501, 2020.

Gao, J., Fan, W., Jiang, J., and Han, J. Knowledge transfer via multiple model local structure mapping. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 283 291, 2008.

Geer, S. A. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.

Hamm, T. and Steinwart, I. Adaptive learning rates for support vector machines working on data with low intrinsic dimension. The Annals of Statistics, 49(6):3153 3180, 2021.

He, Z., Sun, Y., and Li, R. Transfusion: Covariate-shift robust transfer learning for high-dimensional regression. In International Conference on Artificial Intelligence and Statistics, pp. 703 711. PMLR, 2024.

Kanagawa, M., Hennig, P., Sejdinovic, D., and Sriperumbudur, B. K. Gaussian processes and kernel methods: A review on connections and equivalences. ar Xiv preprint ar Xiv:1807.02582, 2018.

Kpotufe, S. and Martinet, G. Marginal singularity and the benefits of labels in covariate-shift. The Annals of Statistics, 49(6):3299 3323, 2021.

Kuzborskij, I. and Orabona, F. Stability and hypothesis transfer learning. In International Conference on Machine Learning, pp. 942 950. PMLR, 2013.

Kuzborskij, I. and Orabona, F. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106:171 195, 2017.

Lepskii, O. On a problem of adaptive estimation in gaussian white noise. Theory of Probability & Its Applications, 35 (3):454 466, 1991.

Li, S., Cai, T. T., and Li, H. Transfer learning for highdimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):149 173, 2022.

Li, T. and Yuan, M. On the optimality of gaussian kernel based nonparametric tests against smooth alternatives. ar Xiv preprint ar Xiv:1909.03302, 2019.

Smoothness Adaptive Hypothesis Transfer Learning

Li, X. and Bilmes, J. A bayesian divergence prior for classiffier adaptation. In Artificial Intelligence and Statistics, pp. 275 282. PMLR, 2007.

Li, Y., Zhang, H., and Lin, Q. On the saturation effect of kernel ridge regression. In The Eleventh International Conference on Learning Representations, 2023.

Lin, H. and Reimherr, M. On hypothesis transfer learning of functional linear models. stat, 1050:22, 2024.

Lin, J. and Cevher, V. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. Journal of Machine Learning Research, 21 (147):1 63, 2020.

Ma, C., Pathak, R., and Wainwright, M. J. Optimally tackling covariate shift in rkhs-based nonparametric regression. ar Xiv preprint ar Xiv:2205.02986, 2022.

Mendelson, S. and Neeman, J. Regularization in kernel learning. 2010.

Minami, S., Fukumizu, K., Hayashi, Y., and Yoshida, R. Transfer learning with affine model transformation. Advances in Neural Information Processing Systems, 36, 2024.

Orabona, F., Castellini, C., Caputo, B., Fiorilla, A. E., and Sandini, G. Model adaptation with least-squares svm for adaptive hand prosthetics. In 2009 IEEE international conference on robotics and automation, pp. 2897 2903. IEEE, 2009.

Reeve, H. W., Cannings, T. I., and Samworth, R. J. Adaptive transfer learning. The Annals of Statistics, 49(6):3618 3649, 2021.

Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive approximation, 26(2):153 172, 2007.

Stein, M. L. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 1999.

Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008.

Steinwart, I., Hush, D., and Scovel, C. An explicit description of the reproducing kernel hilbert spaces of gaussian rbf kernels. IEEE Transactions on Information Theory, 52(10):4635 4643, 2006.

Steinwart, I., Hush, D. R., Scovel, C., et al. Optimal rates for regularized least squares regression. In COLT, pp. 79 93, 2009.

Tian, Y. and Feng, Y. Transfer learning under highdimensional generalized linear models. Journal of the American Statistical Association, pp. 1 14, 2022.

Tian, Y., Gu, Y., and Feng, Y. Learning from similar linear representations: Adaptivity, minimaxity, and robustness. ar Xiv preprint ar Xiv:2303.17765, 2023.

Tripuraneni, N., Jordan, M., and Jin, C. On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33: 7852 7862, 2020.

Varshamov, R. R. Estimate of the number of signals in error correcting codes. Docklady Akad. Nauk, SSSR, 117: 739 741, 1957.

Wang, K. Pseudo-labeling for kernel ridge regression under covariate shift. ar Xiv preprint ar Xiv:2302.10160, 2023.

Wang, W. and Jing, B.-Y. Gaussian process regression: Optimality, robustness, and relationship with kernel ridge regression. Journal of Machine Learning Research, 23 (193):1 67, 2022.

Wang, X. and Schneider, J. G. Generalization bounds for transfer learning under model shift. In UAI, pp. 922 931, 2015.

Wang, X., Oliva, J. B., Schneider, J. G., and P oczos, B. Nonparametric risk and stability analysis for multi-task learning problems. In IJCAI, pp. 2146 2152, 2016.

Wendland, H. Scattered data approximation, volume 17. Cambridge university press, 2004.

Xu, Z. and Tewari, A. Representation learning beyond linear prediction functions. Advances in Neural Information Processing Systems, 34:4792 4804, 2021.

Zhang, H., Li, Y., Lu, W., and Lin, Q. On the optimality of misspecified kernel ridge regression. In International Conference on Machine Learning, pp. 41331 41353. PMLR, 2023.

Zhang, X., Blanchet, J., Ghosh, S., and Squillante, M. S. A class of geometric structures in transfer learning: Minimax bounds and optimality. In International Conference on Artificial Intelligence and Statistics, pp. 3794 3820. PMLR, 2022.

Smoothness Adaptive Hypothesis Transfer Learning

This appendix encompasses more discussions, experiments, and proofs of the theoretical results presented in the main body. Appendix A provides the introduction of the notation we used. Appendix B provides basic concepts of RKHS, Sobolev space, the interpolation space, and the results of norm-equivalence between RKHS and Sobolev space. Appendix C presents the proofs of the results in the Target-Only KRR section, including non-adaptive rate (Theorem 3.1) and adaptive rate (Theorem 3.3). A discussion of the comparison between this work and previous works is also presented. Appendix D contains the proofs of the upper bound and lower bounds of SATL. We present additional experiment details in Appendix E.

A Notation 13

B Foundation of RKHS 13

B.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

B.2 Norm Equivalency between RKHS and Sobolev Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

C Target-Only KRR Learning Results 15

C.1 Proof Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

C.2 Proof of Non-adaptive Rate (Theorem 3.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

C.2.1 Proof of the approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

C.2.2 Proof of the estimator error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

C.2.3 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.2.4 Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C.2.5 Auxiliary Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C.3 Proof of Adaptive Rate (Theorem 3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.4 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D Smoothness Adaptive Transfer Learning Results 25

D.1 Proof of Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D.2 Proof of Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.3 Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

D.4 Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

E Additional Simulation Results 30

E.1 Additional Results for Target-Only KRR with Gaussian kernels . . . . . . . . . . . . . . . . . . . . . . . 30

E.2 Additional Details for TL Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Smoothness Adaptive Hypothesis Transfer Learning

A. Notation

The following notations are used throughout the rest of this work and follow standard conventions. For asymptotic notations: f(n) = O(g(n)) means for all c there exists k > 0 such that f(n) cg(n) for all n k; f(n) g(n) means f(n) = O(g(n)) and g(n) = O(f(n)); f(n) = Ω(g(n)) means for all c there exists k > 0 such that f(n) cg(n) for all n k. We use the asymptotic notations in probability OP( ). That is, for a positive sequence {an}n 1 and a non-negative random variable sequence {Xn}n 1, we say Xn = OP(an) if for any δ > 0, there exist Mδ and Nδ such that P(Xn Mδan) 1 δ, n Nδ. The definition of ΩP follows similarly.

For a function f L1(Rd), its Fourier transform is denoted as

F(f)(ω) = (2π) d/2 Z

Rd f(x)e ix T ωdx.

Since we assume µT = µS, we use L2(X, dµp) for p {T, S} to represent the Lebesgue L2 space and abbreviate it as L2 for simplicity when there is no confusion.

B. Foundation of RKHS

B.1. Basic Concept

In this section, we will present some facts about the RKHS that are useful in our proof and refer readers to Wendland (2004) for more details.

Assume K : X X R is a continuous positive definite kernel function defined on a compact set X Rd (with positive Lebesgure measure and Lipschitz boundary). Indeed, every positive definite kernel can be associated with a reproducing kernel Hilbert space (RKHS). The RKHS, HK, of K are usually defined as the closure of linear space span{K( , x), x X}. In a special case where the kernel function K(x, y) is equal to a translation invariant (stationary) function Φ(x y) = K(x, y) with Φ : Rd R, we can characterize the RKHS of K in terms of Fourier transforms, i.e.

f L2(Rd) C(Rd) : F(f) p

F(Φ) L2(Rd)

When X is a subset of Rd, such a definition still captures the regularity of functions in HK(X) via a norm equivalency result that holds as long as X has a Lipschitz boundary.

For an integer m, we introduce the integer-order Sobolev space, Wm,p(X). For vector α = (α1, , αd), define |α| = α1 + + αd and D(α) = |α|

xα1 1 x αd d denote the multivariate mixed partial weak derivative. Then

Wm,p(X) = n f Lp(X) : D(α)f Lp(Rd), |α| m o ,

where m is the smoothness order of the Sobolev space. In this paper, we only consider p = 2 and abbreviate Wm,2(X) := Hm(X). Later, in Appendix B.2, we will see one can define the Hm via Fourier transform of the reproducing kernel instead of weak derivative.

We now introduce the power space of an RKHS. For the reproducing kernel K, we can define its integral operator TK : L2 L2 as

TK(f)( ) = Z

T K(s, )f(s)ds.

LK is self-adjoint, positive-definite, and trace class (thus Hilbert-Schmidt and compact). By the spectral theorem for self-adjoint compact operators, there exists an at most countable index set N, a non-increasing summable positive sequence {τj}j 1 and an orthonormal basis of L2, {ej}j 1 such that the integrable operator can be expressed as

j N τj , ej L2ej.

Smoothness Adaptive Hypothesis Transfer Learning

The sequence {τj}j 1 and the basis {ej}j 1 are referred as the eigenvalues and eigenfunctions. The Mercer s theorem shows that the kernel K itself can be expressed as

K(x, x ) = X

j N τjej(x)ej(x ), x, x T ,

where the convergence is absolute and uniform.

We now introduce the fractional power integral operator and the composite integral operator of two kernels. For any s 0, the fractional power integral operator Ls K : L2 L2 is defined as

T s K( ) = X

j N τ s j , ej L2ej.

Then the power space [HK]s is defined as

s 2 j ej : (aj) ℓ2(N)

and equipped with the inner product f, g [HK]s = D T s

2 K (f), T s

For 0 < s1 < s2, the embedding [HK]s2 , [HK]s1 exists and is compact. A higher s indicates the functions in [HK]s

have higher regularity. When HK = Hm with m > d/2, the real interpolation indicates [Hm]s = Hms, s > 0.

B.2. Norm Equivalency between RKHS and Sobolev Space

Now, we state the result that connects the general RKHS and Sobolev space. Lemma B.1. Let K(x, x ) be the translation-invariant kernel and K L1(Rd). Suppose X has a Lipschitiz boundary, and the Fourier transform of K has the following spectral density of m, for m d/2,

c1(1 + 2 2)m F(K)( ) c2(1 + 2 2)m. (2)

for some constant 0 < c1 c2. Then, the associated RKHS of K, HK(X), is norm-equivalent to the Sobolev space Hm(X).

Hence, we can naturally define the Sobolev space of order m (m > d

Hm(Rd) = f L2(Rd) C(Rd) : F(f)( )(1 + 2 2)m L2(Rd) .

One advantage of this definition over the classical way that involves weak derivatives is it does not require m to be an integer, and thus one can consider the fractional Sobolev space, i.e. m R+. Such equivalence also holds on X by applying the extension theorem (De Vore & Sharpley, 1993). As an implication, let Km,ν denotes the isotropic Mat ern kernel (Stein, 1999), i.e.

Km,ν(x; ρ) = 21 ν

then the Fourier transform of Km,ν satisfies Equation 2 with m = ν + d

2, and thus the RKHS associated with Km,ν is norm equivalent to Sobolev space Hν+ d

2 (Wendland, 2004).

For a reproducing kernel that satisfies (2), we call it a kernel with Fourier decay rate m and denote it as Km. We further denote its associated RKHS as HKm(X). The Fourier decay rate m captures the regularity of HKm(X).

Now, we are ready to define the function space of f S, f T and fδ via the kernel regularity Assumption B.2 (Smoothness of Target/Source). There exists an m0 d/2 such that f T and f S belong to HKm0. Assumption B.3 (Smoothness of Offset). There exists an m m0 such that fδ := f T f S belongs to HKm.

The proof of all the theoretical results in Section 3 and 4 is built on the assumptions that the true functions are in Sobolev space. Via the norm equivalency (Lemma B.1), the true functions also reside in RKHSs associated with kernel Km0 and Km. Therefore, all the theoretical results still hold under Assumption B.2 and B.3.

Smoothness Adaptive Hypothesis Transfer Learning

C. Target-Only KRR Learning Results

C.1. Proof Scheme

Before formally proving the theoretical results, we first illustrate the main scheme for the proof of Theorem 3.1.

For the given imposed Gaussian kernel K, we define its corresponding integral operator TK : L2(X) L2(X) as

TK(f)( ) = Z

X f(x)K(x, )dρ(x)

where ρ(x) is the probability measure over X. We also note that the integral operator TK can also be viewed as a bounded linear operator on HK.

We now consider its empirical version when the sample {(xi, yi)}n i=1 are available. For a x X, we define the sampling operator as Kx : HK R and its adjoint operator K x : R HK as

Kx : HK R, f 7 f, Kx HK and K x : R HK, y 7 y Kx.

Then we can define the empirical version of TK, termed sample covariance operator, as TK,n : HK HK as

i=1 K xi Kxi.

With these notations, the KRR estimator can be written as

ˆf = (TK,n + λI) 1( 1

i=1 Kxiyi) := (TK,n + λI) 1gn

where gn = 1

n Pn i=1 Kxiyi HK and I is the identical operator. We further define the intermediate term as follows,

fλ := argmin f HK

(f0 f) 2 L2 + λ f 2 HK

and one can show fλ = (TK + λI) 1TK(f0) = (TK + λI) 1g.

Then by triangle inequality, ˆf f0 L2 ˆf fλ L2 | {z } estimation error

+ fλ f0 L2 | {z } approximation error

The following paragraphs discuss the analysis for each of the error terms and how they differ from previous works.

Approximation error. In classical misspecified kernel methods, one typically controls the approximation error via interpolation/power space technique. Specifically, the intermediate term fλ is placed in [Hα0]s while the true function in Hα0 (here, s denotes the interpolation index). Therefore, one can expand fλ and f0 under the same basis since fλ lies in the interpolation/power space of Hα0, which typically controls the approximation error in the form of λs f0 Hα0 and makes the optimal order of λ in n takes polynomial pattern like Proposition 2.1. This technique is widely used in many misspecified kernel literature like Theorem A.2 in Zhang et al. (2023) and etc.

However, since in our case, the intermediate term, fλ, lies in the RKHS associated with Gaussian kernels (s needs to be ), one can t expand fλ and f0 under the same basis and thus such a technique no longer holds. Therefore, the techniques we used are the Fourier transform of the Gaussian kernel and Plancherel Theorem, which allows us to prove

fλ f0 2 L2 fλ f0 2 L2 + λ fλ 2 HK Clog 1

α0 f0 2 Hα0.

Here, the second inequality is proved via the Plancherel Theorem, see Proposition C.2, and HK denote the RKHS associated with Gaussian kernels. We also highlight that this is a technical contribution of this paper.

Smoothness Adaptive Hypothesis Transfer Learning

Estimation error. Regarding the estimation error, we use the standard integral operator techniques origin from Smale & Zhou (2007) and follow a similar strategy as Fischer & Steinwart (2020); Zhang et al. (2023). While most of the current work deals with cases where the eigenvalue decay rate is polynomial, we refine the proof to handle the Gaussian kernel case, whose eigenvalues decay exponentially.

C.2. Proof of Non-adaptive Rate (Theorem 3.1)

In the following proof, we will use C, C1, and C2 to represent constants that could change from place to place. Unless specifically specified, we also omit the X in the norms or in the inner product notation. We use op to denote the operator norm of a bounded linear operator.

In addition, we denote the effective dimension as

N(λ) = tr((TK + λ) 1TK) =

C.2.1. PROOF OF THE APPROXIMATION ERROR

For the approximation error, we can directly apply Proposition C.2, which leads to

fλ f0 2 L2 log( 1

λ) α0 f0 2 Hα0.

Then selecting log(1/λ) n

2 2α0+d leads to

fλ f0 2 L2 n 2α0 2α0+d f0 2 Hα0.

C.2.2. PROOF OF THE ESTIMATOR ERROR

Theorem C.1. Suppose the Assumption (A1) to (A3) hold and f0 Lq Cq for some q. Then by choosing log(1/λ)

2 2α0+d , for any fixed δ (0, 1), when n is sufficient large, with probability 1 δ, we have

ˆf fλ L2 ln(4

δ )Cn α0 2α0+d

where C is a constant proportional to σ.

Proof. First, we notice that ˆf fλ L2 = T

1 2 K ˆf fλ HK

1 2 K (TK + λI) 1

2 op | {z } A1

1 2 (TK,n + λI) 1 (TK + λI)

1 2 op | {z } A2

(TK + λI) 1

2 (gn (TK,n + λI) fλ) HK | {z } A3

For the first term A1, we have

1 2 K (TK + λI) 1

2 op = sup i 1

Smoothness Adaptive Hypothesis Transfer Learning

For the second term, using Proposition C.3 with sufficient large n, we have

δ ( TK op + λ)

such that (TK + λI) 1

2 (TK TK,n) (TK + λI) 1

holds with probability 1 δ

A2 = (TK + λI) 1 2 (TK,n + λI 1)(TK + λI) 1 2 op

= (TK + λI) 1

2 (TK,n + λ) (TK + λI) 1

= I (TK + λI) 1

2 (TK,n TK) (TK + λI) 1

(TK + λI) 1

2 (TK TK,n) (TK + λI) 1

For the third term A3, notice (TK + λI) 1

2 (gn (TK,n + λI) fλ) HK = (TK + λI) 1

2 [(gn TK,n(fλ)) (g TK(fλ))] HK

Using the Proposition C.4, with probability 1 δ

T3 = (TK + λI) 1

2 (gn (TK,n + λI) fλ) HK Cln(4

δ )n α0 2α0+d

where C σ. Combing the bounds for T1, T2 and T3, we finish the proof.

C.2.3. PROOF OF THEOREM 3.1

Based on the bounds of the approximation and estimation error, we finish the proof as follows, ˆf f0 L2 ˆf fλ L2 + fλ f0 L2

= OP n (σ + f0 Hα0) n α0 2α0+d o .

C.2.4. PROPOSITIONS

Proposition C.2. Suppose fλ is defined as follows,

fλ = argmin f HK(X)

n f f0 2 L2(X) + λ f 2 HK(X) o .

Then, under the regularized conditions, the following inequality holds,

fλ f0 2 L2(X) + λ fλ 2 HK(X) Clog 1

α0 f0 2 Hα0.

Smoothness Adaptive Hypothesis Transfer Learning

Proof. Since X has Lipschitz boundary, there exists an extension mapping from L2(X) to L2(Rd), such that the smoothness of functions in L2(X) get preserved. Therefore, there exist constants C1 and C2 such that for any function g Hα0(X), there exists an extension of g, ge Hα0(Rd) satisfying

C1 ge Hα0(Rd) g Hα0(X) C2 ge Hα0(Rd).

Denote fλ,e = argmin f HK(Rd)

n f f0,e 2 L2(Rd) + λ f 2 HK(Rd) o

Then we have, fλ f0 2 L2(X) + λ fλ 2 HK(X) fλ,e|X f0 2 L2(X) + λ fλ,e|X 2 L2(X)

C2 fλ,e f0,e 2 L2(Rd) + λ fλ,e 2 L2(Rd) .

where fλ,e|X is the restriction of fλ,e on X. By Fourier transform of the Gaussian kernel and Plancherel Theorem, we have

fλ,e f0,e 2 L2(Rd) + λ fλ,e 2 HK(Rd)

Rd |F (f0,e) (ω) F (fλ,e) (ω)|2 dω + λ Z

Rd |F (fλ,e) (ω)|2 exp{C ω 2 2}dω

|F (f0,e) (ω) F (fλ,e) (ω)|2 + λ |F (fλ,e) (ω)|2 exp{C ω 2 2} dω

Rd λexp{C ω 2 2} 1 + λexp{C ω 2 2} |F (f0,e) (ω)|2 dω

λexp{C(1 + ω 2 2)} 1 + λexp{C(1 + ω 2 2)} |F (f0,e) (ω)|2 dω + Z

ΩC λexp{C(1 + ω 2 2)} 1 + λexp{C(1 + ω 2 2)} |F (f0,e) (ω)|2 dω

Ω λexp{C(1 + ω 2 2)} |F (f0,e) (ω)|2 dω + Z

ΩC |F (f0,e) (ω)|2 dω

where Ω= {ω : λexp{C(1 + ω 2 2)} < 1} and ΩC = Rd\Ω, and the third equality follows the definition of f e . Over ΩC, we notice that

(1 + ω 2 2) 1

α0 (1 + ω 2 2)α0 1.

Over Ω, we first note that the function h(ω) = exp{C(1 + ω 2 2)}/(1 + ω 2 2)α0 reaches its maximum Cα0λ 1log( 1

λ) α0 if λ satisfies λ < exp{ α0} and λlog( 1

λ)α0 Cα0exp{ C}. One can verify when λ 0 as n 0, the two previous inequality holds. Then

λexp{C(1 + ω 2 2)} Cα0log 1

α0 (1 + ω 2 2)α0 ω Ω.

Combining the inequality over Ωand ΩC,

f e f0,e 2 L2(Rd) + λ f e 2 HK(Rd)

Ω λexp{C(1 + ω 2 2)} |F (f0,e) (ω)|2 dω + Z

ΩC |F (f0,e) (ω)|2 dω

Rd(1 + ω 2 2)α0|F(f0,e)(ω)|2dω

α0 f0,e 2 Hα0(Rd)

α0 f0 2 Hα0(X)

which completes the proof.

Smoothness Adaptive Hypothesis Transfer Learning

Proposition C.3. For all δ (0, 1), with probability at least 1 δ, we have

(TK + λI) 1

2 (TK TK,n) (TK + λI) 1

2 op 4N(λ)B

B = ln(4N(λ)

δ ( TK op + λ)

Proof. Denote Ai = (TK + λI) 1

2 (TK TK,xi)(TK + λI) 1

2 , applying Lemma C.7, we get

Ai op (TK + λI) 1

2 TK,x (TK + λI) 1

2 op + (TK + λI) 1

2 TK,xi (TK + λI) 1

2 op 2E2 KN(λ)

Notice E A2 i E h (TK + λI) 1

2 TK,xi (TK + λI) 1

E2 KN(λ) E h (TK + λI) 1

2 TK,xi (TK + λI) 1

= E2 KN(λ)(TK + λI) 1TK := V

where A B denotes B A is a positive semi-definite operator. Notice

V op = N(λ) TK op TK op + λ N(λ), and tr(V ) = N(λ)2.

The proof is finished by applying Lemma C.8 to Ai and V .

Proposition C.4. Suppose that Assumptions in the estimation error theorem hold. We have (TK + λI) 1

2 gn (TK,n + λI) 1 fλ HK Cln(4

δ )n α0 2α0+d

where C is a constant.

Proof. Denote ξi = ξ(xi, yi) = (TK + λI) 1

2 (Kxiyi TK,xifλ)

ξx = ξ(x, y) = (TK + λI) 1

2 (Kxy TK,xfλ),

then it is equivalent to show 1 n

i=1 ξi E ξx

δ )n α0 2α0+d

Define Ω1 = {x X : |f0| t} and Ω2 = X\Ω1. We decompose ξi and ξx over Ω1 and Ω2, which leads to

(TK + λI) 1

2 gn (TK,n + λI) 1 fλ HK

i=1 ξi Ixi Ω1 E ξx Ix Ω1

HK | {z } I1

i=1 ξi Ixi Ω2

HK | {z } I2

+ E ξx Ix Ω2 HK | {z } I3

For I1, applying Proposition C.5, for any δ (0, 1), with probability 1 δ, we have

N(λ) n M + C2 p

N(λ) n + C1log( 1

Smoothness Adaptive Hypothesis Transfer Learning

where C1 = 8

2, C2 = 8σ and M = L + (N(λ) + 1)t. By choosing log(1/λ) exp{ Cn

2 2α0+d } and applying Lemma C.6, we have

for the second term in (3), C2 p

N(λ) n n α0 2α0+d .

for the third term, C1log( 1

N(λ) n C2 p

N(λ) n n α0 2α0+d .

for the first term,

N(λ) n M C1L p

N(λ) n + C1t N(λ) 3 2 n n α0 2α0+d given t n

2α0 d 2(2α0+d) .

Combining all facts, if t n

2α0 d 2(2α0+d) , with probability 1 δ we have

δ )n α0 2α0+d .

For I2, we have

P ( xi s.t. xi Ω2)

= 1 P (x / Ω2)n

= 1 P (|f0(x)| t)n

Letting τn 0 leading t n 1 q . That is to say, if t n 1 q holds, we have τn = P (I2 > I1) 0.

For I3, we have I3 E ξx Ix Ω2 HK

E (TK + λI) 1

2 K(x, ) HK |(y f0(x)) Ix Ω2|

E2 KN(λ) E |(y f0(x)) Ix Ω2|

E2 KN(λ) (E |(fλ f0(x)) Ix Ω2| + E |ϵIx Ω2|)

Using Cauchy-Schwarz inequality yields

E |(fλ f0(x)) Ix Ω2| f0 fλ L2 P(x Ω2) log( 1

In addition, we have E |ϵIx Ω2| σ E |Ix Ω2| σ(Cq)qt q.

Together, we have

2 (Cq)qt q + σ(Cq)qt q.

Notice if we pick q 2(2α0+d)

2α0 d , there exist t such that with probability 1 δ τn, we have

I1 + I2 + I3 Cln(2

δ )n α0 2α0+d .

For fixed δ, as n , τn is sufficiently small such that τn = o(δ), therefore without loss of generality, we can say with probability 1 δ τn, we have

I1 + I2 + I3 Cln(2

δ )n α0 2α0+d .

Smoothness Adaptive Hypothesis Transfer Learning

Proposition C.5. Under the same conditions as the Proposition, we have 1 n

i=1 ξi Ixi Ω1 E ξx Ix Ω1

N(λ) n M + C2 p

N(λ) n + C1log( 1

where C1 = 8

2, C2 = 8σ and M = L + (N(λ) + 1)t.

Proof. In order to use Bernstein inequality (Lemma C.9), we first bound the m-th moment of ξx Ix Ω1.

E ξx Ix Ω1 m HK = E (TK + λI) 1

2 Kx (y fλ(x)) Ix Ω1 m

E (TK + λI) 1

HK E (|(y fλ(x)) Ix Ω1|m | x) .

Using the inequality (a + b)m 2m 1 (am + bm), we have

|y fλ(x)|m 2m 1 fλ(x) f ρ (x) m + f ρ (x) y m

= 2m 1 fλ(x) f ρ (x) m + |ϵ|m .

Combining the inequalities, we have

E ξx Ix Ω1 m HK 2m 1 E (TK + λI) 1

fλ(x) f ρ (x) Ix Ω1 m

+ 2m 1 E (TK + λI) 1

HK E (|ϵIx Ω1|m | x)

We first focus on B2, by Lemma C.7, we have

E (TK + λI) 1

HK E2 KN(λ) m

By the error moment assumption, we have

E (|ϵIx Ω1|m | x) E (|ϵ|m | x) 1

together, we have

N(λ) 2 (2LN(λ))m 2 . (4)

Turning to bounding B1, we first have

(fλ f0)Ix Ω1 L fλIx Ω1 L + f0Ix Ω1 L (TK + λI) 1TK(f0)Ix Ω1 L + f0Ix Ω1 L

(TK + λI) 1TK op + 1 f0Ix Ω1 L (N(λ) + 1) t := M.

With bounds on approximation error, we get the upper bound for B1 as

B1 2m 1N(λ) m

2 (fλ f0)Ix Ω1 m 2 L (fλ f0)Ix Ω1 2 L2

2 M m 2log( 1

2m! 2log( 1

N(λ) 2 2M p

Smoothness Adaptive Hypothesis Transfer Learning

Denote L = 2(L + M) p

N(λ) + 2log( 1

and combine the upper bounds for B1 and B2, i.e. (5) and (4), then we have

E ξx Ix Ω1 m HK 1

2m! σ2 Lm 2.

The proof is finished by applying Bernstein inequality i.e. Lemma C.9.

C.2.5. AUXILIARY LEMMA

Lemma C.6. If sj = C1 exp( C2j2), by choosing log(1/λ) n

2 2α0+d , we have

Proof. For a positive integer J 1

sj sj + λ +

J exp{ C2x2}dx

λ C1exp{ C2J2}

where we use the fact that the eigenvalue of the Gaussian kernel decays at an exponential rate, i.e. sj C1 exp{ C2j2} and the inequality Z

2 }dt exp{ x2

Then select J = n

d 2α0+d and λ = exp{ C n

2 2α0+d } with C C2 leads to

Lemma C.7. For µ-almost x X, we have

(TK + λI) 1

HK E2 KN(λ), and E (TK + λI) 1

For some constant EK. Consequently, we also have (TK + λI) 1

2 TK,x (TK + λI) 1

2 op E2 KN(λ).

Proof. We first state a fact on the Gaussian kernel. If K is a Gaussian kernel function with fixed bandwidth, then there exists a constant EK such that the eigenfunction of K is uniformly bounded for all j 1, i.e. supj 1 ej L EK. This is indeed the so-called uniformly bounded eigenfunction assumption that usually appears in nonparametric regression literature, especially for those who consider misspecified kernel in KRR, see Mendelson & Neeman (2010); Wang & Jing (2022). Based on the explicit construction of the RKHS associated with the Gaussian kernel (Steinwart et al., 2006), we know the uniformly bounded eigenfunction holds for the Gaussian kernel.

Smoothness Adaptive Hypothesis Transfer Learning

Based on the fact of uniformly bounded eigenfunction, we know e2 j(x) E2 K for all x X and j 1. Then, we prove the first inequality by the following procedure,

(TK + λI) 1

sj + λsjej(x)ej( )

sj sj + λe2 j(x)

The second inequality follows given the fact that E e2 j(x) = 1. The third inequality comes from the observation that for any f HK

(TK + λI) 1

2 TK,x (TK + λI) 1

2 (f) = D (TK + λI) 1

2 K(x, ), f E

HK (TK + λI) 1

and (TK + λI) 1

2 TK,x (TK + λI) 1

2 op = sup f Hk =1 (TK + λI) 1

2 TK,x (TK + λI) 1

= (TK + λI) 1

The following lemma provides the concentration inequality about self-adjoint Hilbert-Schmidt operator-valued random variables, which is widely used in related kernel method literature, e.g., Theorem 27 in Fischer & Steinwart (2020), Lemma 26 in Lin & Cevher (2020) and Lemma E.3 in Zhang et al. (2023).

Lemma C.8. (Lemma E.3 in Zhang et al. (2023)) Let (X, B, µ) be a probability space, and H be a separable Hilbert space. Suppose A1, , An are i.i.d. random variables whose values in the set of self-adjoint Hilbert-Schmidt operators. If E Ai = 0 and the operator norm Ai L µ-a.e. x X, and there exists a self-adjoint positive semi-definite trace class operator V with E A2 i V . Then for δ (0, 1), with probability at least 1 δ, we have

where β = log(4tr(V )/δ V ).

Lemma C.9. (Bernstein inequality) Let (Ω, B, P) be a probability space, H be a separable Hilbert space, and ξ : Ω H be a random variable with

for all m > 2. Then for δ (0, 1), ξi are i.i.d. random variables, with probability at least 1 δ, we have

C.3. Proof of Adaptive Rate (Theorem 3.3)

Proof. To simplify the notation, for a given smoothness α and sample size n, we define

ψn(α) = n log n

Smoothness Adaptive Hypothesis Transfer Learning

First, we show that it is sufficient to consider the true Sobolev space α in A = {α1, , αN} with αj αj 1 1/ log n. If α0 (αj 1, αj), then Hαj Hα0 Hαj 1. Therefore, since ψn(α0) is squeezed between ψn(αj 1) and ψn(αj), we just need to show ψn(αj 1) ψn(αj). By the definition of ψn(α), the claim follows since

log ψn(αj 1)

ψn(αj) = 2αj 1 2αj 1 + d + 2αj 2αj + d

log n log n (αj αj 1) log n 1.

Therefore, we assume f0 Hαi where i {1, 2, , N}.

2 + 1 , i.e. m n

2 , by Theorem 3.1, for some constants C that doesn t depend on n, we have

E( ˆfλα,D1) log 4

2 (E(λα, m) + A(λα, m)) (6)

for all α A simultaneously with probability at least 1 Nδ. Here, E(λ, n) and A(λ, n) denote the estimation and approximation error that depends on the regularization parameter λ and sample size n in non-adaptive rate proof.

Furthermore, by Theorem 7.2 in Steinwart & Christmann (2008) and Assumption 2.5, we have

E( ˆfλ ˆ α) < 6 inf α A E( ˆfλα) + 128σ2L2 log 1

δ + log(1 + N)

< 6 inf α A E( ˆfλα) + 512σ2L2 log 1

δ + log(1 + N)

with probability 1 δ, where the last inequality is based on the fact that n m n

Combining (6) and (7), we have

E( ˆfλ ˆ α) < 6 log 4

2 inf α A E(λα, m) + A(λα, m) + 512σ2L2 log 1

δ + log(1 + N)

2 m 2α0 2α0+d + 512σ2L2 log 1

δ + log(1 + N)

2 n 2α0 2α0+d + 512σ2L2 log 1

δ + log(1 + N)

with probability at least 1 (1 + N)δ. With a variable transformation, we have

E( ˆfλ ˆ α) 12C log 4(1 + N)

2 n 2α0 2α0+d + 512σ2L2 log 1+N

δ + log(1 + N)

with probability 1 δ. Therefore, for the first term

12C log 4(1 + N)

2 n 2α0 2α0+d 24C

2 log2(1 + N) + 1

n 2α0 2α0+d

2α0 2α0+d + 24Cn 2α0 2α0+d

where the first inequality is based on the fact that a + b < ab + 1 for a, b > 1, while the second inequality is based on the fact that log(x) x

α0 2α0+d for some n such that log(log n)/ log n < 1/4. For the second term,

512σ2L2 log 1+N

δ + log(1 + N)

n 512σ2L2 log 1

δ + 1 + 2 log(1 + N)

512σ2L2 log 1

δ + 1 + 2 log n

The proof is finished by combining (8), (9) and (10).

Smoothness Adaptive Hypothesis Transfer Learning

C.4. Comparison to Previous Work

In Table 1, we compare our results with some state-of-the-art works (to the best of our knowledge) that consider general/Mat ern misspecified kernels and Gaussian kernels in target-only setting KRR. For a detailed review of the optimality of misspecified KRR, we refer readers to Zhang et al. (2023).

Wang & Jing (2022) considered the true function lies in Hα0 while the imposed kernels are misspecified Mat ern kernels. On the other hand, Zhang et al. (2023) considered the minimax optimality for misspecified KRR in general RKHS, i.e., the imposed kernel is K while the true function f0 [HK]s for s (0, 2]. However, when the RKHS is specified as the Sobolev space, the results in both papers are equivalent by applying the real interpolation technique in Appendix B. Therefore, we place them in the same row. Unlike the necessary conditions that the imposed RKHS must fulfill α 0 > α0/2 to achieve optimality (Wang & Jing, 2022; Zhang et al., 2023), our results circumvent this requirement, thereby being more robust. Compared to other works on Gaussian kernel-based KRR, our result shows that the optimality can be achieved only via a fixed bandwidth Gaussian kernel.

We would like to note that our statistical rates and Eberts & Steinwart (2013); Hamm & Steinwart (2021) might not be directly comparable. Our results are derived under bounded moment assumption, i.e., Assumption 2.5, while results of Eberts & Steinwart (2013); Hamm & Steinwart (2021) are derived from bounded response assumption, i.e., there exists a constant such that Y [ M, M]. Moreover, Hamm & Steinwart (2021) considers a broader space (Besov space) and both regression/classification problems, while whether these hold on fixed bandwidth settings is still unknown. Although with slightly different settings, the table highlights the difference in optimal order of λ (and γ) for fixed and variable bandwidth settings.

Table 1. Comparison of generalization error convergence rate (non-adaptive) between our result and the prior literature. Here, we assume the mean function f0 belongs to Sobolev space Hα0, imposed RKHS means the RKHS that ˆf belongs to. in column γ means the bandwidth is fixed during training and does not have an optimal order in n. HK means the RKHS associated with the Gaussian kernel while Hα 0 means the Sobolev space with smoothness order α 0.

Paper Imposed RKHS Rate λ γ

Wang & Jing (2022), Zhang et al. (2023) Hα 0, α 0 > α0

2 n 2α0 2α0+d n 2α 0 2α0+d

Eberts & Steinwart (2013) HK n 2α0 2α0+d +η, η > 0 n 1 n 1 2α0+d

Hamm & Steinwart (2021) HK n 2α0 2α0+d logd+1(n) n 1 n 1 2α0+d

This work HK n 2α0 2α0+d exp{ Cn

D. Smoothness Adaptive Transfer Learning Results

D.1. Proof of Lower Bound

In this part, we proof the alternative version for the lower bound, i.e.

inf f sup Θ(h,m0,m) P f f T 2 L2 CδR2 (n S + n T ) 2m0 2m0+d + n m 2m+d T ξL 1 δ

for some constant C that are independent of n S, n T , R, h and δ, and ξL h2/R2.

This alternative form is also used to prove the lower bound in other transfer learning contexts like high-dimensional linear regression or GLM, see Li et al. (2022); Tian & Feng (2022). However, the upper bound we derive for SATL can still be

sharp since in the transfer learning regime, it is always assumed n S n T , and leads to (n S + n T ) 2m0 2m0+d n 2m0 2m0+d S .

On the other hand, one can modify the first phase in OTL by including the target dataset to obtain ˆf S, which produces

an alternative upper bound (n S + n T ) 2m0 2m0+d + n 2m 2m+d T ξL, and mathematically aligns with the alternative lower bound we mention above. However, we would like to note that such a modified OTL is not computationally efficient for transfer learning since for each new upcoming target task, OTL needs to recalculate a new ˆf S with the combination of the target and source datasets.

Smoothness Adaptive Hypothesis Transfer Learning

Note that any lower bound for a specific case will immediately yield a lower bound for the general case. Therefore, we consider the following two cases.

(1) Consider h = 0, i.e. both source and target data are drawn from ρT . In this case, the problem can be viewed as obtaining the lower bound for classical nonparametric regression with sample size n T + n S and prediction function as f T Hm0 with f T Hm0 R. Then using the Proposition D.1, we have

inf f sup Θ(h,m0,m) P f f T 2 L2 C1δR2(n T + n S) 2m0 2m0+d 1 δ,

where C1 is independent of δ, R, n S and n T .

(2) Consider f S = 0, i.e. the source model has no similarity to f T and all the information about f T is stored in the target dataset. By the assumptions, we have f T {f : f Hm, f Hm h}. Again, using the Proposition D.1, we have

inf f sup Θ(h,m0,m) P f f T 2 L2 C2δh2(n T ) 2m 2m+d 1 δ,

where C2 is independent of δ, h, and n T .

Combining the lower bound in case (1) and case (2), we obtain the desired lower bound.

Discussion. Here, we prove the asymptotic lower bound as

ΩP R2(n T + n S) 2m0 2m0+d + h2(n T ) 2m 2m+d . (11)

By factoring out the R2, we obtain the form in Theorem 4.1, i.e.,

ΩP (n T + n S) 2m0 2m0+d + ξL(n T ) 2m 2m+d , (12)

where the constant should be proportional to R2. We present the results in form (12) instead of the form (11) as we would like to emphasize how the transfer dynamic and efficacy depend on both h and R, compared to the form (1) in most existing OTL works. The constant ξU in the upper bound is designed in the same philosophy.

D.2. Proof of Upper Bound

The final estimator for target regression function is ˆf T = ˆf S + ˆfδ. By triangle inequality

ˆf T f T L2 ˆf S f S L2 + ˆfδ fδ L2 (13)

For the first term in the r.h.s. of (13), since the marginal distribution over X are equivalent for both target and source, applying Theorem 3.3 directly leads to with probability at least 1 δ

ˆf S f S L2 C log 4

n S log n S

where C is independent of n S and δ, and proportional to p

σ2 S + f S 2 Hm0 and thus p

For the second term, we modify the proof of Theorem 1 with the same logic. Note that the estimated offset function has the

Smoothness Adaptive Hypothesis Transfer Learning

following expression

ˆfδ = (TK,n T + λ2I) 1 1 n T

i=1 Kx T,i y T,i ˆf S(x T,i) !

= (TK,n T + λ2I) 1 1 n T

i=1 Kx T,i f S(x T,i) ˆf S(x T,i) + fδ(x T,i) + ϵT,i !

= (TK,n T + λ2I) 1 1 n T

i=1 Kx T,i f S(x T,i) ˆf S(x T,i) !

| {z } ˆ fδ1

+ (TK,n T + λ2I) 1 1 n T

i=1 Kx T,i (fδ(x T,i) + ϵT,i)

| {z } ˆ fδ2

We decompose the estimated offset function ˆfδ into ˆfδ1 and ˆfδ2 since in the second phase, we are using estimated source function ˆf S to generate pseudo label instead of the true source function. Therefore, the term ˆfδ1 counts the introduced bias of using the estimated version of f S. By triangle inequality, one has ˆfδ fδ L2 ˆfδ1 L2 | {z } D1

+ ˆfδ2 fδ L2 | {z } D2

For D2, since the label is observed from true offset function fδ with noise, we can directly apply Theorem 3.3. Define the intermediate term fδ,λ as fδ,λ = (TK + λ2I) 1 (TK(fδ))

To control the approximation error, using Proposition C.2 with the fact that fδ Hm h, we have

fδ,λ fδ 2 L2 log( 1

For estimation error, the same proof the same proof procedure of Theorem C.1 can be directly applied. Finally, by applying Theorem 3.3, we have

D2 = ˆfδ2 fδ L2 C log 4

n T log n T

where C is independent of n T and δ, and proportional to p

σ2 T + fδ 2 Hm and thus p

Turning to D1, ˆfδ1 L2 = (TK,n T + λ2I) 1 TK,n T f S ˆf S L2

(TK,n T + λ2I) 1 TK,n T op

f S ˆf S L2

f S ˆf S L2 .

For the second inequality, we used the fact that the upper bound of the largest eigenvalue of (TK,n T + λ2I) 1 TK,n T is bounded by 1.

Finally, the proof is finished by combining the result of ˆf S and ˆfδ and noticing the results hold with probability at least (1 δ) (1 δ) 1 2δ.

D.3. Propositions

Proposition D.1 (Lower bound for target-only KRR). In target-only KRR problem, suppose the observed data are {(xi, yi)}n i=1 and the underlying true function f0 {f Hm0 : f Hm0 R} := Bm0(R). Then, when n is sufficiently

Smoothness Adaptive Hypothesis Transfer Learning

large, one has

inf f sup f Bm0(R) P f f 2 L2 Cδn 2m0 2m0+d 1 δ,

where the constant C is proportional to R2 and independent of δ and n.

Remark D.2. The proof for the lower bound in target-only KRR is standard and can be found in many works. However, while most of the works omit the property of constant C, we prove it is proportional to R2.

Proof. For every f Bm0(R), define the probability distribution Pf on X Y so that y = f(x) + ϵ, where ϵ N(0, σ2) and σ = min(σ, L). Such form of σ ensures the Assumption 2.5 holds.

We construct a series function, term f0, f1, , f N Bm0(R), as follows,

k=M+1 θ(i) k T

Here, K denotes the reproducing kernel of Hm0 and ϕk represents the k-th eigenfunction. We set M the smallest integer great than n

d 2m0+d , and θ(0), θ(1), , θ(N) {0, 1}M for some N 2M/8 such that the for all 0 i j N, the Hamming distance between θ(i) and θ(j) greater than M/8. One can then verify fi Bm0(R)

(θ(i) k NC0)2

1 2 K(ϕi) 2

by having C0 R.

Denote sj as the j-th eigenvalue of K and by the properties of Sobolev space, we have

d sj C2j 2m0

where C1 and C2 are some constants.

With Lemma D.4,

KL(P n fi, P n fj) = n 2 σ2 fi fj L2

n 2 σ2 C2 0 M s MH(θ(i), θ(j))

n 2 σ2 C2 0s M

C2C0 n 2 σ2 M 2m0

where the first inequality use the fact that s M , s2M, the second inequality based on the fact that θ(i) and θ(j) are elements in {0, 1}M, and the third inequality based on the property of the eigenvalue of K. Notice we take M the smallest integer great than n

d 2m0+d , for a fixed a (0, 1/8), letting

KL(P n fi, P n fj) C2C2 0 n 2 σ2 M 2m0

d C2 C2 0 2 σ2 M alog 2

8 M a log N,

C2C2 0 σ2 alog 2

Smoothness Adaptive Hypothesis Transfer Learning

fi fj 2 L2 = C2 0 M

θ(i) k θ(j) k 2 sk

C2 0 M s2MH θ(i) k , θ(j) k

d n 2m0 2m0+d ,

where the second inequality is based on Lemma D.5.

To use Lemma D.3, we set C0 = αR a for some positive α, thus for any fixed R and a (0, 1/8), we can choose an α such that

C0 R and C2 0 σ2

are satisfied. Therefore, by applying Lemma D.3,w e have

inf f sup f Bm0(R) P f f 2 L2 C R2an 2m0 2m0+d

where C = (α22 2m0

d C1)/8. When n is sufficiently large so that N is sufficiently large, the L.H.S. of the above inequality holds a probability greater than 1 3a. For δ (0, 1), choose a = δ/3 completes the proof.

We provide some Lemma that are important for proving the lower bound. All these lemmas are standard for proving the lower bound and can be found in extensive works.

Lemma D.3. Suppose that there is a non-parametric class of functions Θ and a (semi-)distance d( , ) on Θ. {Pθ, θ Θ} is a family of probability distributions indexed by Θ. Assume that N 2 and suppose that Θ contains elements θ0, θ1, , θN such that,

(1) d (θj, θk) 2s > 0, 0 j < k N;

(2) Pj P0, j = 1, , N, and

j=1 K (Pj, P0) a log N,

with 0 < a < 1/8 and Pj = Pθj, j = 0, 1, , N. Then

inf ˆθ sup θ Θ Pθ(d(ˆθ, θ) s)

Lemma D.4. Suppose that µ is a distribution on X and fi L2(X, µ). Suppose that

y = fi(x) + ϵ, i = 1, 2,

where ϵ N(0, σ2) are independent Gaussian random error. Denote the two corresponding distributions on X Y as ρi, i = 1, 2. The KL divergence of two probability distributions on Ωis

K (P1, P2) := Z

Smoothness Adaptive Hypothesis Transfer Learning

if P1 P2 and otherwise K (P1, P2) := . Then we have

KL (ρn 1, ρn 2) = n KL (ρ1, ρ2) = n 2σ2 f1 f2 2 L2(X,dµ) ,

where ρn i denotes the independent product of n distributions ρi, i = 1, 2.

Lemma D.5 (Varshamov-Gilbert bound (Varshamov, 1957)). Denote Ω= {ω = (ω1, , ωM) , ωi {0, 1}} = {0, 1}M. Let m 8, there exists a subset ω(0), , ω(M) of Ωsuch that ω(0) = (0, , 0),

H ω(i), ω(j) :=

ω(i) k ω(j) k M

8 , 0 i < j N,

and N 2M/8. Here H ω(i), ω(j) is the Hamming distance between ω(i) and ω(j).

E. Additional Simulation Results

E.1. Additional Results for Target-Only KRR with Gaussian kernels

In Section 5.1, we only present the best lines with the optimal C. In this part, we report the generalization error decay curve for different C. We report the results in Figure 4. Each subfigure contains 7 different lines that center around the optimal line. One can see even with different C, the empirical error decay curves are still aligned with the theoretical ones.

1000 1500 2000 2500

KRR (non-adaptive), =2

c=0.5, coef=-0.8 c=0.6, coef=-0.77 c=0.7, coef=-0.77 coef=-0.8

1000 1500 2000 2500

KRR (non-adaptive), =3

c=3.25, coef=-0.84 c=3.35, coef=-0.85 c=3.45, coef=-0.85 coef=-0.85

1000 1500 2000 2500

KRR (adaptive), =2

c=0.5, coef=-0.8 c=0.6, coef=-0.76 c=0.7, coef=-0.75 coef=-0.8

1000 1500 2000 2500

KRR (adaptive), =3

c=2.9, coef=-0.85 c=3.0, coef=-0.86 c=3.1, coef=-0.86 coef=-0.86

sample size (n)

Figure 4. Error decay curves of target-only KRR based on Gaussian kernel, both axes are in log scale. The curves with different colors correspond to different C and denote the average logarithmic generalization errors over 100 trials. The dashed black lines denote the theoretical decay rates.

E.2. Additional Details for TL Algorithm Comparison

Implementation of Finite Basis Expansion: Denote the finite basis estimator (FBE) for a regression function as

j=1 βj Bj(x)

where Bj are given finite basis or spline functions with a different order, and M denotes the truncation number, which generally controls the variance-bias trade-off of the estimator. Then, the transfer learning procedure proposed in Wang et al. (2016) can be summarized by the following 4 steps.

1. Estimate f S using the FBE and source data, output ˆf S,M1

2. Produce the pseudo label ˆy T,i using ˆf S,M1 and x T,i, obtain the offset estimation as y T,i ˆy T,i.

3. Estimate the offset function using the FBE with {(y T,i ˆy T,i, x T,i)}NT i=1, output ˆfδ,M2.

4. Return ˆf S,M1 + ˆfδ,M2 as the estimator for f T .

Smoothness Adaptive Hypothesis Transfer Learning

Regularization Selection in SATL: In target-only KRR results, for all n, we fixed the constant C and reported the best generalization error decay curves in Figure 2 and other error decay curves for other Cs in Figure 4. In SATL, one can also conduct a similar tuning strategy and select the best performer C. However, this can be computationally insufficient. For example, if one has 40 candidates for C, then there would be a total of 402 constants combinations in a two-step transfer learning process. Such a problem also appeared in FBE approaches where one needs to tune the optimal M (number of bases or the degree of B-spline).

Therefore, for each α, we determine the constant C in exp{ Cn 1 2α+d } via following cross-validation (CV) approach. We consider the largest sample size in the current setting, i.e., largest n S while estimating the source model and largest n T while estimating the offset, and the estimate C is obtained by the classical K-fold CV, then the estimated ˆC is used for all sample size in the experiments.

Additional Results for different basis in FBE: In this part, we provide a detailed description of the implementation for the comparison between SATL and FBE in Figure 3a and 3b. In our implementation, we consider the finite basis as (1) Fourier basis Bj(x) =

2cos(π k x) (which was used in Wang et al. (2016)) and (2) Bj being the j-th order B-spline. In Wang et al. (2016), the authors use m1 = m2 = 500, but we notice this will hugely degrade the algorithm performance. Therefore, we use CV to select m1 and m2 to optimize the FBE algorithm performance.