# optimal_parametertransfer_learning_by_semiparametric_model_averaging__92b66a95.pdf

Journal of Machine Learning Research 24 (2023) 1-53 Submitted 1/23; Revised 8/23; Published 9/23

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Xiaonan Hu xnhu@amss.ac.cn School of Mathematical Sciences Capital Normal University Beijing, 100048, China Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China Xinyu Zhang xinyu@amss.ac.cn Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China International Institute of Finance, School of Management University of Science and Technology of China Hefei, 230026, Anhui, China Corresponding author

Editor: Amos Storkey

In this article, we focus on prediction of a target model by transferring the information of source models. To be ﬂexible, we use semiparametric additive frameworks for the target and source models. Inheriting the spirit of parameter-transfer learning, we assume that diﬀerent models possibly share common knowledge across parametric components that is helpful for the target predictive task. Unlike existing parameter-transfer approaches, which need to construct auxiliary source models by parameter similarity with the target model and then adopt a regularization procedure, we propose a frequentist model averaging strategy with a J-fold cross-validation criterion so that auxiliary parameter information from diﬀerent models can be adaptively transferred through data-driven weight assignments. The asymptotic optimality and weight convergence of our proposed method are built under some regularity conditions. Extensive numerical results demonstrate the superiority of the proposed method over competitive methods.

Keywords: asymptotic optimality, cross-validation, negative transfer, prediction, weighting

1. Introduction

Numerous machine learning techniques have been successfully applied in many ﬁelds under data-driven paradigms. As one of the most classical machine learning techniques, supervised learning frameworks commonly use given training data with labels to ﬁt a model and then make predictions or inferences on unlabeled testing data based on the resulting model. The

c 2023 Xiaonan Hu and Xinyu Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/23-0030.html.

Hu and Zhang

performance is generally satisfactory when we have suﬃcient training data. However, in many real-world applications, it may be expensive or even unrealistic to collect such amount of data. For example, in some medical and biological studies, it is challenging to obtain a great deal of patient information from a medical institute due to ethical or cost issues, while there are many institutes owning related data sources that form the concept of data islands . Considering the risk of policy, organizational interests, and individual privacy, it is diﬃcult to cooperate by directly integrating all the data sets from multiple owners for uniﬁed analysis. Modeling the data sets separately may suﬀer from deteriorative generalization in cross-domain scenarios, and it also wastes resources and brings additional costs. Therefore, taking advantage of multi-source data reasonably to deal with these challenges motivates transfer learning, the aim of which is to improve the performance of a speciﬁc target task by transferring common knowledge shared across similar source domains (Pan and Yang, 2009; Blanchard et al., 2021). As a prevailing topic in computer science, transfer learning has been displaying great potential in modern applications, such as medical and biological studies (Shin et al., 2016), computer vision (Long et al., 2015), natural language processing (Raﬀel et al., 2020), and recommendation systems (Pan et al., 2010). Among the existing research, there are few studies on theoretical support for transfer learning frameworks prompting us to explore in this work. Prediction is an important task in economic and statistical analysis, and statistical regression models are commonly adopted because of their convenience and interpretability. Semiparametric models, as a traditional class of statistical regression models, provide a ﬂexible way to understand the complicated relationship between the response and the set of covariates by simultaneously considering parametric and nonparametric components. Although transfer learning has been widely studied for decades, little attention has been paid to it in statistical research. Speciﬁcally, the impact of transfer learning under semiparametric frameworks remains unclear, which motivates our study to try to ﬁll this gap. In this article, we consider a partially linear model (PLM) with additive structures, which is a common type of semiparametric model in the literature, such as Stone (1985, 1986), Ma et al. (2006), Wang et al. (2011), and Ma and Zhu (2013). To accommodate the framework of transfer learning, inheriting the spirits of parameter-transfer learning (Pan and Yang, 2009), we further assume that diﬀerent models on multiple data sets possibly share common knowledge across parametric components, meanwhile allowing heterogeneity in nonparametric components. In recent years, a few studies on transfer learning under statistical models have sprung up. Bastani (2021) studies the single-source parameter-transfer approach under highdimensional linear models and derives the estimation error bound, where the sample size of the source domain is larger than the dimension. Li et al. (2021) extend Bastani s work to the multi-source transfer learning framework with high-dimensional target and source models under some weaker assumptions, and the minimax optimality of the estimation error bound under ℓq-regularization (q [0, 1]) is proven. Tian and Feng (2022) further study the multi-source transfer learning framework in generalized linear model settings and develop a consistent procedure to detect transferable sources. In addition, Li et al. (2022b) extend the multi-source transfer learning framework to Gaussian graphical models and construct a multiple testing procedure for edge detection with false discovery rate control. Diﬀerent from the current work, existing studies usually construct auxiliary source models based

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

on parameter similarity between the target and source models and adopt a regularization technique that requires properly selecting tuning parameters. These procedures also require equal dimensions for all the models and integrate data sets in the estimation process, which may bring limitations in practical application. Speciﬁcally, when heterogeneous data are collected from diﬀerent owners, it is often diﬃcult to aggregate the complete data sets from all parties without boundaries, and generally, only summary statistics can be obtained to protect data privacy. In addition, these works mainly focus on parameter estimation while rare attention is paid to out-of-sample prediction. There are some other topics closely related to our work, such as multi-task learning (Evgeniou and Pontil, 2004; Ando et al., 2005) and integrative analysis (Ma et al., 2011; Liu et al., 2014). However, their goals are mainly to simultaneously estimate parameters for all the models considering the similarity and heterogeneity among model parameters. In addition, they generally assume that all the models are correctly speciﬁed, whereas our framework does not require this assumption. To make better predictions of the target model, model averaging is an eﬀective strategy combining information from multiple candidate models. There exists rich literature on model averaging for several decades (Clarke, 2003), and we consider the asymptotically optimal methods in the frequentist model averaging framework in this work. For model averaging studies on PLM, Zhang and Wang (2019) study the optimal model averaging method for PLM with heteroscedasticity and propose a Mallows-type weight choice criterion in a kernel smoothing framework. Furthermore, frequentist model averaging estimators for other variants of partially linear models have also been studied, such as varying-coeﬃcient partially linear models (Li et al., 2018; Zhu et al., 2019; Li et al., 2022a), partially linear functional additive models (Liu and Zhang, 2021) and semi-functional partially linear models (Jiang et al., 2021). In this article, we contribute to taking the model averaging methodology as a bridge for knowledge transfer between the possibly shared parameter information and the predictive task for the target model. Diﬀerent from traditional model averaging procedures, we combine parameter information from multiple semiparametric models using samples from heterogeneous populations. To estimate multiple semiparametric models, a polynomial spline-based estimator is adopted to approximate nonparametric functions, which can be implemented more cheaply than kernel-based smoothing approaches. Our approach has some appealing advantages. First, when the target model is misspeciﬁed, our procedure can asymptotically obtain optimal prediction in the sense of achieving the lowest possible out-of-sample prediction risk. Second, when the target model is correctly speciﬁed, those models possibly sharing auxiliary information are automatically distinguished by weight assignments. In the studies of transfer learning, sometimes knowledge transfer may even hurt the learning performance of the target task when source models are not related to the target model, the phenomenon of which is referred to as negative transfer in the literature (Pan and Yang, 2009). Some recent works also concern this problem, such as Li et al. (2021) and Tian and Feng (2022), where they introduce algorithms to construct auxiliary source models based on parameter similarity. Further, theoretical properties support that transferring knowledge among such auxiliary source models can improve the performance of the target model under certain conditions. Our proposed method does not require knowing auxiliary source models in advance and theoretically ensures that the parameter transfer asymptotically occurs in potential auxiliary models, which attempts to address the negative transfer problem from a new perspective. It can be seen that our

Hu and Zhang

method has theoretical guarantees in both correct and incorrect target model settings, which is a desirable feature in applications. Finally, our procedure oﬀers an alternative strategy for massive data analysis. Speciﬁcally, we can split the full data set into many batches and estimate each batch of data in parallel. We can then aggregate estimators through our transfer learning mechanism to achieve predictions. This approach is similar to the divide and conquer technique in the literature of distributed learning (Zhang et al., 2013; Battey et al., 2018). By transmitting only summary statistics information instead of pooling multiple data sets together, our framework provides a feasible strategy to eﬀectively protect the privacy of individual data. Relevant studies can be found in the literature of meta-analysis (Xie et al., 2011; Kundu et al., 2019). In conclusion, the primary contributions of this work can be summarized as follows.

In contrast to traditional frequentist model averaging approaches for semiparametric models, we adopt a spline-based estimator and propose a data-driven weight choice criterion in the scenario of multiple populations, which provides more insights for model averaging research.

For transfer learning frameworks, we develop a parameter-transfer approach aimed at the predictive task under statistical models. We contribute to taking a model averaging strategy to adaptively transfer possibly shared parameter information instead of auxiliary model selection. Some appealing properties for parameter-transfer learning are established from a statistical view.

The rest of the paper is organized as follows. In Section 2, we introduce our model framework and weight choice criterion. Section 3 provides the theoretical properties of our approach, including asymptotic optimality and weight convergence under certain regularity conditions. Extensive simulation studies and a real data example are conducted in Section 4 and Section 5, respectively. Concluding remarks are summarized in Section 6. All the technical details and additional numerical results are presented in the Appendix.

2. Optimal Parameter-Transfer Approach

In this section, we ﬁrst introduce our semiparametric model setting under transfer learning framework and then propose a parameter-transfer approach based on frequentist model averaging. We further provide a cross-validation based procedure to choose proper weights.

2.1 Model Framework

Assume that target data {y(0) i , x(0) i , z(0) i } for i = 1, . . . , n0 and source data {y(m) i , x(m) i , z(m) i } for m = 1, . . . , M, i = 1, . . . , nm are independent samples from M + 1 heterogeneous populations. For the mth data set, m = 0, . . . , M, y(m) i are continuous scalar responses, x(m) i = (x(m) i1 , . . . , x(m) ip )T are p-dimensional i.i.d observations, and z(m) i = (z(m) i1 , . . . , z(m) iqm )T

are qm-dimensional i.i.d observations. Here, diﬀerent z(m) i are allowed in diﬀerent data sets. Suppose that the target and source samples follow M + 1 semiparametric additive linear models, which are referred to as the target model and source models as follows. For

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

m = 0, . . . , M, i = 1, . . . , nm,

y(m) i = µ(m) i + ε(m) i = (x(m) i )T β(m) + g(m)(z(m) i ) + ε(m) i , (1)

where µ(m) i contains both a parametric component of (x(m) i )T β(m) with an unknown parameter β(m) Rp and nonparametric components of g(m)(z(m) i ) = Pqm l=1 g(m) l (z(m) il ) with a commonly adopted additive structure, g(m) l is a one-dimensional unknown smooth function, and ε(m) i are independent random errors with E(ε(m) i |x(m) i , z(m) i ) = 0 and E{(ε(m) i )2|x(m) i , z(m) i } = σ2 i,m. Note that β(m) in diﬀerent source models are allowed to be identical or diﬀerent from the target model. Here, the dimension p of the parametric component for each model is allowed to go to inﬁnity, and qm is ﬁxed for m = 0, . . . , M. In the context of transfer learning, we expect to improve the prediction of the new response y(0) n0+1 given the corresponding set of covariates {x(0) n0+1, z(0) n0+1} from the target population through transferring common knowledge from source models. To accommodate the transfer learning framework under semiparametric models, we speciﬁcally borrow the idea of parameter-transfer approaches and further assume that source models possibly share parameter information with the target model. Unlike some recent works on transfer learning under linear models (Li et al., 2021) and generalized linear models (Tian and Feng, 2022), it is not necessary for our proposed method to construct auxiliary source models based on parameter similarity and integrate data sets by a regularization procedure. To estimate our models (1), we consider a polynomial spline-based method to approximate nonparametric parts. The corresponding theoretical properties have been well established in the literature (De Boor, 2001). Let Ψ(m) l be the polynomial spline space consisting of functions of degree r(m) l 1 and S(m) l be the number of interior knots in the interval [0, 1] for the mth model. Here, the number of interior knots can vary from diﬀerent models and is allowed to be divergent as the sample size increases. Assume that there exists a normalized B-spline basis B(m) l (z) = {bl1(z), . . . , blv(m) l (z)}T of the spline space, where

v(m) l = r(m) l + S(m) l and v(m) l is allowed to increase as the sample size increases. As discussed in De Boor (2001), nonparametric functions can be well approximated by the linear combination of B-spline basis functions under certain conditions. Therefore, the estimator can be transformed into a least squares formula as follows.

bθ(m) = arg min θ(m)

n y(m) i (d(m) i )T θ(m)o2 , m = 0, . . . , M, (2)

where d(m) i = [(x(m) i )T , {B(m) 1 (z(m) i1 )}T , . . . , {B(m) qm (z(m) iqm )}T ]T , θ(m) = {(β(m))T , (γ(m) 1 )T , . . . ,

(γ(m) qm )T }T , and γ(m) l = (γ(m) l1 , . . . , γ(m)

lv(m) l )T for m = 0, . . . , M and l = 1, . . . , qm. Let the total

dimension of the mth model be pm = Pqm l=1 v(m) l + p.

2.2 Model Averaging Prediction Procedure

In this section, we introduce a frequentist model averaging strategy to transfer possibly shared parameter information from multiple models for our prediction of the target model,

Hu and Zhang

which has the advantages of achieving asymptotically optimal prediction and adaptively using potential auxiliary models by reasonable weight assignments. Since we have little prior knowledge about auxiliary parameter information from source models in practice, we simply put all the M source models into our transfer learning framework. To develop our transfer learning framework, we ﬁrst construct M + 1 models based on B-spline basis approximations. Speciﬁcally, the estimators of µ(0) i corresponding to the M + 1 models, including the target model (m = 0) and M source models (m = 1, . . . , M) with possibly shared parameters, are deﬁned as

bµ(0) i,m = (d(0) i )T bθ(0) m =

(x(0) i )T bβ(0) +

l=1 {B(0) l (z(0) il )}T bγ(0) l , m = 0,

(x(0) i )T bβ(m) +

l=1 {B(0) l (z(0) il )}T bγ(0) l , m = 1, . . . , M,

where bθ(0) m = {( bβ(m))T , (bγ(0) 1 )T , . . . , (bγ(0) q0 )T }T . Slightly diﬀerent from the construction of candidate models in previous model averaging literature, there exists uncertainty about the parameter information of which model can be transferred to the target model. In other words, the informative level of diﬀerent models is not clear, and only information of bβ(m)

is allowed to be transferred between models.

Algorithm 1: Trans-SMAP

Input: Training samples, including the target and source data, {(x(m) i , z(m) i , y(m) i ); i = 1, . . . , nm, m = 0, . . . , M} from the target and source models (1) and the new sample {x(0) n0+1, z(0) n0+1} from the target model.

Output: Prediction of y(0) n0+1 associated with the new sample {x(0) n0+1, z(0) n0+1}.

Step 1. Estimate parameter θ(m) for the target model using training samples by (2), and denote the estimator by bθ(m). Step 2. Split training samples from the target model into J subgroups with 2 J n0. Step 3. foreach j {1, . . . , J} do

Step 3.1. For m = 1, . . . , M, estimate θ(m) separately by (2) with all the training samples, and estimate θ(0) only with the training samples in subgroup Gc j.

Step 3.2. For i Gj, m = 0, . . . , M, perform the prediction of y(0) i based on (3) as

bµ(0) i,m,[Gc j ].

Step 3.3. Construct the weighted combination as bµ(0) i,[Gc j ](w) = PM m=0 wmbµ(0) i,m,[Gc j ].

Step 4. Select the weight vector b w by minimizing the J-fold cross-validation criterion (5).

Step 5. Given the new sample {x(0) n0+1, z(0) n0+1} from the target model, obtain the model

averaging prediction by plugging in b w and bθ(m), that is bµ(0) n0+1( b w) = PM m=0 bwmbµ(0) n0+1,m.

Following the previous studies on model averaging, the ﬁnal prediction can be deﬁned as a weighted average of bµ(0) i,m expressed as bµ(0) i (w) = PM m=0 wmbµ(0) i,m, where w = (w0, . . . , w M)T

is the weight vector in the space W = {w [0, 1]M+1 : PM m=0 wm = 1}. To determine a proper choice of weights, we adopt a J-fold (J > 1) cross-validation criterion. Speciﬁcally,

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

we randomly divide the target samples into J mutually exclusive subgroups G1, . . . , GJ. For simplicity, we assume that all subgroups have equal size n0/J, which is a positive integer. For j = 1, . . . , J, let Gc j = {1, . . . , n0}\Gj denote the set {1, . . . , n0} excluding the elements in Gj. Then the J-fold cross-validation based weight choice criterion is deﬁned as

n y(0) i bµ(0) i,[Gc j ](w) o2 , (4)

where bµ(0) i,[Gc j ](w) is the weighted average of bµ(0) i,m,[Gc j ], and the deﬁnition of bµ(0) i,m,[Gc j ] is similar to

bµ(0) i,m except that the estimator is based on data corresponding to the subgroup Gc j. Note that criterion (4) will reduce to the leave-one-out cross-validation criterion when J = n0. The weight vector can be obtained by solving the following constrained optimization problem

b w = arg min w W CV (w). (5)

Then the resulting model averaging prediction of y(0) n0+1 associated with the new sample from

the target model is given by bµ(0) n0+1( b w). We summarize our procedure in Algorithm 1, and we term it Transfer learning for Semiparametric Model Averaging Prediction (Trans-SMAP). As suggested by an anonymous referee, we provide a discussion about the computational complexity of Algorithm 1 in Appendix B.2. Remark 1 The rationale of the proposed criterion is to ﬁnd a proper weighted averaging estimator by minimizing the expected squared loss E[{y(0) n0+1 bµ(0) n0+1(w)}2], and we consider a data-driven approach to approximate it through the J-fold cross-validation criterion (4). Formally, we prove that CV (w) is an asymptotically unbiased estimator of E[{y(0) n0+1

bµ(0) n0+1(w)}2] that is expected to minimize. The details are provided in Appendix A.2. Remark 2 The choice of J in criterion (4) is usually uncertain in practice. Since there is no theoretically optimal value, we manually use the 5-fold CV criterion in terms of computational eﬃciency in this paper. To be more convincing, we design additional experiments to compare diﬀerent choices of J together with the leave-one-out procedure in terms of prediction error and time consumption. In conclusion, we ﬁnd that the prediction performance is not sensitive to the choice of J, but the time consumption of the J-fold CV criteria can be greatly reduced compared to the leave-one-out procedure as the sample size increases. More details can be referred to Appendix C.5.

3. Theoretical Properties

In this section, we establish some statistical properties of our method. Speciﬁcally, we derive the properties under the misspeciﬁed target model and correctly speciﬁed target model, respectively. Note that a correctly speciﬁed target model in our setting means that all the variables in the true target model (1) are included when ﬁtting the model. Deﬁne the risk function as R(w) = E[{µ(0) n0+1 bµ(0) n0+1(w)}2] and the prediction risk

function as PR(w) = E[{y(0) n0+1 bµ(0) n0+1(w)}2]. It is easy to see the decomposition PR(w) =

E{(y(0) n0+1 µ(0) n0+1)2} + R(w), where the ﬁrst term is unrelated to w. This decomposition

Hu and Zhang

implies that our objective is equivalent to minimizing R(w) based on Remark 1. Therefore, we will ﬁrst establish the asymptotic optimality of our model averaging procedure with respect to minimizing R(w). Note that all the limiting processes throughout this paper correspond to n , where n = min0 m M nm. Let p = max0 m M pm and v(0) = max1 l q0 v(0) l . In addition, we allow the number of source models M to go to inﬁnity. Let a b denote max{a, b} and a b denote min{a, b}.

3.1 Asymptotic Optimality under Misspeciﬁed Target Model

For convenience, before we provide the theoretical properties formally, some notations need to be deﬁned. Suppose that the pseudo-true values of parameters exist for each model, and let eθ(0) m = {( eβ(m))T , (eγ(0) 1 )T , . . . , (eγ(0) q0 )T }T denote the corresponding values for m = 1, . . . , M. For j = 1, . . . , J, i Gj, denote the in-sample prediction of the mth model using the subgroup samples in Gc j by bµ(0) i,m,[Gc j ] = (d(0) i )T bθ(0) m,[Gc j ], where bθ(0) m,[Gc j ] =

{( bβ(0) [Gc j ])T , (bγ(0) 1,[Gc j ])T , . . . , (bγ(0) q0,[Gc j ])T }T for m = 0 and bθ(0) m,[Gc j ] = {( bβ(m))T , (bγ(0) 1,[Gc j ])T , . . . ,

(bγ(0) q0,[Gc j ])T }T for m = 1, . . . , M. Then the weighted averaging estimator can be written as

bµ(0) i,[Gc j ](w) = PM m=0 wmbµ(0) i,m,[Gc j ]. Note that bθ(0) m,[Gc j ] and bθ(0) m have identical limiting values

eθ(0) m under large samples. In addition, the in-sample prediction of the mth model based on the pseudo-true values is deﬁned as eµ(0) i,m = (d(0) i )T eθ(0) m , and the corresponding averaging

prediction is eµ(0) i (w) = PM m=0 wmeµ(0) i,m. Now, we introduce some notations for the prediction associated with the new sample. Deﬁne the prediction of y(0) n0+1 under the mth model based on eθ(0) m and bθ(0) m as eµ(0) n0+1,m =

(d(0) n0+1)T eθ(0) m and bµ(0) n0+1,m = (d(0) n0+1)T bθ(0) m . Then the corresponding averaging predictions

are eµ(0) n0+1(w) = PM m=0 wmeµ(0) n0+1,m and bµ(0) n0+1(w) = PM m=0 wmbµ(0) n0+1,m.

Further, denote the risk function calculated based on the pseudo-true values by e R(w) = E[{µ(0) n0+1 eµ(0) n0+1(w)}2]. Let ξn = infw W e R(w) be the minimum risk over the class of

averaging estimators. Let O(eθ(0) m , c) denote a neighborhood of eθ(0) m for some constant c such that eθ(0) m θ c for any θ O(eθ(0) m , c). To establish the asymptotic optimality for our method, some regularity conditions are necessarily stated as follows.

Condition 1 Let s be a positive integer and t (0, 1] such that κ = s + t > 1.5, and let F denote the collection of functions f on [0, 1] whose sth derivative, f[s], exists and satisﬁes the Lipschitz condition of order t; that is |f[s](x ) f[s](x)| C0|x x|t, 0 x , x 1, for some positive C0. Then, (i) the nonparametric functions g(m) l for l = 1, . . . , qm, m = 0, . . . , M in model (1) belong to F; (ii) the number of interior knots for each spline approximation S(m) l satisﬁes n1/2κ m S(m) l n1/3 m .

Condition 2 For m = 0, . . . , M, the distribution of z(m) il for l = 1, . . . , qm is absolutely continuous, and its density is bounded away from zero and inﬁnity uniformly over l.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Condition 3 Suppose that M n. Uniformly for m = 0, . . . , M, there exist limiting values eθ(m) such that bθ(m) eθ(m) = Op(p1/2 m n 1/2 m M1/2), where bθ(m) = {( bβ(m))T , (bγ(m) 1 )T , . . . , (bγ(m) qm )T }T and eθ(m) = {( eβ(m))T , (eγ(m) 1 )T , . . . , (eγ(m) qm )T }T .

Condition 4 For m = 0, . . . , M, p = o(n1/2 m ).

Condition 5 The expectations E{(µ(0) i )4}, E{(eµ(0) i,m)4}, and E{(ε(0) i )4} exist for m = 0, . . . , M.

Condition 6 For i = 1, . . . , n0, j = 1, . . . , J, (i) bµ(0) i,m,[Gc j ] is diﬀerentiable with respect to

bθ(0) m,[Gc j ]; (ii) there exists a positive constant c such that

sup θ O(eθ(0) m ,c)

bµ(0) i,m,[Gc j ]

bθ(0) m,[Gc j ]

bθ(0) m,[Gc j ]= θ

uniformly for m = 0, . . . , M.

Condition 7 ξ 1 n sup w W |{µ(0) n0+1 bµ(0) n0+1(w)}2 {µ(0) n0+1 eµ(0) n0+1(w)}2| is uniformly integrable.

Condition 8 {(pn 1/2M1/2) (p2n 1M)}ξ 1 n = o(1).

Conditions 1 and 2 are mild and commonly assumed for nonparametric models with spline-based approximation in the literature, such as Stone (1985), Stone (1986), Wang et al. (2011) and Zhang and Liang (2011). Condition 3 ensures that the estimator bθ(m) in each model has the limiting value eθ(m). This can be regarded as a variant of the common condition to establish the asymptotic properties in literature; see, for example, Zhang et al. (2016), Ando and Li (2017), and Zhang and Liu (2023). When the number of source models M is ﬁxed, the required convergence rates are slower than the rates for parametric models established in White (1982) due to the semiparametric model settings. We further weaken the convergence rates to accommodate uniform convergence under a possibly diverging M. Note that the dimension of the parametric component in each model allows to be divergent in our setting, and the extension to high-dimensional settings of p > nm is left for future study. Condition 4 restricts the divergence rates of dimensions of the target and source models based on B-spline approximations as the sample size increases, which is commonly assumed in the literature (Liao et al., 2021). Suppose that the polynomial degree of all spline basis functions is ﬁxed in our setting. Theorem 2 in Stone (1986) shows that the cubic spline estimator of the nonparametric function achieves the optimal convergence rate if the number of interior knots has the order of n1/5 m . When the spline estimators for each model achieve the optimal convergence rate in our framework, Condition 4 still holds. Conditions 5-7 are some mild technical conditions. Speciﬁcally, Condition 5 includes some general constraints of moments on the conditional expectation, prediction, and error, which are commonly assumed in the literature (Wan et al., 2010; Ando and Li, 2017; Zhang and Wang, 2019). Condition 6 concerns boundedness and diﬀerentiability; it is also adopted

Hu and Zhang

in Zhang and Liu (2023). Condition 7 is mainly imposed for technical needs to obtain expectation in the proof. Condition 8 plays an important role in the proof of our theorem. It restricts the divergence rate of the number of source models and requires that the target model is suﬃciently misspeciﬁed. Similar conditions in the literature are Condition 7 in Ando and Li (2014), Condition C.6 in Zhang et al. (2016), and Assumption 5 in Zhang and Liu (2023). A detailed explanation of the misspeciﬁcation of the target model under Condition 8 is given in Appendix A.1. Note that Condition 8 also implies an upper bound on the number of source models M. Speciﬁcally, if we assume that the target model is misspeciﬁed and pn 1/2M1/2 = o(1), then we have M1/2 = o(p 1n1/2ξn) based on Condition 8. If we further assume ξn = O(1) and p = O(n1/2 ζ) for 0 < ζ 1/2 based on Condition 4, then the order of M is o(n2ζ). Next, we formally present the theoretical property under the above conditions in Theorem 1. The proof is provided in Appendix A.3.

Theorem 1 Under Conditions 1-8, we have

R( b w) inf w W R(w) 1

in probability.

Theorem 1 indicates that the proposed model averaging prediction has the asymptotic optimality in the sense of achieving the lowest possible out-of-sample prediction risk, which is a fundamental but important property in the frequentist model averaging literature. Unlike most previous works, the proposed procedure is constructed using multiple data sets, and the corresponding asymptotic optimality is established based on the out-of-sample prediction risk, which is more practical for predictive task in our context. It is worth noting that our result of asymptotic optimality always holds regardless of whether the source models are correct, which is not surprising since the target model is our main concern and naturally dominates the performance.

3.2 Weight Convergence under Correct Target Model

We now turn to the case of the correct target model. Since there are no requirements for the model speciﬁcation of source models, it may contain both correct and misspeciﬁed models. Deﬁne the informative models as the models having the same pseudo-true values of eβ(m) as the target model, and let I {0, . . . , M} be the corresponding set of indices. Obviously, 0 I. Let Ic be the complement of I. Note that the informative models are characterized by the parameter eﬀects in the limit, which also reﬂects the similarity between the estimators for the target model and source models in a sense. We further deﬁne the sum of the weight estimators for the informative models as bτ = P m I bwm. To study the theoretical property of the weights, we rely on an alternative for Condition 8 and some other technical conditions. Let W = {w W : P m I wm = 0} be the subset of W that assigns all the weights to the models belonging to Ic. Then the conditions are presented as follows.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Condition 9 (v(0))1/2 κ = o{(pn 1/2M1/2) (p2n 1M)}, where κ is deﬁned in Condition 1.

Condition 10 E µ(0) n0+1 eµ(0) n0+1,m = O(1) uniformly for m Ic.

Condition 11 {(pn 1/2M1/2) (p2n 1M)}{infw W e R(w)} 1 = o(1).

Condition 12 {infw W e R(w)} 1 = O(1).

Condition 9 speciﬁcally constrains the divergence rate of dimension of the spline basis in the target model. Condition 10 restricts the boundedness of approximation between µ(0) n0+1 and eµ(0) n0+1,m based on the models belonging to Ic; this restriction is common and reasonable. Condition 11 imposes a restriction on the growth rate of the minimum risk based on the averaging prediction over models in Ic, which is an extension to Assumption 6 in Zhang and Liu (2023) with semiparametric model settings. Note that Condition 11 implies that the risk of the target model is suﬃciently large when the target model is misspeciﬁed, so we do not need the constraint in Condition 8. Based on this condition, the resulting risk will converge to zero when combining any informative source models with the correct target model. On the contrary, the risk will be much larger than (pn 1/2M1/2) (p2n 1M) asymptotically when combining any source model belonging to Ic even with the correct target model. Therefore, it is intuitively clear that none of the models in Ic will be asymptotically assigned nonzero weights based on our criterion. Condition 12 restricts the risk of the model averaging estimator under misspeciﬁed models. It can be seen that Condition 11 is satisﬁed if we have p2M = o(n) combined with Condition 12. We summarize the property of weight convergence in the following Theorem 2, and the proof is provided in Appendix A.4.

Theorem 2 If Conditions 1-6 and 9-11 are satisﬁed, then bτ 1 in probability.

Theorem 2 demonstrates that our procedure asymptotically assigns all the weights to informative models when the target model is correctly speciﬁed, which can be regarded as a type of consistency property in model selection. In other words, the proposed model averaging criterion can consistently select informative models by weight assignments, which is an important distinction compared to existing parameter-transfer learning frameworks. From the deﬁnition of the informative models, both correct and incorrect source models may contribute to the prediction if the parameter information is similar enough. In addition, our method yields robust performance even when several models have strong dissimilarity because those models are assigned small or zero weights asymptotically. As discussed in previous sections, we hope to avoid the negative transfer problem in practice. Next, we further discuss the diﬀerence between the upper bound of the risk of our method and that of the least squares estimator on target data only, which is summarized in the following corollary. Let R( b w) and R0 denote the upper bounds of the risk of our Trans-SMAP and the least squares estimator on target data, respectively. The technical details are provided in Appendix A.5.

Hu and Zhang

Corollary 3 Assume that the dimensions and sample sizes of the target and source data satisfy p2n 1 = O(p2 0n 1 0 ). If Conditions 1-8 hold or Conditions 1-7, Conditions 9-10 and Condition 12 hold, then R( b w) = Op( R0).

Corollary 1 demonstrates that the upper bound of the risk of our Trans-SMAP has no larger order than that of the least squares estimator on target data regardless of whether the target model is correct. Hence, in a sense, it also veriﬁes that our procedure provides a reasonable strategy to mitigate the potential negative transfer problem.

4. Simulation Studies

In this section, we evaluate the ﬁnite sample performance of our procedure in various numerical experiments. For comparison, we consider the following seven competitive procedures: transfer learning by the simple averaging procedure (termed as Trans-Simp MA ), transfer learning by the smoothed AIC and BIC (Buckland et al., 1997) based model averaging procedures (termed as Trans-SAIC and Trans-SBIC ), least squares estimator using the target data only (termed as LSE-Tar ), least squares estimator using all the data (termed as LSE-All ), Trans-Lasso (Li et al., 2021), and Trans-GLM (Tian and Feng, 2022). Specifically, Trans-Simp MA, Trans-SAIC, and Trans-SBIC construct the corresponding weighted averaging estimators with equal weight 1/(M +1), exp ( AICm/2)/ PM m=0 exp ( AICm/2), and exp( BICm/2)/ PM m=0 exp( BICm/2), respectively, where AICm = log{n 1 m Pnm i=1(y(m) i bµ(m) i )2}+2pm/nm and BICm = log{n 1 m Pnm i=1(y(m) i bµ(m) i )2}+pm log nm/nm. The purpose of comparing these methods is to verify the superiority of our proposed method. LSE-Tar performs the prediction with the least squares estimator (2) using the target data. LSE-All performs the prediction with the least squares estimator based on all the target and source data by minimizing the following integrative loss function L(β, γ(0) 1 , ..., γ(0) q0 , . . . , γ(m) qm ) = (2 PM m=0 nm) 1 PM m=0 Pnm i=1[y(m) i (x(m) i )T β Pqm l=1{B(m) l (z(m) il )}T γ(m) l ]2. The purpose of considering LSE-Tar and LSE-All is to understand the eﬀect of reasonable knowledge transfer. To comprehensively demonstrate the superiority of our procedure, we also consider two recent transfer learning approaches, Trans-Lasso and Trans-GLM, related to our framework. All experiments are implemented in R software, and more details can be seen in Appendix B.1.

4.1 Simulation Design

Set the target sample size n0 = 150, and source sample sizes (n1, n2, n3) = (200, 200, 150). For the parametric components, x(m) i from the target and source models are generated from a 6-dimensional multivariate normal distribution N(0, Σ) with Σ = [Σaa ]6 6, where Σaa = 0.5|a a |. Set the parametric coeﬃcient vectors of the target and source models as β(0) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T , β(1) = (1.4, 1.2, 1, 0.8, 0.65, 0.3, 1.8)T + δ1, β(2) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T + δ2, and β(3) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T + δ3, where δ1, δ2, δ3 are the parametric coeﬃcient diﬀerences relative to the target model. Further, we set δ1 = 0.02, δ2 = 0.3, and δ3 = 0, so the parameters of the ﬁrst and second source models are diﬀerent from the target model, and the third source model is informative because its coeﬃcient is exactly same as that of the target model. Note that the parametric coeﬃcient

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

of the ﬁrst source model is a 7-dimensional vector, whereas the others are 6-dimensional vectors. Here, we always omit the last component of x(1) i when ﬁtting the ﬁrst source model, so the model is misspeciﬁed. For the other models, we do not ignore any components, and then they are all correctly speciﬁed. The above setting is the case of the correct target model. When the target model is misspeciﬁed, we keep other settings unchanged except setting β(0) = (1.4, 1.2, 1, 0.8, 0.65, 0.3, 0.1)T , and we similarly omit the last component of x(0) i . We set the dimensions of the nonparametric component for each model as qm = 3 for m = 0, . . . , M, and we generate z(m) il from a uniform distribution U(0, 1). The following nonlinear functions for diﬀerent models are considered: g(0)(u) = 2(u1 0.5)3 + sin(πu2) + u3, g(1)(u) = 2(u1 + 0.5)3 + cos(πu2) + u3, g(2)(u) = 2.5(u1 + 0.3)3 + sin(πu2) + 1.5u3, and g(3)(u) = 1.8(u1 + 0.3)3 + cos(πu2) + u3. In order to accommodate the settings of Trans Lasso and Trans-GLM for convenient comparison, we consider the scenairo of multiple data with equal dimensions, but our framework is not limited to this setting. Hence, we conduct additional simulation studies in heterogeneous dimension settings in Appendix C.2. The random error term ε(m) i for the M + 1 models follows a normal distribution N(0, 0.52) with σi,m = 0.5 for m = 0, . . . , M, i = 1, . . . , nm. Here, we mainly consider the homoscedastic setting as an example. Since our framework is compatible with heteroscedasticity, we further conduct the heteroscedastic design in Appendix C.4. To evaluate the prediction performance, we generate new samples from the target model with sample size n = 500. Furthermore, we design alternative settings to increase the sample sizes to (300, 350, 350, 250) and (500, 550, 500, 450), while keeping the other settings invariant. All the experiments are replicated 500 times. Following a referee s suggestion, we conduct an additional simulation study in high-dimensional settings that matches the assumptions of competing methods like Trans-Lasso to provide a relatively fair comparison. Further details regarding this simulation study can be found in Appendix C.8. Next, we consider increasing the number of source models. Let the sample sizes be (n0, . . . , n6) = (150, 200, 150, 200, 150, 150, 200). The parametric coeﬃcients of the target and source models are set as β(0) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T , β(1) = (1.4, 1.2, 1, 0.8, 0.65, 0.3, 1.8)T +δ1, β(2) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T +δ2, β(3) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T +δ3, β(4) = (1.4, 1.2, 1, 0.8, 0.65, 0.3, 1.8)T +δ4, β(5) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T

+δ5, and β(6) = (1.4, 1.2, 1, 0.8, 0.65, 0.3)T + δ6, where δ1 = 0.02, δ2 = 0.02, δ3 = 0.3, δ4 = 0, δ5 = 0.02, δ6 = 0.3. Let the nonlinear functions for diﬀerent models be g(0)(u) = 2(u1 0.5)3 + sin(πu2) + u3, g(1)(u) = 2(u1 + 0.5)3 + cos(πu2) + u3, g(2)(u) = 2.5(u1 + 0.3)3 + sin(πu2) + 1.5u3, g(3)(u) = 1.8(u1 + 0.3)3 + cos(πu2) + u3, g(4)(u) = 1.5(u1 + 0.5)3 + cos(2πu2) + u2 3, g(5)(u) = (u1 + 0.6)2 + cos(πu2) + 1.3u2 3, and g(6)(u) = 1.3(u1+0.5)2+cos(2πu2)+1.6u3. Similarly, we design additional settings with large sample sizes (300, 350, 300, 350, 300, 300, 350) and (500, 550, 500, 550, 500, 500, 550).

4.2 Simulation Results

We evaluate all the methods by the mean squared error (MSE) based on the new samples with sample size n , which is expressed as MSE = Pn i=1(bµ(0) i µ(0) i )2/n . The results of the averaged MSE based on 500 replications are reported in Table 1. The rows labeled Uplift Rate in each table represent the percentage improvement of the averaged MSE for the

Hu and Zhang

1.00 150 300 500

0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.00

target model (informative) source model (non informative) source model (non informative) source model (informative)

1.00 150 300 500

0.25 0.50 0.75 0.25 0.50 0.75 0.25 0.50 0.75 0.00

target model (informative) source model (non informative)

source model (non informative) source model (non informative)

source model (informative) source model (non informative)

source model (non informative)

Figure 1: Stacked area plots of the averaged weight assignments based on our method when the target model is correctly speciﬁed with M = 3 (ﬁrst line) and M = 6 (second line).

proposed method relative to the smallest result among alternative methods. According to the results in Table 1, we see that the proposed Trans-SMAP outperforms all the competitive methods. The improvement is particularly signiﬁcant as the number of models increases, especially in the correct target model scenario. When the target model is misspeciﬁed, the superiority is not surprising because Theorem 1 provides the theoretical guarantee for the asymptotic optimality. The proposed method still has advantages when the target model is correctly speciﬁed due to that our procedure asymptotically uses the informative models based on Theorem 2, whereas LSE-Tar does not borrow any auxiliary information, and other approaches may not use auxiliary information eﬀectively. It is worth noting that the recent transfer learning approaches, Trans-Lasso and Trans-GLM, are also not comparable to our method. Except for the inherent limitations discussed in the introduction, the most important reason is that these procedures are aimed at parametric models. To evaluate the performance under various levels of noise, we let the variance of random error vary such that R2 = var(µ(0) i )/var(y(0) i ) ranges from 0.1 to 0.9 with increments of 0.1, and the detailed results are provided in Appendix C.1. Additionally, we design more general experiments to further demonstrate the stability of our procedure under potential negative transfer scenarios in Appendix C.3. The details are reported in Tables 4-7, and the corresponding results still support our method.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Correct Target Model Misspeciﬁed Target Model Method n0 = 150 n0 = 300 n0 = 500 n0 = 150 n0 = 300 n0 = 500 M = 3 Trans-SMAP 0.027 0.013 0.008 0.034 0.021 0.015 (0.010) (0.005) (0.003) (0.010) (0.005) (0.003) Trans-Simp MA 0.239 0.226 0.220 0.218 0.205 0.198 (0.038) (0.028) (0.023) (0.034) (0.024) (0.021) Trans-SBIC 0.188 0.173 0.165 0.180 0.165 0.156 (0.028) (0.020) (0.016) (0.027) (0.019) (0.016) Trans-SAIC 0.183 0.170 0.165 0.174 0.161 0.155 (0.027) (0.020) (0.016) (0.026) (0.019) (0.016) LSE-Tar 0.030 0.015 0.009 0.038 0.022 0.016 (0.011) (0.005) (0.003) (0.011) (0.005) (0.004) LSE-All 0.347 0.309 0.260 0.316 0.279 0.234 (0.065) (0.046) (0.032) (0.062) (0.040) (0.029) Trans-Lasso 0.123 0.110 0.104 0.130 0.117 0.112 (0.013) (0.008) (0.006) (0.014) (0.009) (0.007) Trans-GLM 0.124 0.109 0.103 0.131 0.116 0.112 (0.017) (0.007) (0.006) (0.017) (0.008) (0.007)

Uplift Rate 11.11% 15.38% 12.50% 11.76% 4.76% 6.67% M = 6 Trans-SMAP 0.026 0.013 0.008 0.034 0.020 0.015 (0.010) (0.004) (0.003) (0.010) (0.005) (0.003) Trans-Simp MA 0.211 0.199 0.197 0.194 0.181 0.178 (0.026) (0.020) (0.018) (0.024) (0.019) (0.016) Trans-SBIC 0.195 0.176 0.173 0.182 0.166 0.161 (0.023) (0.017) (0.015) (0.021) (0.016) (0.014) Trans-SAIC 0.189 0.174 0.172 0.177 0.164 0.160 (0.023) (0.017) (0.015) (0.021) (0.016) (0.014) LSE-Tar 0.031 0.015 0.009 0.039 0.022 0.016 (0.012) (0.005) (0.003) (0.013) (0.006) (0.003) LSE-All 0.336 0.278 0.259 0.311 0.256 0.238 (0.049) (0.031) (0.025) (0.043) (0.030) (0.024) Trans-Lasso 0.124 0.109 0.104 0.133 0.117 0.111 (0.016) (0.008) (0.006) (0.015) (0.009) (0.007) Trans-GLM 0.127 0.108 0.104 0.136 0.116 0.111 (0.019) (0.008) (0.006) (0.019) (0.008) (0.006)

Uplift Rate 19.23% 15.38% 12.50% 14.71% 10.00% 6.67%

Table 1: The averaged MSE of out-of-sample prediction for diﬀerent methods. The standard errors are given in parenthesis.

Hu and Zhang

M = 3 M = 6

100 300 600 1000 100 300 600 1000

Sample size

Sum of weights for correct models

Figure 2: The averaged sum of weights for informative models.

We further evaluate the performance of weight estimation. Figure 1 shows the results of averaged weight assignments when the target model is correctly speciﬁed, where areas with diﬀerent colors denote averaging weights under various R2. It can be observed that large weights tend to be assigned to the target model as well as source models with small diﬀerences in parameter eﬀects, and the trend becomes more signiﬁcant as R2 grows. This is not unexpected because the target model is of interest, and smaller diﬀerences indicate possibly more informative models for knowledge transfer. Notice that the weight for the informative source model is even larger than the weight for the target model, a possible reason of which is that the informative source model has a larger sample size than the target model. To exclude the inﬂuence of confounding factors on the weight estimation, we simply adjust the settings to let the target and source sample sizes be equal but vary from {100, 300, 600, 1000}, and we set only one source model as the misspeciﬁed model with the remaining models being informative. The relationship between the sum of weights for informative models and the sample size is illustrated in Figure 2, which clearly veriﬁes the property in Theorem 2 that the sum of weights for informative models will be close to 1 as the sample size increases.

5. Empirical Data Analysis

In this section, we apply our approach to analyze housing rental information data in Beijing, which is drawn from a publicly available data set on http://www.idatascience.cn/ dataset. Our primary goal is to predict the monthly rent that is conducive to better understanding and following up the housing rental market. Considering the similarity of geographical location, population structure, and rental demand, we choose ﬁve adjacent districts distributed in southwestern Beijing for our analysis, and the speciﬁc locations of rental houses

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

are marked in Figure 7. Overall, the data set for our analysis contains 1409 observations with 33 variables distributed in ﬁve districts (Daxing, Fangshan, Fengtai, Mentougou, and Shijingshan). To accommodate the transfer learning framework, we take the data from different districts as multi-source data sets. The sample sizes for the source domains of Daxing, Fangshan, Fengtai, Mentougou, and Shijingshan are (n1, ..., n5) = (291, 247, 339, 263, 269). The response variable, denoted by Y (m) for m = 1, 2, 3, 4, 5, indicates the natural logarithm of the monthly rent. After excluding irrelevant variables by the preliminary variable selection, only ten covariates remain in our models, including the number of rooms (X(m) 1 ), the number of restrooms (X(m) 2 ), the number of living rooms (X(m) 3 ), total area (X(m) 4 ), have or not a bed (X(m) 5 ), have or not a wardrobe (X(m) 6 ), have or not a air conditioner (X(m) 7 ), have or not a fuel gas (X(m) 8 ), total ﬂoor (Z(m) 1 ), the number of schools within 3 km (Z(m) 2 ). More details can be seen in Table 10 in Appendix C.6. All the covariates have been properly transformed and scaled. Since we have little prior knowledge of whether the relationship between the response and each predictor is linear or nonlinear, we further conduct the marginal visualization for each district to determine our model speciﬁcation. The marginal relationships between the natural logarithm of the monthly rent and ten predictors are plotted in Figure 8-Figure 12 in Appendix C.6. Since our framework theoretically allows transferring among parametric regression models, we construct an ordinary linear regression model for Fengtai. Therefore, we adopt the following models in this analysis

( β(m) 0 + P8 j=1 β(m) j X(m) j + P2 j =1 g(m) j (Z(m) j ) + ε(m) (m = 1, 2, 4, 5),

β(m) 0 + P8 j=1 β(m) j X(m) j + P2 j =1 γ(m) j Z(m) j + ε(m) (m = 3).

To further demonstrate the replicability of our proposal, each data set will be regarded as the target domain, and the others will serve as source domains. Then we consider multiple combinations of the target model and source models and carry out our procedure one by one. Next, we ﬁt semiparametric models introduced in Section 2.1. To evaluate the outof-sample prediction risk, we randomly split the target samples into two subgroups with equal size as the training and testing data. Then we calculate the mean squared prediction error MSPE = 2 Y (m) [k] b Y (m) [k] 2/nm, m = 1, . . . , 5, where subscript [k] denotes the kth replication, and nm refers to the sample size of the mth data set. We repeat the above process 500 times, and the corresponding results are illustrated in Figure 3. Following a referee s advice, we further conduct an additional simulation study similar to the real data structure in Appendix C.7. The detailed results are summarized in Table 11. According to Figure 3, it can be seen that all the CV criteria based Trans-SMAP perform similarly, and they outperform alternative methods for most of the target domains. Speciﬁcally, Trans-SMAP (5-fold CV) yields a smaller median of the MSPE than Trans Simp MA, Trans-SBIC, and Trans-SAIC for Daxing, Fangshan, Fengtai, and Mentougou, which demonstrates the superiority of our weight choice criterion with theoretical support. LSE-Tar performs worse than Trans-SMAP (5-fold CV) for all target domains except Fengtai, but the advantage of LSE-Tar is relatively small. LSE-All performs much worse than our method in all scenarios. The poor performance of LSE-Tar and LSE-All results from their ineﬀective use of potential auxiliary information. Other parameter-transfer approaches,

Hu and Zhang

Daxing Fangshan Fengtai Mentougou Shijingshan

Daxing 0.366 0.114 0.163 0.010 0.347 Fangshan 0.189 0.297 0.002 0.397 0.116 Fengtai 0.125 0.003 0.656 0.000 0.216 Mentougou 0.000 0.199 0.000 0.780 0.020 Shijingshan 0.266 0.131 0.139 0.281 0.183

Table 2: The averaged weight assignments for diﬀerent target domains. The rows represent diﬀerent target domains, and columns are the corresponding models.

Daxing Fangshan Fengtai Mentougou Shijingshan

Prediction Error

Trans SMAP(2 fold CV)

Trans SMAP(5 fold CV)

Trans SMAP(10 fold CV)

Trans SMAP(leave one out CV)

Trans Simp MA

Trans Lasso

Figure 3: Boxplots of the MSPE for diﬀerent target domains in the housing rental information data analysis.

Trans-Lasso and Trans-GLM, also perform much or slightly worse than Trans-SMAP (5fold CV) for all of these target domains except Mentougou. This result shows that the prediction performance of our method is not sensitive to the choice of J. This conclusion is consistent with the results of our simulation studies, which are detailed in Figure 6 in Appendix C.5. However, it is important to note that our analysis of the inﬂuence of J is based on a numerical perspective, and a rigorous theoretical analysis is left for future research. We further examine the weight assignments for the diﬀerent models achieved by our criterion in Table 2. To be more intuitive, we plot a directed graph in Appendix C.6 to visualize the transfer network. Some interesting ﬁndings can be observed from the results. First, the weights are adaptively assigned to diﬀerent models for diﬀerent target domains with distinctiveness. Speciﬁcally, the weights for some models are very small or even exactly zero, which partly reﬂects the weak transferability. Second, from the weight assignments for the target domains of Fengtai and Mentougou, the target model plays a more important role

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

than the source models. This indicates that limited knowledge may be transferred to target tasks. On the contrary, for the target domains of Fangshan and Shijingshan, we can observe that the source models indeed improve the predictive job from the weight assignments and MSPE, among which Mentougou is commonly informative for both target models. Third, we ﬁnd that Fangshan and Mentougou mutually serve as their most informative source models, so they may help each other to improve their performance. In summary, the empirical data example demonstrates the eﬀectiveness of our proposed Trans-SMAP in terms of the MSPE compared to competitive approaches and the proper weight assignments for diﬀerent models, which suggests a promising strategy for predictive tasks in future applications.

6. Concluding Remarks

In the context of transfer learning, we propose an optimal parameter-transfer approach for prediction under a ﬂexible semiparametric additive framework. We develop a model averaging approach to transfer possibly shared parameter information from source models to the target model. The asymptotic optimality of the out-of-sample prediction risk under the misspeciﬁed target model and weight convergence under the correct target model are derived. Extensive numerical results demonstrate our eﬀectiveness compared to alternative methods and further support the theoretical ﬁndings. Note that our framework allows adopting parametric frameworks in diﬀerent models, which demonstrates its ﬂexibility in applications. Even though equal dimensions of parametric components are assumed in our setting, the proposed method theoretically allows for a more general scenario with diﬀerent dimensions. In addition, our procedure provides a feasible strategy to deal with massive data. Speciﬁcally, we can split the data ﬁrst and carry out estimating processes for each data set in parallel, after which we aggregate the corresponding estimators to construct the prediction by our strategy. Since only parameter estimators are exposed across multiple data sets in our procedure, in a sense, our approach can eﬀectively protect the privacy of original data. Several promising future attempts are worth further research. First, it would be interesting to further study the statistical inference for the resulting model averaging prediction. In this regard, some asymptotic distribution theories for the frequentist model averaging estimator have been established in the literature; see related works from Hjort and Claeskens (2003), Liu (2015), and Zhang and Liu (2019). Second, optimal parameter-transfer approaches under some variants of the semiparametric framework are also appealing, such as varying-coeﬃcient models, single-index models, and their generalized versions with extensions to high-dimensional scenarios. Third, it is intuitive to combine multiple models by the traditional model averaging approaches instead of using a single model for each data set. Fourth, it is necessary to consider a data-driven procedure to select J in our criterion instead of just using some given values, and the theoretical investigation warrants further research. Last, transferring shared information of the nonparametric components is a very interesting topic. One possible strategy is to directly transfer the estimates of nonparametric functions in diﬀerent models. Alternatively, we could consider transferring the hyperparameters in the corresponding nonparametric estimation methods, such as the number of internal knots or degree of the piecewise polynomial in spline-based approaches, and kernel function or

Hu and Zhang

bandwidth in kernel-based methods. The speciﬁc methodology needs in-depth research in the future.

Acknowledgments

The authors are very grateful to the action editor and two anonymous referees for their constructive comments and suggestions that substantially improve the original manuscript. Zhang s work was supported by the National Natural Science Foundation of China under Grants 71925007, 72091212, 71988101 and 12288201, and the CAS Project for Young Scientists in Basic Research under Grant YSBR-008. Hu s work was supported by the China Postdoctoral Science Foundation under Grant 2021M703428. No potential conﬂict of interest is reported by the authors.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Appendix A. Technical Details

In this section, we provide some technical details in Section 3 and proofs of Theorem 1, Theorem 2, and Corollary 3.

A.1 Veriﬁcation of Condition 8

Let f 2 = E{f2(X)} = R 1 0 f2(x)p(x)dx denote the L2 norm of the function f on [0, 1], where p(x) is the density of X. Denote the additive function of the target model by g(0)

that can be represented in the form g(0)(z(0) n0+1) = Pq0 l=1 g(0) l (z(0) n0+1,l), where the component

g(0) l belongs to the space F introduced in Condition 1. Let eg(0) be the additive spline of the target model with the form eg(0)(z(0) n0+1) = Pq0 l=1 eg(0) l (z(0) n0+1,l) = Pq0 l=1(B(0) l (z(0) n0+1,l))T eγ(0) l ,

where the component eg(0) l belongs to the space Ψ(0) introduced in Section 2.1. In order to better understand Condition 8, suppose that the target model is correctly speciﬁed, according to Lemma 8 of Stone (1986), we have g(0) eg(0) 2 = O{(v(0)) 2κ}, where v(0) = max1 l q0 v(0) l = max1 l q0{r(0) l + S(0) l } is the maximal dimension of the B-spline basis for the target model, and κ is deﬁned in Condition 1. Then

ξn = inf w W e R(w)

= inf w W E n µ(0) n0+1 eµ(0) n0+1(w) o2

E n µ(0) n0+1 eµ(0) n0+1,0 o2

(x(0) n0+1)T (β(0) eβ(0)) +

l=1 g(0) l (z(0) n0+1,l)

l=1 (B(0) l (z(0) n0+1,l))T eγ(0) l

= E n g(0)(z(0) n0+1) eg(0)(z(0) n0+1) o2

= O{(v(0)) 2κ}. (6)

Hence, Condition 8 is violated based on Condition 1.

A.2 Proof of Asymptotic Unbiasedness

Proof Recall the previous notations as follows.

bµ(0) i,[Gc j ](w) =

m=0 wmbµ(0) i,m,[Gc j ], i Gj, j = 1, . . . , J,

bµ(0) n0+1(w) =

m=0 wmbµ(0) n0+1,m.

Based on the deﬁnition of CV (w), we have the following decompositions

E{CV (w)} = E n y(0) i bµ(0) i,[Gc j ](w) o2

Hu and Zhang

= E h {y(0) i µ(0) i }2i + E n µ(0) i bµ(0) i,[Gc j ](w) o2

PR(w) = E n y(0) n0+1 bµ(0) n0+1(w) o2

= E n y(0) n0+1 µ(0) n0+1 o2 + E n µ(0) n0+1 bµ(0) n0+1(w) o2 .

Comparing the above formulas, since the samples from the target population are independent with identical distribution, we have E[{y(0) i µ(0) i }2] = E[{y(0) n0+1 µ(0) n0+1}2] and

E[{µ(0) i bµ(0) i,[Gc j ](w)}2] = E[{µ(0) n0+1 bµ(0) n0+1,[Gc j ](w)}2]. Note that bµ(0) n0+1(w) and bµ(0) n0+1,[Gc j ](w) are weighted averaging predictions with the same deﬁnition except that the latter uses n0 n0/J observations to estimate parameters θ(0) m instead of all samples. Since n0 and n0 n0/J have the same order for any J {2, . . . , n0} as n , bθ(0) m and bθ(0) m,[Gc j ] have the

same limiting values. Therefore, we have bµ(0) n0+1(w) bµ(0) n0+1,[Gc j ](w) 0 in probability, and

then E(CV (w)) = PR(w) + o(1) for any w. This completes the proof.

A.3 Proof of Theorem 1

First we list the following lemma without proof, which is also provided in Lemma 1 in Zhang (2010), Lemma 1 in Gao et al. (2019), and Lemma 1 in Zhang and Liu (2023), to derive the asymptotic optimality theory for completeness.

Lemma 4 Let b w = arg min w W{R(w) + an(w) + bn}.

e R(w) = op(1)

|R(w) e R(w)|

e R(w) = op(1),

and there exists a positive constant c so that limn infw W e R(w) c almost surely, then we have R( b w) inf w W R(w) 1

in probability.

Now, we formally prove the Theorem 1.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

CV (w) = CV (w) 1

i=1 (y(0) i µ(0) i )(y(0) i + µ(0) i ).

Since n 1 0 Pn0 i=1(y(0) i µ(0) i )(y(0) i + µ(0) i ) is unrelated to w, our weight choice criterion is equivalent to b w = arg min w W CV (w).

According to Lemma 4, Theorem 1 is valid if the following equalities hold

|R(w) e R(w)|

e R(w) = op(1) (7)

|CV (w) e R(w)|

e R(w) = op(1). (8)

We ﬁrst consider (7). Observe that

ξ 1 n sup w W

n µ(0) n0+1 bµ(0) n0+1(w) o2 n µ(0) n0+1 eµ(0) n0+1(w) o2

= ξ 1 n sup w W

n bµ(0) n0+1(w) eµ(0) n0+1(w) o n bµ(0) n0+1(w) eµ(0) n0+1(w) + 2eµ(0) n0+1(w) + 2µ(0) n0+1 o

= ξ 1 n sup w W

m=0 wm(bθ(0) m eθ(0) m )T bµ(0) n0+1,m

bθ(0) m = θ

m=0 wm(bθ(0) m eθ(0) m )T bµ(0) n0+1,m

bθ(0) m = θ

m=0 wmeµ(0) n0+1,m + 2µ(0) n0+1

ξ 1 n sup w W

m=0 wm bθ(0) m eθ(0) m

bµ(0) n0+1,m

bθ(0) m = θ

m=0 wm bθ(0) m eθ(0) m

bµ(0) n0+1,m

bθ(0) m = θ

m=0 wm|eµ(0) n0+1,m| + 2|µ(0) n0+1|

= ξ 1 n Op(pn 1/2M1/2 + p2n 1M)

= op(1), (9)

where the second equality uses Condition 6, the third equality uses Conditions 5 and 6, and the last equality uses Condition 8. Therefore, we have

|R(w) e R(w)|

ξ 1 n sup w W |R(w) e R(w)|

= ξ 1 n sup w W

E n µ(0) n0+1 bµ(0) n0+1(w) o2 n µ(0) n0+1 eµ(0) n0+1(w) o2

Hu and Zhang

E ξ 1 n sup w W

n µ(0) n0+1 bµ(0) n0+1(w) o2 n µ(0) n0+1 eµ(0) n0+1(w) o2

where the last equality is based on Condition 7 and (9). Hence, we obtain (7). Next, we consider (8). Similar to the derivation of (9), we have

n y(0) i bµ(0) i,[Gc j ](w) o2 n y(0) i eµ(0) i (w) o2

= ξ 1 n sup w W

n bµ(0) i,[Gc j ](w) eµ(0) i,[Gc j ](w) o n bµ(0) i,[Gc j ](w) eµ(0) i,[Gc j ](w) + 2eµ(0) i,[Gc j ](w) + 2y(0) i o

= ξ 1 n sup w W

m=0 wm(bθ(0) m,[Gc j ] eθ(0) m )T bµ(0) i,m,[Gc j ] bθ(0) m,[Gc j ]

bθ(0) m,[Gc j ]= θ

(bθ(0) m,[Gc j ] eθ(0) m )T bµ(0) i,m,[Gc j ] bθ(0) m,[Gc j ]

bθ(0) m,[Gc j ]= θ + 2

m=0 wmeµ(0) i,m,[Gc j ] + 2µ(0) i + 2ε(0) i

ξ 1 n sup w W

m=0 wm bθ(0) m,[Gc j ] eθ(0) m

bµ(0) i,m,[Gc j ] bθ(0) m,[Gc j ]

bθ(0) m,[Gc j ]= θ

bθ(0) m,[Gc j ] eθ(0) m

bµ(0) i,m,[Gc j ] bθ(0) m,[Gc j ]

bθ(0) m,[Gc j ]= θ

m=0 wm|eµ(0) i,m,[Gc j ]| + 2|µ(0) i | + 2|ε(0) i |

= Op(pn 1/2M1/2 + p2n 1M). (10)

Observe that

CV (w) e R(w)

ξ 1 n sup w W

CV (w) e R(w)

= ξ 1 n sup w W

n y(0) i bµ(0) i,[Gc j ](w) o2 (y(0) i µ(0) i )(y(0) i + µ(0) i )

E n µ(0) n0+1 eµ(0) n0+1(w) o2

ξ 1 n sup w W

n y(0) i eµ(0) i (w) o2 (y(0) i µ(0) i )(y(0) i + µ(0) i ) E n µ(0) n0+1

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

eµ(0) n0+1(w) o2 + ξ 1 n sup w W

n y(0) i bµ(0) i,[Gc j ](w) o2 n y(0) i eµ(0) i (w) o2

= ξ 1 n sup w W

n y(0) i eµ(0) i (w) o2 (y(0) i µ(0) i )(y(0) i + µ(0) i ) E n µ(0) n0+1

eµ(0) n0+1(w) o2 + ξ 1 n Op(pn 1/2M1/2 + p2n 1M)

ξ 1 n sup w W

n y(0) i eµ(0) i (w) o2 (y(0) i µ(0) i )(y(0) i + µ(0) i ) n µ(0) i eµ(0) i (w) o2

+ ξ 1 n sup w W

n µ(0) i eµ(0) i (w) o2 E n µ(0) n0+1 eµ(0) n0+1(w) o2

+ ξ 1 n Op(pn 1/2M1/2 + p2n 1M)

= ξ 1 n sup w W

n µ(0) i eµ(0) i (w) o2 E n µ(0) n0+1 eµ(0) n0+1(w) o2

+ ξ 1 n sup w W

n 2eµ(0) i (w)(y(0) i µ(0) i ) o

+ ξ 1 n Op(pn 1/2M1/2 + p2n 1M), (11)

where the second equality is based on (10). Hence, to obtain (8), it suﬃces to prove

ξ 1 n sup w W

n µ(0) i eµ(0) i (w) o2 E n µ(0) n0+1 eµ(0) n0+1(w) o2 = op(1) (12)

ξ 1 n sup w W

n eµ(0) i (w)(y(0) i µ(0) i ) o = op(1). (13)

To prove (12), observe that

n µ(0) i eµ(0) i (w) o2 E n µ(0) n0+1 eµ(0) n0+1(w) o2

m=0 wmeµ(0) i,m

m=0 wmeµ(0) n0+1,m

m=0 wm(µ(0) i eµ(0) i,m)

m=0 wm(µ(0) n0+1 eµ(0) n0+1,m)

i=1 (µ(0) i eµ(0) i,m)(µ(0) i eµ(0) i,m ) E n (µ(0) n0+1 eµ(0) n0+1,m)

Hu and Zhang

(µ(0) n0+1 eµ(0) n0+1,m ) o

= n 1/2 0 sup w W

m =0 wmwm bηm,m

n 1/2 0 sup m,m bηm,m

= n 1/2 0 op(ξnn1/2 0 ), (14)

h (µ(0) i eµ(0) i,m)(µ(0) i eµ(0) i,m ) E n (µ(0) n0+1 eµ(0) n0+1,m)(µ(0) n0+1 eµ(0) n0+1,m ) oi ,

and the last equality in (14) is due to the following (15) and (16). From Condition 5, we have

var n (µ(0) i eµ(0) i,m)(µ(0) i eµ(0) i,m ) o = O(1) (15)

uniformly for m, m {0, . . . , M}. By Chebyshev inequality, we derive that for any ν > 0,

ξ 1 n n 1/2 0 sup m,m bηm,m > ν

m =0 Pr n ξ 1 n n 1/2 0 bηm,m > ν o

ξ 2 n n 1 0 ν 2 M X

m =0 var n (µ(0) i eµ(0) i,m)(µ(0) i eµ(0) i,m ) o

= O(ξ 2 n n 1 0 M2) = o(1). (16)

Together with Condition 8, we have supm,m bηm,m = op(ξnn1/2 0 ), and then obtain (12). Similar to the derivation of (12), from Condition 5, we have

var n eµ(0) i,m(y(0) i µ(0) i ) o = O(1) (17)

uniformly for m = 0, . . . , M. Further, for any ν > 0,

ξ 1 n sup m

n eµ(0) i,m(y(0) i µ(0) i ) o > ν

n eµ(0) i,m(y(0) i µ(0) i ) o > ν

ξ 2 n n 1 0 ν 2 M X

m=0 var n eµ(0) i,m(y(0) i µ(0) i ) o

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

= O(ξ 2 n n 1 0 M) = o(1). (18)

Therefore, we have

n eµ(0) i (w)(y(0) i µ(0) i ) o

m=0 wmeµ(0) i,m(y(0) i µ(0) i )

m=0 wm 1 n0

n eµ(0) i,m(y(0) i µ(0) i ) o

n eµ(0) i,m(y(0) i µ(0) i ) o

n eµ(0) i,my(0) i eµ(0) i,mµ(0) i o

where the last equality is based on (17) and (18). Then (13) is obtained. This completes the proof of Theorem 1.

A.4 Proof of Theorem 2

Proof Consider that Theorem 2 trivially holds if Ic is empty, thus we just need to discuss the case that Ic is not empty. Similar to the derivation of (14), for any constant ν > 0,

n µ(0) i eµ(0) i (w) o2 E n µ(0) n0+1 eµ(0) n0+1(w) o2 > ν

m=0 wmeµ(0) i,m

m=0 wmeµ(0) n0+1,m

m=0 wm(µ(0) i eµ(0) i,m)

m=0 wm(µ(0) n0+1 eµ(0) n0+1,m)

i=1 (µ(0) i eµ(0) i,m)(µ(0) i eµ(0) i,m ) E

(µ(0) n0+1 eµ(0) n0+1,m)

(µ(0) n0+1 eµ(0) n0+1,m )

i=1 (µ(0) i eµ(0) i,m)(µ(0) i eµ(0) i,m ) E

(µ(0) n0+1 eµ(0) n0+1,m)

Hu and Zhang

(µ(0) n0+1 eµ(0) n0+1,m )

) > Mn 1/2 0 ν

m =0 var n (µ(0) i eµ(0) i,m)(µ(0) i eµ(0) i,m ) o

where the second inequality uses Boole s inequality, the third inequality uses Chebyshev s inequality, and the last equality is based on (15). Therefore, it follows that

n µ(0) i eµ(0) i (w) o2 E n µ(0) n0+1 eµ(0) n0+1(w) o2 = Op(n 1/2 0 M). (19)

n eµ(0) i (w)(y(0) i µ(0) i ) o = Op(n 1/2 0 M1/2). (20)

Hence, combining (19), (20), and (11), we have

CV (w) = e R(w) + Op(pn 1/2M1/2 + p2n 1M). (21)

Let ϑ be a weight vector with ϑm = 0 for m I and ϑm = wm/(1 τ) for m Ic, where τ = P m I wm. According to Lemma 7 and Lemma 8 in Stone (1986), we have g(0) eg(0) = O{(v(0))1/2 κ}, where f = sup0 x 1|f(x)| denotes the supnorm of the function f on [0, 1]. Then using ϑ, we have

e R(w) = E n µ(0) n0+1 eµ(0) n0+1(w) o2

m=0 wm(µ(0) n0+1 eµ(0) n0+1,m)

m Ic wm(µ(0) n0+1 eµ(0) n0+1,m) + X

m I wm(µ(0) n0+1 eµ(0) n0+1,m)

m Ic wm(µ(0) n0+1 eµ(0) n0+1,m)

m I wm(µ(0) n0+1 eµ(0) n0+1,m)

m Ic wm(µ(0) n0+1 eµ(0) n0+1,m)

m I wm(µ(0) n0+1 eµ(0) n0+1,m)

m Ic (1 τ) 1wm(µ(0) n0+1 eµ(0) n0+1,m)

+ τ 2E n g(0)(z(0) n0+1) eg(0)(z(0) n0+1) o2

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

m Ic wm(µ(0) n0+1 eµ(0) n0+1,m)

) n g(0)(z(0) n0+1) eg(0)(z(0) n0+1) o!

= (1 τ)2 e R(ϑ) + O{(v(0)) 2κ} + O{(v(0))1/2 κ}E

m Ic wm(µ(0) n0+1 eµ(0) n0+1,m)

= (1 τ)2 e R(ϑ) + O{(v(0)) 2κ} + O{(v(0))1/2 κ}

= (1 τ)2 e R(ϑ) + O{(v(0))1/2 κ}, (22)

where the ﬁfth equality is based on the deﬁnition of informative models that have the same pseudo-true values with the target model, the sixth equality is similar to the derivation of (6), and the last but one equality uses Condition 10. Here, the notation e R denotes the function of ϑ with the same deﬁnition as previous. Then based on (21), (22), and Condition 9, replacing w with b w, we have

CV ( b w) = (1 bτ)2 e R( bϑ) + Op(pn 1/2M1/2 + p2n 1M). (23)

Note that in all the functions of w such as b R(w), we calculate expectations ﬁrstly and then plug in b w. Let w be the weight vector with the ﬁrst component one and the others zero. Then we have

CV ( w) = e R( w) + Op(pn 1/2M1/2 + p2n 1M)

= O((v(0)) 2κ) + Op(pn 1/2M1/2 + p2n 1M)

= Op(pn 1/2M1/2 + p2n 1M). (24)

Next, from (23), (24), and the fact that b w minimizes CV (w), we have

(1 bτ)2 e R( bϑ) + Op(pn 1/2M1/2 + p2n 1M)

= Op(pn 1/2M1/2 + p2n 1M).

Hence, (1 bτ)2 inf w W e R(w) Op(pn 1/2M1/2 + p2n 1M), (25)

and it implies bτ 1 in probability based on Condition 11. This completes the proof.

A.5 Proof of Corollary 3

Proof To prove the corollary, we require considering the cases of the correctly speciﬁed target model and the misspeciﬁed target model. When the target model is misspeciﬁed, the result of Theorem 1 ensures that our method yields the minimum risk, and it obviously implies the conclusion. Next, we mainly discuss the case of the correctly speciﬁed target model. When the target model is correct, the prediction of least squares estimator on target data can be written as

bµ(0) n0+1 µ(0) n0+1 = (bµ(0) n0+1 eµ(0) n0+1) + (eµ(0) n0+1 µ(0) n0+1)

Hu and Zhang

= Op(p0n 1/2 0 M1/2) + O((v(0)) κ)

= Op(p0n 1/2 0 M1/2),

where the last equality is based on the deﬁnition of v(0) and Condition 1. Then the risk of least squares estimator on target data is O(p2 0n 1 0 M). In addition, the prediction of Trans-SMAP satisﬁes

bµ(0) n0+1( b w) =

m=0 bwmbµ(0) n0+1,m

= µ(0) n0+1 +

m=0 bwm(bµ(0) n0+1,m µ(0) n0+1)

= µ(0) n0+1 + X

m I bwm(bµ(0) n0+1,m µ(0) n0+1) + X

m Ic bwm(bµ(0) n0+1,m µ(0) n0+1)

= µ(0) n0+1 + X

m I bwm(bµ(0) n0+1,m eµ(0) n0+1,m) + X

m I bwm(eµ(0) n0+1,m µ(0) n0+1)

m Ic bwm(bµ(0) n0+1,m µ(0) n0+1)

= µ(0) n0+1 + Op(pn 1/2M1/2) + O{(v(0))1/2 κ} + Op(1 bτ)

= µ(0) n0+1 + Op(pn 1/2M1/2) + O{(v(0))1/2 κ} + Op(p1/2n 1/4M1/4 + pn 1/2M1/2)

inf w W e R(w) 1/2

= µ(0) n0+1 + Op{pn 1/2M1/2},

where the last but one equality is based on (25) and the last equality is based on Conditions 1, 9 and 12. Therefore, the risk of Trans-SMAP is Op(p2n 1M). Since it obviously has p2n 1 > p2 0n 1 0 , we have R( b w) = Op( R0) as long as p2n 1 = O(p2 0n 1 0 ). This completes the proof.

Appendix B. Implementation Details in Numerical Experiments

B.1 Implementation Details of Diﬀerent Methods

In our simulation study, we implement all the numerical experiments with R software. To implement our Trans-SMAP procedure, we apply the cubic B-splines to approximate additive functions, set r(m) l = 3 for all spline estimators, and specify the number of knots through the argument df in the R function bs . Here, we set df = 3 for each spline estimator in M + 1 models for simplicity and eﬃciency. Note that the number of knots can also be properly determined by criteria such as cross-validation. Since the estimation accuracy of nonparametric components is not our goal, to reduce the computational complexity, we do not focus on selecting the number of knots and simply adopt a ﬁxed setting in the simulation study.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

The optimization of our weight criterion can be formulated as a constrained quadratic programming problem, which can be eﬃciently solved by the existing function solve.QP in the R software package quadprog . Speciﬁcally, let Q = n 1 0 PJ j=1 P i Gj{(y(0) i 1

byi,[Gc j ])(y(0) i 1 byi,[Gc j ])T }, where byi,[Gc j ] = (bµ(0) i,0,[Gc j ], . . . , bµ(0) i,M,[Gc j ])T and 1 is an (M + 1) 1

column vector with ones. Further, with PM m=0 wm = 1, we have

n y(0) i bµ(0) i,[Gc j ](w) o2

n w T (y(0) i 1 byi,[Gc j ]) o2

n (y(0) i 1 byi,[Gc j ])(y(0) i 1 byi,[Gc j ])T o

The recent transfer learning methods, Trans-Lasso and Trans-GLM, can be easily implemented via open-source programs and the R package glmtrans . In addition, we have also attempted to compare our method with the integrative analysis method for semiparametric models (Li et al., 2019) in a small experiment. We ﬁnd that the integrative analysis method is less eﬀective than any other method used in this article, which possibly due to the following reasons. First, it adopts a group lasso type penalization that needs a suitable tuning parameter selection, and the resultant biased estimators may lead to an unsatisfactory prediction. Second, the goal and framework of integrative analysis, which can be regarded as multi-task learning, diﬀer from transfer learning. It aims to identify important predictors and estimate parameters in high-dimensional settings and has no theoretical guarantee for out-of-sample prediction, meanwhile all the models are assumed to be correctly speciﬁed and of equal concern.

B.2 Computational Complexity Analysis of the Trans-SMAP Procedure

The calculation of our algorithm is mainly concentrated in the following two stages: crossvalidation (Step 3.1) and the optimization of weight criterion (Step 4). In the crossvalidation step, we need to solve the parameter estimation of each model in each iteration. The computational burden of this step mainly comes from the B-spline expansion of nonparametric components and least squares estimation of equation (2). Speciﬁcally, the B-splines of degree r(m) l for the lth covariate in function g(m) l ( ) require O((r(m) l )2) computation using De Boor s algorithm (De Boor, 2001; Toraichi et al., 1987). Assuming that J is a positive constant, the computational complexity of Step 3.1 is PM m=0 O(qm(r(m) l )2 +p2 mnm). Moreover, we can estimate the M + 1 models in this step in parallel. In the optimization of weight criterion step, Appendix B.1 shows that CV (w) is a quadratic function of weights. We can formulate the optimization problem as a constrained quadratic programming problem. Under some weak conditions, we can use the ellipsoid or interior point method to solve the quadratic programming problem in polynomial time

Hu and Zhang

150 300 500

Correct Model Misspecified Model

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Trans Simp MA

Trans Lasso

150 300 500

Correct Model Misspecified Model

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Trans Simp MA

Trans Lasso

Figure 4: The scaled averaged MSE of out-of-sample prediction in homogeneous dimension settings with (a) M = 3 and (b) M = 6. Since the numerical experiments of Trans-Lasso for n0 {300, 500} and R2 = 0.1 are infeasible, the corresponding results are not plotted.

(Kozlov et al., 1979). The computational complexity of solving weights in (5) is O((M + 1)n2 0). Hence, the total computational complexity of our algorithm is PM m=0 O(qm(r(m) l )2 + p2 mnm) + O((M + 1)n2 0).

In summary, although our algorithm may seem complicated, the computational burden is acceptable in theory. Moreover, the computational eﬃciency has been validated in our numerical simulation studies. For instance, we compare several cross-validation procedures with diﬀerent choices of J in Appendix C.5. The results show that even the most timeconsuming procedure, leave-one-out cross-validation, takes only 3.865 seconds in a single replicate under the settings of M = 3 and n0 = 500. Therefore, we believe that our proposed algorithm is computationally feasible for practical applications.

Appendix C. Additional Numerical Results

C.1 Supplemental Results in Homogeneous Dimension Settings

Figure 4 presents the relationship between the R2 and the scaled MSE with respect to Trans-SMAP. It can be seen from Figure 4 that Trans-SMAP yields the smallest MSE over most of the range of R2. Speciﬁcally, the advantage of our method over Trans-Simp MA, Trans-SBIC, Trans-SAIC, LSE-All, Trans-Lasso, and Trans-GLM becomes apparent as R2

increases gradually. For example, when the target model is correctly speciﬁed, the gain of our method is possibly due to large weights being assigned to informative models, which can also be validated from Figure 1 in Section 4.2. Note that Trans-SMAP always performs slightly better than LSE-Tar in all of our scenarios. As the sample size increases, Trans SMAP still dominates alternative methods for a wide range of R2.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

150 300 500

Correct Model Misspecified Model

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Trans Simp MA

150 300 500

Correct Model Misspecified Model

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Trans Simp MA

Figure 5: The scaled averaged MSE of out-of-sample prediction in heterogeneous dimension settings with (a) M = 3 and (b) M = 6. Since the numerical experiments of Trans-Lasso for n0 {300, 500} and R2 = 0.1 are infeasible, the corresponding results are not plotted.

C.2 Simulation Study in Heterogeneous Dimension Settings

In this section, we design additional settings similar to that in Section 4.1 except for generating multiple data sets with heterogeneous dimensions of nonparametric parts. For M = 3, we set the dimensions of the nonparametric component for each model as (q0, q1, q2, q3) = (3, 2, 2, 1), and the following nonlinear functions for diﬀerent models are considered: g(0)(u) = 2(u1 0.5)3 + sin(πu2) + u3, g(1)(u) = 2(u1 0.5)3 + sin(πu1) + u2, g(2)(u) = 2(u1 0.5)3 + sin(πu2) + u1, and g(3)(u) = 2(u1 0.5)3 + sin(πu1) + u1. For M = 6, let the dimensions of the nonparametric variables be (q0, ..., q6) = (3, 2, 2, 1, 3, 2, 2), and let the corresponding nonlinear functions for diﬀerent models be g(0)(u) = 2(u1 0.5)3 + sin(πu2) + u3, g(1)(u) = 2(u1 0.5)3 + sin(πu1) + u2, g(2)(u) = 2(u1 0.5)3 + sin(πu2) + u1, g(3)(u) = 2(u1 0.5)3 + sin(πu1)+u1, g(4)(u) = 2(u1 0.5)3 +cos(πu2)+u3, g(5)(u) = 2(u1 0.5)3 +cos(πu1)+u2, and g(6)(u) = 2(u1 0.5)3 + cos(πu2) + u1. All the other settings are consistent with the design in Section 4.1. Since the frameworks of Trans-Lasso and Trans-GLM require equal dimensions of covariates, we omit these two methods in this simulation study. The MSE of our proposed Trans-SMAP and alternative methods are shown in Table 3 and Figure 5. The corresponding results are similar to that in Figure 4 and Table 1 in Section 4.2. It can be seen that our method still outperforms all the competitive methods under various simulation settings. Therefore, it reﬂects the ﬂexibility and eﬀectiveness of our approach in more practical scenarios.

C.3 Stability Analysis under Various Dissimilarities of Parameter Eﬀects

In Section 4.1, we only design the comparison studies under ﬁxed diﬀerences of parametric coeﬃcients. To further demonstrate the stability of our procedure in potential negative

Hu and Zhang

Correct Target Model Misspeciﬁed Target Model Method n0 = 150 n0 = 300 n0 = 500 n0 = 150 n0 = 300 n0 = 500 M = 3 Trans-SMAP 0.026 0.013 0.008 0.035 0.021 0.016 (0.009) (0.005) (0.003) (0.011) (0.005) (0.003) Trans-Simp MA 0.238 0.224 0.219 0.218 0.206 0.199 (0.037) (0.029) (0.024) (0.034) (0.027) (0.021) Trans-SBIC 0.182 0.169 0.166 0.173 0.163 0.156 (0.025) (0.021) (0.017) (0.026) (0.020) (0.016) Trans-SAIC 0.180 0.167 0.165 0.171 0.162 0.156 (0.025) (0.021) (0.017) (0.026) (0.019) (0.016) LSE-tar 0.029 0.014 0.009 0.039 0.022 0.016 (0.011) (0.005) (0.003) (0.013) (0.006) (0.003) LSE-All 0.329 0.289 0.244 0.300 0.266 0.220 (0.060) (0.045) (0.032) (0.057) (0.040) (0.028)

Uplift Rate 11.54% 7.69% 12.50% 11.43% 4.76% 0.00% M = 6 Trans-SMAP 0.025 0.013 0.007 0.033 0.021 0.015 (0.009) (0.005) (0.003) (0.010) (0.005) (0.003) Trans-Simp MA 0.210 0.200 0.196 0.192 0.184 0.179 (0.024) (0.020) (0.016) (0.024) (0.017) (0.015) Trans-SBIC 0.187 0.177 0.173 0.176 0.168 0.162 (0.021) (0.017) (0.015) (0.022) (0.015) (0.013) Trans-SAIC 0.185 0.176 0.173 0.174 0.167 0.162 (0.021) (0.017) (0.015) (0.022) (0.015) (0.013) LSE-tar 0.030 0.015 0.009 0.038 0.023 0.017 (0.011) (0.006) (0.003) (0.011) (0.006) (0.003) LSE-All 0.301 0.249 0.227 0.276 0.228 0.208 (0.043) (0.030) (0.022) (0.041) (0.026) (0.020)

Uplift Rate 20.00% 15.38% 28.57% 15.15% 9.52% 13.33%

Table 3: The averaged MSE of out-of-sample prediction in heterogeneous dimension settings. The standard errors are given in parenthesis.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

transfer scenarios, we conduct the following simulation by varying the diﬀerence of parametric coeﬃcients for the target model. Speciﬁcally, let the values of δ2 for M = 3 and δ3, δ6 for M = 6 in coeﬃcient vectors vary from {0.1, 0.3, 0.5, 0.7, 0.9}, and all the other settings are consistent with the settings in Section 4.1. Here, we display the results of averaged MSE based on 500 replications in homoscedastic settings with the ﬁxed level of noise, where Tables 4-5 are corresponding results associated with heterogeneous dimension settings and Tables 6-7 are for homogeneous dimension settings. Similarly, it is clearly observed from all the tables that our proposed Trans-SMAP still yields signiﬁcant improvement compared to competitive methods. In addition, we can ﬁnd that the prediction accuracy of Trans-SMAP, LSE-Tar, Trans-Lasso, and Trans-GLM is insensitive to the levels of dissimilarity, while the performance of other methods gets worse as the diﬀerence increases. Speciﬁcally, the stable performance shown by Trans-SMAP, Trans Lasso, and Trans-GLM is due to their suitable strategies of knowledge transfer, among which our method brings additional predictive beneﬁts compared to LSE-Tar. Note that the inferior performance of Trans-Lasso and Trans-GLM compared to LSE-Tar mainly results from the semiparametric model settings. For the dissimilarity sensitive methods, such as Trans Simp MA, Trans-SBIC, Trans-SAIC, and LSE-All, the results demonstrate that transferring information from certain sources even leads to unsatisfactory performance compared to LSE-Tar. Therefore, in a sense, the proposed Trans-SMAP has ability to avoid the negative transfer problem.

C.4 Simulation Study in Heteroscedastic Settings

In this section, we supplement additional simulation studies in heteroscedastic settings to evaluate our method comprehensively. Based on the model settings in Section 2.1, our method allows for heteroscedastic cases. For simplicity, we only consider generating data following the homogeneous settings except that the random errors of the m-th model are normally distributed with heteroscedasticity as ε(m) i N(0, 0.5(x(m) i1 )2) for m = 0, . . . , M. The corresponding results of MSE are presented in Tables 8 and 9. According to the results, Trans-SMAP similarly performs the best in both the correct and misspeciﬁed target model settings. It is worth noting that the improvement of Trans-SMAP is more signiﬁcant than those in homoscedastic settings based on the uplift rate in the tables.

C.5 Comparison of Various CV Criteria

To examine the impact of the choice of J in our cross-validation criterion, we analyze the performance of the 2-fold CV, 5-fold CV, 10-fold CV, and leave-one-out CV based Trans-SMAP in terms of the MSE and time consumption. For simplicity, we generate data following the homogeneous settings for M = 3 as an example, and the corresponding results are illustrated in Figure 6. From Figure 6 (a) and (c), it can be seen that all the CV criteria based Trans-SMAP perform similarly better than alternative methods. However, the leave-one-out procedure takes a larger amount of time than alternative criteria as the sample size increases based on Figure 6 (b). For instance, the 5-fold CV takes 0.039 seconds that is approximately 100 times faster than 3.865 seconds of the leave-one-out CV when the target sample size n0 = 500. Hence, we advocate using the J-fold CV criterion in this article instead of the

Hu and Zhang

Correct Target Model Misspeciﬁed Target Model

δ2 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

n0 = 150 Trans-SMAP 0.026 0.027 0.027 0.026 0.026 0.035 0.035 0.034 0.034 0.035 (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) (0.011) (0.011) (0.010) (0.011) Trans-Simp MA 0.115 0.237 0.431 0.693 1.030 0.106 0.221 0.405 0.654 0.976 (0.024) (0.038) (0.051) (0.074) (0.097) (0.023) (0.035) (0.052) (0.066) (0.086) Trans-SBIC 0.054 0.183 0.414 0.744 1.191 0.054 0.175 0.398 0.731 1.152 (0.013) (0.028) (0.051) (0.089) (0.137) (0.013) (0.027) (0.052) (0.087) (0.129) Trans-SAIC 0.054 0.181 0.409 0.735 1.176 0.054 0.173 0.393 0.722 1.138 (0.013) (0.028) (0.051) (0.088) (0.136) (0.013) (0.027) (0.051) (0.086) (0.128) LSE-tar 0.029 0.031 0.030 0.031 0.030 0.039 0.039 0.038 0.039 0.039 (0.011) (0.012) (0.011) (0.012) (0.011) (0.012) (0.013) (0.012) (0.011) (0.012) LSE-All 0.153 0.326 0.601 0.976 1.445 0.140 0.303 0.564 0.914 1.386 (0.037) (0.059) (0.099) (0.167) (0.242) (0.035) (0.054) (0.100) (0.153) (0.237)

Uplift Rate 11.54% 14.81% 11.11% 19.23% 15.38% 11.43% 11.43% 11.76% 14.71% 11.43%

n0 = 300 Trans-SMAP 0.013 0.013 0.013 0.013 0.013 0.021 0.021 0.021 0.020 0.021 (0.005) (0.004) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) Trans-Simp MA 0.102 0.226 0.416 0.680 1.009 0.091 0.206 0.394 0.648 0.971 (0.017) (0.028) (0.043) (0.062) (0.080) (0.016) (0.026) (0.041) (0.059) (0.076) Trans-SBIC 0.043 0.171 0.399 0.731 1.168 0.042 0.162 0.388 0.715 1.149 (0.008) (0.021) (0.042) (0.072) (0.109) (0.008) (0.018) (0.041) (0.069) (0.103) Trans-SAIC 0.042 0.170 0.396 0.725 1.159 0.042 0.161 0.385 0.709 1.140 (0.008) (0.021) (0.041) (0.071) (0.109) (0.008) (0.018) (0.041) (0.069) (0.103) LSE-tar 0.014 0.014 0.014 0.015 0.014 0.022 0.022 0.022 0.022 0.022 (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) LSE-All 0.129 0.292 0.542 0.889 1.324 0.115 0.267 0.517 0.858 1.265 (0.024) (0.042) (0.072) (0.119) (0.174) (0.022) (0.041) (0.072) (0.114) (0.169)

Uplift Rate 7.69% 7.69% 7.69% 15.38% 7.69% 4.76% 4.76% 4.76% 10.00% 4.76%

n0 = 500 Trans-SMAP 0.008 0.008 0.008 0.008 0.007 0.016 0.016 0.015 0.016 0.015 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) Trans-Simp MA 0.099 0.219 0.412 0.675 1.003 0.086 0.200 0.385 0.641 0.963 (0.014) (0.024) (0.036) (0.053) (0.078) (0.013) (0.023) (0.035) (0.052) (0.067) Trans-SBIC 0.039 0.165 0.394 0.727 1.156 0.038 0.157 0.380 0.709 1.136 (0.007) (0.018) (0.036) (0.064) (0.106) (0.006) (0.016) (0.037) (0.061) (0.090) Trans-SAIC 0.039 0.165 0.394 0.726 1.155 0.038 0.157 0.380 0.708 1.135 (0.007) (0.018) (0.036) (0.064) (0.106) (0.006) (0.016) (0.036) (0.061) (0.090) LSE-tar 0.009 0.009 0.009 0.009 0.008 0.017 0.017 0.016 0.016 0.016 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) LSE-All 0.116 0.244 0.446 0.723 1.055 0.101 0.224 0.417 0.684 1.024 (0.019) (0.032) (0.053) (0.083) (0.123) (0.017) (0.03) (0.052) (0.083) (0.121)

Uplift Rate 12.50% 12.50% 12.50% 12.50% 14.29% 6.25% 6.25% 6.67% 0.00% 6.67%

Table 4: The averaged MSE of out-of-sample prediction in heterogeneous dimension settings for M = 3. The standard errors are given in parenthesis.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Correct Target Model Misspeciﬁed Target Model

δ3, δ6 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

n0 = 150 Trans-SMAP 0.025 0.025 0.025 0.025 0.025 0.033 0.033 0.033 0.033 0.033 (0.010) (0.010) (0.010) (0.009) (0.010) (0.010) (0.010) (0.010) (0.010) (0.011) Trans-Simp MA 0.076 0.209 0.439 0.757 1.171 0.069 0.195 0.410 0.721 1.126 (0.014) (0.026) (0.041) (0.063) (0.092) (0.014) (0.022) (0.040) (0.060) (0.084) Trans-SBIC 0.050 0.186 0.440 0.808 1.278 0.050 0.177 0.415 0.775 1.247 (0.011) (0.022) (0.044) (0.078) (0.118) (0.011) (0.020) (0.042) (0.072) (0.112) Trans-SAIC 0.050 0.184 0.434 0.797 1.261 0.050 0.175 0.410 0.765 1.231 (0.011) (0.022) (0.043) (0.077) (0.117) (0.011) (0.020) (0.042) (0.071) (0.111) LSE-tar 0.030 0.030 0.030 0.030 0.029 0.038 0.038 0.038 0.039 0.038 (0.011) (0.012) (0.011) (0.011) (0.011) (0.012) (0.011) (0.012) (0.012) (0.012) LSE-All 0.104 0.303 0.643 1.126 1.727 0.093 0.279 0.599 1.067 1.673 (0.020) (0.045) (0.083) (0.143) (0.227) (0.019) (0.040) (0.079) (0.146) (0.220)

Uplift Rate 20.00% 20.00% 20.00% 20.00% 16.00% 15.15% 15.15% 15.15% 18.18% 15.15%

n0 = 300 Trans-SMAP 0.012 0.012 0.012 0.012 0.013 0.020 0.020 0.020 0.021 0.020 (0.005) (0.004) (0.004) (0.005) (0.004) (0.005) (0.005) (0.005) (0.004) (0.005) Trans-Simp MA 0.065 0.200 0.426 0.747 1.151 0.058 0.183 0.400 0.710 1.111 (0.009) (0.020) (0.034) (0.053) (0.081) (0.009) (0.018) (0.033) (0.054) (0.078) Trans-SBIC 0.040 0.177 0.427 0.795 1.266 0.040 0.167 0.408 0.762 1.237 (0.006) (0.017) (0.035) (0.063) (0.101) (0.006) (0.016) (0.036) (0.064) (0.100) Trans-SAIC 0.040 0.176 0.425 0.791 1.260 0.040 0.166 0.406 0.759 1.231 (0.006) (0.017) (0.035) (0.063) (0.101) (0.006) (0.016) (0.036) (0.063) (0.100) LSE-tar 0.014 0.014 0.014 0.014 0.015 0.022 0.023 0.022 0.023 0.023 (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.006) (0.006) (0.005) (0.005) LSE-All 0.082 0.249 0.529 0.921 1.431 0.073 0.228 0.494 0.881 1.386 (0.013) (0.029) (0.057) (0.096) (0.155) (0.012) (0.026) (0.056) (0.091) (0.148)

Uplift Rate 16.67% 16.67% 16.67% 16.67% 15.38% 10.00% 15.00% 10.00% 9.52% 15.00%

n0 = 500 Trans-SMAP 0.008 0.007 0.008 0.007 0.008 0.015 0.015 0.015 0.015 0.015 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) Trans-Simp MA 0.061 0.195 0.420 0.739 1.148 0.053 0.177 0.393 0.706 1.099 (0.008) (0.017) (0.031) (0.054) (0.075) (0.006) (0.016) (0.029) (0.051) (0.075) Trans-SBIC 0.036 0.173 0.421 0.784 1.258 0.035 0.162 0.401 0.760 1.223 (0.005) (0.015) (0.032) (0.061) (0.090) (0.004) (0.014) (0.031) (0.058) (0.093) Trans-SAIC 0.036 0.172 0.421 0.783 1.255 0.035 0.162 0.400 0.758 1.220 (0.005) (0.015) (0.032) (0.061) (0.090) (0.004) (0.014) (0.031) (0.058) (0.092) LSE-tar 0.009 0.008 0.009 0.009 0.009 0.016 0.016 0.016 0.016 0.016 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) LSE-All 0.074 0.227 0.484 0.843 1.319 0.064 0.207 0.453 0.810 1.264 (0.010) (0.022) (0.043) (0.079) (0.123) (0.008) (0.021) (0.040) (0.073) (0.120)

Uplift Rate 12.50% 14.29% 12.50% 28.57% 12.50% 6.67% 6.67% 6.67% 6.67% 6.67%

Table 5: The averaged MSE of out-of-sample prediction in heterogeneous dimension settings for M = 6. Set the values of δ3 and δ6 be equal but vary from {0.1, 0.3, 0.5, 0.7, 0.9}. The standard errors are given in parenthesis.

Hu and Zhang

Correct Target Model Misspeciﬁed Target Model δ2 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 n0 = 150 Trans-SMAP 0.026 0.026 0.026 0.027 0.026 0.034 0.034 0.034 0.035 0.035 (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) (0.010) (0.011) (0.011) Trans-Simp MA 0.116 0.238 0.427 0.694 1.026 0.104 0.218 0.404 0.660 0.976 (0.024) (0.037) (0.055) (0.073) (0.095) (0.022) (0.036) (0.051) (0.071) (0.094) Trans-SBIC 0.056 0.188 0.424 0.777 1.228 0.055 0.179 0.414 0.758 1.202 (0.013) (0.028) (0.054) (0.090) (0.134) (0.013) (0.027) (0.051) (0.087) (0.138) Trans-SAIC 0.055 0.183 0.411 0.752 1.188 0.054 0.173 0.401 0.733 1.162 (0.013) (0.028) (0.052) (0.087) (0.130) (0.012) (0.027) (0.050) (0.085) (0.134) LSE-Tar 0.030 0.031 0.030 0.030 0.030 0.038 0.037 0.038 0.038 0.039 (0.011) (0.011) (0.012) (0.011) (0.011) (0.012) (0.011) (0.012) (0.012) (0.012) LSE-All 0.172 0.347 0.616 0.997 1.456 0.153 0.317 0.579 0.958 1.408 (0.040) (0.066) (0.107) (0.164) (0.224) (0.036) (0.059) (0.101) (0.160) (0.218) Trans-Lasso 0.123 0.124 0.123 0.125 0.123 0.130 0.130 0.131 0.132 0.134 (0.014) (0.015) (0.014) (0.016) (0.013) (0.014) (0.014) (0.015) (0.015) (0.016) Trans-GLM 0.125 0.125 0.124 0.126 0.124 0.133 0.133 0.133 0.134 0.134 (0.018) (0.018) (0.017) (0.019) (0.016) (0.018) (0.019) (0.019) (0.018) (0.019)

Uplift Rate 15.38% 19.23% 15.38% 11.11% 15.38% 11.76% 8.82% 11.76% 8.57% 11.43% n0 = 300 Trans-SMAP 0.013 0.013 0.013 0.013 0.013 0.021 0.021 0.020 0.021 0.021 (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) Trans-Simp MA 0.103 0.224 0.415 0.685 1.013 0.092 0.204 0.388 0.646 0.966 (0.018) (0.028) (0.041) (0.058) (0.085) (0.016) (0.025) (0.039) (0.057) (0.080) Trans-SBIC 0.044 0.172 0.406 0.757 1.204 0.043 0.164 0.391 0.729 1.171 (0.009) (0.020) (0.042) (0.072) (0.116) (0.008) (0.020) (0.041) (0.075) (0.110) Trans-SAIC 0.043 0.169 0.398 0.741 1.179 0.042 0.161 0.383 0.714 1.147 (0.008) (0.020) (0.042) (0.070) (0.114) (0.008) (0.020) (0.040) (0.074) (0.108) LSE-Tar 0.014 0.015 0.014 0.014 0.014 0.022 0.022 0.022 0.023 0.022 (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.005) (0.006) (0.006) LSE-All 0.146 0.305 0.555 0.908 1.336 0.132 0.279 0.521 0.861 1.296 (0.027) (0.044) (0.071) (0.116) (0.171) (0.026) (0.038) (0.068) (0.111) (0.178) Trans-Lasso 0.109 0.110 0.109 0.109 0.110 0.117 0.117 0.117 0.118 0.117 (0.008) (0.008) (0.008) (0.009) (0.009) (0.009) (0.008) (0.009) (0.009) (0.008) Trans-GLM 0.109 0.109 0.109 0.108 0.109 0.117 0.116 0.116 0.117 0.116 (0.008) (0.008) (0.008) (0.008) (0.008) (0.009) (0.008) (0.008) (0.009) (0.008)

Uplift Rate 7.69% 15.38% 7.69% 7.69% 7.69% 4.76% 4.76% 10.00% 9.52% 4.76% n0 = 500 Trans-SMAP 0.008 0.008 0.008 0.008 0.008 0.016 0.015 0.015 0.015 0.015 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) Trans-Simp MA 0.097 0.221 0.408 0.675 1.008 0.085 0.198 0.383 0.639 0.965 (0.014) (0.022) (0.038) (0.052) (0.078) (0.012) (0.021) (0.035) (0.052) (0.072) Trans-SBIC 0.039 0.165 0.392 0.728 1.168 0.038 0.156 0.379 0.710 1.137 (0.006) (0.016) (0.036) (0.065) (0.104) (0.006) (0.016) (0.033) (0.062) (0.094) Trans-SAIC 0.039 0.165 0.391 0.726 1.165 0.037 0.155 0.378 0.707 1.134 (0.006) (0.015) (0.036) (0.064) (0.103) (0.006) (0.016) (0.033) (0.061) (0.094) LSE-Tar 0.009 0.009 0.009 0.008 0.009 0.017 0.016 0.016 0.016 0.016 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) LSE-All 0.129 0.261 0.458 0.739 1.087 0.114 0.235 0.430 0.696 1.030 (0.020) (0.031) (0.055) (0.084) (0.127) (0.018) (0.029) (0.052) (0.085) (0.118) Trans-Lasso 0.104 0.104 0.104 0.104 0.104 0.112 0.112 0.112 0.111 0.112 (0.006) (0.006) (0.006) (0.006) (0.006) (0.007) (0.007) (0.007) (0.006) (0.007) Trans-GLM 0.103 0.103 0.103 0.103 0.104 0.111 0.111 0.111 0.110 0.111 (0.006) (0.006) (0.006) (0.006) (0.006) (0.006) (0.007) (0.006) (0.006) (0.006)

Uplift Rate 12.50% 12.50% 12.50% 0.00% 12.50% 6.25% 6.67% 6.67% 6.67% 6.67%

Table 6: The averaged MSE of out-of-sample prediction in homogeneous dimension settings for M = 3. The standard errors are given in parenthesis.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Correct Target Model Misspeciﬁed Target Model δ3, δ6 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 n0 = 150 Trans-SMAP 0.026 0.025 0.025 0.025 0.025 0.034 0.034 0.034 0.032 0.034 (0.010) (0.009) (0.010) (0.010) (0.010) (0.011) (0.011) (0.010) (0.010) (0.010) Trans-Simp MA 0.077 0.211 0.437 0.756 1.169 0.069 0.195 0.411 0.719 1.117 (0.015) (0.026) (0.039) (0.063) (0.089) (0.014) (0.024) (0.039) (0.061) (0.088) Trans-SBIC 0.052 0.192 0.447 0.824 1.316 0.051 0.184 0.432 0.794 1.269 (0.011) (0.023) (0.043) (0.076) (0.126) (0.012) (0.022) (0.045) (0.079) (0.119) Trans-SAIC 0.051 0.186 0.433 0.797 1.271 0.050 0.179 0.418 0.768 1.226 (0.011) (0.023) (0.042) (0.074) (0.122) (0.012) (0.022) (0.043) (0.077) (0.116) LSE-Tar 0.031 0.029 0.030 0.030 0.030 0.039 0.039 0.039 0.037 0.038 (0.012) (0.011) (0.011) (0.011) (0.012) (0.013) (0.013) (0.012) (0.011) (0.012) LSE-All 0.141 0.335 0.669 1.146 1.748 0.127 0.312 0.637 1.083 1.677 (0.025) (0.049) (0.081) (0.145) (0.233) (0.023) (0.043) (0.085) (0.138) (0.221) Trans-Lasso 0.122 0.121 0.123 0.124 0.124 0.130 0.131 0.132 0.132 0.133 (0.013) (0.013) (0.014) (0.015) (0.015) (0.015) (0.014) (0.015) (0.014) (0.015) Trans-GLM 0.125 0.122 0.125 0.126 0.125 0.134 0.133 0.133 0.133 0.133 (0.017) (0.015) (0.017) (0.018) (0.018) (0.019) (0.018) (0.018) (0.018) (0.017)

Uplift Rate 19.23% 16.00% 20.00% 20.00% 20.00% 14.71% 14.71% 14.71% 15.63% 11.76% n0 = 300 Trans-SMAP 0.012 0.013 0.012 0.012 0.013 0.020 0.020 0.020 0.020 0.020 (0.005) (0.005) (0.004) (0.005) (0.005) (0.005) (0.004) (0.005) (0.005) (0.005) Trans-Simp MA 0.066 0.200 0.427 0.748 1.156 0.058 0.184 0.397 0.705 1.110 (0.010) (0.020) (0.035) (0.056) (0.082) (0.009) (0.018) (0.033) (0.052) (0.077) Trans-SBIC 0.040 0.177 0.427 0.791 1.265 0.040 0.168 0.404 0.753 1.231 (0.007) (0.017) (0.036) (0.063) (0.103) (0.006) (0.016) (0.035) (0.058) (0.096) Trans-SAIC 0.040 0.175 0.422 0.781 1.249 0.039 0.166 0.399 0.744 1.215 (0.007) (0.017) (0.035) (0.063) (0.102) (0.006) (0.015) (0.035) (0.057) (0.095) LSE-Tar 0.014 0.015 0.014 0.014 0.015 0.022 0.022 0.022 0.022 0.022 (0.005) (0.006) (0.005) (0.005) (0.005) (0.006) (0.005) (0.006) (0.005) (0.006) LSE-All 0.116 0.279 0.557 0.953 1.451 0.105 0.260 0.525 0.902 1.396 (0.016) (0.033) (0.062) (0.106) (0.155) (0.015) (0.031) (0.057) (0.097) (0.142) Trans-Lasso 0.109 0.109 0.109 0.109 0.110 0.116 0.117 0.117 0.117 0.117 (0.008) (0.008) (0.008) (0.008) (0.008) (0.009) (0.008) (0.008) (0.009) (0.009) Trans-GLM 0.108 0.109 0.108 0.109 0.109 0.116 0.116 0.116 0.116 0.116 (0.008) (0.008) (0.008) (0.008) (0.008) (0.009) (0.008) (0.008) (0.009) (0.009)

Uplift Rate 16.67% 15.38% 16.67% 16.67% 15.38% 10.00% 10.00% 10.00% 10.00% 10.00% n0 = 500 Trans-SMAP 0.008 0.008 0.008 0.008 0.007 0.016 0.015 0.015 0.015 0.015 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) Trans-Simp MA 0.061 0.197 0.424 0.739 1.143 0.054 0.178 0.395 0.702 1.112 (0.008) (0.017) (0.032) (0.052) (0.071) (0.007) (0.015) (0.030) (0.049) (0.073) Trans-SBIC 0.036 0.173 0.420 0.776 1.240 0.036 0.162 0.398 0.746 1.217 (0.005) (0.015) (0.032) (0.060) (0.085) (0.005) (0.014) (0.031) (0.054) (0.090) Trans-SAIC 0.036 0.172 0.418 0.772 1.233 0.036 0.161 0.395 0.742 1.210 (0.005) (0.015) (0.031) (0.059) (0.085) (0.005) (0.014) (0.031) (0.053) (0.090) LSE-Tar 0.009 0.009 0.009 0.009 0.009 0.017 0.016 0.016 0.016 0.016 (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003) LSE-All 0.106 0.260 0.518 0.873 1.334 0.097 0.239 0.485 0.835 1.299 (0.012) (0.027) (0.050) (0.079) (0.118) (0.011) (0.023) (0.044) (0.073) (0.112) Trans-Lasso 0.103 0.104 0.104 0.104 0.104 0.111 0.112 0.111 0.112 0.112 (0.006) (0.006) (0.006) (0.006) (0.006) (0.007) (0.007) (0.007) (0.006) (0.007) Trans-GLM 0.103 0.103 0.103 0.103 0.103 0.111 0.111 0.111 0.111 0.111 (0.005) (0.005) (0.006) (0.006) (0.006) (0.006) (0.007) (0.006) (0.006) (0.006)

Uplift Rate 12.50% 12.50% 12.50% 12.50% 28.57% 6.25% 6.67% 6.67% 6.67% 6.67%

Table 7: The averaged MSE of out-of-sample prediction in homogeneous dimension settings for M = 6. Set the values of δ3 and δ6 be equal but vary from {0.1, 0.3, 0.5, 0.7, 0.9}. The standard errors are given in parenthesis.

Hu and Zhang

Correct Target Model Misspeciﬁed Target Model Method n0 = 150 n0 = 300 n0 = 500 n0 = 150 n0 = 300 n0 = 500 M = 3 Trans-SMAP 0.086 0.043 0.027 0.091 0.051 0.035 (0.051) (0.020) (0.014) (0.043) (0.021) (0.016) Trans-Simp MA 0.282 0.246 0.234 0.261 0.228 0.212 (0.063) (0.038) (0.033) (0.059) (0.039) (0.030) Trans-SBIC 0.243 0.205 0.195 0.228 0.195 0.180 (0.075) (0.050) (0.039) (0.073) (0.047) (0.038) Trans-SAIC 0.239 0.202 0.194 0.224 0.192 0.179 (0.074) (0.050) (0.038) (0.072) (0.047) (0.038) LSE-tar 0.104 0.049 0.031 0.113 0.058 0.040 (0.074) (0.026) (0.018) (0.063) (0.028) (0.021) LSE-All 0.368 0.311 0.259 0.336 0.287 0.234 (0.083) (0.052) (0.040) (0.076) (0.052) (0.036)

Uplift Rate 20.93% 13.95% 14.81% 24.18% 13.73% 14.29% M = 6 Trans-SMAP 0.039 0.025 0.076 0.088 0.047 0.032 (0.020) (0.012) (0.046) (0.046) (0.020) (0.013) Trans-Simp MA 0.221 0.208 0.246 0.233 0.203 0.190 (0.032) (0.026) (0.047) (0.051) (0.032) (0.023) Trans-SBIC 0.209 0.194 0.236 0.229 0.195 0.181 (0.041) (0.031) (0.059) (0.061) (0.041) (0.030) Trans-SAIC 0.204 0.191 0.226 0.219 0.190 0.178 (0.040) (0.030) (0.058) (0.060) (0.040) (0.030) LSE-tar 0.050 0.031 0.101 0.113 0.059 0.038 (0.028) (0.017) (0.069) (0.065) (0.027) (0.018) LSE-All 0.267 0.239 0.335 0.313 0.247 0.218 (0.040) (0.031) (0.064) (0.063) (0.038) (0.028)

Uplift Rate 28.21% 24.00% 32.89% 28.41% 25.53% 18.75%

Table 8: The averaged MSE of out-of-sample prediction in heterogeneous dimension and heteroscedastic settings. The standard errors are given in parenthesis.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Correct Target Model Misspeciﬁed Target Model Method n0 = 150 n0 = 300 n0 = 500 n0 = 150 n0 = 300 n0 = 500 M = 3 Trans-SMAP 0.085 0.045 0.026 0.094 0.055 0.034 (0.050) (0.023) (0.013) (0.048) (0.027) (0.016) Trans-Simp MA 0.280 0.247 0.233 0.258 0.228 0.212 (0.064) (0.042) (0.033) (0.059) (0.041) (0.031) Trans-SBIC 0.242 0.210 0.193 0.228 0.199 0.180 (0.078) (0.052) (0.039) (0.072) (0.050) (0.038) Trans-SAIC 0.236 0.207 0.192 0.222 0.196 0.179 (0.078) (0.052) (0.039) (0.071) (0.050) (0.038) LSE-Tar 0.102 0.051 0.031 0.112 0.064 0.039 (0.068) (0.031) (0.016) (0.065) (0.034) (0.022) LSE-All 0.386 0.324 0.271 0.354 0.301 0.251 (0.087) (0.055) (0.040) (0.076) (0.053) (0.038) Trans-Lasso 0.175 0.136 0.121 0.185 0.147 0.129 (0.063) (0.029) (0.017) (0.062) (0.033) (0.019) Trans-GLM 0.173 0.135 0.120 0.184 0.146 0.128 (0.057) (0.027) (0.017) (0.059) (0.032) (0.019)

Uplift Rate 20.00% 13.33% 19.23% 19.15% 16.36% 14.71% M = 6 Trans-SMAP 0.082 0.040 0.025 0.091 0.047 0.032 (0.052) (0.021) (0.011) (0.053) (0.022) (0.012) Trans-Simp MA 0.255 0.218 0.208 0.236 0.203 0.189 (0.057) (0.032) (0.025) (0.050) (0.029) (0.022) Trans-SBIC 0.237 0.199 0.190 0.227 0.191 0.175 (0.066) (0.040) (0.030) (0.058) (0.036) (0.027) Trans-SAIC 0.232 0.197 0.189 0.222 0.189 0.174 (0.065) (0.040) (0.030) (0.058) (0.036) (0.027) LSE-Tar 0.107 0.051 0.033 0.117 0.060 0.039 (0.074) (0.029) (0.017) (0.074) (0.032) (0.017) LSE-All 0.377 0.297 0.269 0.351 0.278 0.247 (0.075) (0.041) (0.031) (0.067) (0.040) (0.029) Trans-Lasso 0.174 0.134 0.122 0.186 0.143 0.129 (0.058) (0.026) (0.018) (0.074) (0.029) (0.019) Trans-GLM 0.180 0.137 0.122 0.191 0.144 0.129 (0.061) (0.029) (0.017) (0.065) (0.028) (0.018)

Uplift Rate 30.49% 27.50% 32.00% 28.57% 27.66% 21.88%

Table 9: The averaged MSE of out-of-sample prediction in homogeneous dimension and heteroscedastic settings. The standard errors are given in parenthesis.

Hu and Zhang

150 300 500

Trans SMAP(2 fold CV) Trans SMAP(5 fold CV) Trans SMAP(10 fold CV) Trans SMAP(leave one out CV) Trans Simp MA Trans SBIC Trans SAIC LSE Tar LSE All Trans Lasso Trans GLM

150 300 500

Trans SMAP(2 fold CV) Trans SMAP(5 fold CV) Trans SMAP(10 fold CV) Trans SMAP(leave one out CV)

150 300 500

Time consumption (s)

Trans SMAP(2 fold CV) Trans SMAP(5 fold CV) Trans SMAP(10 fold CV) Trans SMAP(leave one out CV)

Figure 6: Comparison of various CV criteria for M = 3. (a) MSE of out-of-sample prediction for all the methods. (b) MSE of out-of-sample prediction for diﬀerent CV criteria. (c) Time consumption of our method based on diﬀerent CV criteria. The numerical computation executes on a regular PC with an intel core i7-10700 2.90 GHz CPU.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Variable Description Data Range

Y (m) the natural logarithm of the monthly rent [700, 110000] X(m) 1 the number of rooms {1, 2, 3, 4, 5, 6, 7, 9} X(m) 2 the number of restrooms {0, 1, 2, 3, 4, 5, 9} X(m) 3 the number of living rooms {0, 1, 2, 3, 4} X(m) 4 total area [17, 600] X(m) 5 have or not a bed 0 = no, 1 = yes X(m) 6 have or not a wardrobe 0 = no, 1 = yes X(m) 7 have or not a air conditioner 0 = no, 1 = yes X(m) 8 have or not a fuel gas 0 = no, 1 = yes Z(m) 1 total ﬂoor [1, 39] Z(m) 2 the number of schools within 3 km [0, 132]

Table 10: Description of variables in the housing rental information data.

leave-one-out CV to reduce the computational burden. The more convictive procedure to select J adaptively with theoretical guarantees deserves deeply study in the future.

C.6 More Details of Real Data Analysis

Table 10 provides more details of covariates in our data analysis. In Figure 7, we mark all the speciﬁc locations of rental houses in the map to visualize our data source. We can see that these areas are relatively far away from downtown with relatively low rental prices, so these houses are more likely to be the ﬁrst choice for young people who have just started working. Figures 8-12 visualize the marginal relationships between the natural logarithm of the monthly rent and predictors for diﬀerent target data. For the data sets of Daxing, Fangshan, Mentougou, and Shijingshan, eight covariates (the number of rooms, the number of restrooms, the number of living rooms, total area, have or not a bed, have or not a wardrobe, have or not a air conditioner, and have or not a fuel gas) are conﬁrmed to have linear eﬀects on the dependent variable, and two covariates (total ﬂoor and the number of schools within 3 km) have nonlinear eﬀects. All the covariates have linear eﬀects for the data set of Fengtai. To be more intuitive, we display the transfer network based on weight assignments for diﬀerent target domains in Figure 13 as a supplement of Table 2. In the transfer network, the nodes correspond to diﬀerent data sources, and the directed edges indicate knowledge transfer from source domain to target domain.

C.7 Simulation Study in Real Data Settings

In this section, we conduct an additional simulation study to compare the performance of different methods. To mimic the real data structure, we consider M = 5 data sources denoted as Domain 1, . . . , Domain 5, with ten covariates and sample sizes of (291, 247, 339, 263, 269),

Hu and Zhang

Figure 7: Visualization for the locations of rental houses in Daxing, Fangshan, Fengtai, Mentougou, and Shijingshan in the empirical data anaysis.

1 2 3 4 5 7 Number of rooms

The natural logarithm of monthly rent

1 2 3 4 5 Number of restrooms

The natural logarithm of monthly rent

0 1 2 4 Number of living rooms

The natural logarithm of monthly rent

0 200 400 600 Total area

The natural logarithm of monthly rent

10 20 30 40 Total floor

The natural logarithm of monthly rent

The natural logarithm of monthly rent

yes no Wardrobe

The natural logarithm of monthly rent

yes no Air conditioner

The natural logarithm of monthly rent

yes no Fuel gas

The natural logarithm of monthly rent

0 25 50 75 100 125 Number of schools within 3km

The natural logarithm of monthly rent

Figure 8: The marginal relationship between the natural logarithm of the monthly rent and ten predictors for the data set of Daxing.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

1 2 3 4 Number of rooms

The natural logarithm of monthly rent

1 2 Number of restrooms

The natural logarithm of monthly rent

0 1 2 Number of living rooms

The natural logarithm of monthly rent

50 100 150 200 Total area

The natural logarithm of monthly rent

10 20 30 Total floor

The natural logarithm of monthly rent

The natural logarithm of monthly rent

yes no Wardrobe

The natural logarithm of monthly rent

yes no Air conditioner

The natural logarithm of monthly rent

yes no Fuel gas

The natural logarithm of monthly rent

0 20 40 60 Number of schools within 3km

The natural logarithm of monthly rent

Figure 9: The marginal relationship between the natural logarithm of the monthly rent and ten predictors for the data set of Fangshan.

1 2 3 4 5 6 7 9 Number of rooms

The natural logarithm of monthly rent

1 2 3 5 9 Number of restrooms

The natural logarithm of monthly rent

0 1 2 3 Number of living rooms

The natural logarithm of monthly rent

0 100 200 300 400 Total area

The natural logarithm of monthly rent

0 20 40 60 Total floor

The natural logarithm of monthly rent

The natural logarithm of monthly rent

yes no Wardrobe

The natural logarithm of monthly rent

yes no Air conditioner

The natural logarithm of monthly rent

yes no Fuel gas

The natural logarithm of monthly rent

0 50 100 Number of schools within 3km

The natural logarithm of monthly rent

Figure 10: The marginal relationship between the natural logarithm of the monthly rent and ten predictors for the data set of Fengtai.

Hu and Zhang

1 2 3 4 Number of rooms

The natural logarithm of monthly rent

1 2 3 Number of restrooms

The natural logarithm of monthly rent

0 1 2 Number of living rooms

The natural logarithm of monthly rent

100 200 Total area

The natural logarithm of monthly rent

10 20 30 Total floor

The natural logarithm of monthly rent

The natural logarithm of monthly rent

yes no Wardrobe

The natural logarithm of monthly rent

yes no Air conditioner

The natural logarithm of monthly rent

yes no Fuel gas

The natural logarithm of monthly rent

0 10 20 30 40 Number of schools within 3km

The natural logarithm of monthly rent

Figure 11: The marginal relationship between the natural logarithm of the monthly rent and ten predictors for the data set of Mentougou.

1 2 3 4 Number of rooms

The natural logarithm of monthly rent

0 1 2 Number of restrooms

The natural logarithm of monthly rent

0 1 2 3 Number of living rooms

The natural logarithm of monthly rent

40 80 120 160 Total area

The natural logarithm of monthly rent

10 20 30 Total floor

The natural logarithm of monthly rent

The natural logarithm of monthly rent

yes no Wardrobe

The natural logarithm of monthly rent

yes no Air conditioner

The natural logarithm of monthly rent

yes no Fuel gas

The natural logarithm of monthly rent

0 20 40 60 80 Number of schools within 3km

The natural logarithm of monthly rent

Figure 12: The marginal relationship between the natural logarithm of the monthly rent and ten predictors for the data set of Shijingshan.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

0.139 0.266

0.002 0.189

Shijingshan

Figure 13: Transfer network based on our weight assignments for diﬀerent target domains. The size of nodes is proportional to the weight of the target model, and the thickness of edges is proportional to the weights of source models.

which are the same as our real data structure. We set Domain 3 to be the data set generated from a linear regression model, while the other domains are generated from additive partial linear models. For Domain 3, the ten covariates are generated from a 10-dimensional multivariate normal distribution N(0, Σ) with Σ = [0.5|a a |]10 10, and the parametric coeﬃcient vector is set to be β(3) = (1.42, 1.18, 1.02, 0.78, 0.67, 0.32, 0.48, 1.02, 0.8, 0.7)T . For the other domains, we set the dimensions of parametric and nonparametric components for each model are equally eight and two, respectively. The parametric components for each model are generated from a same 8-dimensional multivariate normal distribution, and the parametric coeﬃcient vectors of diﬀerent models are set to be β(1) = (1.4, 1.2, 1, 0.8, 0.65, 0.3, 0.5, 1)T , β(2) = (1.42, 1.18, 1.02, 0.78, 0.67, 0.32, 0.48, 1.02, 0.12)T , β(4) = (1.7, 0.9, 1.3, 0.5, 0.95, 0.6, 0.2, 1.3)T , and β(5) = (1.4, 1.2, 1, 0.8, 0.65, 0.3, 0.5, 1)T . Similarly, here we consider parametric coeﬃcient vectors may have varying degrees of similarity among diﬀerent domains. We further assume that the model of Domain 2 is misspeciﬁed with the same form as that in Section 4.1. The nonparametric variables u are generated from a uniform distribution U(0, 1), and we consider the following nonlinear functions for diﬀerent models: g(1)(u) = 2(u1 0.5)3 + sin(πu2), g(2)(u) = 2(u1 + 0.5)3 + cos(πu2), g(4)(u) = (1.8u1 + 0.3)3 + cos(πu2), and g(5)(u) = 1.5(u1 + 0.5)3 + cos(2πu2). For simplicity, we only consider the homoscedastic setting and let the random error follow a normal distribution N(0, σ2 ε) with σε {0.5, 1.5}. We let each data source take turns as the target domain and the other data sources as source domains, and then we evaluate diﬀerent procedures in ﬁve scenarios.

Hu and Zhang

Method Domain 1 Domain 2 Domain 3 Domain 4 Domain 5 σε = 0.5 Trans-SMAP 0.011 0.021 0.006 0.017 0.051 (0.004) (0.006) (0.003) (0.007) (0.006) Trans-Simp MA 0.104 0.058 0.052 1.081 0.143 (0.012) (0.009) (0.008) (0.075) (0.014) Trans-SBIC 0.105 0.059 0.053 1.077 0.145 (0.014) (0.010) (0.009) (0.079) (0.016) Trans-SAIC 0.107 0.060 0.054 1.072 0.146 (0.014) (0.010) (0.009) (0.079) (0.016) LSE-Tar 0.014 0.025 0.009 0.017 0.056 (0.005) (0.007) (0.004) (0.007) (0.008) LSE-All 0.102 0.099 0.052 1.168 0.225 (0.017) (0.013) (0.012) (0.097) (0.021) Trans-Lasso 0.111 0.241 0.009 0.829 0.649 (0.009) (0.017) (0.004) (0.052) (0.034) Trans-GLM 0.112 0.241 0.009 0.826 0.657 (0.009) (0.017) (0.004) (0.051) (0.038)

Uplift Rate 27.27% 19.05% 50.00% 0.00% 9.80% σε = 1.5 Trans-SMAP 0.089 0.110 0.048 0.134 0.132 (0.036) (0.047) (0.027) (0.050) (0.041) Trans-Simp MA 0.165 0.130 0.083 1.155 0.208 (0.041) (0.045) (0.031) (0.116) (0.045) Trans-SBIC 0.163 0.130 0.082 1.161 0.206 (0.042) (0.045) (0.031) (0.117) (0.046) Trans-SAIC 0.165 0.130 0.083 1.156 0.208 (0.042) (0.045) (0.032) (0.117) (0.046) LSE-Tar 0.123 0.156 0.077 0.135 0.174 (0.045) (0.060) (0.036) (0.051) (0.052) LSE-All 0.153 0.161 0.079 1.237 0.278 (0.042) (0.046) (0.030) (0.131) (0.048) Trans-Lasso 0.186 0.334 0.075 0.916 0.736 (0.042) (0.057) (0.036) (0.076) (0.066) Trans-GLM 0.213 0.351 0.095 0.920 0.771 (0.059) (0.065) (0.057) (0.080) (0.088)

Uplift Rate 38.20% 18.18% 56.25% 0.75% 31.82%

Table 11: The averaged MSE of out-of-sample prediction for diﬀerent target domains in real data settings. The standard errors are given in parenthesis.

To evaluate the prediction performance, we similarly generate 500 testing samples from the target model and calculate the corresponding prediction MSE based on 500 replications. The results are summarized in Table 11. From the results, we can observe that our Trans SMAP outperforms all competitive methods in most cases, especially for Domain 1 and Domain 3. When Domain 4 is the target domain, our procedure has similar performance to the best alternative method, LSE-Tar, which reﬂects the inﬂuence of parameter similarity between diﬀerent models on the improvement of parameter-transfer approach. Overall, the proposed Trans-SMAP remains eﬀective in the simulation experiments with the real data structure.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

C.8 Simulation Study in High-dimensional Settings

To consider a relatively fair comparison with Trans-Lasso, we conduct a high-dimensional simulation study. Since our framework can not be directly applied to high-dimensional data, we simply replace the least squares estimation (2) in the original step of Algorithm 1 with the following Lasso estimation

bβ(m) = arg min β(m)

i=1 {y(m) i (x(m) i )T β(m)}2 + λ(m) β(m) 1

, m = 0, . . . , M, (26)

and the other steps remain unchanged. Speciﬁcally, we set p = 300 and (n0, . . . , n M) = (100, 100, 150, 100, 150, 100, 100, 100, 150, 150, 150) for M = 10. The covariates in each model are generated from the same multivariate normal distribution with an identity covariance matrix followed by Li et al. (2021). For the coeﬃcient vector of the target model, we set β(0) = (β(0) 1 , . . . , β(0) p )T = (0.5 1T s , 0T p s)T for s = 5. For the coeﬃcient vectors of

the source models, we set β(m) j = β(0) j + ψj I(j {s + 1, . . . , 5s}) for m {1, 2, 5, . . . , 10},

β(3) j = β(0) j , and β(4) j = β(0) j + 0.2 I(j {1, . . . , 50}), where ψj is a binary variable with values -1 and 1 following the binomial distribution with parameter 0.5. In addition, we also consider the scenario where the target model and the second source model are misspeciﬁed. In this case, the corresponding coeﬃcient vectors are (p + 1)-dimensional, and we set the (p+1)th coeﬃcients of β(0) p+1 and β(2) p+1 to 0.5. Similarly, we exclude the (p+1)th components

of x(0) i and x(2) i when ﬁtting the corresponding models. The random error for all M + 1 models follows a standard normal distribution. To diﬀerentiate between the two proposed methods, we refer to the modiﬁed method as Transfer learning for High-dimensional Model Averaging Prediction (Trans-HMAP). The tuning parameters λ(m) for m = 0, . . . , M are chosen by 8-fold cross-validation suggested by Li et al. (2021). To evaluate the performance of our method and other competitive methods, we generate n = 100 testing samples from the target model and calculate the mean squared prediction errors (MSPE). In addition, we report the sum of squared estimation errors (SSE), bβ(0) β(0) 2, for diﬀerent estimators. All results are based on 200 replications. To accommodate the high-dimensional settings, we use Lasso estimation for LSE-Tar and LSE-All, denoted as Lasso-Tar and Lasso-All, instead of least squares estimation. The corresponding results are reported in Table 12. The results show that our Trans-HMAP still outperforms other methods in both estimation and prediction, demonstrating the eﬀectiveness of our framework in high-dimensional scenarios.

Hu and Zhang

Correct Speciﬁcation Misspeciﬁcation SSE MSPE SSE MSPE Trans-HMAP 0.277 0.295 0.308 0.572 (0.083) (0.101) (0.098) (0.120) Trans-Simp MA 1.090 1.096 1.093 1.357 (0.236) (0.285) (0.236) (0.277) Trans-SBIC 0.910 0.919 0.938 1.202 (0.192) (0.236) (0.202) (0.233) Trans-SAIC 1.240 1.247 1.264 1.526 (0.264) (0.316) (0.287) (0.314) Lasso-Tar 0.427 0.446 0.516 0.772 (0.172) (0.191) (0.203) (0.232) Lasso-All 1.557 1.583 1.562 1.815 (0.379) (0.445) (0.356) (0.416) Trans-Lasso 0.388 0.404 0.385 0.641 (0.264) (0.267) (0.251) (0.276) Trans-GLM 0.326 0.340 0.401 0.655 (0.179) (0.185) (0.241) (0.249)

Uplift Rate 17.69% 15.25% 25.00% 12.06%

Table 12: The averaged SSE and MSPE for diﬀerent methods in high-dimensional settings. The standard errors are given in parenthesis.

Rie Kubota Ando, Tong Zhang, and Peter Bartlett. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(11):1817 1853, 2005.

Tomohiro Ando and Ker-Chau Li. A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association, 109(505):254 265, 2014.

Tomohiro Ando and Ker-Chau Li. A weight-relaxed model averaging approach for highdimensional generalized linear models. The Annals of Statistics, 45(6):2654 2679, 2017.

Hamsa Bastani. Predicting with proxies: Transfer learning in high dimension. Management Science, 67(5):2964 2984, 2021.

Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Annals of Statistics, 46(3):1352 1382, 2018.

Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. Journal of Machine Learning Research, 22(1):46 100, 2021.

Steven T Buckland, Kenneth P Burnham, and Nicole H Augustin. Model selection: an integral part of inference. Biometrics, 53:603 618, 1997.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Bertrand Clarke. Comparing Bayes model averaging and stacking when model approximation error cannot be ignored. Journal of Machine Learning Research, 4(Oct):683 712, 2003.

Carl De Boor. A Practical Guide to Splines. Springer-Verlag New York, 2001.

Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 109 117, Seattle, Washington, 2004.

Yan Gao, Xinyu Zhang, Shouyang Wang, Terence Tai-leung Chong, and Guohua Zou. Frequentist model averaging for threshold models. Annals of the Institute of Statistical Mathematics, 71(2):275 306, 2019.

Nils Lid Hjort and Gerda Claeskens. Frequentist model average estimators. Journal of the American Statistical Association, 98(464):879 899, 2003.

Rongjie Jiang, Liming Wang, and Yang Bai. Optimal model averaging estimator for semifunctional partially linear models. Metrika, 84(2):167 194, 2021.

Mikhail K Kozlov, Sergei Pavlovich Tarasov, and Leonid Genrikhovich Khachiyan. Polynomial solvability of convex quadratic programming. In Doklady Akademii Nauk, volume 248, pages 1049 1051. Russian Academy of Sciences, 1979.

Prosenjit Kundu, Runlong Tang, and Nilanjan Chatterjee. Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika, 106(3):567 585, 2019.

Jialiang Li, Xiaochao Xia, Weng Kee Wong, and David Nott. Varying-coeﬃcient semiparametric model averaging prediction. Biometrics, 74(4):1417 1426, 2018.

Jialiang Li, Jing Lv, Alan TK Wan, and Jun Liao. Adaboost semiparametric model averaging prediction for multiple categories. Journal of the American Statistical Association, 117(537):495 509, 2022a.

Sai Li, T. Tony Cai, and Hongzhe Li. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):149 173, 2021.

Sai Li, T Tony Cai, and Hongzhe Li. Transfer learning in large-scale Gaussian graphical models with false discovery rate control. Journal of the American Statistical Association, Forthcoming, 2022b.

Yang Li, Rong Li, Cunjie Lin, Yichen Qin, and Shuangge Ma. Penalized integrative semiparametric interaction analysis for multiple genetic datasets. Statistics in Medicine, 38 (17):3221 3242, 2019.

Jun Liao, Alan TK Wan, Shuyuan He, and Guohua Zou. Frequentist model averaging for the nonparametric additive model. Statistica Sinica, Forthcoming, 2021.

Hu and Zhang

Chu-An Liu. Distribution theory of the least squares averaging estimator. Journal of Econometrics, 186(1):142 159, 2015.

Jin Liu, Shuangge Ma, and Jian Huang. Integrative analysis of cancer diagnosis studies with composite penalization. Scandinavian Journal of Statistics, 41(1):87 103, 2014.

Shishi Liu and Jingxiao Zhang. Model averaging by cross-validation for partially linear functional additive models. ar Xiv preprint ar Xiv:2105.00966, 2021.

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pages 97 105, Lille, France, 2015.

Shuangge Ma, Jian Huang, Fengrong Wei, Yang Xie, and Kuangnan Fang. Integrative analysis of multiple cancer prognosis studies with gene expression measurements. Statistics in Medicine, 30(28):3361 3371, 2011.

Yanyuan Ma and Liping Zhu. Doubly robust and eﬃcient estimators for heteroscedastic partially linear single-index models allowing high dimensional covariates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(2):305 322, 2013.

Yanyuan Ma, Jeng-Min Chiou, and Naisyin Wang. Eﬃcient semiparametric estimator for heteroscedastic partially linear models. Biometrika, 93(1):75 84, 2006.

Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, 2009.

Weike Pan, Evan W. Xiang, Nathan N. Liu, and Qiang Yang. Transfer learning in collaborative ﬁltering for sparsity reduction. In Proceedings of the Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence, pages 230 235, Atlanta, Georgia, 2010.

Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. Journal of Machine Learning Research, 21(140): 1 67, 2020.

Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M Summers. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging, 35(5):1285 1298, 2016.

Charles J Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13(2):689 705, 1985.

Charles J Stone. The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 14(2):590 606, 1986.

Ye Tian and Yang Feng. Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association, Forthcoming, 2022.

Optimal Parameter-Transfer Learning by Semiparametric Model Averaging

Kazuo Toraichi, Kazuki Katagishi, Iwao Sekita, and Ryoichi Mori. Computational complexity of spline interpolation. International Journal of Systems Science, 18(5):945 954, 1987.

Alan TK Wan, Xinyu Zhang, and Guohua Zou. Least squares model averaging by Mallows criterion. Journal of Econometrics, 156(2):277 283, 2010.

Li Wang, Xiang Liu, Hua Liang, and Raymond J Carroll. Estimation and variable selection for generalized additive partial linear models. Annals of Statistics, 39(4):1827 1851, 2011.

Halbert White. Maximum likelihood estimation of misspeciﬁed models. Econometrica: Journal of The Econometric Society, 50(1):1 25, 1982.

Minge Xie, Kesar Singh, and William E Strawderman. Conﬁdence distributions and a unifying framework for meta-analysis. Journal of the American Statistical Association, 106(493):320 333, 2011.

Xinyu Zhang. Model Averaging and Its Applications. Ph D thesis, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 2010.

Xinyu Zhang and Hua Liang. Focused information criterion and model averaging for generalized additive partial linear models. The Annals of Statistics, 39(1):174 200, 2011.

Xinyu Zhang and Chu-An Liu. Inference after model averaging in linear regression models. Econometric Theory, 35(4):816 841, 2019.

Xinyu Zhang and Chu-An Liu. Model averaging prediction by K-fold cross-validation. Journal of Econometrics, 235(1):280 301, 2023.

Xinyu Zhang and Wendun Wang. Optimal model averaging estimation for partially linear models. Statistica Sinica, 29(2):693 718, 2019.

Xinyu Zhang, Dalei Yu, Guohua Zou, and Hua Liang. Optimal model averaging estimation for generalized linear models and generalized linear mixed-eﬀects models. Journal of the American Statistical Association, 111(516):1775 1790, 2016.

Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-eﬃcient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321 3363, 2013.

Rong Zhu, Alan TK Wan, Xinyu Zhang, and Guohua Zou. A Mallows-type model averaging estimator for the varying-coeﬃcient partially linear model. Journal of the American Statistical Association, 114(526):882 892, 2019.