# stability_and_l2penalty_in_model_averaging__cbbb06d9.pdf

Journal of Machine Learning Research 25 (2024) 1-59 Submitted 7/23; Revised 10/24; Published 10/24

Stability and L2-penalty in Model Averaging

Hengkun Zhu hkzhu@cnu.edu.cn Guohua Zou ghzou@amss.ac.cn School of Mathematical Sciences Capital Normal University Beijing 100048, China

Editor: Aryeh Kontorovich

Model averaging has received much attention in the past two decades, which integrates available information by averaging over potential models. Although various model averaging methods have been developed, there is little literature on the theoretical properties of model averaging from the perspective of stability, and the majority of these methods constrain model weights to a simplex. The aim of this paper is to introduce stability from statistical learning theory into model averaging. Thus, we deﬁne the stability, asymptotic empirical risk minimization, generalization and consistency of model averaging, and study the relationship among them. Similar to the existing results in literature, we ﬁnd that stability can ensure that the model averaging estimator has good generalization performance and consistency under reasonable conditions, where consistency means that the model averaging estimator can asymptotically minimize the mean squared prediction error. We also propose an L2-penalty model averaging method without limiting model weights, and prove that it has stability and consistency. In order to overcome selection uncertainty of the L2-penalty parameter, we use cross-validation to select a candidate set of L2-penalty parameters, and then perform a weighted average of the estimators of model weights based on cross-validation errors. We demonstrate the usefulness of the proposed method with a Monte Carlo simulation and application to a prediction task on the wage1 dataset.

Keywords: Model averaging, Stability, Mean squared prediction error, L2-penalty

1. Introduction

In practical applications, data analysts usually determine multiple models based on exploratory analysis for data and empirical knowledge to describe the relationship between variables of interest and related variables. However, how to use these models to produce good results is a more important problem. It is very common to select one model using some data-driven criteria, such as AIC (Akaike, 1973), BIC (Schwarz, 1978), Cp (Mallows, 1973) and FIC (Hjort and Claeskens, 2003). An alternative to model selection is to make a compromise across a set of competing models. Statisticians ﬁnd that they can usually obtain better and more stable results by combining information from diﬀerent models. This process of combining multiple model results is known as model averaging. The problem of Bayesian and frequentist model averaging (BMA and FMA) has been well studied. Fragoso et al. (2018) reviewed the relevant literature on BMA. In this paper, we focus on FMA. In the past decades, the model averaging method has been applied in various ﬁelds. Wan and

c 2024 Hengkun Zhu and Guohua Zou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0853.html.

Zhu and Zou

Zhang (2009) examined the applications of model averaging in tourism research. Zhang and Zou (2011) applied the model averaging method to grain production forecasting in China. Moral-Benito (2015) reviewed the literature on model averaging with special emphasis on its applications in economics. The key to FMA lies in how to select model weights. The common weight selection methods include: 1) methods based on information criteria, such as smoothed AIC and smoothed BIC in Buckland et al. (1997), and smoothed FIC (Hjort and Claeskens, 2003); 2) Mallows model averaging (MMA), proposed by Hansen (2007) (see also Wan et al., 2010), modiﬁed by Liu and Okui (2013) to make it applicable to heteroscedasticity, and improved by Liao and Zou (2020) in small sample sizes; 3) adaptive methods (Yang, 2001; Yuan and Yang, 2005); 4) OPT method (Liang et al., 2011); 5) cross validation methods, such as jackknife model averaging (JMA; Hansen and Racine, 2012; Zhang et al., 2013) and leave-subject-out cross validation model averaging (Gao et al., 2016; Liao et al., 2019). In computational learning theory, stability is used to measure an algorithm s sensitivity to perturbation in the training set and is an important tool for analyzing generalization and learnability. Bousquet and Elisseeﬀ(2002) introduced four notions of stability (hypothesis stability, pointwise hypothesis stability, error stability and uniform stability), and showed how to use them to derive generalization error bounds based on the empirical error and the leave-one-out error. Kutin and Niyogi (2002) introduced several weaker variants of stability and showed how they are suﬃcient to obtain generalization bounds for algorithms. Rakhlin et al. (2005) and Mukherjee et al. (2006) discussed the necessity of stability for learnability under the assumption that uniform convergence is equivalent to learnability. As commented by Shalev-Shwartz et al. (2010), it was recognized that a fundamental and longstanding answer about how to characterize learnability, at least for the cases of supervised classiﬁcation and regression, was that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. However, Shalev-Shwartz et al. (2010) found that in the general learning setting which includes most statistical learning problems as special cases, there are non-trivial learning problems where uniform convergence does not hold, and so empirical risk minimization fails, yet these problems are learnable using alternative mechanisms. Further, instead of uniform convergence, Shalev-Shwartz et al. (2010) identiﬁed stability as the key necessary and suﬃcient condition for learnability. Although various model averaging methods have been proposed, there is little literature on their theoretical properties from the perspective of stability and the majority of these methods are concerned only with whether the resultant estimator obtains a good approximation to the minimum of a given target when the model weights are constrained to a simplex. Thus, our aim in this paper is to study stability in model averaging and to answer whether the resultant estimator can get a good approximation for the minimum of target function when the model weights are unrestricted. Our ﬁrst contribution is to introduce the concept of stability from statistical learning theory into model averaging. Stability discusses how much the algorithm s output varies when the sample changes a little. Shalev-Shwartz et al. (2010) discussed the relationship among asymptotic empirical risk minimization (AERM), stability, generalization and consistency. However, the relevant conclusions cannot be directly applied to model averaging. Therefore, we explore the relevant deﬁnitions and conclusions of Shalev-Shwartz et al. (2010) under

Stability and L2-penalty in Model Averaging

the model averaging framework. Similar to the existing results from Bousquet and Elisseeﬀ (2002) and Shalev-Shwartz et al. (2010), we ﬁnd that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means that the model averaging estimator can asymptotically minimize the mean squared prediction error (MSPE).

Further, we ﬁnd that for MMA and JMA, extreme weights tend to appear due to the inﬂuence of correlation between candidate models when the model weights are unrestricted. This results in poor performance of the model averaging estimator. Therefore, we should not simply remove the weight constraint and directly use the existing model averaging methods. Similar to ridge regression in Hoerl and Kennard (1970), we introduce an L2-penalty for the weight vector in MMA and JMA. We call them Ridge-Mallows model averaging (RMMA) and Ridge-jackknife model averaging (RJMA), respectively. This is our second contribution. Like Theorem 4.3 in Hoerl and Kennard (1970), we discuss the reasonability of introducing L2-penalty. In this regard, there are some related works in literature. Skolkova (2023) proposed the ridge model averaging estimator (RMA). In order to reduce the instability of regression estimators of candidate models caused by high correlation among covariates, RMA uses ridge regression to replace ordinary least squares. Liu (2023) proposed penalized Mallows model averaging (p MMA) to avoid over-selection of candidate models in forming a combined estimator. Unlike RMA and p MMA, we introduce an L2-penalty for the model weights in order to reduce the instability of the model weight estimator caused by high correlation among candidate models when the model weights are unrestricted. We also prove the stability and consistency of RMMA and RJMA.

In the context of shrinkage estimation, Schomaker (2012) discussed the impact of tuning parameter selection and pointed out that the weighted average of shrinkage estimators with diﬀerent tuning parameters can improve overall stability, predictive performance and standard errors of shrinkage estimators. Hence, like Schomaker (2012), we use cross-validation to select a candidate set of L2-penalty parameters, and then perform a weighted average of the estimators of model weights based on cross-validation errors. Moreover, we provide empirical support for the usefulness of the proposed method with a Monte Carlo simulation and application to a prediction task on the wage1 dataset in which our approach outperforms MMA and JMA, as well as some commonly used model selection methods.

The remainder of this paper is organized as follows. In Section 2, we give the deﬁnitions of consistency and stability, and discuss their relationship. In Section 3, we propose RMMA and RJMA methods and prove that they are stable and consistent. Section 4 conducts the Monte Carlo simulation experiment. Section 5 applies the proposed method to a real data set. Section 6 concludes. The proofs of theorems are provided in the Appendix B.

2. Consistency and Stability for Model Averaging

We assume that S = {zi = (yi, x i) Z, i = 1, 2, , n} is a simple random sample from distribution D, where yi is the i-th observation of the response variable and xi is the i-th observation of covariates. Let z = (y , x ) be an observation that is from distribution D and independent of S.

Zhu and Zou

2.1 Model Averaging

In model averaging, M candidate models are selected ﬁrst in order to describe the relationship between response variable and covariates. We assume that the hypothesis spaces of M candidate models are

Hm = hm(x m), hm Fm , m = 1, 2, , M,

where x m consists of some elements of x and Fm is a given function set. For example, in MMA, we take Hm = x mθm, θm Rdim(x m) , m = 1, 2, , M

in order to estimate E(y |x ), where dim( ) represents the dimension of vector. For the mth candidate model, a proper estimation method Am is selected, and then ˆhm, the estimator of hm, is obtained based on S and Am. Then, the hypothesis space of model averaging is deﬁned as follows:

H = ˆh(x , w) = H[w, ˆh1(x 1), ˆh2(x 2), , ˆh M(x M)], w W ,

where W is a given weight space and H( ) is a given function of weight vector and estimators of M candidate models. In MMA, we take

H[w, ˆh1(x 1), ˆh2(x 2), , ˆh M(x M)] =

m=1 wmˆhm(x m)

as the model averaging estimator of E(y |x ). An important problem for model averaging is the choice of model weights. Here, the estimator ˆw of the weight vector is obtained based on S and a proper weight selection criterion A(w) that makes ˆw be optimal in a certain sense. The selection of {Am, m = 1, 2, , M} and A(w) is closely related to the deﬁnition of the loss function. Let L[ˆh(x , w), y ] be a real value loss function which is deﬁned in H Y, where Y is the value space of y . Then, the risk function is deﬁned as follows:

F(w, S) = Ez L[ˆh(x , w), y ] ,

which is MSPE given the sample S and weight vector w.

2.2 Related Concepts

In this paper, we mainly discuss whether F( ˆw, S) can approximate the smallest possible risk infw W F(w, S). If A(w) has such a property, we say that A(w) is consistent. For ﬁxed m, related concepts from Shalev-Shwartz et al. (2010) can be used to discuss the stability and consistency of Am. Obviously, for model averaging, we need to pay more attention to the stability and consistency of weight selection criterion. We note that the relevant conclusions of Shalev-Shwartz et al. (2010) cannot be directly applied to model averaging because H depends on S. Therefore, we extend the relevant deﬁnitions and conclusions to model averaging. The following is the deﬁnition of consistency:

Stability and L2-penalty in Model Averaging

Deﬁnition 1 (Consistency) If there is a sequence of constants {ϵn, n N+} such that ϵn = o(1) and ˆw satisﬁes

ES F( ˆw, S) infw W F(w, S) = O(ϵn),

then A(w) is said to be consistent with rate ϵn.

In statistical learning theory, stability concerns how much the algorithm s output varies when S changes a little. Leave-one-out (Loo) and Replace-one (Ro) are two common tools used to evaluate stability. Loo considers the change in the algorithm s output after removing an observation from S and Ro considers such a change after replacing an observation in S with an observation that is independent of S. Accordingly, the stability is called Loo stability and Ro stability, respectively. Here, we will give the formal deﬁnitions of Loo stability and Ro stability. To this end, we ﬁrst give the deﬁnition of algorithm symmetry:

Deﬁnition 2 (Symmetry) If the algorithm s output is not aﬀected by the order of the observations in S, then the algorithm is symmetric.

Now let S i be the sample after removing the i-th observation from S, ˆh i m be the estimator of hm based on S i and Am, ˆw i be the estimator of weight vector based on S i and A(w) and F(w, S i) = Ez L[ˆh i(x , w), y ] , where

ˆh i(x , w) = H[w, ˆh i 1 (x 1), ˆh i 2 (x 2), , ˆh i M (x M)].

We deﬁne Loo stability as follows:

Deﬁnition 3 (PLoo Stability) If there is a sequence of constants {ϵn, n N+} such that ϵn = o(1) and A(w) satisﬁes

i=1 ES F( ˆw, S) F( ˆw i, S i) = O(ϵn),

then A(w) is Predicted-Loo (PLoo) stable with rate ϵn; If {Am, m = 1, 2, , M} and A(w) are symmetric, then a PLoo stable A(w) needs only to satisfy

ES F( ˆw, S) F( ˆw n, S n) = O(ϵn).

Deﬁnition 4 (FLoo Stability) If there is a sequence of constants {ϵn, n N+} such that ϵn = o(1) and A(w) satisﬁes

i=1 ES L[ˆh(xi, ˆw), yi] L[ˆh i(xi, ˆw i), yi] = O(ϵn),

then A(w) is Fitted-Loo (FLoo) stable with rate ϵn; If {Am, m = 1, 2, , M} and A(w) are symmetric, then a FLoo stable A(w) needs only to satisfy

ES L[ˆh(xn, ˆw), yn] L[ˆh n(xn, ˆw n), yn] = O(ϵn).

Zhu and Zou

Let Si be the sample S with the i-th observation replaced by z i = (y i , x i ) , ˆhi m be the estimator of hm based on Si and Am, and ˆwi be the estimator of the weight vector based on Si and A(w), where z i is from distribution D and independent of S. Let F(w, Si) = Ez L[ˆhi(x , w), y ] , where ˆhi(x , w) = H[w, ˆhi 1(x 1), ˆhi 2(x 2), , ˆhi M(x M)]. Note that

i=1 ES,z i F( ˆw, S) F( ˆwi, Si) = 0.

Therefore, we deﬁne Ro stability as follows:

Deﬁnition 5 (Ro Stability) If there is a sequence of constants {ϵn, n N+} such that ϵn = o(1) and A(w) satisﬁes

i=1 ES,z i L[ˆh(xi, ˆw), yi] L[ˆhi(xi, ˆwi), yi] = O(ϵn),

then A(w) is Ro stable with rate ϵn; If {Am, m = 1, 2, , M} and A(w) are symmetric, then an Ro stable A(w) needs only to satisfy

ES,z n L[ˆh(xn, ˆw), yn] L[ˆhn(xn, ˆwn), yn] = O(ϵn).

Before discussing the relationship between stability and consistency, we give the deﬁnitions of AERM and generalization. The empirical risk function is deﬁned as follows:

ˆF(w, S) = 1

i=1 L[ˆh(xi, w), yi].

Deﬁnition 6 (AERM) If there is a sequence of constants {ϵn, n N+} such that ϵn = o(1) and A(w) satisﬁes

ES ˆF( ˆw, S) infw W ˆF(w, S) = O(ϵn),

then A(w) is an AERM with rate ϵn.

Vapnik (1998) proved some theoretical properties of the empirical risk minimization principle. However, when the sample size is small, the empirical risk minimizer tends to produce over-ﬁtting phenomenon. Therefore, the structural risk minimization principle is proposed in Vapnik (1998). Shalev-Shwartz et al. (2010) also discussed the deﬁciency of the empirical risk minimization principle and the importance of AERM.

Deﬁnition 7 (Generalization) If there is a sequence of constants {ϵn, n N+} such that ϵn = o(1) and A(w) satisﬁes

ES ˆF( ˆw, S) F( ˆw, S) = O(ϵn),

then A(w) generalizes with rate ϵn.

In statistical learning theory, generalization refers to the performance of the concept learned by models on unknown sample. It can be seen from Deﬁnition 7 that the generalization of A(w) describes the diﬀerence between using ˆw to ﬁt the training set S and predict the unknown sample.

Stability and L2-penalty in Model Averaging

2.3 Relationship between Diﬀerent Concepts

In this subsection, we study the relationship between diﬀerent concepts based on the ﬁndings from Bousquet and Elisseeﬀ(2002) and Shalev-Shwartz et al. (2010). Bousquet and Elisseeﬀ (2002) uses triangle inequality to illustrate that Loo stability implies Ro stability. However, we note that for any i {1, 2, , n},

n L[ˆh(xi, ˆw), yi] L[ˆhi(xi, ˆwi), yi] o

n L[ˆh(xi, ˆw), yi] L[ˆh i(xi, ˆw i), yi] + L[ˆh i(xi, ˆw i), yi] L[ˆhi(xi, ˆwi), yi] o

= ES n L[ˆh(xi, ˆw), yi] L[ˆh i(xi, ˆw i), yi] o + ES[F( ˆw i, S i) F( ˆw, S)].

From this, we give the following theorem to illustrate the relationship between Loo stability and Ro stability:

Theorem 8 If A(w) has two of FLoo stability, PLoo stability and Ro stability with rate ϵn, then it has all three stabilities with rate ϵn.

Shalev-Shwartz et al. (2010) emphasized that Ro stability and Loo stability are in general incomparable notions, but Theorem 8 shows that they are closely related. By deﬁnitions of generalization and Ro stability, we have

ES[ ˆF( ˆw, S) F( ˆw, S)]

= ES,z 1,z 2, ,z n n 1

i=1 L[ˆh(xi, ˆw), yi] 1

i=1 L[ˆh(x i , ˆw), y i ] o

n L[ˆh(xi, ˆw), yi] L[ˆhi(xi, ˆwi), yi] o .

From this, we can give the following theorem to illustrate the equivalence of Ro stability and generalization:

Theorem 9 A(w) has Ro stability with rate ϵn if and only if A(w) generalizes with rate ϵn.

For the symmetric algorithm, this result has been given by Lemma 7 of Bousquet and Elisseeﬀ(2002), and extending Lemma 7 of Bousquet and Elisseeﬀ(2002) to the asymmetric case is straightforward. For the theoretical completeness of this section, we still present this result here as a theorem. Theorem 9 shows that stability is an important property of weight selection criteria, which can ensure that the corresponding estimator has good generalization performance. Let ˆw W satisfy F( ˆw , S) = infw W F(w, S). Note that

ES[F( ˆw, S) F( ˆw , S)]

= ES[F( ˆw, S) ˆF( ˆw, S) + ˆF( ˆw, S) ˆF( ˆw , S) + ˆF( ˆw , S) F( ˆw , S)]

ES[F( ˆw, S) ˆF( ˆw, S) + ˆF( ˆw, S) infw W ˆF(w, S) + ˆF( ˆw , S) F( ˆw , S)].

We give the following theorem to illustrate the relationship between stability and consistency:

Zhu and Zou

Theorem 10 If A(w) is an AERM and has Ro stability with rate ϵn, and ˆw satisﬁes

ES ˆF( ˆw , S) F( ˆw , S) = O(ϵn),

then A(w) is consistent with rate ϵn.

Remark 11 For general learning algorithms, Bousquet and Elisseeﬀ(2002) studied how to use stability to derive generalization error bounds based on the empirical error and the leaveone-out error, while we study the relationships among AERM, stability, generalization and consistency in model averaging. Although the relationships among Loo stability, Ro stability and generalization that are demonstrated by Bousquet and Elisseeﬀ(2002) are applicable to model averaging, no relationship between stability and consistency is explored. From Theorem 10, we can see the close relationship between stability and consistency in model averaging.

Remark 12 If for any v > 0, there exists a wv independent of S such that F(wv, S) F( ˆw , S) + v, then Lemma 15 of Shalev-Shwartz et al. (2010) can be applied to model averaging. However, since ˆw and H depend on S, it is diﬃcult to guarantee that such a wv always exists, and so we could not immediately obtain Theorem 10 from the proof of Lemma 15 in Shalev-Shwartz et al. (2010). Further, on the basis of Lemma 15 of Shalev-Shwartz et al. (2010), we ﬁnd that this requirement for the existence of {wv : v > 0} can be replaced by F(w, S) generalizes with rate ϵn .

In the next section, we will propose an L2-penalty model averaging method and prove that it has stability and consistency under certain reasonable conditions.

3. L2-penalty Model Averaging

In most existing literature on model averaging, the theoretical properties are explored under the weight set W 0 = {w [0, 1]M : PM m=1 wm = 1}. From Deﬁnition 1, it is seen that even if the corresponding weight selection criterion is consistent, such consistency holds only under the subspace of RM. Therefore, a natural question is whether it is possible to make the weight space unrestricted. What will happen when we do so? The unrestricted Granger-Ramanathan method obtains the estimator of the weight vector under RM by minimizing the sum of squared forecast errors from the combination forecast. However, its poor performance is observed when it is compared with some other methods (Hansen, 2008). One possible reason for this is that it merely removes weight constraints without addressing the resulting weight instability. In Section 3.2, we will provide relevant explanations for the causes of this instability. On the other hand, in the prediction task, we are more concerned about whether the resulting estimator can predict better. Intuitively, the estimator that minimizes MSPE in the full space would most likely outperform the estimator that minimizes MSPE in the subspace. To demonstrate this point, some new ideas are needed.

Stability and L2-penalty in Model Averaging

3.1 Model Framework and Estimators

We assume that the response variable yi and covariates x1i, x2i, satisfy the following data generating process:

yi = µi + ei =

k=1 xkiθk + ei,

and M candidate models are given by

k=1 xm(k)iθm,(k) + ei, m = 1, 2, , M.

We assume that the M-th candidate model contains all of the considered covariates and deﬁne that bmi = µi Pkm k=1 xm(k)iθm,(k) is the approximating error of the m-th candidate model. Let xi = (x(1)i, x(2)i, , x(k M)i) and S = {zi = (yi, x i) , i = 1, 2, , n} be a simple random sample from distribution D throughout the rest of this article, where x(k)i = x M(k)i. Let y = (y1, y2, , yn) , µ = (µ1, µ2, , µn) , e = (e1, e2, , en) and bm = (bm1, bm2, , bmn) . Then, the corresponding matrix form of the true model is y = Xmθm + bm + e, where θm = (θm,(1), θm,(2), , θm,(km)) and Xm is the design matrix of the m-th candidate model. When there is an candidate model such that bm = 0, the true model is included in the m-th candidate model, i.e. the model is correctly speciﬁed. Unlike Hansen (2007) , Wan et al. (2010) and Hansen and Racine (2012), we do not require that infw W 0 Rn(w) (this is deﬁned in Section 3.2) tends to inﬁnity. Therefore, we allow that the model is correctly speciﬁed. Let πm RK km be the variable selection matrix satisfying XMπm = Xm and π mπm = Ikm, m = 1, 2, , M. Then, the hypothesis spaces of M candidate models are

Hm = x πmθm, θm Rkm , m = 1, 2, , M.

The least squares estimator of θm is ˆθm = (X m Xm) 1X my, m = 1, 2, , M.

3.2 Weight Selection Criterion

Let Pm = Xm(X m Xm) 1X m, P(w) = PM m=1 wm Pm, Ln(w) = µ P(w)y 2 2 and Rn(w) = Ee[Ln(w)]. When σ2 i σ2, Hansen (2007) and Wan et al. (2010) used Mallows criterion Cn(w) = y ˆΩw 2 2+2σ2w κ to select a model weight vector from W 0 and proved that the estimator of the weight vector asymptotically minimizes Ln(w), where ˆΩ= (P1y, P2y, , PMy) and κ = (k1, k2, , k M) . Hansen and Racine (2012) used Jackknife criterion Jn(w) = y Ωw 2 2 to select a model weight vector from W 0 and proved that the estimator of the weight vector asymptotically minimizes Ln(w) and Rn(w), where Ω= y D1(I P1)y, y D2(I P2)y, , y DM(I PM)y with Dm = diag[(1 hm ii ) 1] and hm ii = x iπm(X m Xm) 1π mxi, i = 1, 2, , n. Diﬀerent from Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012), we focus on whether the model averaging estimator can asymptotically minimize MSPE when the model weights are not restricted. Let ˆγ = (x π1ˆθ1, x π2ˆθ2, , x πM ˆθM). Then, the risk function and the empirical risk function are deﬁned as:

F(w, S) = Ez [y x ˆθ(w)]2 = Ez [(y ˆγw)2]

Zhu and Zou

ˆF(w, S) = 1

i=1 [yi x iˆθ(w)]2 = 1

n y ˆΩw 2 2,

respectively, where ˆθ(w) = PM m=1 wmπmˆθm. Since Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012) restricted w W 0, the corresponding estimators of the weight vector do not necessarily asymptotically minimize F(w, S) on RM. An intuitive way that makes the estimator of the weight vector asymptotically minimize F(w, S) on RM is to remove the restriction w W 0 directly. Let ˆP and P be the orthogonal matrices satisfying ˆP ˆΩ ˆΩˆP = diag(ˆζ1, ˆζ2, , ˆζM) and P Ω Ω P = diag( ζ1, ζ2, , ζM), where ˆζ1 ˆζ2 ˆζM and ζ1 ζ2 ζM are the eigenvalues of ˆΩ ˆΩand Ω Ω, respectively. We assume that Ez (ˆγ ˆγ), ˆΩ ˆΩand Ω Ωare invertible (this is reasonable under Assumption 3), then

ˆw0 = argminw RM Cn(w) = (ˆΩ ˆΩ) 1(ˆΩ y σ2κ),

w0 = argminw RM Jn(w) = ( Ω Ω) 1 Ω y,

w = argminw RM ˆF(w, S) = (ˆΩ ˆΩ) 1 ˆΩ y

ˆw = argminw RM F(w, S) = [Ez (ˆγ ˆγ)] 1Ez (ˆγ y ).

In order to satisfy the consistency, ˆw0 and w0 should be good estimators of ˆw . However, when candidate models are highly correlated, the minimum eigenvalues of ˆΩ ˆΩand Ω Ω may be small so that ˆw0 2 2 = PM m=1 a2 m ˆζ2m a2 1 ˆζ2 1 and w0 2 2 = PM m=1 b2 m ζ2m b2 1 ζ2 1 are too

large, which usually result in extreme weights, where (a1, a2, , a M) = ˆP (ˆΩ y σ2κ) and (b1, b2, , b M) = P Ωy. Therefore, similar to ridge regression in Hoerl and Kennard (1970), we make the following correction to Cn(w) and Jn(w):

C(w, S) = Cn(w) + λnw w

J(w, S) = Jn(w) + λnw w,

where λn 0 is a tuning parameter. The above corrections are actually L2-penalty for weight vector. Let ˆZ = (ˆΩ ˆΩ+ λn I) 1 ˆΩ ˆΩand Z = ( Ω Ω+ λn I) 1 Ω Ω. Then

ˆw = argminw RM C(w, S) = (ˆΩ ˆΩ+ λn I) 1(ˆΩ y σ2κ) = ˆZ ˆw0

w = argminw RM J(w, S) = ( Ω Ω+ λn I) 1 Ω y = Z w0.

In the next subsection, we discuss the theoretical properties of C(w, S) and J(w, S).

Stability and L2-penalty in Model Averaging

3.3 Stability and Consistency

Let λmin( ) be the minimum eigenvalue of a square matrix, X n m be the matrix after removing the n-th row of Xm, X (n 1,n) m be the matrix after removing the n-th and (n 1)-th rows of Xm, y n be the vector after removing the n-th element of y and χ be the value space of K covariates, where K = k M. Then, we deﬁne

ˆΩ n = (X n 1 ˆθ n 1 , X n 2 ˆθ n 2 , , X n M ˆθ n M ),

Ω n = (In 1 D n m )y n + D n m X n m ˆθ n m : m = 1, 2 , M

and ˆγ n = (x π1ˆθ n 1 , x π2ˆθ n 2 , , x πM ˆθ n M ),

where ˆθ n m = (X n m X n m ) 1X n m y n

and D n m = diag[(1 hm, n ii ) 1 : i = n]

with hm, n ii = x iπm(X n m X n m ) 1π mxi. In order to discuss the stability and consistency of the proposed method, we need the following assumptions:

Assumption 1 There is a constant C1 > 0 such that λmin(n 1X (n 1,n)

M X (n 1,n) M ) C1, a.s..

Assumption 2 There is a constant C2 > 0 such that n 1y y C2, a.s.; There is a constant C3 > 0 such that χ B(0K, C3) a.s., where B(0K, C3) is a sphere with the center 0K and radius C3, and 0K is the K-dimensional vector of zeros.

Assumption 3 There is a constant C4 > 0 such that

min[λmin(n 1 ˆΩ ˆΩ), λmin(n 1 ˆΩ n ˆΩ n)] C4, a.s.,

min[λmin(n 1 Ω Ω), λmin(n 1 Ω n Ω n)] C4, a.s.

and min{λmin[Ez (ˆγ ˆγ)], λmin[Ez (ˆγ n ˆγ n)]} C4, a.s..

Assumption 4 There is a constant C5 > 0 such that max( ˆw 2 2, ˆw n 2 2) C5, a.s., where ˆw n = argminw RM F(w, S n) and F(w, S n) = Ez {[y x ˆθ n(w)]2}.

Assumption 5 K3M2 = o(n) and λn = O(K2M).

Assumption 1 is weak as n 1X MXM is often positive deﬁnite. The similar assumption is often made in literature. For example, Condition (b) of Zou (2006) and Condition (C.1) of Zhang and Liu (2019) require n 1X MXM to converge to a positive deﬁnite matrix. In particular, Zhang and Liu (2019) points out that if yi is a stationary and ergodic martingale diﬀerence sequence with ﬁnite fourth moments, then their Condition (C.1) is true. Here we

Zhu and Zou

provide an example where Assumption 1 holds. Let x1, , xn be i.i.d. Gaussian random vectors with zero mean and unit covariance matrix and K/n B (0, 1). Then by using Theorem 2 of Bai and Yin (1993), we see that

limn λmin(n 1X MXM) = (1

which shows that Assumption 1 is true. Shalev-Shwartz et al. (2010) assumed that the loss function is bounded, which is usually not satisﬁed in traditional regression analysis. We replace this assumption with Assumption 2. Tong and Wu (2017) assumed that χ Y is a compact subset of RK+1, under which Assumption 2 is obviously true. Assumption 3 requires that matrices n 1 ˆΩ ˆΩand n 1 ˆΩ n ˆΩ n are almost always positive deﬁnite, which is similar to condition (C.4) of Liao and Zou (2020). Lemmas 22 and 23 justify the rationality of the assumptions regarding the eigenvalue of n 1 ˆΩ ˆΩ. Lemma 24 guarantees the rationality of this assumption about the eigenvalues of Ω Ω, Ω n Ω n, Ez (ˆγ ˆγ) and Ez (ˆγ n ˆγ n). Assumption 4 requires that the L2-norm of the optimal weight ˆw and ˆw n

is bounded. Lemma 25 shows that Assumption 4 has a certain rationality. Further, Lemma 29 provides a case where Assumption 4 holds. Assumption 5 limits the growth rate of the numbers of covariates and models, and also makes a mild assumption about λn to avoid excessive penalty. Let ˆV (λn) = ˆZ ˆw0 ˆZ ˆw 2 2, ˆB(λn) = ˆZ ˆw ˆw 2 2, V (λn) = Z w0 Z ˆw 2 2 and B(λn) = Z ˆw ˆw 2 2. We deﬁne

ˆ M(λn) = ˆZ ˆw0 ˆw 2 2 = ˆV (λn) + ˆB(λn) + 2( ˆZ ˆw0 ˆZ ˆw ) ( ˆZ ˆw ˆw )

and M(λn) = Z w0 ˆw 2 2 = V (λn) + B(λn) + 2( Z w0 Z ˆw ) ( Z ˆw ˆw ).

In order to make F( ˆZ ˆw0, S) and F( Z w0, S) better approximate F( ˆw , S), we naturally hope ES[ ˆ M(λn)] and ES[ M(λn)] to be as small as possible. In the following discussion, we refer to ES[ ˆ M(λn)] and ES[ M(λn)], ES[ ˆV (λn)] and ES[ V (λn)], ES[ ˆB(λn)] and ES[ B(λn)] as the corresponding mean squared errors, estimation variances and estimation biases of model weight estimator, respectively. Obviously, when λn = 0, ˆZ = Z = IM which means that the estimation bias is equal to zero. From Assumption 3 and Lemma 25, we see that under Assumptions 1-5, ˆB(λn) and B(λn) are O(n 2λ2 n) a.s.. On the other hand, the existence of extreme weights may make the performance of ˆw0 and w0 extremely unstable. So the purpose of introducing L2-penalty is to reduce estimation variance by introducing estimation bias, which results in the stable performance of the model averaging estimator. Further, we deﬁne ˆ M1(λn) = ˆV (λn) + ˆB(λn)

and M1(λn) = V (λn) + B(λn).

Like Theorem 4.3 in Hoerl and Kennard (1970), we give the following theorem to illustrate the reasonability of introducing L2-penalty:

Theorem 13 Let ˆλn = min{λn : d dλn ˆ M1(λn) = 0} and λn = min{λn : d dλn M1(λn) = 0}. Then, 1) when ˆw0 = ˆw , ˆλn > 0 and ˆ M1(ˆλn) < ˆ M1(0); 2) when w0 = ˆw , λn > 0 and M1( λn) < M1(0).

Stability and L2-penalty in Model Averaging

Theorem 13 shows that the use of L2-penalty reduces estimation variance by introducing estimation bias. However, since ˆw is unknown, ˆλn and λn are also unknown. In Section 3.4, we use cross validation to select the tuning parameter λn. The following theorem shows that C(w, S) and J(w, S) are AERM.

Theorem 14 Under Assumptions 1-5, both C(w, S) and J(w, S) are AERMs with rate n 1K2M.

The following theorem shows that C(w, S), J(w, S) and F(w, S) have FLoo stability and PLoo stability.

Theorem 15 Under Assumptions 1-5, C(w, S), J(w, S) and F(w, S) all have FLoo stability and PLoo stability with rate n 1K3M2.

It can be seen from Theorems 8, 9 and 15 that C(w, S), J(w, S) and F(w, S) have Ro stability and generalization. The following theorem shows that C(w, S) and J(w, S) have consistency, which is a direct consequence of Theorems 8-10 and 14-15.

Theorem 16 Under Assumptions 1-5, both C(w, S) and J(w, S) have consistency with rate n 1K3M2.

From Theorem 3 and Proposition 3 of Mourtada (2022), we see that the excess quadratic risk of the largest candidate model is of order O(K/n) in the linear regression with randomdesign. However, Mourtada (2022) required to assume E( x 2 2) < , which is usually not true when K diverges, e.g. x (1), x (2), , x (K) are independently identically distributed with U( 1, 1). Under the assumption that the functions to be aggregated are bounded and independent of the current data, Theorem 4 of Tsybakov (2003) showed that the excess quadratic risk of linear aggregation is of order O(M/n). This boundedness assumption is often not met in regression analysis. For C(w, S), in order to get a faster rate O(KM/n), we need the following additional assumptions:

Assumption 6 ES{Λ2 max[Ez (ˆγ ˆγ) ˆΩ ˆΩ/n]} = O(n 2K4M2), where Λmax( ) represents the maximum singular value of the corresponding matrix.

Assumption 7 max1 k K var(x (k)y ) C6.

Theorem 17 Under Assumptions 1-3 and 5-7, C(w, S) has consistency with rate O(n 1KM).

Remark 18 From the proof of Lemma 24 and Gershgorin s Theorem, we have

Λ2 max{ES[Ez (ˆγ ˆγ) ˆΩ ˆΩ/n]} = O(n 2K4M2),

hence Assumption 6 possesses a certain degree of rationality. Assumption 7 is a common constraint on second-order moment.

Zhu and Zou

Remark 19 Let ˆθ = (π1ˆθ1, π2ˆθ2, , πM ˆθM). When λmin(ˆθˆθ ) C7, a.s. (this requires that the matrix ˆθˆθ is positive deﬁnite. Lemma 22 provides a case where this requirement holds), from Marcinkiewicz-Zygmund-Burkholder inequality in Lin and Bai (2010), we have

ES{ ˆθ [X My/n Ez (x y )] 2 2}

C7ES{[X My/n Ez (x y )] [X My/n Ez (x y )]}

k=1 ES nh 1

i=1 [x(k)iyi Ez (x (k)y )] i2o

4 1n 2KC7 min 1 k K ES n n X

i=1 [x(k)iyi Ez (x (k)y )]2o

= 4 1n 1KC7 min 1 k K ES n 1

i=1 [x(k)iyi Ez (x (k)y )]2o

= 4 1n 1KC7 min 1 k K var(x(k)iyi).

Thus, from the proof of Theorem 17 and

ES[ Ez (ˆγ ˆγ)( ˆw ˆw ) 2 2] ES{λmax[Ez (ˆγ ˆγ)]( ˆw ˆw ) Ez (ˆγ ˆγ)( ˆw ˆw )},

we see that when min1 k K var(x (k)y ) has a non-zero lower bound and λmax[Ez (ˆγ ˆγ)] = O(1), a.s., then the rate of consistency of C(w, S) is not lower than O(n 1K). Thus, when M is bounded, Theorem 17 indicates that the rate O(n 1K) is optimal. Further, from the proof of Theorem 17, we see that when λmax(ˆθˆθ ) = O(1), a.s., the rate O(n 1K) is also optimal even if M diverges. For J(w, S), we can obtain a similar conclusion by using (9) and Lemma 25.

3.4 Optimal Weighting Based on Cross Validation

Although Theorem 13 shows that there are ˆλn and λn such that ˆw and w are better than ˆw0 and w0, ˆλn and λn cannot be obtained. Therefore, like Schomaker (2012), we propose an algorithm based on cross validation to obtain the estimator of the weight vector, which is a weighted average of the weight estimators for diﬀerent tuning parameters. That is, we ﬁrst select L segmentation points on [0, M log n] with equal intervals as the candidates of λn. Then, we calculate the estimation error for each candidate of λn using cross validation. Based on this, we remove those candidates with large estimation error. Lastly, for the remaining candidates, the estimation errors are used to perform a weighted average of the estimators of the weight vector. We summarize our algorithm for RMMA below, and a similar algorithm can be given for RJMA.

Algorithm 1 Optimal weighting based on cross validation

Require: S, L 1, B 2, l [1, L], b [1, B 1]; Ensure: ˆw;

1: ˆEL = 0, L = 1, 2, , L;

2: The sample S is randomly divided into B equal-sized subsets, and the sample index set belonging to the B-th part is denoted as SB, B = 1, 2, , B;

Stability and L2-penalty in Model Averaging

3: for each B {1, 2, , B} do

4: if B + b B then

5: Let Bidx = (B, B + 1, , B + b);

7: Let Bidx = (B, B + 1, , B, 1, 2, , B + b B);

8: Let Strain = {Si, i Bidx} and Stest = {Si, i / Bidx};

9: ˆθB m is obtained based on Strain, m = 1, 2, , M;

10: for each L {1, 2, , L} do

11: ˆw BL is obtained based on λn = (L 1)M log n

L 1 and C(w, Strain);

12: Let ˆθB( ˆw BL) = PM m=1 ˆw BLmπmˆθB m;

13: The estimation error of ˆw BL on Stest is obtained as

ˆE( ˆw BL) = X

zi Stest [yi x iˆθB( ˆw BL)]2;

14: ˆEL = ˆEL + ˆE( ˆw BL);

15: Let Sλ be the index set of the smallest l numbers in { ˆEL, L = 1, 2, , L};

16: ˆw L is obtained based on λn = (L 1)M log n

L 1 , S and C(w, S), where L Sλ;

17: ˆw = P L Sλ exp( 0.5 ˆEL) P L Sλ exp( 0.5 ˆEL) ˆw L.

4. Simulation Study

In this section, we conduct simulation experiments to demonstrate the ﬁnite sample performance of the proposed method. Similar to Hansen (2007), we consider the following data generating process:

yi = µi + ei =

k=1 xkiθk + ei i = 1, 2, , n,

where θk, k = 1, 2, K are the model parameters, x1i 1, (x2i, x3i, , x Ki) N(0, Σ) and (e1, e2, , en) N[0, diag(σ2 1, σ2 2, ..., σ2 n)]. We set n = 100, 300, 500, 700, Σ = (σkl) and σkl = ρ|k l| with ρ = 0.3, 0.6, and R2 = 0.1, 0.2, , 0.9, where the population R2 = var(µi) var(µi+ei). For the homoskedastic simulation, we set σ2 i 1, while for the heteroskedastic simulation, we set σ2 i = x2 2i. We compare the following model selection/averaging methods: 1) model selection with AIC (AI), model selection with BIC (BI) and model selection with Cp (Cp); 2) model averaging with smoothed AIC (SA) and model averaging with smoothed BIC (SB); 3) Mallows model averaging (MM), jackknife model averaging (JM) and least squares model averaging based on generalized cross validation (GM; Li et al., 2021); 4) Ridge-Mallows model averaging (RM) and Ridge-jackknife model averaging (RJ). To evaluate these methods, we generate a test set {(y i , x i ), i = 1, 2, , nt} by the above data generating process, and

MSE = n 1 t

i=1 [µ i x i ˆθ( ˆw)]2

Zhu and Zou

is calculated as a measure of consistency, where µ i = PK k=1 x kiθk. In the simulation, we set nt = 1000 and repeat 400 times. For each parameterization, we normalize the MSE by dividing by the infeasible MSE (the mean of the smallest MSEs of M candidate models in 400 simulations). We consider two candidate model settings. In the ﬁrst setting, like Hansen (2007), all candidate models are misspeciﬁed and the candidate models are strictly nested. In the second setting, the true model is one of the candidate models and the candidate models are non-nested. For Algorithm 1, we set L = 100, B = 10, l = 50 and b = 5.

4.1 Nested Setting and Results

We set K = 400, θk = c

2 with α = 0.5, 1.0, 1.5 and K = log2 4 n (i.e. K = 11, 17, 20, 22), where R2 is controlled by the parameter c. For ρ = 0.3, the mean of normalized MSEs in 400 simulations is shown in Figures 1-6. The results with ρ = 0.6 are similar and so omitted for saving space. For the homoskedastic case, we can draw the following conclusions from Figures 1-3. When α = 0.5 (Figure 1): 1) RM and RJ perform better than other methods if R2 0.5 and n = 300, 500 and 700, and comparably with the best method in other cases; 2) RM performs better than RJ in most of cases if n = 100. When α = 1.0 (Figure 2): 1) RM and RJ perform better than other methods if R2 0.8, and comparably with the best method if R2 = 0.9; 2) RM performs better than RJ in most of cases if n = 100. When α = 1.5 (Figure 3): 1) RM and RJ always perform better than other methods; 2) RM performs better than RJ in most of cases if n = 100. For the heteroskedastic case, we can draw the following conclusions from Figures 4-6. When α = 0.5 (Figure 4): 1) RM and RJ perform better than other methods if R2 0.4, and comparably with the best method in other cases; 2) RM performs better than RJ. When α = 1.0 (Figure 5): 1) RM and RJ perform better than other methods if R2 0.7, and comparably with the best method in other cases; 2) RM performs better than RJ. When α = 1.5 (Figure 6): 1) RM and RJ always perform better than other methods; 2) RM performs better than RJ. To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases, and even when they are not the best, their performance is close to that of the best method; 2) When α is small and R2 is large, GM has the best performance, and RM and RJ are the best in other cases; 3) RM performs better than RJ.

4.2 Non-nested Setting and Results

We set K = 12, and θk = c

2 with α = 0.5, 1.0, 1.5 for 1 k 10 and θk = 0 for k = 11, 12, where R2 is controlled by the parameter c. Each candidate model contains the ﬁrst 6 covariates, and the last 6 covariates are combined to obtain 26 candidate models. For ρ = 0.3, the mean of normalized MSEs in 400 simulations is shown in Figures 7-12. Like the nested case, the results with ρ = 0.6 are similar and so omitted. For this setting, we can draw the following conclusions from Figures 7-12. When α = 0.5 (Figures 7 and 10): 1) RM and RJ perform better than other methods if R2 0.5, and have performance close to the best method if R2 > 0.5; 2) RM performs better than RJ in most of cases. When α = 1.0 (Figures 8 and 11): 1) RM and RJ perform better than other

Stability and L2-penalty in Model Averaging

methods except for SB if R2 0.7, but the performance of SB is very unstable; 2) RM and RJ have performance close to the best method if R2 > 0.7; 3) RM performs better than RJ in most of cases. When α = 1.5 (Figures 9 and 12): 1) RM and RJ perform better than other methods except for BI and SB, but the performance of SB and BI is very unstable; 2) RM performs better than RJ in most of cases.

To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases and have stable performance; 2) One of SB, SA, BI and AI may perform the best when R2

is small or large, but their performance is unstable compared to RM and RJ; 3) On the whole, RM performs better than RJ.

5. Real Data Analysis

In this section, we apply the proposed method to the real wage1 dataset in Wooldridge (2003) from the US Current Population Survey for the year 1976. There are 526 observations in this dataset. The response variable is the log of average hourly earning, while covariates include: 1) dummy variables nonwhite, female, married, numdep, smsa, northcen, south, west, construc, ndurman, trcommpu, trade, services, profserv, profocc, clerocc and servocc; 2) non-dummy variables educ, exper and tenure; 3) interaction variables nonwhite educ, nonwhite exper, nonwhite tenure, female educ, female exper, female tenure, married educ, married exper and married tenure.

We consider the following two cases: 1) We rank the covariates according to their linear correlations with the response variable, and then consider the strictly nested model averaging method (the intercept term is considered and ranked ﬁrst); 2) 100 models are selected by the function regsubsets in leaps package of R language, where the parameters nvmax and nbest are taken to be 20 and 5, respectively, and other parameters use default values. For Algorithm 1, we set b = 9 and the rest of the settings are the same as in the simulation study.

We randomly divide the data into two parts: a training sample S of n observations for estimating the models and a test sample St of nt = 529 n observations for validating the results. We consider n = 110, 210, 320, 420, and

MSE = n 1 t X

zi St [yi x iˆθ( ˆw)]2

is calculated as a measure of consistency. We replicate the process 400 times. The box plots of MSEs in 400 simulations are shown in Figures 13-14. From these ﬁgures, we see that the performance of RM and RJ is good and stable. We also compute the mean and median of MSEs, as well as the best performance rate (BPR), which is the frequency of achieving the lowest risk across the replications. The results are shown in Tables 1-2. From these tables, we can draw the following conclusions: 1) RM and RJ are superior to other methods in terms of mean and median of MSEs, and BPR; 2) The performance of RM and RJ is basically the same in terms of mean and median of MSEs; 3) For BPR, on the whole, RM outperforms RJ.

Zhu and Zou

6. Concluding Remarks

In this paper, we study the relationship among AERM, stability, generalization and consistency in model averaging. The results indicate that stability is an important property of model averaging, which can ensure that model averaging estimator has good generalization performance and consistency under reasonable conditions. When the model weights are not restricted, similar to ridge regression in Hoerl and Kennard (1970), extreme weights tend to appear due to the inﬂuence of correlation between candidate models. This results in poor performance of the corresponding model averaging estimator. Thus, we propose an L2-penalty model averaging method. We prove that it has stability and consistency. In order to overcome selection uncertainty of tuning parameter, we use cross-validation to select a candidate set of tuning parameter, and then perform a weighted average of the estimators of model weights based on cross-validation errors. We show the superiority of the proposed method with a Monte Carlo simulation and application to a prediction task on the wage1 dataset. Many issues deserve further investigation. We apply the methods of Section 2 to the generalization of MMA and JMA only for linear regression. It is worth investigating whether it is possible to extend the proposed method to more complex scenarios, such as generalized linear model, quantile regression and dependent data. Further, it is also interesting to develop a model averaging framework with stability and consistency under the online setting. In addition, with RMMA and RJMA, we see that the estimators of the weight vector are explicitly expressed. So, how to study their asymptotic behavior based on these explicit expressions is a meaningful but challenging topic.

7. Acknowledgements

The authors thank the editor, and two referees for their insightful suggestions and comments that have substantially improved an earlier version of this article. This work was partially supported by the National Natural Science Foundation of China (Grant nos. 12031016 and 11971323) and the Beijing Natural Science Foundation (Grant no. Z210003).

Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory, Akademia Kiado, Budapest, pages 267 281, 1973.

Donald W.K. Andrews. Generic uniform convergence. Econometric Theory, 8(2):241 257, 1992.

Zhi-Dong Bai and Yong-Qua Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability, 21(3):1275 1294, 1993.

Olivier Bousquet and Andr e Elisseeﬀ. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002.

Stability and L2-penalty in Model Averaging

Steven T Buckland, Kenneth P Burnham, and Nicole H Augustin. Model selection: An integral part of inference. Biometrics, 53(2):603 618, 1997.

Jean-Marie Dufour. Recursive stability analysis of linear regression relationships: An exploratory methodology. Journal of Econometrics, 19(1):31 76, 1982.

Tiago M Fragoso, Wesley Bertoli, and Francisco Louzada. Bayesian model averaging: A systematic review and conceptual classiﬁcation. International Statistical Review, 86(1): 1 28, 2018.

Yan Gao, Xinyu Zhang, Shouyang Wang, and Guohua Zou. Model averaging based on leave-subject-out cross-validation. Journal of Econometrics, 192(1):139 151, 2016.

Bruce E Hansen. Least squares model averaging. Econometrica, 75(4):1175 1189, 2007.

Bruce E Hansen. Least-squares forecast averaging. Journal of Econometrics, 146(2):342 350, 2008.

Bruce E Hansen and Jeﬀrey S Racine. Jackknife model averaging. Journal of Econometrics, 167(1):38 46, 2012.

Nils L Hjort and Gerda Claeskens. Frequentist model average estimators. Journal of the American Statistical Association, 98(464):879 899, 2003.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970.

Samuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization error. In In Proceedings of the 18th Conference in Uncertainty in Artiﬁcial Intelligence, page 275 282. 2002.

Xinmin Li, Guohua Zou, Xinyu Zhang, and Shangwei Zhao. Least squares model averaging based on generalized cross validation. Acta Mathematicae Applicatae Sinica, English Series, 37(3):495 509, 2021.

Hua Liang, Guohua Zou, Alan TK Wan, and Xinyu Zhang. Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association, 106(495):1053 1066, 2011.

Jun Liao and Guohua Zou. Corrected Mallows criterion for model averaging. Computational Statistics and Data Analysis, 144:106902, 2020.

Jun Liao, Xianpeng Zong, Xinyu Zhang, and Guohua Zou. Model averaging based on leave-subject-out cross-validation for vector autoregressions. Journal of Econometric, 209:35 60, 2019.

Zhengyan Lin and Zhidong Bai. Probability Inequalities. Springer, Berlin Heidelberg, 2010.

Qingfeng Liu and Ryo Okui. Heteroskedasticity-robust Cp model averaging. The Econometrics Journal, 16(3):463 472, 2013.

Zhu and Zou

Yifan Liu. Penalized mallow s model averaging. Communications in Statistics-Theory and Methods, doi: 10.1080/03610926.2023.2264995, 2023.

Colin L Mallows. Some comments on Cp. Technometrics, 15:661 675, 1973.

Enrique Moral-Benito. Model averaging in economics: An overview. Journal of Economic Surveys, 29(1):46 75, 2015.

Jaouad Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. The Annals of Statistics, 50(4):2157 2178, 2022.

Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is suﬃcient for generalization and necessary and suﬃcient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161 193, 2006.

Alexander Rakhlin, Sayan Mukherjee, and Tomaso Poggio. Stability results in learning theory. Analysis and Applications, 3(04):397 417, 2005.

Michael Schomaker. Shrinkage averaging estimation. Statistical Papers, 53(4):1015 1034, 2012.

Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461 464, 1978.

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635 2670, 2010.

Alena Skolkova. Model averaging with ridge regularization. Cerge-ei working papers, The Center for Economic Research and Graduate Education - Economics Institute, Prague, 2023. URL https://Econ Papers.repec.org/Re PEc:cer:papers:wp758.

Hongzhi Tong and Qiang Wu. Learning performance of regularized moving least square regression. Journal of Computational and Applied Mathematics, 325:42 55, 2017.

Alexandre B. Tsybakov. Optimal rates of aggregation. In Learning Theory and Kernel Machines, pages 303 313. Springer Berlin Heidelberg, 2003.

Vladimir Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

Alan TK Wan and Xinyu Zhang. On the use of model averaging in tourism research. Annals of Tourism Research, 36:525 532, 2009.

Alan TK Wan, Xinyu Zhang, and Guohua Zou. Least squares model averaging by Mallows criterion. Journal of Econometrics, 156(2):277 283, 2010.

Jeﬀrey M Wooldridge. Introductory Econometrics. Thompson South-Western, Thompson, 2003.

Yuhong Yang. Adaptive regression by mixing. Journal of the American Statistical Association, 96(454):574 588, 2001.

Stability and L2-penalty in Model Averaging

Zheng Yuan and Yuhong Yang. Combining linear regression models: When and how? Journal of the American Statistical Association, 100(472):1202 1214, 2005.

Xinyu Zhang and Chu-An Liu. Inference after model averaging in linear regression models. Econometric Theory, 35(4):816 841, 2019.

Xinyu Zhang and Guohua Zou. Model averaging method and its application in forecast. Statistical Research, 28(6):6, 2011.

Xinyu Zhang, Alan TK Wan, and Guohua Zou. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics, 174(2):82 94, 2013.

Hui Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.

Zhu and Zou

Appendix A. Lemmas and Their Proofs

Let ˆθ (n 1,n)(w) be the model averaging estimator based on S (n 1,n), where S (n 1,n) is the set of observations after removing the (n 1)-th and n-th observations from S. We give the following lemmas in order to prove Theorems 14-17. The following Lemma 20 shows that the L2-norms of the parameter estimators of M candidate models are uniformly bounded, which will be repeatedly used in subsequent proofs.

Lemma 20 Under Assumptions 1 and 2, there exists a constant B1 > 0 such that

max 1 m M max ˆθm 2 2, ˆθ n m 2 2, ˆθ (n 1,n) m 2 2 B1, a.s..

Proof It follows from Assumption 1 that

P λmin(X (n 1,n)

M X (n 1,n) M /n) C1 = 1.

Thus, we have

P λmin(X n M X n M /n) C1

= P λmin[(X (n 1,n)

M X (n 1,n) M + xn 1x n 1)/n] C1

P λmin(X (n 1,n)

M X (n 1,n) M /n) C1

which means that

λmin(X n M X n M /n) C1, a.s.. (1)

Similarly, we have

P λmin(X MXM/n) C1

= P λmin[(X n M X n M + xnx n)/n] C1

P λmin(X n M X n M /n) C1

which means that

λmin(X MXM/n) C1, a.s.. (2)

From X m Xm = π m X MXMπm, the deﬁnition of πm and (2), we have

λmin(n 1X m Xm) λmin(n 1X MXM) C1, a.s..

Xm(X m Xm) 1(X m Xm) 1X m = Xm(X m Xm) 1/2(X m Xm) 1(X m Xm) 1/2X m.

Stability and L2-penalty in Model Averaging

Hence, we have

λmax[Xm(X m Xm) 1(X m Xm) 1X m] 1 λmin(X m Xm) 1 C1n, a.s.. (3)

Thus, it follows from Assumption 2 that

max 1 m M ˆθm 2 2 = max 1 m M (X m Xm) 1X my 2 2

= max 1 m M y Xm(X m Xm) 1(X m Xm) 1X my

C 1 1 n 1y y

C 1 1 C2, a.s..

Note that from Assumption 2, we have

n 1y n y n n 1y y C2, a.s. (4)

and n 1y (n 1,n) y (n 1,n) n 1y y C2, a.s..

So in a similar way, we obtain

max 1 m M ˆθ n m 2 2 C 1 1 C2, a.s.

max 1 m M ˆθ (n 1,n) m 2 2 C 1 1 C2, a.s..

We complete the proof by taking B1 = C 1 1 C2.

The following lemma characterizes the degree of impact of removing an observation on the parameter estimators of the candidate models. From the proofs of Lemmas 24-28, we can see that they are crucial for proving the stability of our method.

Lemma 21 Under Assumptions 1 and 2, we have

ES max 1 m M ˆθm ˆθ n m 2 2 = O(n 2K2)

and ES max 1 m M ˆθ n m ˆθ (n 1,n) m 2 2 = O(n 2K2).

Proof By Dufour (1982), we see that for any m {1, 2, , M}, we have

ˆθm = ˆθ n m + (X m Xm) 1π mxn(yn x nπmˆθ n m ).

Zhu and Zou

It follows from (2) that

ES h max 1 m M ˆθm ˆθ n m 2 2 i

= ES h max 1 m M x nπm(X m Xm) 1(X m Xm) 1π mxn(yn x nπmˆθ n m )2i

ES h max 1 m M λ2 max[(X m Xm) 1] π mxn(yn x nπmˆθ n m ) 2 2 i

ES h C 2 1 n 2 max 1 m M π mxn(yn x nπmˆθ n m ) 2 2 i . (5)

From Assumption 2, we obtain

E[(y )2] = E[n 1y y] C2

ES h max 1 m M π mxn(yn x nπmˆθ n m ) 2 2 i

ES h max 1 m M

k=1 x2 (k)n(yn x nπmˆθ n m )2i

C2 3KES h max 1 m M(yn x nπmˆθ n m )2]

C2 3KES h max 1 m M[2y2 n + 2(x nπmˆθ n m )2] i

C2 3K n 2C2 + 2ES h max 1 m M(x nπmˆθ n m )2io . (6)

Further, from Assumption 2 and Lemma 20, we have

ES h max 1 m M(x nπmˆθ n m )2i

ES max 1 m M xn 2 2 ˆθ n m 2 2

C2 3KES max 1 m M ˆθ n m 2 2

B1C2 3K. (7)

Combining (5)-(7), it is seen that

ES max 1 m M ˆθm ˆθ n m 2 2 = O(n 2K2).

In a similar way, it can be shown that

ES max 1 m M ˆθ n m ˆθ (n 1,n) m 2 2 = O(n 2K2)

by using (1).

Stability and L2-penalty in Model Averaging

The following lemma shows that under certain conditions, n 1 ˆΩ ˆΩand ˆθˆθ are positive deﬁnite when the m-th model contains the ﬁrst m covariates (which is a common nested model setting in model averaging literature, such as Hansen (2007) and Zhang and Liu (2019). Let y = n 1 Pn i=1 yi, ˆym = Pmy and ˆσ2 m = n 1 y ˆym 2 2. Speciﬁcally, this lemma requires the following two conditions: 1) n 1ˆy 1ˆy1 > 0, a.s.; 2) ˆσ2 1 > ˆσ2 2 > > ˆσ2 M, a.s.. Assume the ﬁrst model contains only the constant term, then n 1ˆy 1ˆy1 = y2, and therefore the ﬁrst condition only needs y = 0, a.s.. It follows from

ˆσ2 1 = y XMπ1ˆθ1 2 2/n = min θ:θ2=θ3= =θM=0 y XMθ 2 2/n,

ˆσ2 2 = y XMπ2ˆθ2 2 2/n = min θ:θ3=θ4= =θM=0 y XMθ 2 2/n,

ˆσ2 M = y XMπM ˆθM 2 2/n = min θ y XMθ 2 2/n

that ˆσ2 1 ˆσ2 2 ˆσ2 M.

On the other hand, if x2, x3, ..., x M are all signiﬁcant covariates, then the second condition is reasonable. Additionally, when M = n, we know that ˆσ2 M = 0. Therefore, even if x2, x3, ..., x M contain insigniﬁcant covariates, the second condition may still be satisﬁed due to the ever-expanding parameter space.

Lemma 22 When the m-th model contains the ﬁrst m covariates, if n 1ˆy 1ˆy1 > 0 and ˆσ2 1 > ˆσ2 2 > > ˆσ2 M, a.s., then n 1 ˆΩ ˆΩand ˆθˆθ are positive deﬁnite, a.s..

Proof From the proof of Lemma 2 of Hansen (2007), we have

= (y Pm Pty)M M

y P1y y P1y y P1y y P1y y P1y y P2y y P2y y P2y y P1y y P2y y P3y y P3y ... ... ... ... ... y P1y y P2y y P3y y PMy

After a series of elementary row transformations, we see that

y P1y y P1y y P1y y P1y 0 y (P2 P1)y y (P2 P1)y y (P2 P1)y 0 y (P2 P1)y y (P3 P1)y y (P3 P1)y ... ... ... ... ... 0 y (P2 P1)y y (P3 P1)y y (PM P1)y

Zhu and Zou

y P1y y P1y y P1y y P1y 0 y (P2 P1)y y (P2 P1)y y (P2 P1)y 0 0 y (P3 P2)y y (P3 P2)y ... ... ... ... ... 0 0 y (P3 P2)y y (PM P2)y

y P1y y P1y y P1y y P1y 0 y (P2 P1)y y (P2 P1)y y (P2 P1)y 0 0 y (P3 P2)y y (P3 P2)y ... ... ... ... ... 0 0 0 y (PM PM 1)y

ˆy 1ˆy1 ˆy 1ˆy1 ˆy 1ˆy1 ˆy 1ˆy1 0 ˆy 2ˆy2 ˆy 1ˆy1 ˆy 2ˆy2 ˆy 1ˆy1 ˆy 2ˆy2 ˆy 1ˆy1 0 0 ˆy 3ˆy3 ˆy 2ˆy2 ˆy 3ˆy3 ˆy 2ˆy2 ... ... ... ... ... 0 0 0 ˆy M ˆy M ˆy M 1ˆy M 1

It follows from

= (y y + ˆy mˆym 2y ˆym)/n

= (y y + ˆy mˆym 2y Pmy)/n

= (y y ˆy mˆym)/n

that ˆσ2 m 1 ˆσ2 m > 0 is equivalent to (ˆy mˆym ˆy m 1ˆym 1)/n > 0. Thus, n 1 ˆΩ ˆΩis positive deﬁnite, a.s.. Further, it follows from ˆΩ ˆΩ= ˆθ X MXM ˆθ and M = K that ˆθˆθ is positive deﬁnite a.s..

In the following, we provide an example to illustrate the rationality of Assumption 3 based on Lemma 22.

Lemma 23 Consider the nested setting of Lemma 22 with

Assumption 2 holds, xi1 1 and | y| c0 > 0, a.s.,

E(ei|xi) = 0 and E(|ei||xi) < ,

b Mb M = o(n) and µ µ = O(n),

0 < c1 min1 k K |θM,(k)| max1 k K |θM,(k)| c2 < ;

the orthogonal design, i.e. n 1X MXM = IK,

Stability and L2-penalty in Model Averaging

then, there is a constant c3 > 0 such that |n 1 ˆΩ ˆΩ| c M 3 , a.s., and when M is bounded, there is a constant c4 > 0 such that λmin(n 1 ˆΩ ˆΩ) c4, a.s., where an bn means an/bn 1 and bn/an 1.

Proof Denote Xc m as a matrix consisting of columns of XM not contained in Xm, and πc m as a selection matrix such that Xc m = XMπc m. Let θc m = πc mθM and ξi(w) = P w W wmx(m)iei, where W = {w [0, 1]M : PM m=1 wm = 1}. From

W is a totally bounded metric space with metric L1-norm,

for any w W, n 1 Pn i=1 ξi(w) = o(1), a.s.,

for any w1, w2 W, |ξi(w1) ξi(w2)| max1 m M |x(m)iei| w1 w2 1,

and Theorem 3 of Andrews (1992), we have supw W |n 1 Pn i=1 ξi(w)| = o(1), a.s., and then

n 1e (Xm X m Xm 1X m 1)e

j=1 x(m)ix(m)jeiej

= n n 1 n X

i=1 x(m)iei 2

= o(n), a.s.

uniformly holds for m {1, 2, , M}. So from (Pm Pm 1)(Pm Pm 1) = Pm Pm 1, we obtain

|Λm,m 1| =|b M(Pm Pm 1)b M + 2b M(Xc m 1θc m 1 Xc mθc m)

+ e (Pm Pm 1)e + 2µ (Pm Pm 1)e|

= |b M(Pm Pm 1)b M + 2b MXM(πc m 1θc m 1 πc mθc m)

+ e (Pm Pm 1)e + 2µ (Pm Pm 1)e|

b M(Pm Pm 1)b M + 2 q

nθ2 M,(m) + n 1e (Xm X m Xm 1X m 1)e

e (Pm Pm 1)e

b Mb M + 2 q

nc2 2 + n 1e (Xm X m Xm 1X m 1)e

n 1e (Xm X m Xm 1X m 1)e

= o(n), a.s.

uniformly holds for m {2, 3, , M}. On the other hand, it is seen that

nˆσ2 m = (Xmθm + Xc mθc m + b M) (I Pm)(Xmθm + Xc mθc m + b M)

+ e (I Pm)e + 2µ (I Pm)e

= (Xc mθc m + b M) (I Pm)(Xc mθc m + b M) + e (I Pm)e + 2µ (I Pm)e

Zhu and Zou

= nθc mθc m + b M(I Pm)b M + 2b MXc mθc m + e (I Pm)e + 2µ (I Pm)e.

n(ˆσ2 m 1 ˆσ2 m)

= n(θc m 1θc m 1 θc mθc m) + Λm,m 1 = nθ2 M,(m) + Λm,m 1

nc2 1 + o(n), a.s..

Thus, for large n, there is a constant c3 > 0 such that

min y2, min 2 m M(ˆσ2 m 1 ˆσ2 m) c3 + o(1), a.s..

1 0 0 0 1 1 0 0 1 1 1 0 ... ... ... ... ... 1 1 1 1

y2 y2 y2 y2

0 ˆσ2 1 ˆσ2 2 ˆσ2 1 ˆσ2 2 ˆσ2 1 ˆσ2 2 0 0 ˆσ2 2 ˆσ2 3 ˆσ2 2 ˆσ2 3 ... ... ... ... ... 0 0 0 ˆσ2 M 1 ˆσ2 M

we obtain |n 1 ˆΩ ˆΩ| [c3 + o(1)]M c M 3 , a.s., that is, n 1 ˆΩ ˆΩis positive deﬁnite, a.s.. When M is bounded, there is obviously a constant c3 > 0 such that λmin(n 1 ˆΩ ˆΩ) c3, a.s..

The following lemma is used to illustrate the rationality of Assumption 3. Understanding its proof process is helpful for us to understand the proofs of Lemma 27.

Lemma 24 Under Assumptions 1 and 2, we have

ES,z (ˆγ ˆγ) = n 1ES(ˆΩ ˆΩ) + O(n 1K2)

and n 1ES( Ω Ω) = n 1ES(ˆΩ ˆΩ) + O[n(C1n C2 3K) 2K].

Proof Note that ˆγ ˆγ = (x πmˆθmˆθ tπ tx )M M,

Ω Ω= [y Dm(y Pmy)] [y Dt(y Pty)]

ˆΩ ˆΩ= (y P m Pty)M M = n X

i=1 x iπmˆθmˆθ tπ txi

Stability and L2-penalty in Model Averaging

It follows from Assumption 2, and Lemmas 20 and 21 that

|ES(x nπmˆθmˆθ tπ txn x nπmˆθ n m ˆθ n t π txn)|

|ES[x nπm(ˆθm ˆθ n m )ˆθ tπ txn]| + |ES[x nπmˆθ n m (ˆθt ˆθ n t ) π txn]|

ES( ˆθm ˆθ n m 2 2) q

ES( x nπtˆθtπ mxn 2 2) + q

ES( x nπmˆθ n m π txn 2 2) q

ES( ˆθt ˆθ n t 2 2)

ES( ˆθm ˆθ n m 2 2) q

ES( xn 2 2 ˆθt 2 2 xn 2 2) + q

ES( xn 2 2 ˆθ n m 2 2 xn 2 2) q

ES( ˆθt ˆθ n t 2 2)

= O(n 1K2).

In a similar way, we obtain

|ES,z (x πmˆθ n m ˆθ n t π tx x πmˆθmˆθ tπ tx )| = O(n 1K2).

Further, it is readily seen that

i=1 x iπmˆθmˆθ tπ txi x πmˆθmˆθ tπ tx )

= ES,z (x nπmˆθmˆθ tπ txn x πmˆθmˆθ tπ tx )

= ES,z (x nπmˆθmˆθ tπ txn x πmˆθ n m ˆθ n t π tx ) + ES,z (x πmˆθ n m ˆθ n t π tx x πmˆθmˆθ tπ tx )

= ES(x nπmˆθmˆθ tπ txn x nπmˆθ n m ˆθ n t π txn) + ES,z (x πmˆθ n m ˆθ n t π tx x πmˆθmˆθ tπ tx ),

so we have ES,z (ˆγ ˆγ) = n 1ES[ˆΩ ˆΩ] + O(n 1K2).

On the other hand, it follows from (2) and Assumption 2 that

max 1 i n max 1 m M hm ii = max 1 i n max 1 m M x iπm(X m Xm) 1π mxi C2 3K n C1 , a.s.. (8)

Hence, from Assumption 2, we have

|y P m Pty [y Dm(y Pmy)] [y Dt(y Pty)]|

= |y P m Pty [(In Dm)y + Dm Pmy] [(In Dt)y + Dt Pty]|

|y P m(Dm Dt In)Pty| + |y (In Dm)(In Dt)y| + |y (In Dm)Dt Pty|

+ |y P m Dm(In Dt)y|

max 1 m M max 1 i n[(1 hm ii ) 2 1]y P m Pmy + max 1 m M max 1 i n[1 (1 hm ii ) 1]2y y

+ 2 max 1 m M max 1 i n

(1 hm ii ) 2[1 (1 hm ii ) 1]2y yy P m Pmy

max 1 m M max 1 i n{[(1 hm ii ) 2 1] + [(1 hm ii ) 1 1]2

+ 2(1 hm ii ) 1[(1 hm ii ) 1 1]}y y

C2n{[(1 C2 3K n C1 ) 2 1] + [(1 C2 3K n C1 ) 1 1]2

Zhu and Zou

+ 2(1 C2 3K n C1 ) 1[(1 C2 3K n C1 ) 1 1]}

= 4C1C2C2 3n2K (C1n C2 3K)2 , a.s.. (9)

Thus, we obtain

n 1ES( Ω Ω) = n 1ES(ˆΩ ˆΩ) + O[n(C1n C2 3K) 2K].

Let w n = argminw RM ˆF(w, S n), where

ˆF(w, S n) = 1

i=1 [yi x iˆθ n(w)]2.

The following lemma shows that L2-norm of the estimator of model weight vector is bounded, which makes Assumption 4 reasonable. In subsequent proofs, we use this conclusion many times.

Lemma 25 Under Assumptions 2-3, there is a constant B2 > 0 such that

max( ˆw 2 2, ˆw n 2 2) B2(1 + n 2K2M), a.s.,

max( w 2 2, w n 2 2) B2, a.s.

and max( w 2 2, w n 2 2) B2, a.s..

Proof It is straightforward to check that

λmax[ˆΩ(ˆΩ ˆΩ+ λn In) 1(ˆΩ ˆΩ+ λn In) 1 ˆΩ ] [λmin(ˆΩ ˆΩ+ λn In)] 1.

Similarly, we can prove that

λmax[ Ω( Ω Ω+ λn In) 1( Ω Ω+ λn In) 1 Ω ] [λmin( Ω Ω+ λn In)] 1.

It follows from Assumptions 2-3 that

ˆw 2 2 = (ˆΩ ˆΩ+ λn In) 1 ˆΩ y σ2(ˆΩ ˆΩ+ λn In) 1κ 2 2 2y ˆΩ(ˆΩ ˆΩ+ λn In) 1(ˆΩ ˆΩ+ λn In) 1 ˆΩ y + 2σ4κ (ˆΩ ˆΩ+ λn In) 1(ˆΩ ˆΩ+ λn In) 1κ

2C 1 4 n 1y y + 2σ4C 2 4 n 2κ κ

2C 1 4 C2 + 2σ4C 2 4 n 2K2M, a.s.,

w 2 2 = ( Ω Ω+ λn In) 1 Ω y 2 2

Stability and L2-penalty in Model Averaging

C 1 5 n 1y y

C 1 4 C2, a.s..

In a similar way, we obtain w 2 2 C 1 4 C2, a.s..

From Assumption 3 and (4), we complete the proof by taking B2 = max{2C 1 4 C2, 2σ4C 2 4 }.

The following three lemmas characterize the degree of impact of removing an observation on ˆw, w and ˆw , and are used to prove Theorem 15.

Lemma 26 Under Assumptions 1-5, we have

ES[ ˆw ˆw n 2 2] = O(n 2K4M2).

Proof Let ˆθ n = (π1ˆθ n 1 , π2ˆθ n 2 , , πM ˆθ n M ). Then we have

ˆw = (ˆΩ ˆΩ+ λn IM) 1(ˆΩ y σ2κ)

= (ˆθ X MXM ˆθ + λn IM) 1(ˆθ X My σ2κ)

ˆw n = (ˆΩ n ˆΩ n + λn 1IM) 1(ˆΩ n y n σ2κ)

= (ˆθ n X n M X n M ˆθ n + λn 1IM) 1(ˆθ n X n M y n σ2κ).

Thus, under Assumption 3, we obtain

(n C4 + λn)2ES[ ˆw ˆw n 2 2]

ES[λ2 min(ˆΩ ˆΩ+ λn IM) ˆw ˆw n 2 2]

ES[ (ˆΩ ˆΩ+ λn IM)( ˆw ˆw n) 2 2]

2ES[ (ˆΩ ˆΩ+ λn IM) ˆw (ˆθ n X n M X n M ˆθ n + λn 1IM) ˆw n 2 2]

+ 2ES[ (ˆθ n X n M X n M ˆθ n + λn 1IM) ˆw n (ˆΩ ˆΩ+ λn IM) ˆw n 2 2]

= 2ES[ ˆθ X My ˆθ n X n M y n 2 2]

+ 2ES[ (ˆΩ n ˆΩ n + λn 1IM ˆΩ ˆΩ λn IM) ˆw n 2 2]

2ES[ ˆθ X My ˆθ n X n M y n 2 2]

+ 4ES[ (ˆΩ n ˆΩ n ˆΩ ˆΩ) ˆw n 2 2] + 4(λn 1 λn)2ES( ˆw n 2 2). (10)

We now consider the ﬁrst term on the right-hand side of (10). From (2), Assumption 2, Lemma 20 and (3), we have

ES[ max 1 m M |(ˆθm ˆθ n m ) X my|2]

Zhu and Zou

= ES[ max 1 m M |x nπ m(X m Xm) 1X my(yn x nπmˆθ n m )|2]

ES{ max 1 m M λmax(πmxnx nπ m)λmax[Xm(X m Xm) 1(X m Xm) 1X m] y 2 2|yn x nπmˆθ n m |2}

C 1 1 C2C2 3KES( max 1 m M |yn x nπmˆθ n m |2)

2C 1 1 C2C2 3K[ES(|yn|2) + ES( max 1 m M |x nπmˆθ n m |2)]

2C 1 1 C2C2 3K[C2 + ES( xn 2 2 max 1 m M ˆθ n m 2 2)]

2C 1 1 C2C2 3K[C2 + C2 3B1K]

ES[ max 1 m M |ˆθ mπ m X My ˆθ n m π m X n M y n|2]

= ES[ max 1 m M |ˆθ mπ m X My ˆθ n m π m X My + ˆθ n m π m X My ˆθ n m π m X n M y n|2]

= ES[ max 1 m M |(ˆθm ˆθ n m ) X my + ˆθ n m (X my X n m y n)|2]

2ES[ max 1 m M |(ˆθm ˆθ n m ) X my|2] + 2ES[ max 1 m M |ˆθ n m (X my X n m y n)|2]

2ES[ max 1 m M |(ˆθm ˆθ n m ) X my|2] + 2ES( max 1 m M ˆθ n m 2 2 max 1 m M X my X n m y n 2 2)

2ES[ max 1 m M |(ˆθm ˆθ n m ) X my|2] + 2B1

k=1 ES(|x(k)nyn|2)

2ES[ max 1 m M |(ˆθm ˆθ n m ) X my|2] + 2C2C2 3B1K

Hence, we obtain

ES[ ˆθ X My ˆθ n X n M y n 2 2]

m=1 ES(|ˆθ mπ m X My ˆθ n m π m X n M y n|2)

= O(K2M). (11)

We then consider the second term on the right-hand side of (10). It follows from Assumption 2 and Lemmas 20-21 that

ES( max 1 m M max 1 t M |x n 1πmˆθmˆθ tπ txn 1 x n 1πmˆθ n m ˆθ n t π txn 1|2)

2ES[ max 1 m M max 1 t M |x n 1πm(ˆθm ˆθ n m )ˆθ tπ txn 1|2]

+ 2ES[ max 1 m M max 1 t M |x n 1πmˆθ n m (ˆθt ˆθ n t ) π txn 1|2]

2ES( max 1 m M ˆθm ˆθ n m 2 2 max 1 m M max 1 t M x n 1πtˆθtπ mxn 1 2 2)

Stability and L2-penalty in Model Averaging

+ 2ES( max 1 m M max 1 t M x n 1πmˆθ n m π txn 1 2 2 max 1 t M ˆθt ˆθ n t 2 2)

2ES( max 1 m M ˆθm ˆθ n m 2 2 max 1 t M xn 1 2 2 ˆθt 2 2 xn 1 2 2)

+ 2ES( max 1 m M xn 1 2 2 ˆθ n m 2 2 xn 1 2 2 max 1 t M ˆθt ˆθ n t 2 2)

4C4 3B1K2ES( max 1 m M ˆθm ˆθ n m 2 2)

ES max 1 m M max 1 t M

i=1 (x iπmˆθ n m ˆθ n t π txi x iπmˆθmˆθ tπ txi) x nπmˆθmˆθ tπ txn 2

2ES max 1 m M max 1 t M

i=1 (x iπmˆθ n m ˆθ n t π txi x iπmˆθmˆθ tπ txi) 2

+ 2ES max 1 m M max 1 t M

x nπmˆθmˆθ tπ txn 2

i=1 ES max 1 m M max 1 t M

x iπmˆθ n m ˆθ n t π txi x iπmˆθmˆθ tπ txi 2

+ 2ES max 1 m M max 1 t M

x nπmˆθm|2|ˆθ tπ txn 2

i=1 ES max 1 m M max 1 t M

x iπmˆθ n m ˆθ n t π txi x iπmˆθmˆθ tπ txi 2

+ 2ES max 1 m M xn 4 2 ˆθm 4 2

i=1 ES max 1 m M max 1 t M

x iπmˆθ n m ˆθ n t π txi x iπmˆθmˆθ tπ txi 2 + 2C4 3B2 1K2

= 2(n 1)2ES max 1 m M max 1 t M

x n 1πmˆθ n m ˆθ n t π txn 1 x n 1πmˆθmˆθ tπ txn 1 2 + 2C4 3B2 1K2

So from Lemma 25,

ˆΩ ˆΩ= (y P m Pty)M M = (ˆθ m X m Xtˆθt)M M = n X

i=1 x iπmˆθmˆθ tπ txi

ˆΩ n ˆΩ n = (ˆθ n m X n m X n t ˆθ n t )M M = n 1 X

i=1 x iπmˆθ n m ˆθ n t π txi

ES[ (ˆΩ n ˆΩ n ˆΩ ˆΩ) ˆw n 2 2]

Zhu and Zou

t=1 ˆw n m ˆw n t

i=1 x iπmˆθ n m ˆθ n s π sxi

i=1 x iπmˆθmˆθ sπ sxi)

i=1 x iπsˆθ n s ˆθ n t π txi

i=1 x iπsˆθsˆθ tπ txi) i

t=1 | ˆw n m ˆw n t | max 1 m M max 1 t M

i=1 x iπmˆθ n m ˆθ n s π sxi

i=1 x iπmˆθmˆθ sπ sxi)(

i=1 x iπsˆθ n s ˆθ n t π txi

i=1 x iπsˆθsˆθ tπ txi) i

m=1 | ˆw n m |2 max 1 m M max 1 t M

i=1 x iπmˆθ n m ˆθ n t π txi

i=1 x iπmˆθmˆθ tπ txi 2i

= O(K4M2). (12)

Finally, combining (10)-(12) and using Assumption 5, we see that Lemma 26 is true.

Lemma 27 Under Assumptions 1-5, we have

ES[ w w n 2 2] = O(n 2K4M2).

Proof Note that

w = ( Ω Ω+ λn IM) 1 Ω y

w n = ( Ω n Ω n + λn 1IM) 1 Ω n y n.

So under Assumption 3, we obtain

(n C4 + λn)2ES[ w w n 2 2]

ES[λ2 min( Ω Ω+ λn IM) w w n 2 2]

ES[ ( Ω Ω+ λn IM)( w w n) 2 2]

2ES[ ( Ω Ω+ λn IM) w ( Ω n Ω n + λn 1IM) w n 2 2]

+ 2ES[ ( Ω n Ω n + λn 1IM) w n ( Ω Ω+ λn IM) w n 2 2]

= 2ES[ Ω y Ω n y n 2 2]

+ 2ES[ ( Ω n Ω n + λn 1IM Ω Ω λn IM) w n 2 2]

2ES[ Ω y Ω n y n 2 2]

+ 4ES[ ( Ω n Ω n Ω Ω) w n 2 2] + 4(λn 1 λn)2ES( w n 2 2). (13)

Stability and L2-penalty in Model Averaging

We now consider the ﬁrst term on the right-hand side of (13). From Assumption 2, (8) and the deﬁnition of Dm, we have

Ω y ˆΩ y 2 2

m=1 {[y Dm(I Pm)y] y y Pmy}2

m=1 {y (In Dm)y + y Pm(Dm In)y}2

2M max 1 m M |y (Dm In)y|2 + |y Pm(Dm In)y|2

2M max 1 m M{| max 1 i n[(1 hm ii ) 1 1]y y|2 + y (Dm In)Pmyy Pm(Dm In)y}

4M max 1 m M max 1 i n[(1 hm ii ) 1 1]2(y y)2

4C2 2n2M[(1 C2 3K n C1 ) 1 1]2

= 4C2 2C4 3n2MK2(n C1 C2 3K) 2, a.s..

Similarly, it follows from (1) and (4) that

ˆΩ n y n Ω n y n 2 2 4C2 2C4 3n2MK2(n C1 C2 3K) 2, a.s..

Hence, from Assumption 5 and (11), we obtain

ES[ Ω y Ω n y n 2 2]

= ES[ Ω y ˆΩ y + ˆΩ y ˆΩ n y n + ˆΩ n y n Ω n y n 2 2]

3ES[ Ω y ˆΩ y 2 2] + 3ES[ ˆΩ y ˆΩ n y n 2 2] + ES[ ˆΩ n y n Ω n y n 2 2]

= O(K2M). (14)

We then consider the second term on the right-hand side of (13). From Assumption 5, Lemma 25, (1), (4), (9) and (12), we have

ES[ ( Ω n Ω n Ω Ω) w n 2 2]

= ES[ ( Ω n Ω n ˆΩ n ˆΩ n + ˆΩ n ˆΩ n ˆΩ ˆΩ+ ˆΩ ˆΩ Ω Ω) w n 2 2]

3ES[ ( Ω n Ω n ˆΩ n ˆΩ n) w n 2 2] + 3ES[ (ˆΩ n ˆΩ n ˆΩ ˆΩ) w n 2 2]

+ 3ES[ (ˆΩ ˆΩ Ω Ω) w n 2 2]

= O(K4M2). (15)

Thus, under Assumption 5, this proof can be completed by combining (13)-(15).

Lemma 28 Under Assumptions 1-5, we have

ES[ ˆw ˆw n 2 2] = O(n 2K4M2).

Zhu and Zou

Proof Note that

ˆw = [Ez (ˆγ ˆγ)] 1Ez (ˆγ y )

= [ˆθ Ez (x x )ˆθ] 1ˆθ Ez (x y )

ˆw n = [ˆθ n Ez (x x )ˆθ n] 1ˆθ n Ez (x y ).

So under Assumption 3, we obtain

C2 4ES( ˆw ˆw n 2 2)

ES{λ2 min[ˆθ Ez (x x )ˆθ] ˆw ˆw n 2 2}

ES[ ˆθ Ez (x x )ˆθ( ˆw ˆw n ) 2 2]

2ES[ ˆθ Ez (x x )ˆθ ˆw ˆθ n Ez (x x )ˆθ n ˆw n 2 2]

+ 2ES[ ˆθ n Ez (x x )ˆθ n ˆw n ˆθ Ez (x x )ˆθ ˆw n 2 2]

= 2ES[ ˆθ Ez (x y ) ˆθ n Ez (x y ) 2 2]

+ 2ES[ ˆθ n Ez (x x )ˆθ n ˆw n ˆθ Ez (x x )ˆθ ˆw n 2 2]. (16)

We now consider the ﬁrst term on the right-hand side of (16). From (2), Assumption 2 and Lemma 20, we have

ES[ ˆθ Ez (x y ) ˆθ n Ez (x y ) 2 2]

m=1 ES[|(ˆθm ˆθ n m ) π m Ez (x y )|2]

m=1 ES[|x nπm(X m Xm) 1π m Ez (x y )(yn x nπmˆθ n m )|2]

C 2 1 C2 3n 2K

m=1 ES[ π m Ez (x y )(yn x nπmˆθ n m ) 2]

C 2 1 C2 3n 2K

m=1 ES[(yn x nπmˆθ n m )2]

k=1 [Ez (x (k)y )]2

C 2 1 C2C4 3n 2K2 M X

m=1 ES[(yn x nπmˆθ n m )2]

2C 2 1 C2C4 3n 2K2 M X

m=1 [ES(|yn|2) + ES(|x nπmˆθ n m |2)]

= O(n 2K3M). (17)

It follows from Assumption 2 and Lemmas 20-21 that

ES[ max 1 m M max 1 t M Ez (|x πmˆθmˆθ tπ tx x πmˆθ n m ˆθ n t π tx |2)]

Stability and L2-penalty in Model Averaging

2ES{ max 1 m M max 1 t M Ez [|x πm(ˆθm ˆθ n m )ˆθ tπ tx |2]}

+ 2ES{ max 1 m M max 1 t M Ez [|x πmˆθ n m (ˆθt ˆθ n t ) π tx |2]}

2ES[ max 1 m M ˆθm ˆθ n m 2 2 max 1 m M max 1 t M Ez ( x πtˆθtπ mx 2 2)]

+ 2ES[ max 1 m M max 1 t M Ez ( x πmˆθ n m π tx 2 2) max 1 t M ˆθt ˆθ n t 2 2]

2ES[ max 1 m M ˆθm ˆθ n m 2 2 max 1 t M Ez ( x 2 2 ˆθt 2 2 x 2 2)]

+ 2ES[ max 1 m M Ez ( x 2 2 ˆθ n m 2 2 x 2 2) max 1 t M ˆθt ˆθ n t 2 2]

4C4 3B1K2ES( max 1 m M ˆθm ˆθ n m 2 2)

= O(n 2K4).

We then consider the second term on the right-hand side of (16). From Assumption 4,

ˆγ ˆγ = (x πmˆθmˆθ tπ tx )M M

and ˆγ n ˆγ n = (x πmˆθ n m ˆθ n t π tx )M M,

ES[ ˆθ n Ez (x x )ˆθ n ˆw n ˆθ Ez (x x )ˆθ ˆw n 2 2]

t=1 ˆw n m ˆw n t

s=1 Ez (x πmˆθmˆθ sπ sx x πmˆθ n m ˆθ n s π sx )

Ez (x πsˆθsˆθ tπ tx x πsˆθ n s ˆθ n t π tx ) i

m=1 | ˆw n m |2 max 1 m M max 1 t M Ez (|x πmˆθmˆθ tπ tx x πmˆθ n m ˆθ n t π tx |2) i

= O(n 2K4M2). (18)

Thus, under Assumption 5, this proof can be completed by combining (16)-(18).

The following lemma gives a case where Assumption 4 holds.

Lemma 29 Assume y = PK k=1 x (k)θ k + e with x = (x (1), x (2), , x (K)) N(0, IK)

and E(e |x ) = 0. If there is a constant C10 > 0 such that PK k=1 θ 2 k C10, then under Assumption 3, Assumption 4 holds.

Proof Similar to the proof of (3), it follows from Assumption 3, ˆγ = x ˆθ and Ez (x x ) = IK that

λmax[ˆθ(ˆθ ˆθ) 2ˆθ ] C 1 4 , a.s..

Zhu and Zou

And then, from the deﬁnition of ˆw , we have

ˆw 2 2 = [Ez (ˆγ ˆγ)] 1Ez (ˆγ y ) 2 2 = [ˆθ Ez (x x )ˆθ] 1ˆθ Ez (x y ) 2 2 = Ez (x y )ˆθ[ˆθ Ez (x x )ˆθ] 2ˆθ Ez (x y )

Ez (x (k)y ) 2

j=1 x (j)θ j + e ) 2

j=1 θ j Ez (x (k)x (j)) + Ez (x (k)e ) 2

C 1 4 C10, a.s..

Similarly, we have ˆw n 2 2 C 1 4 C10 a.s..

Appendix B. Proofs of Theorems

Proof of Theorem 13: We ﬁrst prove 1). Let (ˆc1, ˆc2, ..., ˆc M) = ˆP ˆw0 and ( ˆd1, ˆd2, ..., ˆd M) = ˆP ˆw . Then, we have

ˆ M1(λn) = ˆZ ˆw0 ˆZ ˆw 2 2 + ˆZ ˆw ˆw 2 2

(ˆcm ˆdm)2ˆζ2 m (λn + ˆζm)2 +

m=1 ˆd2 m( ˆζm λn + ˆζm 1)2

(ˆcm ˆdm)2ˆζ2 m (λn + ˆζm)2 +

ˆd2 mλ2 n (λn + ˆζm)2

d dλn ˆ M1(λn) =

2(ˆcm ˆdm)2ˆζ2 m (λn + ˆζm)3 +

2 ˆd2 mλnˆζm (λn + ˆζm)3 .

(ˆcm ˆdm)2ˆζ2 m (λn + ˆζm)3

(ˆcm ˆdm)2ˆζ2 m (λn + ˆζm)2

Stability and L2-penalty in Model Averaging

λn + ˆζM ˆZ( ˆw0 ˆw ) 2 2,

we see that when ˆw0 = ˆw , PM m=1 (ˆcm ˆdm)2 ˆζ2 m ˆζ3m > 0. So we have ˆλn > 0 and ˆ M1(ˆλn) < ˆ M1(0)

when ˆw0 = ˆw . Below we prove 2). Let ( c1, c2, ..., c M) = P w0 and ( d1, d2, ..., d M) = P ˆw . Then, we have

M1(λn) = Z w0 Z ˆw 2 2 + Z ˆw ˆw 2 2

( cm dm)2 ζ2 m (λn + ζm)2 +

m=1 d2 m( ζm λn + ζm 1)2

( cm dm)2 ζ2 m (λn + ζm)2 +

d2 mλ2 n (λn + ζm)2

d dλn M1(λn) =

2( cm dm)2 ζ2 m (λn + ζm)3 +

2 d2 mλn ζm (λn + ζm)3 .

Similarly, from

( cm dm)2 ζ2 m (λn + ζm)3

( cm dm)2 ζ2 m (λn + ζm)2

= 1 λn + ζM Z( w0 ˆw ) 2 2,

we see that when w0 = ˆw , PM m=1 ( cm dm)2 ζ2 m ζ3m > 0. So we have λn > 0 and M1( λn) < M1(0)

when w0 = ˆw . Proof of Theorem 14: We ﬁrst prove that C(w, S) is an AMER. It follows from Assumption 5 and Lemma 25 that

ES[ ˆF( ˆw, S) ˆF( w, S)]

= ES[ ˆF( ˆw, S) 1

n C( ˆw, S) + 1

n C( ˆw, S) ˆF( w, S)]

ES[ ˆF( ˆw, S) 1

n C( ˆw, S) + 1

n C( w, S) ˆF( w, S)]

= ES 2σ2 w κ n + λn w w n 2σ2 ˆw κ n λn ˆw ˆw n

1 2 2 KM 1 2 (1 + n 2K2M) 1 2 n + 2B2λn(1 + n 2K2M)

Zhu and Zou

= O[n 1 max(λn, KM 1 2 )].

So C(w, S) is an AMER with rate n 1 max(λn, KM 1 2 ). Next, we prove that J(w, S) is an AMER. It follows from Lemmas 20, 21 and 25 that

ˆθ( w) 2 2 =

t=1 wm wtˆθ mπ mπtˆθt

t=1 | wm wt||ˆθ mπ mπtˆθt|

max 1 m M max 1 t M |ˆθ mπ mπtˆθt|

t=1 | wm wt|

max 1 m M max 1 t M ˆθm 2 ˆθt 2 M X

m=1 | wm| 2

= M max 1 m M ˆθm 2 2

B1B2M, a.s. (19)

ES[ ˆθ( w) ˆθ n( w) 2 2]

t=1 wm wt(ˆθm ˆθ n m ) π mπt(ˆθt ˆθ n t ) i

t=1 | wm wt||(ˆθm ˆθ n m ) π mπt(ˆθt ˆθ n t )|

t=1 | wm wt| ˆθm ˆθ n m 2 ˆθt ˆθ n t 2

MES max 1 m M max 1 t M ˆθm ˆθ n m 2 ˆθt ˆθ n t 2

B2MES max 1 m M ˆθm ˆθ n m 2 2

= O(n 2K2M). (20)

In a similar way to (19), we obtain

ˆθ n( w) 2 2 B1B2M, a.s..

Further, from Assumption 2, we have

ES n xn[2yn x nˆθ( w) x nˆθ n( w)] 2 2 o

Stability and L2-penalty in Model Averaging

k=1 ES n x2 (k)n[2yn x nˆθ( w) x nˆθ n( w)]2o

C2 3KES n [2yn x nˆθ( w) x nˆθ n( w)]2o

C2 3KES n 12y2 n + 3[x nˆθ( w)]2 + 3[x nˆθ n( w)]2o

C2 3KES[12y2 n + 3 xn 2 2 ˆθ( w) 2 2 + 3 xn 2 2 ˆθ n( w) 2 2]

12C2C2 3K + 6C4 3B1B2K2M.

Therefore, we obtain

|ES[ ˆF( w, S) 1

n J( w, S)]|

[yi x iˆθ( w)]2 [yi x iˆθ i( w)]2 λn w w n

= ES n [yn x nˆθ( w)]2 [yn x nˆθ n( w)]2o ES(λn w w n )

= ES n [2yn x nˆθ( w) x nˆθ( n)( w)][x nˆθ n( w) x nˆθ( w)] o ES(λn w w n )

ES[ xn(2yn x nˆθ( w) x nˆθ n( w) 2 2] q

ES[ ˆθ( w) ˆθ n( w) 2 2] + B2n 1λn

= O[n 1 max(λn, K2M)]. (21)

In a similar way, it is seen that

n JS( w) ˆF( w, S)]| = O[n 1 max(λn, K2M)].

Thus, we have

ES[ ˆF( w, S) ˆF( w, S)]

= ES[ ˆF( w, S) 1

n J( w, S) + 1

n J( w, S) ˆF( w, S)]

ES[ ˆF( w, S) 1

n J( w, S) + 1

n J( w, S) ˆF( w, S)]

= O[n 1 max(λn, K2M)].

So J(w, S) is an AMER with rate n 1 max(λn, K2M). Proof of Theorem 15: We ﬁrst prove that C(w, S) has PLoo and FLoo stability. From Lemma 26, we have

ES[ ˆθ( ˆw) ˆθ( ˆw n) 2 2]

t=1 ( ˆwm ˆw n m )( ˆwt ˆw n t )ˆθ mπ mπtˆθt i

t=1 |( ˆwm ˆw n m )( ˆwt ˆw n t )||ˆθ mπ mπtˆθt| i

Zhu and Zou

t=1 |( ˆwm ˆw n m )( ˆwt ˆw n t )| ˆθm 2 ˆθt 2 i

MES h max 1 m M max 1 t M ˆθm 2 ˆθt 2

m=1 ( ˆwm ˆw n m )2i

B1MES h M X

m=1 ( ˆwm ˆw n m )2i

= O(n 2K4M3). (22)

Similar to (21), we see that ES n [2yn x nˆθ n( ˆw n) x nˆθ( ˆw)][x nˆθ n( ˆw n) x nˆθ( ˆw n)] o = O(n 1K2M)

and ES n [2yn x nˆθ n( ˆw n) x nˆθ( ˆw)][x nˆθ( ˆw n) x nˆθ( ˆw)] o = O(n 1K3M2).

Noting that

|ES n [yn x nˆθ n( ˆw n)]2 [yn x nˆθ( ˆw)]2o |

= |ES n [2yn x nˆθ n( ˆw n) x nˆθ( ˆw)][x nˆθ( ˆw) x nˆθ n( ˆw n)] o |

|ES n [2yn x nˆθ n( ˆw n) x nˆθ( ˆw)][x nˆθ n( ˆw n) x nˆθ( ˆw n)] o |

+ |ES n [2yn x nˆθ n( ˆw n) x nˆθ( ˆw)][x nˆθ( ˆw n) x nˆθ( ˆw)] o |,

we have ES n [yn x nˆθ n( ˆw n)]2 [yn x nˆθ( ˆw)]2o = O(n 1K3M2).

In a similar way, we obtain

ES,z n [y x ˆθ n( ˆw n)]2 [y x ˆθ( ˆw)]2o = O(n 1K3M2).

Therefore, from Deﬁnitions 3-4, C(w, S) has PLoo and FLoo stability with rate n 1K3M2. Next, we prove that J(w, S) has PLoo and FLoo stability. From Lemma 27, we have

|ES n [2yn x nˆθ n( w n) x nˆθ( w)][x nˆθ n( w n) x nˆθ( w n)] o | = O(n 1K2M)

|ES n [2yn x nˆθ n( w n) x nˆθ( w)][x nˆθ( w n) x nˆθ( w)] o | = O(n 1K3M2).

Further, since

|ES n [yn x nˆθ n( w n)]2 [yn x nˆθ( w)]2o |

Stability and L2-penalty in Model Averaging

= |ES n [2yn x nˆθ n( w n) x nˆθ( w)][x nˆθ( w) x nˆθ n( w n)] o |

|ES n [2yn x nˆθ n( w n) x nˆθ( w)][x nˆθ n( w n) x nˆθ( w n)] o |

+ |ES n [2yn x nˆθ n( w n) x nˆθ( w)][x nˆθ( w n) x nˆθ( w)] o |,

it is seen that

ES n [yn x nˆθ n( w n)]2 [yn x nˆθ( w)]2o = O(n 1K3M2).

In a similar way, we obtain

ES,z n [y x ˆθ n( w n)]2 [y x ˆθ( w)]2o = O(n 1K3M2).

Thus, from Deﬁnitions 3-4, J(w, S) has PLoo and FLoo stability with rate n 1K3M2. Finally, we prove that F(w, S) has PLoo and FLoo stability. Similarly, it follows from Lemma 28 that

|ES n [2yn x nˆθ n( ˆw n ) x nˆθ( ˆw )][x nˆθ n( ˆw n ) x nˆθ( ˆw n )] o | = O(n 1K2M)

|ES n [2yn x nˆθ n( ˆw n ) x nˆθ( ˆw )][x nˆθ( ˆw n ) x nˆθ( ˆw )] o | = O(n 1K3M2).

|ES n [yn x nˆθ n( ˆw n )]2 [yn x nˆθ( ˆw )]2o |

= |ES n [2yn x nˆθ n( ˆw n ) x nˆθ( ˆw )][x nˆθ( ˆw ) x nˆθ n( ˆw n )] o |

|ES n (2yn x nˆθ n( ˆw n ) x nˆθ( ˆw )][x nˆθ n( ˆw n ) x nˆθ( ˆw n )] o |

+ |ES n [2yn x nˆθ n( ˆw n ) x nˆθ( ˆw )][x nˆθ( ˆw n ) x nˆθ( ˆw )] o |,

we see that

ES n [yn x nˆθ n( ˆw n )]2 [yn x nˆθ( ˆw )]2o = O(n 1K3M2).

In a similar way, we obtain

ES,z n [y x ˆθ n( ˆw n )]2 [y x ˆθ( ˆw )]2o = O(n 1K3M2).

So from Deﬁnitions 3-4, F(w, S) has PLoo and FLoo stability with rate n 1K3M2. Proof of Theorem 17: From Assumption 5, Assumption 6 and Lemma 25, we know that

ES{Λ2 max[Ez (ˆγ ˆγ) ˆΩ ˆΩ/n] ˆw 2 2} = o(n 1K),

(λn/n)2ES( ˆw 2 2) = o(n 1K)

Zhu and Zou

and ES( σ2κ/n 2 2) = o(n 1K).

ES[F( ˆw, S) F( ˆw , S)] = ES[( ˆw ˆw ) Ez (ˆγ ˆγ)( ˆw ˆw )]

C4ES[( ˆw ˆw ) Ez (ˆγ ˆγ)( ˆw ˆw )]

ES{λmin[Ez (ˆγ ˆγ)]( ˆw ˆw ) Ez (ˆγ ˆγ)( ˆw ˆw )}

ES[ Ez (ˆγ ˆγ)( ˆw ˆw ) 2 2]

= ES[ Ez (ˆγ ˆγ) ˆw (ˆΩ ˆΩ+ λn IM) ˆw/n + (ˆΩ ˆΩ+ λn IM) ˆw/n Ez (ˆγ ˆγ) ˆw 2 2]

= ES{ [Ez (ˆγ ˆγ) (ˆΩ ˆΩ/n + λn/n IM)] ˆw + (ˆθ X My σ2κ)/n ˆθ Ez (x y ) 2 2}

= ES{ [Ez (ˆγ ˆγ) (ˆΩ ˆΩ/n + λn/n IM)] ˆw + ˆθ [X My/n Ez (x y )] σ2κ/n 2 2}

4ES{ [Ez (ˆγ ˆγ) ˆΩ ˆΩ/n] ˆw 2 2} + 4(λn/n)2ES( ˆw 2 2)

+ 4ES{ ˆθ [X My/n Ez (x y )] 2 2} + 4ES( σ2κ/n 2 2)

4ES{Λ2 max[Ez (ˆγ ˆγ) ˆΩ ˆΩ/n] ˆw 2 2} + 4(λn/n)2ES( ˆw 2 2)

+ 4ES{ ˆθ [X My/n Ez (x y )] 2 2} + 4ES( σ2κ/n 2 2),

we see that proving ES{ ˆθ [X My/n Ez (x y )] 2 2} = O(n 1KM) is suﬃcient to demonstrate that C(w, S) is consistent with rate O(n 1KM). It follows from Lemma 20 and Marcinkiewicz-Zygmund-Burkholder inequality in Lin and Bai (2010) that

ES{ ˆθ [X My/n Ez (x y )] 2 2}

ES[λmax(ˆθˆθ ) X My/n Ez (x y ) 2 2]

B1MES[ X My/n Ez (x y ) 2 2]

k=1 ES nh 1

i=1 [x(k)iyi Ez (x (k)y )] i2o

4B1n 2KM max 1 k K ES n n X

i=1 [x(k)iyi Ez (x (k)y )]2o

= 4B1n 1KM max 1 k K var(x(k)iyi)

4B1C6n 1KM.

Thus, we have completed the proof of Theorem 17.

Stability and L2-penalty in Model Averaging

Appendix C. Figures and Tables

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 1: The mean of MSEs under homoskedastic errors with α = 0.5 for nested setting of simulation study

Zhu and Zou

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 2: The mean of MSEs under homoskedastic errors with α = 1.0 for nested setting of simulation study

Stability and L2-penalty in Model Averaging

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 3: The mean of MSEs under homoskedastic errors with α = 1.5 for nested setting of simulation study

Zhu and Zou

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 4: The mean of MSEs under heteroskedastic errors with α = 0.5 for nested setting of simulation study

Stability and L2-penalty in Model Averaging

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 5: The mean of MSEs under heteroskedastic errors with α = 1.0 for nested setting of simulation study

Zhu and Zou

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 6: The mean of MSEs under heteroskedastic errors with α = 1.5 for nested setting of simulation study

Stability and L2-penalty in Model Averaging

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 7: The mean of MSEs under homoskedastic errors with α = 0.5 for non-nested setting of simulation study

Zhu and Zou

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 8: The mean of MSEs under homoskedastic errors with α = 1.0 for non-nested setting of simulation study

Stability and L2-penalty in Model Averaging

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 9: The mean of MSEs under homoskedastic errors with α = 1.5 for non-nested setting of simulation study

Zhu and Zou

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 10: The mean of MSEs under heteroskedastic errors with α = 0.5 for non-nested setting of simulation study

Stability and L2-penalty in Model Averaging

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 11: The mean of MSEs under heteroskedastic errors with α = 1.0 for non-nested setting of simulation study

Zhu and Zou

n = 500 n = 700

n = 100 n = 300

0.25 0.50 0.75 0.25 0.50 0.75

The square of R

Method AI Cp

Figure 12: The mean of MSEs under heteroskedastic errors with α = 1.5 for non-nested setting of simulation study

Stability and L2-penalty in Model Averaging

n = 320 n = 420

n = 110 n = 210

AI Cp BI SA SB MM RM GM JM RJ AI Cp BI SA SB MM RM GM JM RJ

Figure 13: The box plot of MSEs in Case 1 of real data analysis

Zhu and Zou

n = 320 n = 420

n = 110 n = 210

AI Cp BI SA SB MM RM GM JM RJ AI Cp BI SA SB MM RM GM JM RJ

Figure 14: The box plot of MSEs in Case 2 of real data analysis

Stability and L2-penalty in Model Averaging

Table 1: The mean, median and BPR of MSEs in Case 1 of real data analysis

n AI Cp BI SA SB MM RM GM JM RJ

Mean 0.190 0.183 0.184 0.181 0.178 0.170 0.163 0.179 0.169 0.163 Median 0.182 0.177 0.182 0.175 0.177 0.168 0.162 0.174 0.167 0.162 BPR 0.009 0.012 0.016 0.037 0.043 0.040 0.345 0.034 0.053 0.410

Mean 0.161 0.160 0.170 0.158 0.166 0.155 0.152 0.156 0.155 0.152 Median 0.161 0.160 0.169 0.158 0.165 0.155 0.152 0.156 0.155 0.152 BPR 0.008 0.008 0.000 0.118 0.020 0.033 0.383 0.078 0.035 0.320

Mean 0.152 0.152 0.157 0.150 0.155 0.149 0.147 0.149 0.149 0.147 Median 0.152 0.152 0.156 0.151 0.155 0.149 0.147 0.149 0.149 0.148 BPR 0.033 0.005 0.045 0.143 0.033 0.030 0.328 0.083 0.015 0.288

Mean 0.151 0.151 0.150 0.149 0.151 0.148 0.147 0.148 0.148 0.147 Median 0.149 0.149 0.149 0.147 0.150 0.146 0.145 0.146 0.146 0.145 BPR 0.018 0.008 0.130 0.188 0.050 0.013 0.223 0.068 0.045 0.260

Table 2: The mean, median and BPR of MSE in Case 2 of real data analysis

n AI Cp BI SA SB MM RM GM JM RJ

Mean 0.169 0.168 0.172 0.163 0.164 0.159 0.153 0.161 0.159 0.153 Median 0.165 0.164 0.170 0.160 0.163 0.158 0.152 0.159 0.158 0.152 BPR 0.019 0.000 0.000 0.082 0.036 0.011 0.477 0.014 0.027 0.334

Mean 0.151 0.151 0.157 0.148 0.152 0.148 0.145 0.148 0.148 0.145 Median 0.152 0.151 0.157 0.149 0.152 0.149 0.144 0.149 0.148 0.144 BPR 0.010 0.000 0.003 0.145 0.018 0.003 0.568 0.003 0.010 0.243

Mean 0.147 0.147 0.152 0.144 0.148 0.145 0.142 0.145 0.145 0.142 Median 0.147 0.147 0.152 0.145 0.147 0.145 0.143 0.145 0.145 0.143 BPR 0.013 0.000 0.000 0.283 0.025 0.000 0.428 0.003 0.013 0.238

Mean 0.144 0.144 0.149 0.141 0.145 0.142 0.140 0.142 0.142 0.140 Median 0.143 0.143 0.148 0.141 0.143 0.142 0.140 0.143 0.142 0.140 BPR 0.015 0.000 0.003 0.338 0.040 0.000 0.310 0.000 0.015 0.280