# mutual_transfer_learning_for_massive_data__16385703.pdf Mutual Transfer Learning for Massive Data Ching-Wei Cheng 1 Xingye Qiao 2 Guang Cheng 1 In the transfer learning problem, the target and the source data domains are typically known. In this article, we study a new paradigm called mutual transfer learning where among many heterogeneous data domains, every data domain could potentially be the target of interest, and it could also be a useful source to help the learning in other data domains. However, it is important to note that given a target not every data domain can be a successful source; only data sets that are similar enough to be thought as from the same population can be useful sources for each other. Under this mutual learnability assumption, a confidence distribution fusion approach is proposed to recover the mutual learnability relation in the transfer learning regime. Our proposed method achieves the same oracle statistical inferential accuracy as if the true learnability structure were known. It can be implemented in an efficient parallel fashion to deal with large-scale data. Simulated and real examples are analyzed to illustrate the usefulness of the proposed method. 1. Introduction Transfer Learning (TL) aims to leverage knowledge from a related domain (called source domain) to improve the predictive and inferential performance in a target domain, on which the availability of the training data is limited. Source data are drawn from a source distribution P on SP , and a relatively small quantity of labeled or unlabeled data are from the target distribution Q on SQ. We focus on the homogeneous TL setting with SP = SQ, on which P and Q are different but related distributions. See surveys of TL in Pan & Yang (2009); Zhuang et al. (2019). In the TL problem, it is typically known a priori which 1Department of Statistics, Purdue University 2Department of Mathematical Sciences, Binghamton University. Correspondence to: Xingye Qiao . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). data domain is the target and which data will be used as the source. However, this is not necessarily the case in many contemporary applications. For example, one of the 4V s for Big Data is variety. Big Data are often collections of multiple data sets (called data units hereafter) from different data domains. These data units are of the same nature but are distributed differently, due to, for example, that they are collected in different time periods, at different locations, or possibly using different data collection devices. It hence may be desirable to leverage TL techniques to transfer knowledge between these related data domains so as to improve the predictive performance or statistical inference in a target domain. However, in the applications we consider in this article and many other cases, literally, every data domain could potentially be a target of interest, and every data domain could potentially be useful as a source to help the learning performance in other domains. In other words, the target and source domains are not known a priori. Moreover, due to the variety among the data units, it is quite possible that given a target domain, although all data units could be useful for transferring knowledge to some extent, some data units are less useful as source data than the others. Hence, a second feature of the problem considered here is that one must identify source data units that are more useful for each possible target. While the existing TL literature is fairly diverse and extensive, fewer efforts have been made to make successful transfers in this setting. In this article, we consider a mutual transfer learning (MTL) framework in which every data unit could potentially be the target. The goal of MTL is to simultaneously identify useful source data units for each target domain and use such information to improve learning performance. We propose a mutual learnability assumption, which suggests that only data sets that are similar enough to be thought as from the same population can be useful sources for each other. Under this assumption, we propose a confidence distribution fusion approach to recover the mutual learnability structure in the transfer learning regime, using regression problem as a working example. Incorporating the recovered learnability structure, we fit a statistical model for both predictive and inferential purposes. Our proposed method achieves the same oracle statistical inferential accuracy as if the true learnability structure were known. Mutual Transfer Learning for Massive Data Figure 1. Illustration of the two-layer learnability. The data units from the same subgroup have more learnability between them. Suppose there are M data units from M related domains. For each potential target domain, there exist M 1 source data units to transfer from. We consider a regression problem with two sets of features: the global features and the heterogeneous features. While all the data units could be useful to learn the role of the global features in explaining the response variable, it is unreasonable to assume that all the M 1 source data units are equally useful to learn the heterogeneous features. Specifically, given the target, there may be a subgroup of data units that are more useful than the others. This could be explained by a two-layer learnability model, shown in Figure 1, in which different behaviors for data units are coded by colors. The data units bounded together are assumed from the same subgroup, within which it is easier to transfer useful information to learn the heterogeneous features, hence more learnability. Data from different groups are significantly different, and the learnability between groups are limited to only the global features. To motivate our statistical model, we analyzed NOAA s n Clim Div database. It consists of monthly temperature, precipitation, and several indices for drought severity, and contains a total of N = 503,616 observations from M = 344 climate divisions. Each climate division is viewed as a data domain. The monthly average temperature is the response of interest, and there are 8 features within which we have designated 5 as global features and 3 heterogeneous features. Our proposed method has identified 5 subgroups shown in different colors in Figure 2. Knowledge about the global features can be learned from all climate divisions, while that about heterogeneous features can only be learned from identified source data in the same subgroup. Whether a given feature is global or heterogeneous is typically suggested by domain knowledge or preliminary inspection of the data. In this real data example, we checked if the kernel density for the coefficient estimates of a feature have a multimodal distribution to determine whether it is global. We use the following statistical model to characterize this two-layer structure. Suppose that M data units come from S exclusive subgroups, and that the i-th data unit consists of Figure 2. Learnability recovered for the 344 climate divisions. ni observations. Response Yi from the ith data domain is Yi = X β0 + Z ϑi + ε = X β0 + Z θi0 + Z ui + ε, i = 1, . . . , M, (1) where the global features X Rp and the heterogeneous features Z Rq correspond to coefficients β0 and ϑi, which have different learnability. β0 is the same among all data domains and all data units can be used to learn it. ϑi = θi0 + ui Rq represents a unit-specific effect. Its first component θi0 is equal among data units in the same subgroup as the ith data unit. That is, θi0 = θj0, if and only if the i-th and j-th data units are from the same subgroup. Knowledge about θi0 can only be transferred between data units in the same subgroup. The second term ui explains the subtle difference between data units in the same subgroup, as seen from Figure 1. This random effect ui cannot be transferred. We assume that {ui}M i=1 are independent and identically distributed with E(ui) = 0 and Var(ui) = Ψ with the minimum eigenvalue τ = λmin(Ψ) > 0. Distinct values of {θi0}M i=1 are denoted as {αs0}S s=1. With a membership label Li {1, . . . , S}, indicating which subgroup unit i belongs to, i.e., θi0 = αLi0, (1) is re-formulated as Yi = X β0 + Z αLi0 + Z ui + ε, i = 1, . . . , M. (2) Our goal in this article is two-fold. First, we aim to recover the learnability structure, that is, the subgroup membership for each data unit. Second, we aim to transfer knowledge to improve learning performance for all the data domains by incorporating this revealed structure. We develop a confidence distribution (CD) fusion approach. Specifically, we first obtain individual unit estimates, and then combine their CD densities, as defined in Liu et al. (2015), using a pairwise concave fusion penalty for θi s. We prove that the resulting estimate is exactly the same as that based on the full-data method, which is often considered as gold standard in the meta-analysis (e.g. Debray et al., 2015). However, the full-data method has much larger computational costs as discussed in Section 3.2. Mutual Transfer Learning for Massive Data The proposed estimator is proven to asymptotically achieve the highest statistical inferential accuracy as if the true learnability structure were known. Such an oracle result holds when the minimum subsample size diverges fast enough and a minimal signal condition is met. The above learnability recovery issue was not considered in Liu et al. (2015). An alternating direction method of multipliers (ADMM, Boyd et al., 2010) algorithm was established for the proposed method. The derived algorithm is particularly suitable for large-scale data based on parallel computing. The empirical performance of the proposed approach is examined through simulation studies, demonstrating that the most reliable results are produced under the minimax concave penalty (MCP, Zhang, 2010). The rest of the article is organized as follows. In Section 2, we formalize the proposed statistical model (2) and develop a CD fusion-based MTL approach to recovery the learnability structure and mutually transfer knowledge to estimate the coefficients. We derive an ADMM algorithm to implement the proposed MTL approach and then analyze some computational considerations in Section 3. Theoretical properties of the proposed estimator are established in Section 4. The finite-sample properties of the proposed approach are evaluated in Section 5 via simulation experiments. Section 6 investigates the n Clim Div database to illustrate the practical usefulness of the proposed method. Detailed proofs to all theoretical results are provided in the supplementary material. Notations. Let v v v and v max1 l r |vl| be the L2 and L norms of a vector v = (v1, . . . , vr) . For a symmetric matrix M, let λmin(M) and λmax(M) denote its minimum and maximum eigenvalues. For a matrix M Rr d, let [M]jk be its (j, k)-th element, [M]j. be its j-th row vector, and [M].k be its kth column vectors. Let M maxv Rd, v =1 Mv , M max1 j r Pd k=1 [M]jk , and M max max1 j r,1 k d |[M]jk|. If M has full column rank, define PM M(M M) 1M and P M I PM. For positive sequences {an} and {bn}, we write an bn if a 1 n bn = o(1). Let a b min(a, b) for any a, b R. 2. Methodology We formalize the two-layer learnability framework, and introduce a confidence distribution fusion approach for learnability recovery and mutual transfer learning. 2.1. Two-Layer Learnability Structure The ni data points in the ith unit can be expressed as yi = xiβ0 + ziθi0 + ziui + εi, i = 1, . . . , M, (3) where yi = (yi1, . . . , yini) , xi (zi) is an ni p (ni q) data matrix for the global (heterogeneous) features, and εi is the vector of noises with zero mean and covariance σ2 εIni. Let N = PM i=1 ni be the total sample size of the full data, which can be expressed as Y = Xβ0 + ZΘ0 + ZU + E, (4) where Y = (y i ) i=1,...,M RN is the stacked response, E = (ε i ) i=1,...,M RN is the stacked error, X = (x i ) i=1,...,M and Z = diag (zi)i=1,...,M are stacked N p and N Mq data matrices, and Θ0 = (θ i0) i=1,...,M and U = (u i ) i=1,...,M are Mq-dimensional vectors of coefficients for the heterogeneous features and random effects. Recall that β0 is a global coefficient vector shared by all data units. Ignoring the mutual learnability assumption, the above model is a linear mixed-effects model, which can be estimated by the generalized least square method (GLS Rothenberg, 1984; Anh & Chelliah, 1999), which amounts to minimize LGLS N (β, Θ) i=1 (yi xiβ ziθi) Wi(yi xiβ ziθi) where Wi Cov(yi|xi, zi) 1 = (σ2 εIni + ziΨz i ) 1. For simplicity, we assume that Ψ and σ2 ε are known. Otherwise, these variance components can be consistently estimated through the restricted maximum likelihood (REML) method (Richardson & Welsh, 1994; Jiang, 1996). However, the standard GLS approach does not take into account the different learnability and does not conduct learnability recovery. To fuse together θi s, we may add a pairwise concave fusion penalty to the objective function: QN(β, Θ) LGLS N (β, Θ) + X 1 i 0 and a concavity parameter γ > 0. A natural estimate is thus defined as bβ(λ) bΘ(λ) = arg min β Rp,Θ RMq QN(β, Θ). (6) The above full-data estimation is referred as the individual participant data (IPD) method, the gold standard in the meta-analysis context (e.g. Debray et al., 2015). However, for large-scale data, the IPD method requires very expensive computation; see Section 3.2 for the comparisons. This motivates the proposed CD fusion approach below. In this article, we consider four popular concave fusion penalty functions: L1 or the Lasso penalty (Tibshirani, Mutual Transfer Learning for Massive Data 1996), minimax concave penalty (MCP, Zhang (2010)), smoothly clipped absolute deviation penalty (SCAD, Fan & Li (2001)) and truncated Lasso penalty (TLP, Shen et al. (2012)). The L1 penalty is known to produce biased estimates (Zhao & Yu, 2006). In our simulations, it tends to produce either many subgroups or no subgroup. Instead, MCP, SCAD and TLP are more appropriate for recovering the learnability structure since they enjoy selection consistency. See Section S.1 for definitions of these penalties. 2.2. MTL Estimator using Confidence Distribution Fusion We propose a CD fusion approach which produces the MTL estimator bβ(λ) , bΘ(λ) with much less computational costs than the IPD method, while obtaining the same result (see Theorem 2.1.) We start with obtaining individual unit estimates, and then merge them through their CD densities as defined in Liu et al. (2015). Given data in each unit, the first step is to obtain individual unit estimates for (β , θ i ) via the GLS method, for each i, as b = (xi, zi) Wi(xi, zi) 1 (xi, zi) Wiyi. (7) Recall that Ψ and σ2 ε are assumed known. Let Vi denote the squared root matrix of Wi such that Wi = V 2 i , and then (7) is equivalent to the ordinary least square solution of the unit data Viyi = Vixiβ + Viziθi + Viεi such that Var(Viyi) = Ini. According to He & Shao (2000), as ni grows, we have, conditional on (xi, zi), (xi, zi) Wi(xi, zi) 1 D = N (0, I) . Following Liu et al. (2015), we define the combined CD density as h(β, Θ) = QM i=1 hi(β, θi), where hi(β, θi) is the CD density for the ith unit. We then define an objective function which consists of log h(β, Θ) (with additive constant terms omitted) and a pairwise concave fusion penalty, i.e., QCD N (β, Θ) (xi, zi) Wi(xi, zi) 1 i a t) 2 exp( cft2) for all i = 1, . . . , M, k = 1, . . . , ni, and some absolute constant cf > 0. (ii) Let F = (X , Z ) . There exists some absolute constant 0 < Cf < 1 such that Cf λmin E FF λmax E FF C 1 f . Case9 CD IPD Gigabyte (GB) Figure 3. Computation time (upper panel) and peak memory usage (lower panel) for the CD-based MTL and the IPD methods. Mutual Transfer Learning for Massive Data Assumption 4.2 (U and E). The coefficient U and the noise vectors E have sub-Gaussian tails in the sense that P(|a U| > a t) 2 exp( cet2) and P(|b E| > b t) 2 exp( cet2) for any vectors a RMq, b RN and t, ce > 0. Let nmin min1 i M ni, gmin min1 s S P i:Li=s ni, and gmax max1 s S P i:Li=s ni be the minimum unit size, and the minimum and maximum total sample size in a subgroup. Theorem 4.1 gives a non-asymptotic upper bound and the limiting distribution for the oracle estimator. Theorem 4.1 (Properties of the Oracle Estimator). Given Assumptions 4.1 and 4.2, suppose gmin (p+Sq)1/2N 3/4, M = o(exp(nmin)) and p + Sq = o exp(gmax) . (i). With probability at least 1 (p+Sq)(4N 1 +e gmax) 2Me nmin, we have, bβOR β0 bαOR α0 5c 1/2 e c1/2 f C 1 f τ σ2 ε 1/2 σεg 1 min p (p + Sq)N log N, and τ = λmin(Ψ). (ii). For any sequence of (p + Sq)-vectors {a N} with a N = 1, we have as N , σ 1 N (a N)a N bβOR β0 bαOR α0 D = N(0, 1), (12) with σN(a N) q a N [(X, ZA) W (X, ZA)] 1 a N. It is important to note that gmin . Since gmin N/S, the condition gmin N 3/4(p + Sq)1/2 implies that S p + Sq = o(N 1/4). This imposes an upper bound on how fast (S, p, q) can grow with N. In particular, if p and q are assumed to be fixed, we have S = o(N 1/6). To appreciate the upper bound (11), we make the following assumptions: ni n (thus N = Mn), gmin gmax N/S and fixed (p, q). In this case, the upper bound in (11) admits the rate S3/2p (log M + log n)/(Mn). It becomes o P (1) if S3 = o Mn/(log M + log n) , which suggests S cannot grow faster than M 1/3. 4.2. Theoretical Properties of the MTL Estimator We prove that the proposed MTL estimator is asymptotically equivalent to the oracle estimator, and hence achieves the highest level of statistical inferential accuracy. More importantly, we also show that through MTL, the proposed estimator is asymptotically more efficient than the IPD estimated in (7) that is based on the full data. These results require two additional conditions: one is on the minimal signal defined as N mins =s αs αs ; the other is on the penalty function pγ(t; λ). Assumption 4.3 (Penalty Function). pγ(t; λ) is symmetric with pγ(0; λ) = 0, and is non-decreasing and concave for t [0, ). There exists a constant a > 0 such that pγ(t; λ) is a constant for all t aλ. tpγ(0; λ) exists and is continuous except for a finite number of t and tpγ(0+; λ) = λ. This assumption is satisfied by such non-convex penalties as MCP, SCAD and TLP with a = γ and is considered in Ma & Huang (2017; 2016) as well. Along with proper minimal signal condition, it ensures that the penalty term does not push θi and θj from different subgroups toward each other. Theorem 4.2 (Oracle Property). Suppose the conditions in Theorem 4.1 and Assumption 4.3 hold and S 2. If φN λ a 1 N, where a is defined in Assumption 4.3 and φN is given in Theorem 4.1, then there exists a local minimizer bβ(λ) , bΘ (λ) of (8) satisfying bβ(λ) bΘ(λ) Let bα(λ) and bαOR be the distinct values of bΘ(λ) and bΘOR, respectively. Theorem 4.2 implies that the proposed estimator has the same limiting distribution as the oracle estimator given in (12). Moreover, an interesting efficiency boosting phenomenon is discovered in (ii) of Corollary 4.1 below Corollary 4.1. Suppose the conditions in Theorem 4.2 hold. (i) (Asymptotic Normality) For any sequence of (p + Sq)- vectors {a N} with a N = 1, we have as N , σ 1 N (a N)a N bβ(λ) β0 bα(λ) α0 D = N(0, 1), where σN(a N) is given in Theorem 4.1. (ii) (Efficiency Boosting) For any p-vector vp and q-vector vq, we have for all i = 1, . . . , M, v p Cov bβ(λ) vp v p Cov b v q Cov bθi(λ) vq v q Cov b To estimate the limit covariance matrix σN(a N), we need to replace the unknown label matrix A (under which Θ0 = Aα0) by its estimate b A. This can be trivially done after bα(λ) and bΘ(λ) are computed. Another challenge is that the direct computation of (X, Z b A) W (X, Z b A), which requires O N 2(p + b S(λ)q) operations, is infeasible when N is large. A more efficient way is to take the multivariate GLS approach of Becker & Wu (2007) as follows. Let b Gs, s = 1, . . . , b S(λ) be (p + q) (p + b S(λ)q) matrices based on the estimated label matrix b A such that bβ(λ) bθi(λ) = bβ(λ) bαb Li(λ) bβ(λ) bα(λ) , i = 1, . . . , M. Mutual Transfer Learning for Massive Data After straightforward algebra, we have X, Z b A W X, Z b A = i=1 b G b Li Wi(xi, zi) b Gb Li. Note that (xi, zi) Wi(xi, zi) s are (p+q) (p+q) precision matrices given by data units under the MTL approach. Hence, the right-hand-side in the equation above only requires O M(p + b S(λ)q)2(p + q) operations. 5. Simulation Experiments We now evaluate the MTL method using simulations. 5.1. Simulation Setup Table 1. Simulation Settings. Simulation case n M S SNR Case 1 1024 50 2 4.399 Case 2 1024 50 3 8.838 Case 3 2048 50 2 4.399 Case 4 2048 50 3 8.838 Case 5 2048 100 2 4.399 Case 6 2048 100 3 8.838 Case 7 2048 100 5 8.734 Case 8 2048 150 5 8.734 Case 9 2048 150 7 9.260 Table 1 summarizes nine simulation settings (with p = 5 global features and q = 3 heterogeneous features), and each has 100 replications, where the signal-to-noise ratio (SNR) is defined in Section S.10. The largest total sample size is 307,200. For simplicity, we consider equal unit sizes ni n for i = 1, . . . , M. We let the number of units in each subgroup to be (M1, . . . , MS) = 1S + Multinomial(M S, 1S/S). The coordinates of β0 were generated from Uniform( 2, 2) independently. To mimic the different coefficient values for the heterogeneous features between subgroups, we generated α0 = (α 1,0, . . . , α S,0) , where αs,0 = (αs,0,1, αs,0,2, αs,0,3) , in a way to guarantee the minimal signal condition. The values of (α1,0,1, . . . , αS,0,1) were assigned to evenly spaced grid points over [ S1.4/2, S1.4/2], denoted as (α 1, . . . , α S), and then α0 was generated by α 1 α 2 α 3 . . . α S 1 α S α S α 1 α 2 . . . α S 2 α S 1 α S 1 α S α 1 . . . α S 3 α S 2 where vec( ) denotes the vectorization operator. If S is odd, we further add 1 to all coordinates of α0 to avoid subgroup effect coordinates being 0. The data matrices fi = (xi, zi) consist of independent rows with each generated from 8-dimensional normal N(0, ΣF), where the diagonal elements of ΣF are 1 and off-diagonal elements are 0.3. 0.0 0.5 1.0 1.5 2.0 λ Modified BIC 0.0 0.5 1.0 1.5 2.0 λ Modified BIC 0 1 2 3 4 λ Modified BIC 0.00 0.05 0.10 0.15 λ Modified BIC Figure 4. Solution paths of (θ1,1, . . . , θM,1) for the first dataset in Case 5 (n = 256, M = 100, S = 2). Colors stand for true subgroups, horizontal dashed lines represent true values of the first coordinate of αs s, and vertical line displays best λ selected by modified BIC, whose values are displayed by gray bars. Moreover, ui and εi follows N(0, 0.3I) and N(0, I), respectively. Finally, Y was generated from the oracle model. Figure 4 shows the solution paths of (θ1,1, . . . , θM,1), the first coordinate of θi s. It can be seen that, under MCP, SCAD and TLP, the learnability structure can be well recovered using the modified BIC given in (10) to choose the tuning parameter λ, denoted as bλ. On the other hand, although the L1 penalty gets the correct number of subgroups, the coefficient estimates are far away from the truth. 5.2. Learnability Structure Recovery We use the Normalized Mutual Information (NMI; Fred & Jain, 2003) as the performance measure for learnability structure recovery. NMI [0, 1] and larger values imply more similar groupings. Let C = {C1, C2, . . .} and D = {D1, D2, . . .} denote two partitions of {1, . . . , M}. NMI is defined as NMI(C, D) 2I(C; D) H(C) + H(D), where k,l |Ck Dl|/M log(M|Ck Dl|/|Ck||Dl|) is the mutual information between C and D, and H(C) P k(|Ck|/M) log(|Ck|/M) is the entropy of C. The percentage of perfect recoveries among the 100 replications is also reported. We also consider an ad hoc approach based on performing K-means on the unit estimates bθi with modified BIC to select the optimal subgroup size, to compare with our method. Specifically, for any given number of subgroup size S = K, the K-means algorithm is performed to obtain an estimated subgroup labels, and then the oracle estimation can be computed using the estimated label and variance components. Subsequently, the modified BIC values are calculated by plugging in these K-means-based estimates. In our analysis, we run through K = 1, . . . , 10 and choose Mutual Transfer Learning for Massive Data Table 2. Learnability structure recovery. Short version of Table Mean, Median Perfect Method (Min,Max) of b S NMI Recover Case 1 MCP 2, 2 (2, 2) 0.998 0.99 S = 2 SCAD 2, 2 (2, 2) 0.998 0.99 TLP 2, 2 (2, 2) 0.998 0.99 L1 2, 2 (1, 4) 0.947 0.62 K-Mns 2, 2 (2, 3) 0.989 0.95 Case 7 MCP 5, 5 (5, 6) 0.999 0.98 S = 5 SCAD 5, 5 (5, 5) 0.999 0.99 TLP 5, 5 (5, 6) 0.999 0.99 L1 63.2, 100 (1, 100) 0.587 0.00 K-Mns 5, 5 (5, 5) 0.976 0.37 Case 9 MCP 7, 7 (7, 7) 0.999 0.99 S = 7 SCAD 7.3, 7 (7, 9) 0.998 0.77 TLP 7.2, 7 (7, 9) 0.998 0.81 L1 150, 150 (150, 150) 0.620 0.00 K-Mns 7, 7 (7, 7) 0.978 0.20 the K value such that the modified BIC is minimized. Table 2 includes results for select cases (Cases 2, 5, 7; see the full table in the supplementary material). It can be seen that MCP, SCAD and TLP result in desirable recovery performance, while L1 penalty does not. MCP works very well in all cases. Although the high NMI values suggest that SCAD and TLP still capture the true subgroup structure well, their performance gets worse when S is large (e.g., Case 7) in terms of the standard deviation of b S and perfect recovery rate. This echoes with the discussion following Theorem 4.1 that the theoretical properties hold only when S does not grow too fast. Due to the poor performance, L1 penalty is no longer considered in the subsequent analysis. K-means performs well for small M = 50, but poorly for larger M = 100 and 150, even though the number of subgroups are correctly obtained (e.g., Case 7). 5.3. Parameter Estimation To evaluate the parameter estimation, we define the root mean squared error (RMSE) between two vectors v1, v2 Rd as RMSE v1, v2 = 1 d v1 v2 . In Figure 5, we empirically examine how close the proposed estimate is to both true value and oracle estimate. Memory outage occurred in some cases when computing oracle estimates. To resolve this issue, the multivariate GLS approach of Becker & Wu (2007) is applied to compute the oracle estimates. Firstly, from the top panel in Figure 5, we can easily see RMSE(bα, α0) reasonably increases (decreases) with S (M). This confirms the intuition that more S suggests more challenges in mutual transfer. On the other hand, increasing M allows to estimate the heterogeneous features using more data units, which improves their estimation accuracy. The third panel in Figure 5 indicates that RMSE( bβ, β0) depends on N = Mn but is independent of S. Note that β is 1024 1024 2048 2048 2048 2048 2048 2048 2048 0.0 RMSE(bα, α0) MCP SCAD TLP Unit 1024 1024 2048 2048 2048 2048 2048 2048 2048 0.000 RMSE(bα, bαOR) 1024 1024 2048 2048 1024 1024 1024 1024 1024 0.00 RMSE( bβ, β0) n = 2048 M = 100 n = 2048 M = 100 n = 2048 M = 100 n = 2048 M = 150 n = 2048 M = 150 1 M PM i=1 RMSE(bθi, θ0i) Figure 5. Evaluation of parameter estimation. the global parameter shared over all data units. This result numerically verifies that the proposed method effectively merges the common information provided from data units, regardless of how many different subgroups they come from. The results in Theorem 4.2 and (ii) in Corollary 4.1 can be verified as well. Firstly, the smaller values of RMSE(bα, bαOR) (the second panel) and RMSE( bβ, bβOR) (which are all close to 0 and omitted here) imply that ( bβ , bα ) is fairly close to their oracle counterpart ( bβ OR, bα OR) . The third panel in Figure 5 indicates that the estimation accuracy of bβ is better than the best unit estimate for common effect. Similar observation applies to the estimation of θi s as demonstrated in the bottom panel in Figure 5. Both panels support the efficiency boosting result shown in Corollary 4.1. We also studied the asymptotic covariance approximation and the inferential accuracy. See Section S.11. 6. Real Data Example NOAA s n Clim Div database1 were analyzed to demonstrate MTL method s practical usefulness. The monthly average temperature is the response of interest. We categorize the 8 features to global ones and heterogeneous ones by inspecting the kernel densities. Intuitively, the distribution of the unit-level coefficient estimate for a heterogeneous feature is more likely to be multimodal or widespread (as opposed to unimodal or concentrated) across units. In this way we have chosen intercept, PCPN and ZNDX as the heterogeneous features (i.e., q = 3 and p = 5) and the rest as global 1Available at ftp://ftp.ncdc.noaa.gov/pub/data/cirs/climdiv/. Mutual Transfer Learning for Massive Data features; see details in Section S.11.2. Figure 2 displays the recovered learnability structure, where the five resulting subgroups generally follow a geographical pattern, although we do not use any spatial information of the climate divisions. It may appear strange that northeastern parts of New England are in the same subgroup as Texas and Southern California. However, this pattern actually coincides with NOAA s outlook for temperature, precipitation, and drought: blue and red groups together follow the temperature outlook2, and red subgroup stands out due to severe drought conditions3. Table 3. The parameter estimates using the proposed MTL method with MCP. The average asymptotic standard deviation (ASD), after multiplied by 100, is shown in the parenthesis. Color Heterogeneous Features (ASD 100) [#(units)] bαIntercept bαPCPN bαZNDX Red [41] 64.97 (13.2) 0.37 (9.5) 0.07 (9.5) Blue [132] 49.53 (7.1) 0.85 (5.3) 1.51 (5.3) Green [79] 35.32 (8.9) 5.44 (6.9) 4.05 (6.8) Purple [81] 24.74 (9.2) 7.28 (6.8) 5.16 (6.7) Orange [11] 9.90 (32.3) 9.14 (19.3) 6.54 (18.6) Coef. for Global Features (ASD 100) bβSummer bβFall bβWinter bβPDSI bβPHDI 18.26 (2) 4.06 (2) 15.12 (2) 0.18 (1) 0.20 (1) From the parameter estimates tabulated in Table 3, we can see the general pattern that the average temperature tends to drop from southwestern areas to northeastern areas. ZNDX has the similar tendency as the intercept does, but PCPN behaves in an opposite direction. For the global feature estimates, there is no surprise to see the seasonal effects, and PDSI and PHDI have positive but relatively small (compared to the heterogeneous features) influence on the temperature. Additional prediction-driven analysis was conducted by partitioning the n Clim Div dataset into two parts: data in 18952000 was used for training and 2001-2016 for testing. Our model did not perform as well as the baseline unit-level regression model in terms of the RMSE for the test data. This may be caused by a sub-optimal choice of the tuning parameter λ which leads to fewer subgroups than what would be needed for better prediction performance. Recall that we did not choose λ to minimize the cross-validation prediction error, but used BIC in order to get a parsimonious model. On the other hand, with the simulation results and optimality theory presented in the paper, our model dominates the baseline model in terms of statistical inference accuracy. This may suggest a fundamental difference between inferential analysis and predictive analysis. 2E.g., see the first figure in https://www.climate.gov/newsfeatures/videos/noaas-2019-20-winter-outlook-temperatureprecipitation-and-drought. 3See the United States Drought Monitor Animation at https://droughtmonitor.unl.edu/Maps/Animations.aspx Acknowledgements This work was completed while Guang Cheng was a member of Institute for Advanced Study (IAS) at Princeton University in Fall 2019. Guang Cheng would like to acknowledge the hospitality of IAS, and the financial support from the National Science Foundation (DMS-1712907, DMS1811812, and DMS-1821183), the Office of Naval Research (ONR N00014-18-2759), and the Adobe Data Science Fund. Anh, V. V. and Chelliah, T. Estimated generalized least squares for random coefficient regression models. Scandinavian Journal of Statistics, 26(1):31 46, 1999. Becker, B. J. and Wu, M.-J. The synthesis of regression slopes in meta-analysis. Statistical Science, 22(3):414 429, 2007. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2010. Craven, P. and Wahba, G. Smoothing noisy data with spline functions. Numerische Mathematik, 31(4):377 403, 1978. Debray, T. P. A., Moons, K. G. M., van Valkenhoef, G., Efthimiou, O., Hummel, N., Groenwold, R. H. H., Reitsma, J. B., and on behalf of the Get Real methods review group. Get real in individual participant data (IPD) meta-analysis: A review of the methodology. Research Synthesis Methods, 6(4):293 309, 2015. Eckstein, J. A practical test for univariate and multivariate normality. Technical report, RUTCOR Research Report RRR 32-2012, 12 2012. Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001. Fred, A. L. and Jain, A. K. Robust data clustering. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pp. 128 136, 2003. Ghadimi, E., Teixeira, A., Shames, I., and Johansson, M. Optimal parameter selection for the alternating direction method of multipliers (admm): Quadratic problems. IEEE Transactions on Automatic Control, 60(3):644 658, 2015. Harville, D. A. Matrix Algebra From a Statistician s Perspective. Springer, corrected edition edition, 2000. ISBN 978-0387949789. Mutual Transfer Learning for Massive Data He, X. and Shao, Q.-M. On parameters of increasing dimensions. Journal of Multivariate Analysis, 73(1):120 135, 2000. Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14(3):569 600, 2014. Jiang, J. REML estimation: Asymptotic behavior and related topics. The Annals of Statistics, 24(1):255 286, 1996. Liu, D., Liu, R. Y., and Xie, M. Multivariate meta-analysis of heterogeneous studies using only summary statistics: Efficiency and robustness. Journal of the American Statistical Association, 110(509):326 340, 2015. Ma, S. and Huang, J. Estimating subgroup-specific treatment effects via concave fusion. ar Xiv preprint, 2016. Ma, S. and Huang, J. A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410 423, 2017. Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359, 2009. Richardson, A. and Welsh, A. Asymptotic properties of restricted maximum likelihood (REML) estimates for hierarchical mixed linear models. Australian Journal of Statistics, 36(1):31 43, 1994. Rothenberg, T. J. Approximate normality of generalized least squares estimates. Econometrica, 52(4):811 825, 1984. Schwarz, G. Estimating the dimension of a model. The Annals of Statistics, 6(2):461 464, 1978. Shen, X., Pan, W., and Zhu, Y. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107(497):223 232, 2012. Tibshirani, R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1):267 288, 1996. Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Eldar, Y. C. and Kutyniok, G. (eds.), Compressed Sensing, chapter 5, pp. 210 268. Cambridge University Press, Cambridge, 2012. Wang, H., Li, B., and Leng, C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671 683, 2009. Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894 942, 2010. Zhao, P. and Yu, B. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541 2563, 2006. Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q. A comprehensive survey on transfer learning. ar Xiv preprint ar Xiv:1911.02685, 2019.