# mutual_transfer_learning_for_massive_data__16385703.pdf

Mutual Transfer Learning for Massive Data

Ching-Wei Cheng 1 Xingye Qiao 2 Guang Cheng 1

In the transfer learning problem, the target and the source data domains are typically known. In this article, we study a new paradigm called mutual transfer learning where among many heterogeneous data domains, every data domain could potentially be the target of interest, and it could also be a useful source to help the learning in other data domains. However, it is important to note that given a target not every data domain can be a successful source; only data sets that are similar enough to be thought as from the same population can be useful sources for each other. Under this mutual learnability assumption, a conﬁdence distribution fusion approach is proposed to recover the mutual learnability relation in the transfer learning regime. Our proposed method achieves the same oracle statistical inferential accuracy as if the true learnability structure were known. It can be implemented in an efﬁcient parallel fashion to deal with large-scale data. Simulated and real examples are analyzed to illustrate the usefulness of the proposed method.

1. Introduction

Transfer Learning (TL) aims to leverage knowledge from a related domain (called source domain) to improve the predictive and inferential performance in a target domain, on which the availability of the training data is limited. Source data are drawn from a source distribution P on SP , and a relatively small quantity of labeled or unlabeled data are from the target distribution Q on SQ. We focus on the homogeneous TL setting with SP = SQ, on which P and Q are different but related distributions. See surveys of TL in Pan & Yang (2009); Zhuang et al. (2019).

In the TL problem, it is typically known a priori which

1Department of Statistics, Purdue University 2Department of Mathematical Sciences, Binghamton University. Correspondence to: Xingye Qiao <qiao@math.binghamton.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

data domain is the target and which data will be used as the source. However, this is not necessarily the case in many contemporary applications. For example, one of the 4V s for Big Data is variety. Big Data are often collections of multiple data sets (called data units hereafter) from different data domains. These data units are of the same nature but are distributed differently, due to, for example, that they are collected in different time periods, at different locations, or possibly using different data collection devices. It hence may be desirable to leverage TL techniques to transfer knowledge between these related data domains so as to improve the predictive performance or statistical inference in a target domain. However, in the applications we consider in this article and many other cases, literally, every data domain could potentially be a target of interest, and every data domain could potentially be useful as a source to help the learning performance in other domains. In other words, the target and source domains are not known a priori.

Moreover, due to the variety among the data units, it is quite possible that given a target domain, although all data units could be useful for transferring knowledge to some extent, some data units are less useful as source data than the others. Hence, a second feature of the problem considered here is that one must identify source data units that are more useful for each possible target. While the existing TL literature is fairly diverse and extensive, fewer efforts have been made to make successful transfers in this setting.

In this article, we consider a mutual transfer learning (MTL) framework in which every data unit could potentially be the target. The goal of MTL is to simultaneously identify useful source data units for each target domain and use such information to improve learning performance. We propose a mutual learnability assumption, which suggests that only data sets that are similar enough to be thought as from the same population can be useful sources for each other. Under this assumption, we propose a conﬁdence distribution fusion approach to recover the mutual learnability structure in the transfer learning regime, using regression problem as a working example. Incorporating the recovered learnability structure, we ﬁt a statistical model for both predictive and inferential purposes. Our proposed method achieves the same oracle statistical inferential accuracy as if the true learnability structure were known.

Mutual Transfer Learning for Massive Data

Figure 1. Illustration of the two-layer learnability. The data units from the same subgroup have more learnability between them.

Suppose there are M data units from M related domains. For each potential target domain, there exist M 1 source data units to transfer from. We consider a regression problem with two sets of features: the global features and the heterogeneous features. While all the data units could be useful to learn the role of the global features in explaining the response variable, it is unreasonable to assume that all the M 1 source data units are equally useful to learn the heterogeneous features. Speciﬁcally, given the target, there may be a subgroup of data units that are more useful than the others. This could be explained by a two-layer learnability model, shown in Figure 1, in which different behaviors for data units are coded by colors. The data units bounded together are assumed from the same subgroup, within which it is easier to transfer useful information to learn the heterogeneous features, hence more learnability. Data from different groups are signiﬁcantly different, and the learnability between groups are limited to only the global features.

To motivate our statistical model, we analyzed NOAA s n Clim Div database. It consists of monthly temperature, precipitation, and several indices for drought severity, and contains a total of N = 503,616 observations from M = 344 climate divisions. Each climate division is viewed as a data domain. The monthly average temperature is the response of interest, and there are 8 features within which we have designated 5 as global features and 3 heterogeneous features. Our proposed method has identiﬁed 5 subgroups shown in different colors in Figure 2. Knowledge about the global features can be learned from all climate divisions, while that about heterogeneous features can only be learned from identiﬁed source data in the same subgroup. Whether a given feature is global or heterogeneous is typically suggested by domain knowledge or preliminary inspection of the data. In this real data example, we checked if the kernel density for the coefﬁcient estimates of a feature have a multimodal distribution to determine whether it is global.

We use the following statistical model to characterize this two-layer structure. Suppose that M data units come from S exclusive subgroups, and that the i-th data unit consists of

Figure 2. Learnability recovered for the 344 climate divisions.

ni observations. Response Yi from the ith data domain is

Yi = X β0 + Z ϑi + ε

= X β0 + Z θi0 + Z ui + ε, i = 1, . . . , M, (1)

where the global features X Rp and the heterogeneous features Z Rq correspond to coefﬁcients β0 and ϑi, which have different learnability. β0 is the same among all data domains and all data units can be used to learn it. ϑi = θi0 + ui Rq represents a unit-speciﬁc effect. Its ﬁrst component θi0 is equal among data units in the same subgroup as the ith data unit. That is, θi0 = θj0, if and only if the i-th and j-th data units are from the same subgroup. Knowledge about θi0 can only be transferred between data units in the same subgroup. The second term ui explains the subtle difference between data units in the same subgroup, as seen from Figure 1. This random effect ui cannot be transferred. We assume that {ui}M i=1 are independent and identically distributed with E(ui) = 0 and Var(ui) = Ψ with the minimum eigenvalue τ = λmin(Ψ) > 0. Distinct values of {θi0}M i=1 are denoted as {αs0}S s=1. With a membership label Li {1, . . . , S}, indicating which subgroup unit i belongs to, i.e., θi0 = αLi0, (1) is re-formulated as

Yi = X β0 + Z αLi0 + Z ui + ε, i = 1, . . . , M. (2)

Our goal in this article is two-fold. First, we aim to recover the learnability structure, that is, the subgroup membership for each data unit. Second, we aim to transfer knowledge to improve learning performance for all the data domains by incorporating this revealed structure. We develop a conﬁdence distribution (CD) fusion approach. Speciﬁcally, we ﬁrst obtain individual unit estimates, and then combine their CD densities, as deﬁned in Liu et al. (2015), using a pairwise concave fusion penalty for θi s. We prove that the resulting estimate is exactly the same as that based on the full-data method, which is often considered as gold standard in the meta-analysis (e.g. Debray et al., 2015). However, the full-data method has much larger computational costs as discussed in Section 3.2.

Mutual Transfer Learning for Massive Data

The proposed estimator is proven to asymptotically achieve the highest statistical inferential accuracy as if the true learnability structure were known. Such an oracle result holds when the minimum subsample size diverges fast enough and a minimal signal condition is met. The above learnability recovery issue was not considered in Liu et al. (2015).

An alternating direction method of multipliers (ADMM, Boyd et al., 2010) algorithm was established for the proposed method. The derived algorithm is particularly suitable for large-scale data based on parallel computing. The empirical performance of the proposed approach is examined through simulation studies, demonstrating that the most reliable results are produced under the minimax concave penalty (MCP, Zhang, 2010).

The rest of the article is organized as follows. In Section 2, we formalize the proposed statistical model (2) and develop a CD fusion-based MTL approach to recovery the learnability structure and mutually transfer knowledge to estimate the coefﬁcients. We derive an ADMM algorithm to implement the proposed MTL approach and then analyze some computational considerations in Section 3. Theoretical properties of the proposed estimator are established in Section 4. The ﬁnite-sample properties of the proposed approach are evaluated in Section 5 via simulation experiments. Section 6 investigates the n Clim Div database to illustrate the practical usefulness of the proposed method. Detailed proofs to all theoretical results are provided in the supplementary material.

Notations. Let v

v v and v max1 l r |vl| be the L2 and L norms of a vector v = (v1, . . . , vr) . For a symmetric matrix M, let λmin(M) and λmax(M) denote its minimum and maximum eigenvalues. For a matrix M Rr d, let [M]jk be its (j, k)-th element, [M]j. be its j-th row vector, and [M].k be its kth column vectors. Let M maxv Rd, v =1 Mv , M max1 j r Pd k=1 [M]jk , and M max max1 j r,1 k d |[M]jk|. If M has full column rank, deﬁne PM M(M M) 1M and P M I PM. For positive sequences {an} and {bn}, we write an bn if a 1 n bn = o(1). Let a b min(a, b) for any a, b R.

2. Methodology

We formalize the two-layer learnability framework, and introduce a conﬁdence distribution fusion approach for learnability recovery and mutual transfer learning.

2.1. Two-Layer Learnability Structure

The ni data points in the ith unit can be expressed as

yi = xiβ0 + ziθi0 + ziui + εi, i = 1, . . . , M, (3)

where yi = (yi1, . . . , yini) , xi (zi) is an ni p (ni q) data matrix for the global (heterogeneous) features, and εi is the vector of noises with zero mean and covariance σ2 εIni. Let N = PM i=1 ni be the total sample size of the full data, which can be expressed as

Y = Xβ0 + ZΘ0 + ZU + E, (4)

where Y = (y i ) i=1,...,M RN is the stacked response, E = (ε i ) i=1,...,M RN is the stacked error, X = (x i ) i=1,...,M and Z = diag (zi)i=1,...,M are stacked N p and N Mq data matrices, and Θ0 = (θ i0) i=1,...,M and U = (u i ) i=1,...,M are Mq-dimensional vectors of coefﬁcients for the heterogeneous features and random effects. Recall that β0 is a global coefﬁcient vector shared by all data units. Ignoring the mutual learnability assumption, the above model is a linear mixed-effects model, which can be estimated by the generalized least square method (GLS Rothenberg, 1984; Anh & Chelliah, 1999), which amounts to minimize

LGLS N (β, Θ)

i=1 (yi xiβ ziθi) Wi(yi xiβ ziθi)

where Wi Cov(yi|xi, zi) 1 = (σ2 εIni + ziΨz i ) 1. For simplicity, we assume that Ψ and σ2 ε are known. Otherwise, these variance components can be consistently estimated through the restricted maximum likelihood (REML) method (Richardson & Welsh, 1994; Jiang, 1996). However, the standard GLS approach does not take into account the different learnability and does not conduct learnability recovery. To fuse together θi s, we may add a pairwise concave fusion penalty to the objective function:

QN(β, Θ) LGLS N (β, Θ) + X

1 i<j M pγ( θi θj ; λ), (5)

where pγ(t; λ) is a concave penalty function with a tuning parameter λ > 0 and a concavity parameter γ > 0. A natural estimate is thus deﬁned as bβ(λ) bΘ(λ)

= arg min β Rp,Θ RMq QN(β, Θ). (6)

The above full-data estimation is referred as the individual participant data (IPD) method, the gold standard in the meta-analysis context (e.g. Debray et al., 2015). However, for large-scale data, the IPD method requires very expensive computation; see Section 3.2 for the comparisons. This motivates the proposed CD fusion approach below.

In this article, we consider four popular concave fusion penalty functions: L1 or the Lasso penalty (Tibshirani,

Mutual Transfer Learning for Massive Data

1996), minimax concave penalty (MCP, Zhang (2010)), smoothly clipped absolute deviation penalty (SCAD, Fan & Li (2001)) and truncated Lasso penalty (TLP, Shen et al. (2012)). The L1 penalty is known to produce biased estimates (Zhao & Yu, 2006). In our simulations, it tends to produce either many subgroups or no subgroup. Instead, MCP, SCAD and TLP are more appropriate for recovering the learnability structure since they enjoy selection consistency. See Section S.1 for deﬁnitions of these penalties.

2.2. MTL Estimator using Conﬁdence Distribution Fusion

We propose a CD fusion approach which produces the MTL estimator bβ(λ) , bΘ(λ) with much less computational costs than the IPD method, while obtaining the same result (see Theorem 2.1.) We start with obtaining individual unit estimates, and then merge them through their CD densities as deﬁned in Liu et al. (2015). Given data in each unit, the ﬁrst step is to obtain individual unit estimates for (β , θ i )

via the GLS method, for each i, as b

= (xi, zi) Wi(xi, zi) 1 (xi, zi) Wiyi. (7)

Recall that Ψ and σ2 ε are assumed known. Let Vi denote the squared root matrix of Wi such that Wi = V 2 i , and then (7) is equivalent to the ordinary least square solution of the unit data Viyi = Vixiβ + Viziθi + Viεi such that Var(Viyi) = Ini. According to He & Shao (2000), as ni grows, we have, conditional on (xi, zi),

(xi, zi) Wi(xi, zi) 1

D = N (0, I) .

Following Liu et al. (2015), we deﬁne the combined CD density as h(β, Θ) = QM i=1 hi(β, θi), where hi(β, θi) is the CD density for the ith unit. We then deﬁne an objective function which consists of log h(β, Θ) (with additive constant terms omitted) and a pairwise concave fusion penalty, i.e., QCD N (β, Θ)

(xi, zi) Wi(xi, zi)

1 i<j M pγ ( θi θj ; λ) . (8)

The following theorem validates that the CD-based objective function QCD N (β, Θ) produces exactly the same minimizer, bβ(λ) , bΘ(λ) , as QN(β, Θ) does.

Theorem 2.1. QCD N (β, Θ) differs from QN(β, Θ) only by a constant.

Remarkably, our CD fusion approach does not require the statistical model in each data unit to be the same. This allows our method to be generalized to other learning tasks. For example, consider a generalized linear model for the i-th unit

φi E [Y|X, Z] = X β0 + Z θi0 + Z ui,

where the link function φi( ) may differ between units (e.g., identity link for linear model with continuous response, logit link for logistic model with binary response, log link for Poisson regression with count data, etc.) Given the individual unit estimates (

θi) s and their limit distributions, a modiﬁed CD fusion MTL approach can still be applied with Theorem 2.1 holds in an asymptotic way.

3. Computation

We discuss the derived ADMM algorithm, analyze its convergence property, and compare our MTL method with the IPD method in terms of various computational considerations.

3.1. An ADMM Algorithm

We note that the fusion penalty function cannot be written as the sum of individual functions for θi. Hence, we introduce a new set of parameter δij = θi θj and rewrite QCD N (β, Θ) as the following constrained optimization problem:

L0(β, Θ, δ)

(xi, zi) Wi(xi, zi)

1 i<j M pγ ( δij ; λ) ,

subject to θi θj δij = 0, 1 i < j M, (9)

where δ = (δ ij) i<j. This reformulation allows the adoption of the ADMM algorithm. See the details of the derivation in the supplementary material.

For regularization parameters (γ, λ), the best pair could be searched over a two-dimensional grid using some criteria, such as Bayesian Information Criterion (BIC, Schwarz (1978)) and generalized cross-validation (GCV Craven & Wahba, 1978). To avoid this computationally expensive search, we simply ﬁx γ = 3.7 for MCP and SCAD as suggested in Fan & Li (2001), and choose γ = 1.85 for TLP to mimic the degree of penalization in MCP under γ = 3.7. These choices of the penalty parameters have been shown to be practically useful in many statistical applications. For λ, we employ the modiﬁed BIC (Wang et al., 2009) that is deﬁned in high-dimensional data setting:

BIC log-likelihood + CN log N

p + b S(λ)q , (10)

Mutual Transfer Learning for Massive Data

where CN is a positive number which can depend on N, and b S(λ) is the estimated number of subgroups using the CD fusion approach. Following Wang et al. (2009), we use CN = log log p + b S(λ)q for implementation.

In the following proposition, we study the convergence property of the ADMM algorithm by examining whether the primal residual vanishes. We choose to stop the updates when r[k+1] 2 < 10 6 in our study.

Proposition 3.1 (Convergence). The primal residual r[k] = B Θ[k] δ[k] of the proposed ADMM, where B is a matrix such that B Θ = (θ i θ j ) i<j, satisﬁes that

limk r[k] 2 = 0 for MCP, SCAD and TLP.

3.2. Computational Considerations

COMMUNICATION COST

We compare the communication costs of the proposed MTL approach and the IPD method. For the former, each unit

only needs to pass a (p + q)-dimensional vector (

and a (p+q) (p+q) precision matrix (xi, zi) Wi(xi, zi) to a central computer node, leading to a communication cost O (p + q)2 . On the other hand, the latter requires each unit to pass a ni (p + q + 1) data matrix (yi, xi, zi), resulting in a communication cost O [ni(p + q)]. Since the unit GLS estimates require that (xi, zi) Wi(xi, zi) are invertible, we must have p + q < ni for all i = 1, . . . , M. Accordingly, the communication cost has been much reduced in the proposed MTL approach.

COMPUTATION TIME AND PEAK MEMORY

For computation time and peak memory used during the implementations of both approaches, we provide empirical evaluation using the simulation cases considered in Section 5. We only present the results obtained on the ﬁrst generated dataset of each simulation cases; see Section 5.1 for details about the simulation settings. L1 penalty is excluded due to its poor performance to be shown in Section 5.

Figure 3 displays the computation time (upper panel) and the maximum memory used (lower panel). It can be seen that the CD-based MTL approach outperforms the IPD method in terms of computation speed and memory usage in all cases, especially when M is large (e.g., Cases 5 9). Specifically, compared to IPD, MTL generally takes about 3/4 to 4/5 of the computation time, and requires roughly just a half of the memory usage. In addition, the three types of penalties result in similar performance. We note that the computational performance under different S values is not comparable because different ranges of λ were used. In summary, the proposed approach provides signiﬁcant savings on computational resources. Next, we will show it also offers the best possible statistical inferential performance.

4. Theoretical Guarantees

We show that the proposed estimator has the same limiting distribution as the oracle estimator, which transfers the knowledge as if the true learnability structure were known.

4.1. Oracle Estimator

Consider an Mq Sq label matrix A such that Θ0 = Aα0, where α0 = (α s0) s=1,...,S. The oracle model Y = Xβ0 + ZAα0 +ZU +E is deﬁned assuming A were known. The resulting oracle estimator for (β , α ) is bβOR bαOR

= (X, ZA) W (X, ZA) 1 (X, ZA) W Y ,

where W = diag [(Wi)i=1,...,M] is a N N block diagonal matrix. Below we present a concrete example for A. Example 4.1. Suppose M = 5, S = 2, θ1,0 = θ2,0 = α1,0 and θ3,0 = θ4,0 = θ5,0 = α2,0. Then we have

θ1,0 θ2,0 θ3,0 θ4,0 θ5,0

α1,0 α1,0 α2,0 α2,0 α2,0

Iq Iq Iq Iq Iq

5q 2q | {z } A

2q 1 | {z } α0

The following regularity conditions on the design matrices, U and noise E are needed. Assumption 4.1 (Design Matrix). (i) The rows of the design matrices fi = (xi, zi) s have sub-Gaussian tails in the sense that, for any a Rp+q, P(|a [fi]k.| > a t) 2 exp( cft2) for all i = 1, . . . , M, k = 1, . . . , ni, and some absolute constant cf > 0.

(ii) Let F = (X , Z ) . There exists some absolute constant 0 < Cf < 1 such that

Cf λmin E FF λmax E FF C 1 f .

Case9 CD IPD

Gigabyte (GB)

Figure 3. Computation time (upper panel) and peak memory usage (lower panel) for the CD-based MTL and the IPD methods.

Mutual Transfer Learning for Massive Data

Assumption 4.2 (U and E). The coefﬁcient U and the noise vectors E have sub-Gaussian tails in the sense that P(|a U| > a t) 2 exp( cet2) and P(|b E| > b t) 2 exp( cet2) for any vectors a RMq, b RN

and t, ce > 0.

Let nmin min1 i M ni, gmin min1 s S P

i:Li=s ni, and gmax max1 s S P i:Li=s ni be the minimum unit size, and the minimum and maximum total sample size in a subgroup. Theorem 4.1 gives a non-asymptotic upper bound and the limiting distribution for the oracle estimator. Theorem 4.1 (Properties of the Oracle Estimator). Given Assumptions 4.1 and 4.2, suppose gmin (p+Sq)1/2N 3/4, M = o(exp(nmin)) and p + Sq = o exp(gmax) .

(i). With probability at least 1 (p+Sq)(4N 1 +e gmax) 2Me nmin, we have,

bβOR β0 bαOR α0

5c 1/2 e c1/2 f C 1 f τ σ2 ε 1/2 σεg 1 min p

(p + Sq)N log N, and τ = λmin(Ψ).

(ii). For any sequence of (p + Sq)-vectors {a N} with a N = 1, we have as N ,

σ 1 N (a N)a N

bβOR β0 bαOR α0

D = N(0, 1), (12)

with σN(a N) q

a N [(X, ZA) W (X, ZA)] 1 a N.

It is important to note that gmin . Since gmin N/S, the condition gmin N 3/4(p + Sq)1/2 implies that S p + Sq = o(N 1/4). This imposes an upper bound on how fast (S, p, q) can grow with N. In particular, if p and q are assumed to be ﬁxed, we have S = o(N 1/6).

To appreciate the upper bound (11), we make the following assumptions: ni n (thus N = Mn), gmin gmax N/S and ﬁxed (p, q). In this case, the upper bound in (11) admits the rate S3/2p

(log M + log n)/(Mn). It becomes o P (1) if S3 = o Mn/(log M + log n) , which suggests S cannot grow faster than M 1/3.

4.2. Theoretical Properties of the MTL Estimator

We prove that the proposed MTL estimator is asymptotically equivalent to the oracle estimator, and hence achieves the highest level of statistical inferential accuracy. More importantly, we also show that through MTL, the proposed estimator is asymptotically more efﬁcient than the IPD estimated in (7) that is based on the full data. These results require two additional conditions: one is on the minimal signal deﬁned as N mins =s αs αs ; the other is on the penalty function pγ(t; λ).

Assumption 4.3 (Penalty Function). pγ(t; λ) is symmetric with pγ(0; λ) = 0, and is non-decreasing and concave for t [0, ). There exists a constant a > 0 such that pγ(t; λ) is a constant for all t aλ. tpγ(0; λ) exists and is continuous except for a ﬁnite number of t and

tpγ(0+; λ) = λ.

This assumption is satisﬁed by such non-convex penalties as MCP, SCAD and TLP with a = γ and is considered in Ma & Huang (2017; 2016) as well. Along with proper minimal signal condition, it ensures that the penalty term does not push θi and θj from different subgroups toward each other. Theorem 4.2 (Oracle Property). Suppose the conditions in Theorem 4.1 and Assumption 4.3 hold and S 2. If φN λ a 1 N, where a is deﬁned in Assumption 4.3 and φN is given in Theorem 4.1, then there exists a local minimizer bβ(λ) , bΘ (λ) of (8) satisfying

bβ(λ) bΘ(λ)

Let bα(λ) and bαOR be the distinct values of bΘ(λ) and bΘOR, respectively. Theorem 4.2 implies that the proposed estimator has the same limiting distribution as the oracle estimator given in (12). Moreover, an interesting efﬁciency boosting phenomenon is discovered in (ii) of Corollary 4.1 below Corollary 4.1. Suppose the conditions in Theorem 4.2 hold.

(i) (Asymptotic Normality) For any sequence of (p + Sq)- vectors {a N} with a N = 1, we have as N ,

σ 1 N (a N)a N

bβ(λ) β0 bα(λ) α0

D = N(0, 1),

where σN(a N) is given in Theorem 4.1.

(ii) (Efﬁciency Boosting) For any p-vector vp and q-vector vq, we have for all i = 1, . . . , M,

v p Cov bβ(λ) vp v p Cov b

v q Cov bθi(λ) vq v q Cov b

To estimate the limit covariance matrix σN(a N), we need to replace the unknown label matrix A (under which Θ0 = Aα0) by its estimate b A. This can be trivially done after bα(λ) and bΘ(λ) are computed. Another challenge is that the direct computation of (X, Z b A) W (X, Z b A), which requires O N 2(p + b S(λ)q) operations, is infeasible when N is large. A more efﬁcient way is to take the multivariate GLS approach of Becker & Wu (2007) as follows. Let b Gs, s = 1, . . . , b S(λ) be (p + q) (p + b S(λ)q) matrices based on the estimated label matrix b A such that bβ(λ) bθi(λ)

= bβ(λ) bαb Li(λ)

bβ(λ) bα(λ)

, i = 1, . . . , M.

Mutual Transfer Learning for Massive Data

After straightforward algebra, we have

X, Z b A W X, Z b A =

i=1 b G b Li Wi(xi, zi) b Gb Li.

Note that (xi, zi) Wi(xi, zi) s are (p+q) (p+q) precision matrices given by data units under the MTL approach. Hence, the right-hand-side in the equation above only requires O M(p + b S(λ)q)2(p + q) operations.

5. Simulation Experiments

We now evaluate the MTL method using simulations.

5.1. Simulation Setup

Table 1. Simulation Settings.

Simulation case n M S SNR Case 1 1024 50 2 4.399 Case 2 1024 50 3 8.838 Case 3 2048 50 2 4.399 Case 4 2048 50 3 8.838 Case 5 2048 100 2 4.399 Case 6 2048 100 3 8.838 Case 7 2048 100 5 8.734 Case 8 2048 150 5 8.734 Case 9 2048 150 7 9.260

Table 1 summarizes nine simulation settings (with p = 5 global features and q = 3 heterogeneous features), and each has 100 replications, where the signal-to-noise ratio (SNR) is deﬁned in Section S.10. The largest total sample size is 307,200. For simplicity, we consider equal unit sizes ni n for i = 1, . . . , M. We let the number of units in each subgroup to be (M1, . . . , MS) = 1S + Multinomial(M S, 1S/S). The coordinates of β0 were generated from Uniform( 2, 2) independently. To mimic the different coefﬁcient values for the heterogeneous features between subgroups, we generated α0 = (α 1,0, . . . , α S,0) , where αs,0 = (αs,0,1, αs,0,2, αs,0,3) , in a way to guarantee the minimal signal condition. The values of (α1,0,1, . . . , αS,0,1) were assigned to evenly spaced grid points over [ S1.4/2, S1.4/2], denoted as (α 1, . . . , α S), and then α0 was generated by

α 1 α 2 α 3 . . . α S 1 α S α S α 1 α 2 . . . α S 2 α S 1 α S 1 α S α 1 . . . α S 3 α S 2

where vec( ) denotes the vectorization operator. If S is odd, we further add 1 to all coordinates of α0 to avoid subgroup effect coordinates being 0. The data matrices fi = (xi, zi) consist of independent rows with each generated from 8-dimensional normal N(0, ΣF), where the diagonal elements of ΣF are 1 and off-diagonal elements are 0.3.

0.0 0.5 1.0 1.5 2.0 λ

Modiﬁed BIC

0.0 0.5 1.0 1.5 2.0 λ

Modiﬁed BIC

0 1 2 3 4 λ

Modiﬁed BIC

0.00 0.05 0.10 0.15 λ

Modiﬁed BIC

Figure 4. Solution paths of (θ1,1, . . . , θM,1) for the ﬁrst dataset in Case 5 (n = 256, M = 100, S = 2). Colors stand for true subgroups, horizontal dashed lines represent true values of the ﬁrst coordinate of αs s, and vertical line displays best λ selected by modiﬁed BIC, whose values are displayed by gray bars.

Moreover, ui and εi follows N(0, 0.3I) and N(0, I), respectively. Finally, Y was generated from the oracle model.

Figure 4 shows the solution paths of (θ1,1, . . . , θM,1), the ﬁrst coordinate of θi s. It can be seen that, under MCP, SCAD and TLP, the learnability structure can be well recovered using the modiﬁed BIC given in (10) to choose the tuning parameter λ, denoted as bλ. On the other hand, although the L1 penalty gets the correct number of subgroups, the coefﬁcient estimates are far away from the truth.

5.2. Learnability Structure Recovery

We use the Normalized Mutual Information (NMI; Fred & Jain, 2003) as the performance measure for learnability structure recovery. NMI [0, 1] and larger values imply more similar groupings. Let C = {C1, C2, . . .} and D = {D1, D2, . . .} denote two partitions of {1, . . . , M}.

NMI is deﬁned as NMI(C, D) 2I(C; D) H(C) + H(D), where

k,l |Ck Dl|/M log(M|Ck Dl|/|Ck||Dl|) is the mutual information between C and D, and H(C) P

k(|Ck|/M) log(|Ck|/M) is the entropy of C. The percentage of perfect recoveries among the 100 replications is also reported.

We also consider an ad hoc approach based on performing K-means on the unit estimates bθi with modiﬁed BIC to select the optimal subgroup size, to compare with our method. Speciﬁcally, for any given number of subgroup size S = K, the K-means algorithm is performed to obtain an estimated subgroup labels, and then the oracle estimation can be computed using the estimated label and variance components. Subsequently, the modiﬁed BIC values are calculated by plugging in these K-means-based estimates. In our analysis, we run through K = 1, . . . , 10 and choose

Mutual Transfer Learning for Massive Data

Table 2. Learnability structure recovery. Short version of Table

Mean, Median Perfect Method (Min,Max) of b S NMI Recover Case 1 MCP 2, 2 (2, 2) 0.998 0.99 S = 2 SCAD 2, 2 (2, 2) 0.998 0.99 TLP 2, 2 (2, 2) 0.998 0.99 L1 2, 2 (1, 4) 0.947 0.62 K-Mns 2, 2 (2, 3) 0.989 0.95 Case 7 MCP 5, 5 (5, 6) 0.999 0.98 S = 5 SCAD 5, 5 (5, 5) 0.999 0.99 TLP 5, 5 (5, 6) 0.999 0.99 L1 63.2, 100 (1, 100) 0.587 0.00 K-Mns 5, 5 (5, 5) 0.976 0.37 Case 9 MCP 7, 7 (7, 7) 0.999 0.99 S = 7 SCAD 7.3, 7 (7, 9) 0.998 0.77 TLP 7.2, 7 (7, 9) 0.998 0.81 L1 150, 150 (150, 150) 0.620 0.00 K-Mns 7, 7 (7, 7) 0.978 0.20

the K value such that the modiﬁed BIC is minimized.

Table 2 includes results for select cases (Cases 2, 5, 7; see the full table in the supplementary material). It can be seen that MCP, SCAD and TLP result in desirable recovery performance, while L1 penalty does not. MCP works very well in all cases. Although the high NMI values suggest that SCAD and TLP still capture the true subgroup structure well, their performance gets worse when S is large (e.g., Case 7) in terms of the standard deviation of b S and perfect recovery rate. This echoes with the discussion following Theorem 4.1 that the theoretical properties hold only when S does not grow too fast. Due to the poor performance, L1 penalty is no longer considered in the subsequent analysis. K-means performs well for small M = 50, but poorly for larger M = 100 and 150, even though the number of subgroups are correctly obtained (e.g., Case 7).

5.3. Parameter Estimation

To evaluate the parameter estimation, we deﬁne the root mean squared error (RMSE) between two vectors v1, v2 Rd as RMSE v1, v2 = 1

d v1 v2 . In Figure 5, we empirically examine how close the proposed estimate is to both true value and oracle estimate. Memory outage occurred in some cases when computing oracle estimates. To resolve this issue, the multivariate GLS approach of Becker & Wu (2007) is applied to compute the oracle estimates.

Firstly, from the top panel in Figure 5, we can easily see RMSE(bα, α0) reasonably increases (decreases) with S (M). This conﬁrms the intuition that more S suggests more challenges in mutual transfer. On the other hand, increasing M allows to estimate the heterogeneous features using more data units, which improves their estimation accuracy. The third panel in Figure 5 indicates that RMSE( bβ, β0) depends on N = Mn but is independent of S. Note that β is

1024 1024 2048 2048 2048 2048 2048 2048 2048 0.0

RMSE(bα, α0)

MCP SCAD TLP Unit

1024 1024 2048 2048 2048 2048 2048 2048 2048 0.000

RMSE(bα, bαOR)

1024 1024 2048 2048 1024 1024 1024 1024 1024 0.00

RMSE( bβ, β0)

n = 2048 M = 100

n = 2048 M = 100

n = 2048 M = 100

n = 2048 M = 150

n = 2048 M = 150

1 M PM i=1 RMSE(bθi, θ0i)

Figure 5. Evaluation of parameter estimation.

the global parameter shared over all data units. This result numerically veriﬁes that the proposed method effectively merges the common information provided from data units, regardless of how many different subgroups they come from.

The results in Theorem 4.2 and (ii) in Corollary 4.1 can be veriﬁed as well. Firstly, the smaller values of RMSE(bα, bαOR) (the second panel) and RMSE( bβ, bβOR) (which are all close to 0 and omitted here) imply that ( bβ , bα ) is fairly close to their oracle counterpart ( bβ OR, bα OR) . The third panel in Figure 5 indicates that the estimation accuracy of bβ is better than the best unit estimate for common effect. Similar observation applies to the estimation of θi s as demonstrated in the bottom panel in Figure 5. Both panels support the efﬁciency boosting result shown in Corollary 4.1. We also studied the asymptotic covariance approximation and the inferential accuracy. See Section S.11.

6. Real Data Example

NOAA s n Clim Div database1 were analyzed to demonstrate MTL method s practical usefulness. The monthly average temperature is the response of interest. We categorize the 8 features to global ones and heterogeneous ones by inspecting the kernel densities. Intuitively, the distribution of the unit-level coefﬁcient estimate for a heterogeneous feature is more likely to be multimodal or widespread (as opposed to unimodal or concentrated) across units. In this way we have chosen intercept, PCPN and ZNDX as the heterogeneous features (i.e., q = 3 and p = 5) and the rest as global

1Available at ftp://ftp.ncdc.noaa.gov/pub/data/cirs/climdiv/.

Mutual Transfer Learning for Massive Data

features; see details in Section S.11.2. Figure 2 displays the recovered learnability structure, where the ﬁve resulting subgroups generally follow a geographical pattern, although we do not use any spatial information of the climate divisions. It may appear strange that northeastern parts of New England are in the same subgroup as Texas and Southern California. However, this pattern actually coincides with NOAA s outlook for temperature, precipitation, and drought: blue and red groups together follow the temperature outlook2, and red subgroup stands out due to severe drought conditions3.

Table 3. The parameter estimates using the proposed MTL method with MCP. The average asymptotic standard deviation (ASD), after multiplied by 100, is shown in the parenthesis.

Color Heterogeneous Features (ASD 100) [#(units)] bαIntercept bαPCPN bαZNDX Red [41] 64.97 (13.2) 0.37 (9.5) 0.07 (9.5) Blue [132] 49.53 (7.1) 0.85 (5.3) 1.51 (5.3) Green [79] 35.32 (8.9) 5.44 (6.9) 4.05 (6.8) Purple [81] 24.74 (9.2) 7.28 (6.8) 5.16 (6.7) Orange [11] 9.90 (32.3) 9.14 (19.3) 6.54 (18.6) Coef. for Global Features (ASD 100) bβSummer bβFall bβWinter bβPDSI bβPHDI 18.26 (2) 4.06 (2) 15.12 (2) 0.18 (1) 0.20 (1)

From the parameter estimates tabulated in Table 3, we can see the general pattern that the average temperature tends to drop from southwestern areas to northeastern areas. ZNDX has the similar tendency as the intercept does, but PCPN behaves in an opposite direction. For the global feature estimates, there is no surprise to see the seasonal effects, and PDSI and PHDI have positive but relatively small (compared to the heterogeneous features) inﬂuence on the temperature.

Additional prediction-driven analysis was conducted by partitioning the n Clim Div dataset into two parts: data in 18952000 was used for training and 2001-2016 for testing. Our model did not perform as well as the baseline unit-level regression model in terms of the RMSE for the test data. This may be caused by a sub-optimal choice of the tuning parameter λ which leads to fewer subgroups than what would be needed for better prediction performance. Recall that we did not choose λ to minimize the cross-validation prediction error, but used BIC in order to get a parsimonious model. On the other hand, with the simulation results and optimality theory presented in the paper, our model dominates the baseline model in terms of statistical inference accuracy. This may suggest a fundamental difference between inferential analysis and predictive analysis.

2E.g., see the ﬁrst ﬁgure in https://www.climate.gov/newsfeatures/videos/noaas-2019-20-winter-outlook-temperatureprecipitation-and-drought. 3See the United States Drought Monitor Animation at https://droughtmonitor.unl.edu/Maps/Animations.aspx

Acknowledgements

This work was completed while Guang Cheng was a member of Institute for Advanced Study (IAS) at Princeton University in Fall 2019. Guang Cheng would like to acknowledge the hospitality of IAS, and the ﬁnancial support from the National Science Foundation (DMS-1712907, DMS1811812, and DMS-1821183), the Ofﬁce of Naval Research (ONR N00014-18-2759), and the Adobe Data Science Fund.

Anh, V. V. and Chelliah, T. Estimated generalized least squares for random coefﬁcient regression models. Scandinavian Journal of Statistics, 26(1):31 46, 1999.

Becker, B. J. and Wu, M.-J. The synthesis of regression slopes in meta-analysis. Statistical Science, 22(3):414 429, 2007.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2010.

Craven, P. and Wahba, G. Smoothing noisy data with spline functions. Numerische Mathematik, 31(4):377 403, 1978.

Debray, T. P. A., Moons, K. G. M., van Valkenhoef, G., Efthimiou, O., Hummel, N., Groenwold, R. H. H., Reitsma, J. B., and on behalf of the Get Real methods review group. Get real in individual participant data (IPD) meta-analysis: A review of the methodology. Research Synthesis Methods, 6(4):293 309, 2015.

Eckstein, J. A practical test for univariate and multivariate normality. Technical report, RUTCOR Research Report RRR 32-2012, 12 2012.

Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001.

Fred, A. L. and Jain, A. K. Robust data clustering. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pp. 128 136, 2003.

Ghadimi, E., Teixeira, A., Shames, I., and Johansson, M. Optimal parameter selection for the alternating direction method of multipliers (admm): Quadratic problems. IEEE Transactions on Automatic Control, 60(3):644 658, 2015.

Harville, D. A. Matrix Algebra From a Statistician s Perspective. Springer, corrected edition edition, 2000. ISBN 978-0387949789.

Mutual Transfer Learning for Massive Data

He, X. and Shao, Q.-M. On parameters of increasing dimensions. Journal of Multivariate Analysis, 73(1):120 135, 2000.

Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14(3):569 600, 2014.

Jiang, J. REML estimation: Asymptotic behavior and related topics. The Annals of Statistics, 24(1):255 286, 1996.

Liu, D., Liu, R. Y., and Xie, M. Multivariate meta-analysis of heterogeneous studies using only summary statistics: Efﬁciency and robustness. Journal of the American Statistical Association, 110(509):326 340, 2015.

Ma, S. and Huang, J. Estimating subgroup-speciﬁc treatment effects via concave fusion. ar Xiv preprint, 2016.

Ma, S. and Huang, J. A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410 423, 2017.

Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359, 2009.

Richardson, A. and Welsh, A. Asymptotic properties of restricted maximum likelihood (REML) estimates for hierarchical mixed linear models. Australian Journal of Statistics, 36(1):31 43, 1994.

Rothenberg, T. J. Approximate normality of generalized least squares estimates. Econometrica, 52(4):811 825, 1984.

Schwarz, G. Estimating the dimension of a model. The Annals of Statistics, 6(2):461 464, 1978.

Shen, X., Pan, W., and Zhu, Y. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107(497):223 232, 2012.

Tibshirani, R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1):267 288, 1996.

Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Eldar, Y. C. and Kutyniok, G. (eds.), Compressed Sensing, chapter 5, pp. 210 268. Cambridge University Press, Cambridge, 2012.

Wang, H., Li, B., and Leng, C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):671 683, 2009.

Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894 942, 2010.

Zhao, P. and Yu, B. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541 2563, 2006.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q. A comprehensive survey on transfer learning. ar Xiv preprint ar Xiv:1911.02685, 2019.