# model_linkage_selection_for_cooperative_learning__88f79339.pdf

Journal of Machine Learning Research 22 (2021) 1-44 Submitted 09/20; Revised 09/21; Published 10/21

Model Linkage Selection for Cooperative Learning

Jiaying Zhou zhou1054@umn.edu School of Statistics University of Minnesota Minneapolis MN, 55455

Jie Ding dingj@umn.edu School of Statistics University of Minnesota Minneapolis MN, 55455

Kean Ming Tan keanming@umich.edu Department of Statistics University of Michigan Ann Arbor MI, 48109

Vahid Tarokh vahid.tarokh@duke.edu Department of Electrical Engineering and Engineering Duke University Durham NC, 27708

Editor: Bert Huang

We consider the distributed learning setting where each agent or learner holds a speciﬁc parametric model and a data source. The goal is to integrate information across a set of learners and data sources to enhance the prediction accuracy of a given learner. A natural way to integrate information is to build a joint model across a group of learners that shares common parameters of interest. However, the underlying parameter sharing patterns across a set of learners may not be known a priori. Misspecifying the parameter sharing patterns or the parametric model for each learner often yields a biased estimator that degrades the prediction accuracy. We propose a general method to integrate information across a set of learners that is robust against misspeciﬁcation of both models and parameter sharing patterns. The main crux of our proposed method is to sequentially incorporate additional learners that can enhance the prediction accuracy of an existing joint model based on userspeciﬁed parameter sharing patterns across a set of learners. Theoretically, we show that the proposed method can data-adaptively select a parameter sharing pattern that enhances the predictive performance of a given learner. Extensive numerical studies are conducted to assess the performance of the proposed method.

Keywords: Data integration; decentralized learning; federated learning; model linkage selection; prediction eﬃciency.

1. Introduction

In recent years, there has been a growing interest in statistical learning problems with a set of decentralized learners, where each learner encompasses a speciﬁc data modality and a

c 2021 Jiaying Zhou, Jie Ding, Kean Ming Tan, and Vahid Tarokh.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-1078.html.

Zhou, Ding, Tan, Tarokh

statistical model built using domain-speciﬁc knowledge. The goal is to integrate information across decentralized learners to achieve higher statistical eﬃciency and predictive accuracy. Integrating information from diﬀerent data sources is crucial in many scientiﬁc domains such as environmental science (Blangiardo et al., 2011; Xingjian et al., 2015), epidemiology (Yang et al., 2015; Guo et al., 2017), statistical machine learning problems (Ngiam et al., 2011; Kong et al., 2016; Cao and Jin, 2007; Ye et al., 2020; Xian et al., 2020), and computational biology (Simmonds and Higgins, 2007; Liu et al., 2009, 2015; Wen and Stephens, 2014). For instance, in the context of epidemiology, a considerable amount of online search data from diﬀerent platforms are integrated and used to form accurate predictions of inﬂuenza epidemics (Yang et al., 2015; Guo et al., 2017). The aforementioned applications raise a critical question: how to reliably integrate information from diﬀerent data sources to enhance statistical eﬃciency in a robust manner? We consider the setting in which there are a set of learners, each consisting of a data set and a parametric statistical model. The learners may or may not share common parameters among themselves. The goal is to develop a framework to enhance the statistical eﬃciency of any particular learner, say L1, by integrating information from the other learners through parameter sharing. In general, learner L1 can be assisted explicitly or implicitly by building a joint model with the other learners with potentially diﬀerent statistical models and diﬀerent data sources. Explicit assistance could be achieved by joint modeling with a set of learners that share at least one parameter. On the other hand, implicit assistance could be achieved by joint modeling with learners whose parameters are not directly related to L1, but are related to learners who could explicitly assist L1. In principle, if the true underlying parameter sharing patterns among all learners are known a priori and that the parametric statistical model for each learner is correctly speciﬁed, then one can build a joint model with constraints on the shared parameters based on the joint likelihood function. Many existing modeling methods can be formulated as special instances of the above setting. For example, when multiple learners employ the same parametric model across diﬀerent data sources, it is usually studied in the context of data integration (Jensen et al., 2007; Vonesh et al., 2006; Liu et al., 2015; Lee et al., 2017b; Jordan et al., 2019; Danaher et al., 2014; Ma and Michailidis, 2016; Tang and Song, 2016; Li and Li, 2018; Tang et al., 2019; Maity et al., 2019; Shen et al., 2019), or in the context of distributed optimization (Boyd et al., 2011; Shi et al., 2014; Lee et al., 2017a; Li et al., 2019). A recent topic, referred to as federated learning, is also related to the above setting, where a central server (interpreted as the main learner) sends the current global model to a set of clients (interpreted as other learners), each client updates the model parameters with the local data source, and then returns them to the central server (Shokri and Shmatikov, 2015; Koneˇcn y et al., 2016; Mc Mahan et al., 2016). Though it is often helpful to establish a joint model from multiple learners, naively combining data sources and performing joint modeling can lead to severely degraded statistical performance due to four possible reasons: (i) misspeciﬁed statistical models for some learners; (ii) misspeciﬁed parameter sharing patterns among learners; (iii) heterogeneity from diﬀerent data sources (Simmonds and Higgins, 2007; Wen and Stephens, 2014; Liu et al., 2015); and (iv) distinct learning objectives for diﬀerent learners (Zuech et al., 2015; Sivarajah et al., 2017; Xian et al., 2020; Wang et al., 2021). Most existing data integration and distributed computing methods are based on the assumption that the statistical model

Model Linkage Selection for Cooperative Learning

for each learner is correctly speciﬁed and that the parameter sharing patterns are known a priori. A systematic statistical approach for decentralized learning robust against the four aspects mentioned above is relatively lacking. In this paper, we propose a general approach to enhance the predictive performance of a speciﬁc learner L1 by integrating information from the other learners. We consider the setting where there are M learners and that the learners may have diﬀerent statistical models and heterogeneous data sources. To characterize parameter sharing patterns among all learners, we introduce the notion of a model linkage graph. A model linkage graph G = (V, E) consists of a set of M vertices V and a set of edges E, where each vertex represents a learner, and an edge between a pair of vertices encodes a parameter sharing pattern between the pair of learners. In addition, a pair of learners do not share any common parameters if there is no edge between them. A joint model that enhances the predictive performance of L1 can then be ﬁt given a model linkage graph. However, the ground truth or the most suitable model linkage graph is not known a priori. Due to model misspeciﬁcation within each learner and misspeciﬁed model linkages between pairs of learners, the prediction performance of L1 may degrade after incorporating information from other learners. To enhance robustness against a misspeciﬁed model linkage graph, one could exhaustively build joint models for all possible sets of learners that are connected to L1 within a given model linkage graph. Then, the set of learners that yields the largest conditional marginal likelihood of L1 is selected. However, such an approach is computationally infeasible since the number of possible sets of learners grows exponentially with the number of learners. To address this challenge, we propose a greedy algorithm that is robust against a misspeciﬁed model linkage graph. Our proposed algorithm sequentially incorporates additional learners based on the user-speciﬁed model linkage graph, starting from learner L1. In each iteration of the algorithm, we utilize the joint model built with a group of learners from the previous iteration and search for the next learner to improve the marginal likelihood of the current group of learners. This process is continued until no more learners are included in the joint model. This approach approximately reduces the number of possible sets of model linkages from exponential to quadratic in the number of learners. Compared with the joint modeling approach, our proposed method is of distributed nature and does not require data sharing across learners. To quantify the theoretical aspects of the proposed method, we introduce the notion of model linkage selection consistency and asymptotic prediction eﬃciency. They are diﬀerent but conceptually parallel to asymptotic eﬃciency and selection consistency in the classical model selection literature (see, e.g., Ding et al., 2018). We show that the proposed algorithm achieves linkage selection consistency as long as the user-speciﬁed model linkage graph is a superset of the underlying model linkage graph. In other words, the proposed method selects data sources that are truly useful for enhancing the predictive performance of L1 in a dataadaptive manner as the sample size grows. In addition, we show that the proposed method achieves asymptotic prediction eﬃciency, meaning that the predictive performance of the selected model is asymptotically equivalent to that of the best joint model in hindsight. The paper is outlined as follows. In Section 2, we provide the motivation, problem description, and deﬁnitions related to the model linkage graph. In Section 3, we propose a general method for integrating information from diﬀerent learners to enhance the predictive performance of L1. The theoretical results for the proposed framework are provided in

Zhou, Ding, Tan, Tarokh

Section 3.2.2. In Section 4, we perform numerical studies to evaluate the performance of the proposed method under diﬀerent scenarios such as data contamination and model misspeciﬁcation. We close with a discussion in Section 5, where we highlight some related literature on data integration and federated learning. The technical proofs and the regularity conditions needed for the main results are included in the Appendix.

2. Background and Motivation

Suppose that there are M learners, L1, L2, . . . , LM. Each learner is a pair Lκ = (D(κ), Pκ), where D(κ) is a set of nκ observations from the sample space D(κ), and Pκ is a user-speciﬁed class of parametric model for modeling D(κ), parameterized by a pκ-dimensional parameter θκ Θκ. Our goal is to develop a framework for enhancing the predictive performance of a particular learner, say L1, by integrating information from other learners, L2, . . . , LM. We will focus on the regression setting where D(κ) = (Y, X (κ)), with Y = R and X (κ) = Rkκ. The covariates are allowed to have diﬀerent dimensions across the M learners due to diﬀerent data sources. While we focus on the regression setting, the proposed approach can be applied more generally to data of diﬀerent forms. Let D(κ) = (y(κ), X(κ)) D(κ), where y(κ) Rnκ is an nκ-dimensional vector of response and X(κ) Rnκ kκ is an nκ kκ matrix of covariates. Let Pκ = n p(κ) θκ ( |θk, X(κ)) : θκ Θκ, X(κ) X (κ)o be a class of user-

speciﬁed parametric model for modeling y(κ) given X(κ). For each learner, assume that the underlying response variable is generated independently according to the probability law P κ described by a conditional density p κ( |x(κ)), given the covariates x(κ) Rkκ. We start with providing several deﬁnitions that will serve as a foundation of the proposed framework: model misspeciﬁcation, model linkage, and model linkage misspeciﬁcation. A class of user-speciﬁed model Pκ is said to be misspeciﬁed when it does not contain the underlying conditional density p κ( |x(κ)). In practice, model misspeciﬁcation often occurs due to an inappropriate functional form between the response and covariates, such as underﬁtting the model or neglecting dependent random noise (see, e.g., Domowitz and White, 1982; Clarke, 2005; Cawley and Talbot, 2010). We now provide a formal deﬁnition of model misspeciﬁcation.

Deﬁnition 1 A model P = {pθ : θ Θ} is well-speciﬁed if there exists a θ Θ such that pθ = p almost everywhere. A model P = {pθ : θ Θ} is misspeciﬁed if supθ Θ P {y : pθ(y|θ, x) = p (y|x)} < 1 for some covariates x, where P denotes the probability measure corresponding to p .

Figure 1 provides several examples of well-speciﬁed and misspeciﬁed models in the context of regression. The left panel of Figure 1 indicates that the learner L1 is well-speciﬁed, since the underlying linear model p 1 P1, where P1 is a class of user-speciﬁed linear model with two covariates. On the other hand, the middle panel of Figure 1 presents a case when the learner L2 is misspeciﬁed since p 2 / P2. Model misspeciﬁcation also occurs when the user-speciﬁed model class is diﬀerent from the underlying data-generating process, as illustrated in the right panel of Figure 1. Next, we provide a new deﬁnition of model linkage. Suppose that there are two learners Li and Lj. A model linkage occurs between two learners Li and Lj if they are restricted to share some common parameters. A formal deﬁnition is given in Deﬁnition 2.

Model Linkage Selection for Cooperative Learning

1,i + β2x(3)

i Bernoulli{logistic(β1x(3)

1,i + β2x(3)

Figure 1: Three examples of well-speciﬁed and misspeciﬁed models. Left panel: wellspeciﬁed model. Middle panel: misspeciﬁed model. Right panel: misspeciﬁed model.

Deﬁnition 2 (Model linkage) Suppose that two learners Li and Lj are well-speciﬁed. Let θi,Si and θj,Sj be subvectors of θi and θj, indexed by the subsets Si {1, . . . , pi} and Sj {1, . . . , pj}, respectively. There exists a model linkage between Li and Lj if θi,Si = θj,Sj, also denoted by θSi,Sj for notational convenience. We also refer to θSi,Sj as the shared common parameter between Li and Lj.

To put the idea of model linkage into perspective, we consider an example in the context of an epidemiological study with two learners, illustrated in Figure 2. A similar example was considered in (Plummer, 2015). Suppose that both learners are well-speciﬁed. Learner L1 concerns estimating the human papilomavirus (HPV) prevalence with data D(1) = (y(1), x(1)), where y(1) and x(1) are both n-dimensional vectors recording the number of women infected with high-risk HPV and the population size in diﬀerent states, respectively. A binomial model is speciﬁed for y(1) i with a population size x(1) i and HPV prevalence parameter θ1,i, with i = 1, 2, . . . , n. Learner L2 models the relationship between the HPV prevalence θ1,i and cancer incidence in the form of Poisson regression. In L2, let D2 = (y(2), x(2)), where y(2) is the number of cancer incidents and x(2) is the women-years of follow up. Both L1 and L2 are restricted to share the same HPV prevalence parameters θ1,i, i = 1, 2, . . . , n, and thus there is a model linkage between L1 and L2.

i Poisson( 1,i 2x(2)

Figure 2: Learners L1 and L2 share common parameters θ1,i, i = 1, . . . , n.

Zhou, Ding, Tan, Tarokh

The deﬁnition and example above focus mainly on whether there is a model linkage between two learners. Such an idea can be generalized to a set of model linkages among a group of learners, which we refer to as model linkage graph in the following deﬁnition.

Deﬁnition 3 (Model linkage graph) Let G = (V, E) be an undirected model linkage graph, where V is a set of M vertices representing the M learners L1, . . . , LM, and E is an edge set encoding model linkages between pairs of learners.

In practice, the model linkage graph and linkages between pairs of learners are pre-speciﬁed by the user, usually based on domain-speciﬁc knowledge, before model ﬁtting. Therefore, the prediction performance of a learner may not be improved after incorporating information from the model linkage graph due to the potential misspeciﬁcation of models and model linkages. A concept in parallel with model misspeciﬁcation in the context of model linkage misspeciﬁcation is provided in Deﬁnition 4.

Deﬁnition 4 (Model linkage misspeciﬁcation) Suppose that there is a model linkage between two learners Li and Lj. In other words, a pair of subsets of parameters in two learners are restricted to be the same, say θi,Si = θj,Sj. A model linkage is misspeciﬁed if either Li or Lj is misspeciﬁed, or that θ i,Si = θ j,Sj. More generally, a model linkage graph G = (V, E) is misspeciﬁed if there exists a misspeciﬁed model linkage between a pair of learners.

Recall that we are interested in enhancing the predictive performance of L1 by integrating information from other learners L2, . . . , LM. Thus far, it is clear that a model linkage should exist between L1 and Lκ if the pair of learners shares common parameters. We now introduce the notion of information ﬂow in which learners that do not share common parameters directly with L1 can also enhance its predictive performance by sharing common parameters with learners that can, in turn, assist L1. Consider an example with six learners as illustrated in Figure 3. For simplicity, assume that the statistical models for all learners are all well-speciﬁed and that the true model linkage graph is known. Learners L2 and L4 share common parameters with L1, and thus there exist model linkages between L1 and L2, and L1 and L4. In addition, learner L3 has a shared common parameter with L2, and thus in principle, L3 can help enhance the predictive performance of L1 implicitly. This process of positive feedback transmission is an information ﬂow that enables implicit assistance to L1. In Figure 3, there are two such information ﬂows, namely L3 L2 L1 and L5 L4 L1. Thus, in the model linkage graph, paths exist from L3 and L5 to L1. Learner L6 does not share common parameters with learners related to L1, and thus there is no model linkage between L6 and the others. However, in practice, the true model linkage graph is not known a priori and needs to be speciﬁed. A misspeciﬁed model linkage graph may hamper the predictive performance of L1. The following section proposes a data-adaptive framework to identify an appropriate model linkage graph for prediction. The same idea can be used to improve parameter estimation accuracy.

Model Linkage Selection for Cooperative Learning

Figure 3: A model linkage graph with six learners. Learner L1 shares common parameters with L2 and L4, and is implicitly connected to L3 and L5 through L2 and L4.

3. General Bayesian Framework for Model Linkage Selection

3.1 Proposed Method

We propose a Bayesian framework to enhance the predictive performance of learner L1 by leveraging data from the other learners L2, . . . , LM according to a user-speciﬁed model linkage graph, G = (V, E). Suppose that there exists a model linkage between two learners, say learners Li and Lj. Some of the elements in θi and θj are restricted to be the same, say θi,Si = θj,Sj = θSi,Sj. From the Bayesian perspective, a natural way to integrate information is to compute the posterior distribution of the parameters θ = (θT i, Si, θT Si,Sj, θT j, Sj)T, where θi, Si and θj, Sj are obtained by removing the elements from θi and θj, indexed by the sets Si and Sj, respectively. Then, the posterior distribution of θ can be computed as

π(θ | D(i), D(j)) = p(i) θi (y(i)|θi, X(i))p(j) θj (y(j)|θj, X(j))π(θ)

p(y(i), y(j)|X(i), X(j))

where π(θ) is a prior distribution on θ, and p(y(i), y(j)|X(i), X(j)) is the marginal likelihood obtained by integrating the numerator of the above equation with respect to θ. More generally, one can compute the posterior distribution of the parameters according to the user-speciﬁed model linkage graph. Let C(G) be a set of indices recording the vertices that form a connected component with learner L1 in the user-speciﬁed graph G, including learner L1. That is, C(G) is a set containing all indices of learners that have at least a path to L1 and learner L1. For brevity, throughout the paper, let θ be a vector obtained by concatenating the entries of θκ for all κ C(G) without duplication. That is, the shared parameters between pairs of learners appear only once in θ. The posterior distribution for θ under a speciﬁc model linkage can then be computed by

π(θ | κ C(G)D(κ)) = π(θ) Q κ C(G) p(κ) θκ (y(κ)|θκ, X(κ))

p( κ C(G)y(κ)| κ C(G) X(κ))

Then, the posterior predictive distribution for a new observation in learner L1 can be computed as

p(ey| κ C(G) D(κ), ex) = Z

Θ p(1) θ1 (ey|θ1, ex)π(θ | κ C(G)D(κ))dθ, (1)

Zhou, Ding, Tan, Tarokh

Algorithm 1 Greedy Algorithm for Model Linkage Selection.

Input: User-speciﬁed graph G, data D(κ), parameter vector θκ, parametric distribution p(κ) θκ ( | θκ, X(κ)), prior distribution πκ( ) on parameters θκ, for κ = 1, . . . , M. 1: Initialize the index ℓ= 1, linkage set ζ(1) = {1}. 2: for ℓ= 2, . . . , M do 3: Let NG(ζ(ℓ 1)) denote the neighboring set of ζ(ℓ 1) within G, namely the set of learners in G\ζ(ℓ 1)

that have a model linkage with at least one learner in ζ(ℓ 1). 4: Calculate jopt = argmaxj NG(ζ(ℓ 1))p( κ ζ(ℓ 1)y(κ) | κ ζ(ℓ 1)X(κ), D(j)). 5: if p( κ ζ(ℓ 1)y(κ)| κ ζ(ℓ 1) X(κ)) p( κ ζ(ℓ 1)y(κ) | κ ζ(ℓ 1)X(κ), D(jopt)) break 6: Let ζ(ℓ) = {jopt} ζ(ℓ 1). 7: if ζ(ℓ) = C(G) break 8: end for 9: For a new predictor ex, let bp(ℓ) = p( | κ ζ(ℓ) D(κ), ex) and bπ(ℓ) = π( | κ ζ(ℓ) D(κ)). Output: Predictive distribution bp = bp(ℓ), posterior distribution bπ = bπ(ℓ), and model linkage graph b G.

where ex denotes the future observed covariates and Θ is the parameter space. Note that the posterior predictive distribution has integrated information from learners in C(G) through the parameters in θ. In practice, the true model linkage graph is not known a priori, and the user-speciﬁed model linkage graph G may be misspeciﬁed, as deﬁned in Deﬁnition 4. A joint model as in (1) with a misspeciﬁed model linkage graph G can lead to severely biased parameter estimation, which aﬀects the predictive quality of L1. Let |C(G)| be the cardinality of the set C(G). One way to address the aforementioned challenge is to exhaustively search all possible sets of model linkages over a graph with |C(G)| learners, and pick the set of model linkages that yields the largest marginal likelihood for L1 conditioned on other learners. However, the number of possible sets of model linkages grows exponentially with |C(G)|, and it is computationally infeasible to evaluate all possible subgraphs of G. To address the above challenge, we propose a greedy algorithm that is computationally feasible with theoretical guarantees. The proposed algorithm reduces the possible sets of model linkages from exponential in |C(G)| to quadratic in |C(G)|. The main idea is to successively search for the next learner that will improve the conditional marginal likelihood of the current group of learners, starting from a singleton set L1. The algorithm will terminate and output an estimated model linkage graph b G = (V, b E) when adding any further learner no longer increases the marginal likelihood of the maintained learners. The greedy algorithm is outlined in Algorithm 1. When the algorithm terminates at the ℓth iteration, L1 will integrate information from learners in ζ(ℓ), leading to the following posterior distribution of θ:

π(θ| κ ζ(ℓ) D(κ)) = π(θ) Q κ ζ(ℓ) p(κ) θk (y(κ)|θk, X(κ))

p( κ ζ(ℓ)y(κ)| κ ζ(ℓ) X(κ)) .

Recall that the above θ is the vector obtained by concatenating free parameters of θκ for all κ ζ(ℓ) without duplication. The posterior predictive distribution for a new observation ex in learner L1 obtained from Algorithm 1 is computed as

p(ey| κ ζ(ℓ) D(κ), ex) = Z

Θ p(1) θ1 (ey|θ1, ex)π(θ| κ ζ(ℓ) D(κ))dθ.

Model Linkage Selection for Cooperative Learning

Input Input graph

" = ($, &) *!, *", *#, *$

Iteration 1

Iteration 2 Output graph

E8Ehghn+e ABl4y Cm Gh Cq OR6V0x HRBIKOs WKDs Fe PHm Zd Oo1u1Gz7dt Gt Xl V5FGR+g Yn SIb Xa Amuk Et1EYUPa EX9Ibej Wfj1fgw Pmet Ja OYOUR/YHz9Amtonbc=</latexit> S2,S4

E8Ehghn+e ABl4y Cm Gh Cq OR6V0x HRBIKOs WKDs Fe PHm Zd Oo1u1Gz7dt Gt Xl V5FGR+g Yn SIb Xa Amuk Et1EYUPa EX9Ibej Wfj1fgw Pmet Ja OYOUR/YHz9Amtonbc=</latexit> S2,S4

Algorithm terminates

Figure 4: A schematic diagram illustrating Algorithm 1 with four learners. Each learner Lκ = (D(κ), Pκ) consists of the data and a user-speciﬁed class of parametric model. The algorithm starts with a user-speciﬁed graph that may or may not be misspeciﬁed. At each iteration, the algorithm selects a learner that maximizes the marginal likelihood of the current group of learners. The algorithm terminates when no such learners can be found. Finally, the algorithm outputs an estimated linkage graph b G, predictive distribution bp, and the posterior distribution bπ for θ.

We now illustrate Algorithm 1 with a toy example in Figure 4. In this example, there are four learners, and the user-speciﬁed model linkage forms a connected component among all learners, i.e., there is a path from one learner to another learner. Suppose that the goal is to assist learner L1. It can be seen from the user-speciﬁed model linkage graph G that L1 is directly connected to L2 and L3, and implicitly connected with L4 through L2. At the ﬁrst iteration of Algorithm 1, L1 computes its marginal likelihood p(D(1)), and conditional marginal likelihoods p(D(1) | D(2)) and p(D(1) | D(3)). If none of the conditional marginal likelihoods is larger than the marginal likelihood, Algorithm 1 will be terminated; otherwise, it includes the learner that produces the larger conditional marginal likelihood into the model linkage set. In this case, learner L2 is included in the model linkage set.

At the second iteration, the linkage set {L1, L2} is treated as one learner since they are linked together during the ﬁrst iteration. On the other hand, the candidate learners to establish a linkage are L3 and L4. In Figure 4, L4 is included in the linkage set following a similar argument as in the ﬁrst iteration. Finally, L3 is not included in the linkage set based on our criterion and the algorithm terminates. Consequently, L1 will obtain a parameter estimation and predictive model that are trained from the union of L1, L2, and L4.

Zhou, Ding, Tan, Tarokh

Remark 5 Computation and Communication: In the above example, we note that L1 does not need to access the raw data of the other learners during the ﬁrst iteration. In particular, L1 only requires L2 to share its likelihood function (e.g., in the form of an API) to calculate the (conditional) marginal likelihoods required for decision-making. Moreover, one could design more sophisticated protocols to increase the computation and communication eﬃciency in the proposed algorithm. For instance, when L1 decides whom to collaborate with, it only needs L2 and L3 to transmit their posterior distribution (in the form of, e.g., Monte Carlo samples) computed locally, to calculate the conditional marginal likelihood. Once a collaborator, say L2, is determined, the likelihood function of L2 will be sent to L1 to calculate the latest marginal likelihood and posterior.

We brieﬂy comment on the computation complexity of L1 in a general scenario. Suppose that at each iteration ℓ, each learner in NG(ζ(ℓ 1)) (following the notation in Algorithm 1) will send their likelihood functions to the linkage set ζ(ℓ 1). Let us consider the complexity of evaluating each candidate learner j in line 4 of Algorithm 1 as one unit. On the one hand, if learner L1 performs the computation alone, its total complexity units will be O(P ℓ=2,...,M(M ℓ+ 1)) = O(M2) for large M. On the other hand, if everyone in the current collaboration set shares the computation cost, the complexity of L1 will be reduced to O(P ℓ=2,...,M(M ℓ+ 1)/(ℓ 1)) = O(M log M). The second setting is reasonable, because learners in ζ(ℓ 1) are already collaborators of L1 in the joint modeling.

3.2 Theoretical Results

We ﬁrst provide deﬁnitions on asymptotic prediction eﬃciency and model linkage selection consistency in Section 3.2.1. Theoretical results for the proposed framework are presented in Section 3.2.2. Throughout the section, let G = (V, E) and b G = (V, b E) be the user-speciﬁed and estimated model linkage graphs, respectively. The user-speciﬁed G = (V, E) is usually speciﬁed based on prior scientiﬁc knowledge of practitioners, which may or may not be wellspeciﬁed. Let G = (V, E ) be the largest subgraph of G, whose underlying model linkages are all well-speciﬁed after the statistical model for each learner is speciﬁed. That is, for any pair of learners Li and Lj connected on G, their linkage is well-speciﬁed if and only if there exists an edge between them in G . Recall that more linkages imply a fewer number of free parameters in the joint model from G. Thus, intuitively speaking, G represents the most parsimonious parameterization of the underlying data-generating process. Ideally, the proposed algorithm can data-adaptively select b G = G , thus identifying the correct model linkages and ﬁltering out misspeciﬁed ones within G.

3.2.1 Definitions

Recall that the goal of the proposed framework is to enhance the predictive performance of L1 by borrowing information from other learners, L2, . . . , LM. To evaluate the predictive performance of L1, we consider a general class of proper scoring rules (Gneiting and Raftery, 2007; Parry et al., 2012; Shao et al., 2019). Examples of proper scoring functions are the logarithmic score s(p, y, x) = log p(y|x) and the Brier score s(p, y, x) = p(y|x) + 0.5 R

R p(ey|x)2dey (Parry, 2016).

Model Linkage Selection for Cooperative Learning

Deﬁnition 6 (Proper scoring function) Let p be the true data-generating density function. A scoring function s : (p, y, x) 7 s(p, y, x) is proper if for any conditional density function p, we have R

Y s(p, y, x)p (y|x)dy R

Y s(p , y, x)p (y|x)dy almost surely.

Let E be the expectation with respect to the data-generating distribution of y conditional on x, denoted by p . Let (ey, ex) be a new observation. The value E[s(p , ey, ex) | ex] is referred to as the oracle score, and E[s(bp, ey, ex) s(p , ey, ex) | ex] is a non-negative expected prediction loss since s( ) is a proper scoring rule. It can be seen that the non-negativity of the Kullback-Leibler divergence from p ( |x) to bp( |x), deﬁned as Dkl{p ( |x)||bp( |x)} = R

Y p (y|x)[log{p (y|x)} log{bp(y|x)}]dy, implies that the logarithmic score is proper. Recall that C(G) is a set of indices recording a set of learners that forms a connected component with L1 in a model linkage graph G. Next, we deﬁne linkage selection consistency.

Deﬁnition 7 (Linkage selection consistency) Given a pre-speciﬁed model linkage graph G, suppose that ψ : {D(κ) : κ C(G)} 7 b G is a linkage selection criterion in order to assist L1. Then, the linkage selection criterion ψ achieves linkage selection consistency if Pr( b E = E ) 1 as n .

Here, the probability is deﬁned over the observed data. In other words, a consistent linkage selection criterion ψ selects all the correct model linkages present in the user-speciﬁed linkage graphs. Next, we introduce the notion of asymptotic prediction eﬃciency. Suppose that C denotes a generic set of learners other than L1. Let bp C denote the predictive distribution of L1 conditional all the learners in C.

Deﬁnition 8 (Asymptotic prediction eﬃciency) Let bp be a constructed marginal predictive distribution for L1, and let s( ) be a proper scoring function. Then, bp is asymptotically prediction eﬃcient if

E[s(bp C(G ), ey, ex) s(p , ey, ex)]

E[s(bp, ey, ex) s(p , ey, ex)] (2)

converges in probability to one as the number of observations nκ for κ = 1, . . . , M. Here, the expectation is taken over a new observation (ey, ex). If bp is the posterior predictive distribution for L1 under a certain model linkage b G, then b G is also referred to as an asymptotically eﬃcient model linkage graph.

The ratio (2) contrasts the expected prediction loss of the constructed predictive density function bp and that of the predictive density function induced by G .

3.2.2 Theoretical Properties of Algorithm 1

We now proceed to study the theoretical properties of Algorithm 1. For technical convenience, we assume that learner L1 is well-speciﬁed. Note that, more generally, learner L1 may or may not be well-speciﬁed. In either case, the predictive performance can be evaluated by a proper scoring rule, e.g., the logarithmic rule. Throughout the theoretical studies, we consider the regime in which the number of learners M is ﬁxed and the number of observations nκ satiﬁes nκ/n cκ, where n = PM κ=1 nκ and cκ is a positive constant.

Zhou, Ding, Tan, Tarokh

Recall that G = (V, E) is a user-speciﬁed graph. Moreover, recall that G = (V, E ) is the true model linkage graph, deﬁned as the largest subgraph of G with correct model linkages between pairs of learners (given that each learner s model is speciﬁed). We denote the estimated model linkage graph from Algorithm 1 as b G = (V, b E).

Theorem 9 Under some regularity conditions in Appendix A and given a user-speciﬁed graph G, the estimated model linkage edge set b E from Algorithm 1 achieves linkage selection consistency, namely Pr( b E = E ) 1 as n .

Theorem 9 indicates that the estimated model linkage graph b E from Algorithm 1 is consistent in linkage selection, only selecting all the well-speciﬁed model linkages from G. Thus, the proposed approach is asymptotically robust against model linkage misspeciﬁcation. We will verify the ﬁnite sample performance of Algorithm 1 using numerical studies in Section 4, by considering various settings such as model misspeciﬁcation, model linkage misspeciﬁcation, and data contamination. The following theorem guarantees that the predictive distribution constructed from Algorithm 1 is asymptotically prediction eﬃcient.

Theorem 10 Let bp be the constructed predictive distribution for L1 via Algorithm 1 based on its selected model linkage graph b G. Let s( ) be a proper scoring function. Under the same conditions as in Theorem 9, bp is asymptotically prediction eﬃcient.

Note that Theorem 10 holds even when the statistical models for some learners are misspeciﬁed due to the deﬁnition of G . In other words, the proposed method yields a predictive distribution that is robust to model misspeciﬁcation and model linkage graph misspeciﬁcation.

4. Numerical Studies

4.1 Linear Regression Example

We consider a regression setting with six learners L1, . . . , L6. The goal is to enhance the predictive performance of L1 by incorporating information from other learners. The data for the six learners are generated as follows:

P7 j=1 β(κ) j x(κ) ij + ϵ(κ) i for κ = 1, 3, 4, P15 j=1 β(κ) j x(κ) ij + ϵ(κ) i for κ = 2, P7 j=1 β(κ) j (x(κ) ij + 5)2 + ϵ(κ) i for κ = 5, P15 j=8 β(κ) j x(κ) ij + ϵ(κ) i for κ = 6,

where all regression coeﬃcients are set to equal 0.3 except that β(4) j = 0.6 for j = 1, . . . , 7.

Each covariate x(κ) ij and the random noise are generated from a standard Gaussian distribution for all the learners. Since learners L1, L2, and L3 share common parameters, and L2 shares common parameters with L6, there are model linkages among learners L1, L2, L3, and L6. The true underlying model linkage graph G is illustrated on the right panel of Figure 5. Note that there is a model linkage between L2 and L3 since both share the same

Model Linkage Selection for Cooperative Learning

Figure 5: The user-speciﬁed model linkage graph G and the largest well-speciﬁed model linkage graph G within G are shown on the left and right panels, respectively.

common parameters with L1. For simplicity, we set the sample size for each learner to be n.

In practice, the user needs to specify a model linkage graph for the six learners and a statistical model for each learner. For the numerical studies, we specify correct statistical models for κ = {1, 2, 3, 4, 6}, and we misspecify L5 by assuming that the covariates are linearly related to the response y. We further restrict β(κ) j to be the same for κ = 1, . . . , 5

and j = 1, . . . , 7. Moreover, we restrict β(2) j = β(6) j to be the same for j = 8, . . . , 15. Hence, the model linkage graph is misspeciﬁed in the sense that we assume that there exist model linkages between L4 and {L1, L2, L3, L5}, and between L5 and {L1, L2, L3}. The userspeciﬁed graph G is illustrated on the left panel of Figure 5. We apply Algorithm 1 with the aforementioned user-speciﬁed graph and impose a multivariate Gaussian distribution, Np(0, 4Ip), as the prior distribution for the regression coeﬃcients for all learners.

We will compare the proposed Algorithm 1 to ﬁtting the model using data only from L1, and using the combined data from L1 and L4. Recall that the actual regression coeﬃcients in L4 are diﬀerent from that of L1, and thus combining data in L1 and L4 can lead to severely biased estimates of regression coeﬃcients.

To assess the model linkage selection accuracy, we calculate the selection accuracy as the proportion of times when the estimated model linkage graph b G from Algorithm 1 is equal to G . To evaluate the performance across diﬀerent models, we generate 50 test data for L1, and calculate the mean squared error between the predicted response and the actual response in the test data. Note that the mean squared error is a surrogate of the predictive logarithmic score under the Gaussian noise assumption. The predicted response is obtained by taking the mean of the posterior predictive distribution for each model. In addition, we calculate the length of the 95% prediction interval obtained from the posterior predictive distribution. The results for a range of sample sizes n {50, 75, . . . , 150}, averaged over 200 replications, are shown in Figure 6.

Zhou, Ding, Tan, Tarokh

From the left panel of Figure 6, the selection accuracy from Algorithm 1 approaches one as the sample size increases. This is in line with the selection consistency result of Theorem 9. Notably, Algorithm 1 yields a consistent model linkage graph even though the user-speciﬁed model linkage graph G is misspeciﬁed as shown in Figure 5. From the middle panel of Figure 6, we see that the proposed greedy algorithm yields the lowest prediction mean squared error across a range of sample sizes n. The results indicate that combining data from L1 and L4 imprudently can lead to higher prediction mean squared errors. Meanwhile, properly integrating data can improve the predictive performance of a single learner L1. Finally, the average length of the 95% prediction interval for the diﬀerent models is presented in the right panel of Figure 6. We see that the proposed algorithm yields the narrowest prediction interval across the range of n.

Figure 6: The results for modeling based on the proposed algorithm ( greedy algorithm ), modeling of the single agent L1 ( L1 ), and combined modeling of L1 and L4 ( L1, L4 ) in the regression experiment. The evaluation uses the selection accuracy (left), prediction mean squared error (middle), and length of the 95% prediction interval (right), averaged over 200 replications.

4.2 Logistic Regression Example

In this section, we illustrate that the proposed framework can be employed for classiﬁcation problems. We perform a numerical study with ﬁve learners L1, . . . , L5 to enhance the prediction accuracy of L1. For each learner, we generate the covariates independently from a uniform distribution on the closed interval [ 1, 1]. Then, the response variable is generated as the following:

logit(Pr(y(κ) i | x(κ) i )) =

( (x(κ) i )Tβ for κ = 1, 2, 3, 4, 0 for κ = 5,

where β = { 0.8, 0.5, 0.2, 0.1, 0.4, 0.7, 1.0, 1.3, 1.6}T. Learners L1, L2, L3, and L4 share common parameters β , and there are model linkages among L1, L2, L3, and L4. Learner L5 indicates that y(5) i follows a Bernoulli distribution with probability 0.5 and is independent of the covariates. Thus, there are no model linkages between L5 and the other learners. We again set the sample sizes for all learners to the same n for simplicity.

Model Linkage Selection for Cooperative Learning

Figure 7: The results for modeling based on the proposed algorithm ( greedy algorithm ), modeling of the single agent L1 ( L1 ), and combined modeling of L1 and L5 ( L1, L5 ) in the logistic regression experiment. The evaluation uses selection accuracy and prediction mean squared error, averaged over 200 replications.

We compare Algorithm 1 to models that are based on the data from L1, and combined data from L1 and L5. To evaluate the performance across diﬀerent methods, we calculate the selection accuracy and prediction error. For Algorithm 1, we specify an incorrect model linkage graph where all learners are connected and ﬁt the same logistic regression model for all learners. We set a multivariate normal distribution, N(0, 4I), as the prior distribution for the regression coeﬃcients for all learners. The results for a range of sample sizes n = {100, 150, . . . , 350}, averaged over 200 replications, are shown in Figure 7. From the left panel of Figure 7, we see that the selection accuracy converges to one as we increase the sample size n for each learner. That is, the greedy algorithm chooses not to include information from L5, even when the user-speciﬁed model linkage graph contains model linkages between L5 and the other learners. In addition, the proposed greedy algorithm yields the highest prediction error, whereas the modeling based on the joint data from L1 and L5 yields the lowest prediction error.

4.3 Logistic Regression with Data Contamination using Breast Cancer Data

Data contamination is an important issue when one decides whether to incorporate information from other learners. In practice, when several data sources are collected to have the same covariates, users tend to analyze the combined dataset to leverage more information. However, if some data sources are corrupted or contaminated, it is crucial to discriminate against them and avoid incorporating information from the contaminated learners. We illustrate that Algorithm 1 is robust against data contamination on some data sources. We consider the Wisconsin Breast Cancer database (Mangasarian et al., 1995). The data consist of a response variable recording whether a cancer tissue is benign or malignant with 9 covariates from a total of 699 subjects. We randomly choose 100 samples as the test data

Zhou, Ding, Tan, Tarokh

for evaluating prediction accuracy. Then, the data are randomly divided into 10 learners, in which each learner has n samples. Since the data from the 10 learners L1, L2, . . . , L10 are subsamples of the original data set, we assume that the regression coeﬃcients are the same across all learners. We then contaminate the data in L10 such that the binary response is ﬂipped. We apply Algorithm 1 with a misspeciﬁed model linkage graph by assuming that all learners are linked among each other. For each learner, we assume a logistic regression model with an intercept and 9 covariates. For simplicity, we impose the prior distribution N(0, 42) on all regression coeﬃcients. The prediction accuracy for the proposed method, the method that uses L1, and the method that uses the combined data L1 and L10, averaged over 200 replications, are reported in Table 1. From Table 1, we see that naively combining data from L1 and L10 will lead to a much lower prediction accuracy than the model using only data from L1. Our proposed method, on the other hand, chooses not to incorporate information from L10. Also, by combining data sources adaptively, our proposed method yields a prediction accuracy that is much higher than the model ﬁt using data from L1 alone, for both cases of n = {25, 50}.

Table 1: Performance results of the proposed approach, model of L1 alone, and joint model of L1 and L10, as evaluated by the prediction accuracy. The results are averaged over 200 replications, with n = {25, 50}.

number of samples proposed method L1 L1 and L10

n = 25 0.928 0.844 0.500 n = 50 0.952 0.894 0.498

4.4 Integrating Information on Kidney Cancer Data

In this section, we analyze the kidney cancer data considered in Maity et al. (2019). The kidney cancer data consist of 33 diﬀerent types of tumors, with a total of n = 8108 samples and up to p = 198 proteins for the diﬀerent types of tumors. Each tumor type may have a diﬀerent number of proteins. To study the association between patients survival time and proteins, Maity et al. (2019) ﬁt an accelerated failure time model with a log-normal assumption, from which they identiﬁed eight proteins that are most related to patients survival time. We now illustrate that integrating information from related cancer tumors using the proposed method can improve the prediction accuracy of patients survival time. For simplicity, we consider only patients that are not alive at the observed survival time and ﬁt a linear regression on the log-transformed survival time. We consider three types of tumors: (i) kidney renal clear cell carcinoma (KIRC), L1, (ii) kidney renal papillary cell carcinoma (KIRP), L2, and (iii) uterine corpus endometrial carcinoma (UCEC), L3, each of which has 146, 24, and 34 samples, respectively. Moreover, we pick up three proteins PCADHERIN , GAB2 , and HER3 p Y1298 as the covariates, following (Maity et al., 2019) . The three proteins have been well studied and are all well-known for kidney tumor growth and in-

Model Linkage Selection for Cooperative Learning

vasion (Blaschke et al., 2002; Duckworth et al., 2016; Akbani et al., 2014). In particular, the PCADHERIN has been considered as one of the most important proteins for kidney cancer (Maity et al., 2019). Both KIRC and KIRP originate from cells in the proximal convoluted tubules of the nephron (Chen et al., 2016). Thus, it is reasonable to assume that PCADHERIN has a similar eﬀect on the log-transformed survival time of patients with KIRC and KIRP. On the other hand, PCADHERIN is expected to have a diﬀerent eﬀect on UCEC since UCEC is a type of uterine cancer. Inspired by the above domain knowledge, we set up a model linkage graph in the following way. We assume that there are linkages among KIRP, KIRC, and UCEC by sharing the eﬀect of PCADHERIN on survival time across three tumor types.

We ﬁt a linear regression model with a log-transformed response, namely log(y(κ) i ) = P3 j=1 x(κ) ij β(κ) j +ϵ(κ) i , where the indices i, j, and κ denote the ith subject, the jth protein, and the κth tumor type. For simplicity, we assume that the random noise is normally distributed with diﬀerent variances to account for heterogeneity across diﬀerent tumor types, namely ϵ(κ) i N(0, σ2 κ). Let β(1) 1 , β(2) 1 , β(3) 1 be the regression coeﬃcient for PCADHERIN across the three cancer types, which are linked on the speciﬁed linkage graph.

We compare the prediction accuracy of the proposed greedy algorithm with the model using only data from L1 (namely KIRC). To that end, we sample 20 data points from L1 such that the sample sizes across three tumor types are approximately the same. We treat the remaining data points for test purposes. The prior distributions of the regression coeﬃcients are assumed to be standard normal, and the prior distributions for the three intercepts are assumed to follow a normal distribution with a mean of 10 and variance one. Moreover, we assume that σ2 κ Inv Gamma(2, 1) for κ = 1, 2, 3, where Inv Gamma denotes the inverse gamma distribution. We replicate the experiments above 100 times by sub-sampling 20 data observations from L1 for training the models.

The results indicate that Algorithm 1 connects L2 to L1 in 65% of the replications, and L2 to L3 20% of the replications. In this example, Algorithm 1 selects diﬀerent linkages across diﬀerent replications due to randomness in the samples. The average prediction mean squared error (with its standard error) of the proposed method is 1.69(0.027). In comparison, the values are 1.74(0.031) if using data only from L1, and 1.66(0.024) if using the joint data from L1 and L2 throughout the 100 replications. The results imply that collaborating with L2 increases the predictive performance of L1, and Algorithm 1 tends to favor such a collaboration.

4.5 Epidemiological Data Study

In this experiment, we revisit the example illustrated in Figure 2 of Section 2. Recall that L1 employs n Binomial models y(1) i Binomial(x(1) i , θ1,i), and learner L2 has poisson models y(2) i Poisson(θ1,iθ2x(2) i ), with i = 1, 2, . . . , n. The datasets for the two learners are denoted by D(1) = (y(1), x(1)) and D(2) = (y(2), x(2)). Note that the two learners are heterogeneous, and the parameters of L2 are not identiﬁable since θ1,iθ2 = cθ1,i c 1θ2 for any positive constant c. With a joint modeling of L1 and L2, the parameters become identiﬁable. We ﬁrst use the data in (Maucort-Boulch et al., 2008), where each learner has n = 13 data. For L1, the population size x(1) i ranges from 37 to 700 and the empirical infection rate y(1) i /x(1) i

Zhou, Ding, Tan, Tarokh

Table 2: Predictive performance of using a joint modeling of L1 and L2 ( Joint ), the proposed method ( Proposed ), and a single-agent modeling ( L1 or L2 alone ), evaluated for each learner. The evaluation is based on the expected log-predictive distribution, numerically computed from 1000 out-sample data and 100 replications. Standard errors are reported in the parentheses.

L1 (Binomial) L2 (Poisson) Joint Proposed L1 alone Joint Proposed L2 alone

Case 1 -3.76(0.006) -3.86(0.016) -4.05(0.009) -3.76(0.002) -3.76(0.002) -3.76(0.002) Case 2 -4.29(0.033) -4.38(0.041) -4.87(0.019) -4.03(0.011) -4.03(0.011) -4.03(0.010) Case 3 -8.37(0.069) -4.58(0.135) -4.04(0.016) -3.62(0.009) -3.63(0.009) -3.63(0.009) Case 4 -3.56(0.033) -3.56(0.033) -4.05(0.016) -3.26(0.008) -3.26(0.008) -3.31(0.008)

ranges from 0 to 0.2. For L2, the women-years follow up x(2) i ranges from 20000 to 550000, and the number of cancer incidences y(2) i ranges from 10 to 700. In this experiment, we divided each x(2) i by 1000 for computation eﬃciency, and note that such a scaling does not make an essential diﬀerence for parameter estimation. We used Uniform(0, 1) as the prior distribution for each θ1,i and Gamma(5, 1) for θ2. Algorithm 1 outputs that there is no linkage established between L1 and L2. To develop more insights into the nature of the above models and data size, we perform four cases of simulated data experiments. We use n = 13 and the identical prior distributions as in the real-data experiment. In Case 1, we generated simulated data

with x(1) = (200,

12 z }| { 1000, ..., 1000) and x(2) = (

13 z }| { 1000, ..., 1000), θ2 = 10, and θ1,i = 0.1 for i = 1, 2, . . . , n. It serves as a case with large data, which tends to produce an accurate

parameter estimation. Case 2 is similar to Case 1, except that x(1) = (20,

12 z }| { 100, ..., 100) and

13 z }| { 100, ..., 100). The latter two experimental cases are based on parameters estimated from the real data. Case 3 simulates the setting where there exists no underlying linkage between two learners. In particular, we simulate y(1) i from the Binomial model with the original population size x(1) i and the empirical rate θ1,i that is calculated from the original data observations. We simulate y(2) i from the Poisson model with the expectation that equals the original count observation. In contrast, Case 4 simulates the setting where there exists an underlying linkage between two learners. In particular, we estimate a joint model of the two learners, and use the posterior of θ1,i s and θ2 to generate y(1) in Binomial models and y(2) in Poisson models, with the original population sizes x(1) and x(2). To calculate the out-sample test performance of L1 or L2, whomever is being assisted, we generate 1000 test data based on the underlying data distributions. In particular, we numerically calculate E[log p(yf | xf)] where p is the predictive distribution and (yf, xf) denotes the test data. The results are shown in Table 2. Recall that in Cases 1, 2, and 4, there exists a model linkage between L1 and L2. In Case 3, there exists no underlying

Model Linkage Selection for Cooperative Learning

model linkage. From the results, the test performance of the proposed method can dataadaptively approach the better performance of two options, namely with and without a linkage through the parameters θ1,i. Also, though L2 alone is not identiﬁable, it can improve L1 s performance in collaborative cases.

5. Conclusion and Further Remarks

With the rapid growth of low-cost data collection devices and decentralized learners, data analysts are faced with an important challenge to integrate information across a set of learners with diverse data sources. However, prudently combining data sources and ﬁtting a joint model on all data sources can lead to biased estimates with a low prediction accuracy due to misspeciﬁed models or model linkages. We proposed a general approach to enhance the predictive performance of learner L1 by robustly integrating information across a set of learners. Since information are integrated through parameter linkages by sharing parameters, the data sources do not need to be transmitted across learners. As such, the proposed method is naturally compatible with decentralized learning that involves internetof-things requiring low-energy consumption (Da Xu et al., 2014), smart sensors with limited hardware capacities (Zhou et al., 2016), and decentralized networks with limited communication bandwidths (Xiao and Luo, 2005). We showed that the proposed method is linkage selection-consistent and asymptotically prediction-eﬃcient. The theoretical properties are established under the regime in which the number of learners M and the parameter dimensions are ﬁxed. An interesting future problem is to study the theoretical properties of the proposed framework under the regime in which the number of learners or the dimension of parameters is allowed to diverge with the sample size.

In the following, we brieﬂy describe the connection between the proposed framework and existing methods on data integration and distributed learning.

Data Integration. Integrating information from diﬀerent data sources has been studied in the context of data integration (see, for instance, Tang and Song, 2016; Li and Li, 2018, and the references therein). When there is a uniﬁed model across multiple data sources, it is possible to improve statistical eﬃciency through parameter sharing or ﬁtting one model using the combined data. For instance, Tang and Song (2016) employed a fused lasso approach to encourage the regression coeﬃcients for diﬀerent data sources to be similar. Li and Li (2018) developed an integrative linear discriminant analysis method by combining diﬀerent data sources and showed that the classiﬁcation accuracy could be improved compared with using a single data source. Some other work pre-speciﬁed constraints on latent variables to utilize heterogeneous data sources to address multiple parametric models. For example, in the study of gene regulatory networks, Jensen et al. (2007) proposed a Bayesian hierarchical model to integrate gene expression data, Ch IP binding data, and promoter sequence data to infer statistical relationships between transcription factors and genes. The uniqueness of our work compared with existing methods is that our proposed method allows for a set of learners with diverse learning objectives and distinct statistical models. Furthermore, the set of learners share information only through linked parameters of interest. Therefore, the method in Section 3 can be used to help any learner eﬃciently identify cooperative learners when prior information is lacking.

Zhou, Ding, Tan, Tarokh

Lunn et al. (2000) and Plummer (2015) also studied data integrations in the context of cut distributions, which can be seen as a probabilistic version of a two-step estimator. The main idea is to cut the propagation from uncertain models to precise models during joint learning to reduce biases propagated from incorrectly speciﬁed models (Lunn et al., 2009; Ogle et al., 2013). Jacob et al. (2017) proposed a predictive score principle for choosing the most appropriate joint modeling approach among the cut, full posterior, prior, and two-step approaches over a set of learners. It is possible to incorporate cut distribution into our proposed method to improve the ﬁnite-sample regime s performance. Nevertheless, the number of possible candidates exponentially increases with the number of learners, and thus the search space of each greedy selection step can be computationally prohibitive. Also, a systematic theoretical study of the cut distribution remains a challenging problem.

Distributed Learning. Data privacy has gained much attention recently, especially in distributed learning, where data curators do not wish to share the original data. This motivates some recent advances in distributed learning such as federated learning, where a central server sends the current global statistical model to a set of selected clients, and each client updates the model parameter with local data and returns the updates to the central server (Shokri and Shmatikov, 2015; Koneˇcn y et al., 2016; Diao et al., 2020b). The objective function for federated learning is typically formulated as

min θ F(θ) :=

κ=1 Fκ(θ), where Fκ(θ) = X

(x,y) D(κ) f(θ; x, y), (3)

where f denotes a global loss function, θ parameterizes a global model to learn, and D(κ) is a labeled dataset of the κth client. To optimize over θ, each client locally takes several iterations of (stochastic) gradient descent on the current parameter using its local data. Then the server takes a weighted average of the resulting parameters. Within a similar context, Jordan et al. (2019) proposed a communication-eﬃcient surrogate likelihood framework for solving distributed statistical estimation problems, which provably improves upon simple averaging schemes.

In the context of our proposed framework, consider a set of learners each holding a data source D(κ) and personal objective function fκ, and a set of optimization constraints C. A frequentist counterpart of our Bayesian approach is to minimize a proper scoring function (Dawid and Musio, 2015), e.g., the negative log-likelihood function, added with some form of regularization. The unknown parameters can be estimated by solving the following optimization problem

min θ1,...,θM F(θ) :=

κ=1 Fκ(θκ) + R(θ1, . . . , θM), subject to C, (4)

where Fκ(θ) = X

(x,y) D(κ) fκ(θ; x, y),

where R is a suitably chosen regularization function. The federated learning falls into the above formulation when the models are restricted to be the same among diﬀerent learners, namely fκ = f, θκ = θ for all κ. Without the constraint C and regularization R, (4) is equivalent to optimizing M individual objectives separately.

Model Linkage Selection for Cooperative Learning

Both the developed Bayesian formulation or the frequentist counterpart in (4) can also be regarded as forms of personalized federated learning, where decentralized and heterogeneous agents participate in joint learning with peer agents to boost local learning performance. A key characteristic of personalization is that each agent has a speciﬁc local task, loss, model, and data. As such, the order of establishing linkages and the selected set of collaborative learners may depend on whom to assist. This is reﬂected through the asymmetric nature of the proposed Algorithm 1 in ﬁnite-sample regimes. To illustrate this point, we provide a toy example that involves three learners, each with a Gaussian model y(κ) N(µκ, 1), κ = 1, 2, 3. It can be regarded as a regression with x(κ) = 1 and unknown parameters µκ s. Suppose that G is a fully connected graph, and the prior of µκ is N(0, 102). The observed data are y(1) = 2, y(2) = 0.3, y(3) = 2, respectively. One can verify that if L1 is the one to be assisted, Algorithm 1 will link it with L2 at the ﬁrst step; Then, L1 and L2 are not linked with L3, terminating the algorithm. On the other hand, if L2 will be assisted, the algorithm ﬁrst links it with L3 and then stops. Consequently, L1 and L2 will not establish a linkage in that case. The above indicates the asymmetric nature of collaboration: L1 being assisted by L2 doest not mean that L1 can assist L2 in the presence of other learners. Nevertheless, it can be veriﬁed that there is no such asymmetry in the case of two learners, meaning that L1 selecting L2 is equivalent to L2 selecting L1 in Algorithm 1.

Even in the context of vanilla federated learning, considerations of user misspeciﬁcation or adversarial attacks are relatively new (Bhagoji et al., 2019; Jere et al., 2020), and the proposed notion of prediction eﬃciency and selection consistency are readily applicable. Moreover, in a general learning scenario where parameters may lose interpretability, it is still possible to build linkages among learners to reduce the overall model complexity and generalization errors. For example, Diao et al. (2019, 2020a) recently showed that appropriately restricting deep neural network parameters can signiﬁcantly improve the performance of multi-modal image generation, compared with state-of-the-art methods that train an image generator separately from each data modality. From a theoretical perspective, the risk bound can be reduced by restricting the size of the function spaces through C. When each learner s predictive performance is not severely biased by other learners compared with its reduced variance, it is worth establishing a joint optimization in the form of (4).

Acknowledgments

The authors thank the action editor and two anonymous referees for their constructive comments. The authors thank Dr. Veera Baladandayuthapani for sharing the kidney cancer data in Maity et al. (2019). Jiaying Zhou was supported by the Army Research Laboratory and the Army Research Oﬃce under grant number W911NF-20-1-0222 and National Science Foundation under grant number ECCS-2038603. Jie Ding and Vahid Tarokh were supported by the Oﬃce of Naval Research under grant number N00014-18-1-2244. Kean Ming Tan was supported by National Science Foundation under grant numbers DMS-1949730 and DMS-2113346, and National Institutes of Health under grant number RF1-MH122833.

Zhou, Ding, Tan, Tarokh

Appendix A. Notation and Regularity Conditions

We start with introducing some regularity conditions needed for the theoretical development. These regularity conditions are generalizations of those in Walker (1969) from scalar to multidimensional vector. Suppose that each observation (yi, xi), 1 i n, is modeled via a joint distribution with a density function p(y, x | θ) with respect to a σ-ﬁnite measure µ. Moreover, x Rk is modeled using the density function h(x) that is independent of the parameter θ = (θ1, θ2, . . . , θp)T Rp. The joint density can thus be written as p(y, x | θ) = p(y | x, θ)h(x). Let the true conditional density function of y given x be p (y | x), and thus the true joint density of (y, x) can be written as p (y, x) = p (y | x)h (x). Note that h(x) and h (x) are not necessarily the same due to potential model misspeciﬁcation when modeling x. Let θ be an interior value in the parameter space Θ deﬁned as the minimizer of the Kullback-Leibler divergence between p(y | x, θ ) and p (y | x):

θ = argmax θ Rp

{Y,X} log p(y, x | θ)

p (y, x) p (y, x)dµ

= argmax θ Rp

{Y,X} [log p(y | x, θ) + log{h(x)}]p (y | x)h (x)dxdy

= argmax θ Rp

{X} h (x) Z

{Y} log p(y | x, θ)p (y | x)dydx

= argmax θ Rp

{Y} log p(y | x, θ)p (y | x)dy. (5)

Note that when the parametric model p(y | x, θ) is well-speciﬁed, p(y | x, θ ) = p (y | x). Let ℓ(θ) = Pn i=1 log p(yi, xi | θ) be the log-likelihood function for the n observations and let bθ = argmax θ Rp ℓ(θ) be the maximum likelihood estimator (MLE) of θ. Let In(θ)

be the observed Fisher information matrix with {In(θ)}i,j = 2ℓ(θ)

θi θj , and let I(θ) be the expected Fisher information matrix for a single observation (y, x) with {I(θ)}i,j =

E n 2 log p(y,x|θ)

o . Let {J(θ)}i,j = E{ log p(y,x|θ)

θi log p(y,x|θ)

θj }. We deﬁne T(n) = Θ{f(n)}, namely when n is large, there exist some ﬁxed constants c1 and c2 such that c1f(n) |T(n)| c2f(n). Some regularity conditions that are needed in the theoretical development are listed in the following.

(I) The parameter space Θ Rp is compact.

(II) The set of points {Y, X} = {(y, x) : p(y, x | θ) > 0} is independent of θ.

(III) If θα = θβ, then µ {(y, x) : p(y, x | θα) = p(y, x | θβ)} > 0.

(IV) For all (y, x) {Y, X} and δ > 0, we have |log p(y, x|θ) log p(y, x|θ )| < Hδ(y, x, θ ) as long as θ θ 2 < δ. Here, function Hδ has the property that lim δ 0 Hδ(y, x, θ ) = 0

and that lim δ 0

{Y,X} Hδ(y, x, θ )p (y, x)dµ = 0.

Model Linkage Selection for Cooperative Learning

(V) If Θ is not bounded, then for any θM Θ, and suﬃciently large , we have

log p(y, x|θ) log p(y, x|θM) < K (y, x, θM),

where θ 2 > and K has the property that

{Y,X} K (y, x, θM)p (y, x)dµ < 0.

(VI) The maximum likelihood estimator (MLE), denoted by bθ, exists, and the matrix In(bθ) is positive deﬁnite almost surely.

(VII) The log-likelihood function log p(y, x|θ) is twice continuously diﬀerentiable with respect to θ in some neighborhood of θ .

(VIII) The ﬁrst and second derivatives with respect to θ, and the integral of log p(y, x|θ), are exchangeable.

(IX) There exists a δ > 0 such that

2 log p(y, x | θ)

θi θj 2 log p(y, x | θ )

< Mδ(y, x, θ )

for any pair (i, j) and θ θ 2 < δ, where the function Mδ satisﬁes

{Y,X} Mδ(y, x, θ )p (y, x)dµ = 0.

(X) The prior density function is continuous at θ = θ and π(θ ) > 0.

(XI) If L1 has a well-speciﬁed model pθ1, and L2 has misspeciﬁed model pθ2, the model linkage (deﬁned in Deﬁnition 2) between L1 and L2 is misspeciﬁed (Deﬁnition 4).

Conditions (I) (V) ensure that when θ is an interior value of Θ, ℓ(θ) ℓ(θ ) is suﬃciently small for values of θ that are not in the vicinity of θ , with probability tending to one as n . We note that Condition (I) is a stronger than necessary assumption made to simplify the technical arguments. Alternatively, one may remove this assumption and show that the parameter estimate falls into a compact set with the probability increasing to one as the sample size increases. Conditions (VI) (IX) ensure that when θ = θ , n1/2(bθ θ ) has a limiting distribution N 0, I(θ ) 1J(θ )I(θ ) 1 . Conditions (VII) (IX) assume that In(θ) is smooth in the vicinity of θ . Condition (XI) is needed to guarantee that learners with misspeciﬁed models are not included in the joint model to enhance the statistical performance of L1.

Zhou, Ding, Tan, Tarokh

Appendix B. Assumptions on the Scoring Rule

Let y = {y1, y2, . . . , yn}T be an n-dimensional vector of the response and X = {x1, x2, . . . , xn}T

Rn k be the design matrix of covariates. In the following, we state some assumptions on the score function s deﬁned in Deﬁnition 6.

(A1) sup p,y,x E[s(p, y, x)] (0, ).

(A2) Assume that the model p(y|x, θ) is well-speciﬁed. Let p(ey|ex, y, X) be the Bayesian predictive distribution of ey given the new predictor ex and the data y, X. As n ,

E s{p(ey|ex, y, X), ey, ex} s{p(ey|ex, θ ), ey, ex} = Θ(n 1

The ﬁrst assumption indicates that the expected loss is bounded by some positive constant for any given density function and prediction point. The second assumption indicates that the diﬀerence between the prediction score and oracle score is in the order of n 1/2. Many commonly used scoring rules such as the Kullback-Leibler divergence and cross-entropy satisfy the aforementioned assumptions.

Appendix C. Proof of Theorems 9 10

We start with some technical lemmas that will be helpful for the proof of Theorems 9 10. Let L(θ|y, X) = p(y, X|θ) = Qn i=1 p(yi, xi|θ) be the likelihood function for the n observations and let ℓ(θ) be the log-likelihood function. Let p(y, X) = R p(y, X|θ)π(θ)dθ be the marginal likelihood of (y, X). The following lemma is a multidimensional counterpart of Walker (1969) that provides limiting properties for the maximum likelihood estimator and marginal likelihood.

Lemma 11 Assume that the regularity conditions in Appendix A hold. Let θ Rp be an interior value in the parameter space Θ, and assume that the data pair (yi, xi) Rk+1 has density p(yi, xi|θ) for i = 1, 2, . . . , n. Let bθ be the maximum likelihood estimator of θ. As n , the following results hold:

(i) Let N(δ) = {θ : θ θ 2 < δ} be a neighborhood of θ contained in Θ. For any positive δ, there exists a positive number k(δ) depending on δ such that

sup θ Θ\N(δ) n 1 {ℓ(θ) ℓ(θ )} < k(δ)

(ii) Let (bξ2) 1 = det|In(bθ)|, we have lim n n p n (bξ2) 1o = det|I(θ )|.

(iii) ℓ(θ ) ℓ(bθ) = Θ(1).

n p(y, X|bθ)bξ o 1 p(y, X) = (2π) p 2 π(θ ).

Model Linkage Selection for Cooperative Learning

Lemma 11(i) indicates that the diﬀerence between ℓ(θ) and ℓ(θ ) will be large when θ is not in the δ-neighborhood of θ . Lemma 11(ii) establishes that the determinant of the observed Fisher information matrix converges to the determinant of the expected Fisher information matrix. Lemma 11(iii) shows that the log-likelihood function evaluated at θ and bθ are at the same order. The proof of Lemma 11 is provided in Appendix D.1. Next, we present a key lemma that provides similar results as those of Lemma 11, but under the setting with non-identically distributed data that arise from two diﬀerent data sources from two diﬀerent learners. To this end, we deﬁne some notation. Without loss of generality, we consider two learners L1 and L2 with sample size n1 and n2, respectively. In particular, for each learner Lκ with κ = 1, 2, the user-speciﬁed density function of the data pair (y(κ) i , x(κ) i ) Rkκ+1 is denoted as p(κ) θκ (y(κ) i , x(κ) i |θκ). Let p κ(y(κ) i , x(κ) i ) be the true underlying density function. Similar to Lemma 11, let θ κ Rpκ be an interior value in the parameter space Θκ. As deﬁned in Deﬁnition 2, let θ1,S1 = θ2,S2 = θS1,S2 Rps be the shared parameter between L1 and L2. Moreover, let θC = (θT 1, S1, θT S1,S2, θT 2, S2)T Rp C be the vector obtained by concatenating entries of θ1 and θ2 without duplication, which has a dimension of p C = p1 + p2 ps. Let θ C Rp C be its corresponding interior value in ΘC. We denote eθ1 = (θT 1, S1, θT S1,S2)T and eθ2 = (θT 2, S2, θT S1,S2)T as the parameters for L1 and L2 after incorporating information from the model linkage between the two learners. We note that eθ1 and eθ2 are equivalent to θ1 and θ2, respectively, after some reordering of the elements. The notation are simply deﬁned to facilitate the proof of the theoretical results. Let y(κ) = (y1, y2, . . . , ynκ)T and X(κ) = (x(κ) 1 , x(κ) 2 , . . . , x(κ) nκ )T. Denote L(eθ1|y(1), X(1)), L(eθ2|y(2), X(2)), and L(θC|y(1), y(2), X(1), X(2)) as the likelihood functions of L1, L2, and (L1, L2), where L(eθ1|y(1), X(1)) = p(1) eθ1 (y(1), X(1)|eθ1), L(eθ2|y(2), X(2)) =

p(2) eθ2 (y(2), X(2)|eθ2), and

L(θC|y(1), y(2), X(1), X(2)) = p(C) θC (y(1), y(2), X(1), X(2)|θC)

= p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2).

Let ℓ1, ℓ2, and ℓC be their corresponding log-likelihood functions and let bθ1, bθ2, and bθC be the MLEs obtained from maximizing ℓ1, ℓ2, and ℓC, respectively. Let I(κ) nκ (eθκ) and I(κ)(eθκ) be the observed Fisher information matrix and expected Fisher information matrix on single observation, respectively, for κ = 1, 2. Let In(θC) be the matrix with element {In(θC)}i,j = 2ℓC(θC)

θC,i θC,j , where θC,i is the ith element of θC. Next, let π(eθ1), π(eθ2), and π(θC) be the prior

density of eθ1, eθ2, and θC, respectively. Let p(y(κ), X(κ)) = R p(κ) eθκ (y(κ), X(κ)|eθκ)π(eθκ)deθκ be

the marginal likelihood of (y(κ), X(κ)), for κ = 1, 2. Moreover, the marginal likelihood of (y(1), y(2), X(1), X(2)) is expressed as

p(y(1), y(2), X(1), X(2)) = Z p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC.

We now present the results in the following lemma.

Zhou, Ding, Tan, Tarokh

Lemma 12 Assume that the regularity conditions in Appendix A hold. Suppose that n = n1 + n2, and n1/n c1, n2/n c2, as n , where c1, c2 (0, 1). Then, the following results hold:

(i) Let NC(δ) = {θ : θ θ C 2 < δ} be a neighborhood of θ C contained in ΘC. For any δ > 0, there exists a positive number k(δ) depending on δ such that

sup θ ΘC\NC(δ) n 1 {ℓC(θ) ℓC(θ C)} < k(δ)

(ii) Let (bξ2 C) 1 = det |In(bθC)|, we have lim n n p C n (bξ2 C) 1o = Θ(1).

(iii) ℓC(θ C) ℓC(bθC) = Θ(1).

n p(C) θC (y(1), y(2), X(1), X(2)|bθC)bξC o 1 p(y(1), y(2), X(1), X(2)) = (2π) p C

Lemmas 13 14 concern the magnitudes of the joint marginal likelihood and marginal likelihood for each learner, under the case when the model linkage is well-speciﬁed or misspeciﬁed. Similar results were established in the context of change point detection (Du et al., 2016). Both lemmas will be used to prove that Algorithm 1 will select the correct model linkage in each iteration of the algorithm.

Lemma 13 Assume that the regularity conditions in Appendix A hold. Let θS1,S2 Rps be the shared parameter between L1 and L2 as deﬁned in Deﬁnition 2. Let the interior value of θS1,S2 in L1 and L2 be θ 1,S1 and θ 2,S2, respectively. If θ 1,S1 = θ 2,S2, and n1/n2 = Θ(1), as m = min {n1, n2} , we have

p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2))

Lemma 14 Under the same conditions as in Lemma 13, if θ 1,S1 = θ 2,S2, as min{n1, n2} , we have

p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2))

C.1 Proof of Theorem 9

Proof The main idea of the proof is to show that in each iteration, the proposed algorithm will integrate information from one additional learner based on the linkage set E , and avoid incorporating information from the learner that is not in the linkage set E . Consequently, the ﬁnal output of the algorithm b E = E . We start the proof at the ℓth iteration. At the ℓth iteration, let ζ(ℓ 1) be the set of learners that are already included in the joint model built from the previous ℓ 1 iterations. Let ζ(ℓ 1) C(G) {2, . . . , M}\ζ(ℓ 1) be a non-empty set of learners with at least a path to L1 as deﬁned in the user-speciﬁed model

Model Linkage Selection for Cooperative Learning

linkage graph G. Note that ζ(ℓ 1) C(G) is non-empty, otherwise, the algorithm would have been terminated at the (ℓ 1)th iteration. There are two cases: (i) there is at least a j ζ(ℓ 1) C(G) and an i ζ(ℓ 1) such that the model linkage between Lj and Li is well-speciﬁed; and (ii) there are no well-speciﬁed linkages between pairs of learners in ζ(ℓ 1) C(G) and ζ(ℓ 1).

Case (i): Suppose that there is a well-speciﬁed model linkage between Lj for a j ζ(ℓ 1) C(G) and Li for an i ζ(ℓ 1). By Lemma 13, Step 2 of Algorithm 1 will hold, and Lj will form a new set with ζ(ℓ 1), namely ζ(ℓ) = ζ(ℓ 1) {j}. More generally, if there are more than one learner in ζ(ℓ 1) C(G) that have well-speciﬁed linkages with learners in ζ(ℓ 1), the algorithm will

select jopt = argmaxj ζ(ℓ 1) C(G) p( κ ζ(ℓ 1)y(κ)|y(j)) and form a new set ζ(ℓ) = ζ(ℓ 1) {jopt}.

Case (ii): On the other hand, if the model linkages between Lj for j ζ(ℓ 1) C(G) and Li for i ζ(ℓ 1) are misspeciﬁed for all i and j, Lemma 14 ensures that Algorithm 1 will be terminated, and hence, the misspeciﬁed model linkages are not included into ζ(ℓ 1). As a result, the algorithm terminates when all well-speciﬁed linkages are included, and no misspeciﬁed linkages and misspeciﬁed models will be included, namely b E = E . This concludes the proof.

C.2 Proof of Theorem 10

Proof Recall that C(G) is the set of indices recording the vertices that form a connected component with learner L1 in the user-speciﬁed graph G, and G is the true underlying graph after the statistical models for all learners and G are speciﬁed. Let G = { G = (V, E) : E E, E = E } be a set of graphs that is a subgraph of G, but E = E . Let bp C(G) be the posterior predictive distribution constructed based on learners in C(G) as deﬁned in (1), and let Gi = (V, Ei) be a graph with edge set Ei. We consider the prediction eﬃciency ratio deﬁned as follows:

E{s(bp C(G ), ey, ex) s(p , ey, ex)}

Pr( b E = E )E{s(bp C(G ), ey, ex) s(p , ey, ex)} + P Gi G Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)} .

To prove Theorem 10, it suﬃces to show that

Pr( b E = E ) 1 (8)

and Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)}

E{s(bp C(G ), ey, ex) s(p , ey, ex)} 0 (9)

for all Gi G. Equation (8) is a direct consequence of the result in Theorem 1. In the remaining of the proof, we focus on establishing (9). To show (9), we consider two cases: (1) all learners in C(Gi) are well-speciﬁed and that the model linkages that form a connected component with L1 are well-speciﬁed; (2) there exist at least one learner with misspeciﬁed models or misspeciﬁed model linkages in C(Gi).

Zhou, Ding, Tan, Tarokh

For case (1), Assumption (A2) in Section B indicates that E{s(bp C(Gi), ey, ex) s(p , ey, ex)} =

Θ(n 1/2 C(Gi)), and E{s(bp C(G ), ey, ex) s(p , ey, ex)} = Θ(n 1/2 C(G )), where n C(Gi) and n C(G ) are the sample sizes of C(Gi) and C(G ), respectively. Combining the above and the result from Theorem 9 that Pr( b E = Ei) 0 leads to (9). For case (2), consider a learner Lw C(Gi) that has misspeciﬁed model linkages with some other learners in C(Gi), denoted as C(Gi)miss. By deﬁnition, C(Gi)miss C(Gi). If b E = Ei, then Gi must include the path between Lw and the learners in C(Gi)miss, which further indicates that

Pr( b E = Ei) Pr p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ))

p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ)) > 1 , (10)

where the right side of the equation can be interpreted as the probability of including Lw in Step 2 of Algorithm 1 to build a joint model. From the proof of Lemma 14, we have

p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ))

p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ)) = Θ{np /2 w exp( nw Cw)} (11)

for some positive ﬁnite constants Cw and p , where nw is the number of samples in Lw. By an application of the Markov s inequality and (11), we have

Pr p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ))

p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ)) > 1 (12)

E p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ))

p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ))

< np /2 w exp( 1

2nw Cw), (14)

implying Pr( b E = Ei) < np /2 w exp( 0.5nw Cw). Due to the existence of misspeciﬁed linkages in C(Gi), Assumption (A2) is no longer applicable to bound E{s(bp C(Gi), ey, ex) s(p , ey, ex)}. We instead employ Assumption (A1) to bound E{s(bp C(Gi), ey, ex) s(p , ey, ex)} by a ﬁnite constant C . Combining the above, we have

Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)} < C np /2 w exp( 1

2nw Cw). (15)

By Assumption (A2) in Section B and (15), we conclude that

Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)}

E{s(bp C(G ), ey, ex) s(p , ey, ex)} C n(1+p )/2 w exp( 1

2nw Cw) 0 (16)

as nw , where C is some positive ﬁnite constant. This concludes the proof.

Model Linkage Selection for Cooperative Learning

Appendix D. Proof of Lemmas 11 14

D.1 Proof of Lemma 11

Proof Proof of Lemma 11(i): Let θ Θ \ θ and let Zi = log {p(yi, xi|θ)/p(yi, xi|θ )} be the log ratio between two joint densities evaluated under θ and θ . Let E(Zi) be the expectation of Zi with respect to the density function p(yi, xi | θ ). We start with proving the following intermediate result that is helpful for the proof of Lemma 11(i):

n {ℓ(θ) ℓ(θ )} < c(θ) = 1, (17)

where c(θ) is a positive ﬁnite number that may depend on θ. We consider two cases when E(Zi) is ﬁnite and inﬁnite, respectively. When E(Zi) is ﬁnite, it follows from the Jensen s inequality that

E(Zi) < log E {exp(Zi)} = 0. (18)

Thus, by the law of large number, we have

i=1 Zi p E(Zi) < 0,

implying Pr{n 1 Pn i=1 Zi 0.5E(Zi)} 0. That is, Pr{n 1 Pn i=1 Zi 0.5E(Zi)} 1. Pick c(θ) = 0.5E(Zi), and (17) is satisﬁed. If E(Zi) is not ﬁnite, then we have E(Zi) = . Let Z i = max{Zi, k}, where k < 0. Then E(|Z i |) < . By the strong law of large number, we obtain 1 n

i=1 Z i a.s. E(Z i ). (19)

As k , by the monotone convergence theorem, we obtain E(Z i ) E(Zi) = . Moreover, (19) implies

lim sup n 1 n

i=1 Zi = lim n sup n m

i=1 Zi lim n 1 n

i=1 Z i = E(Z i ) = (20)

almost surely . Thus, we obtain lim sup n n 1 Pn i=1 Zi = almost surely. In other words,

n 1 Pn i=1 Zi a.s. . Any positive ﬁnite number c(θ) guarantees that (17) will hold. We now apply (17) to prove Lemma 11(i) holds in some open balls, where the union of these ﬁnite number of open balls covers Θ \ N(δ). Consider θj Θ and let Nj(δj) = {θ : θ θj 2 < δj} be a ball of size δj centered at θj. By the regularity condition (IV) in Appendix A, we have

sup θ Nj(δj)

1 n {ℓ(θ) ℓ(θ )} = sup θ Nj(δj)

n {ℓ(θ) ℓ(θj)} + 1

n {ℓ(θj) ℓ(θ )}

i=1 Hδ(yi, xi, θj) + 1

n {ℓ(θj) ℓ(θ )} . (21)

Zhou, Ding, Tan, Tarokh

By the weak law of large number, as n , we have

lim δ 0 1 n

i=1 Hδ(yi, xi, θj) p E{Hδ(yi, xi, θj)} = 0 (22)

Applying (17) with θ = θj, we obtain

n {ℓ(θj) ℓ(θ )} < cj

where cj is a positive constant that depends on θj. Applying (22) and (23), we get the upper bound in (21), which is shown below:

sup θ Nj(δj)

1 n{ℓ(θ) ℓ(θ )} < 1

n{ℓ(θj) ℓ(θ )} + 1

i=1 Hδ(yi, xi, θj) < cj

Then we get

sup θ Nj(δj)

1 n{ℓ(θ) ℓ(θ )} < cj

thereafter. If Θ is bounded, the compact set Θ \ N(δ) can be covered by a ﬁnite number of balls, namely N1(δ1), N2(δ2), . . . , Nm(δm), centered at θ1, θ2, . . . , θm, respectively. Then Lemma 11(i) holds by (24) with

k(δ) = min{c1, c2, . . . , cm}.

If Θ is unbounded, we apply the same argument to the bounded compact set Θ \ {N(δ) S( )}, where S( ) = {θ : θ 2 > } for suﬃciently large , from (V) in Appendix A, we have

1 n{ℓ(θ) ℓ(θ )} < 1

i=1 K (yi, xi, θ ). (25)

If E{K (yi, xi, θ )} > , by the weak law of large number, we get n 1 Pn i=1 K (yi, xi, θ ) p E{K (yi, xi, θ )} < 0, thus we have

1 n {ℓ(θ) ℓ(θ )} < E {K (yi, xi, θ )}

Under this case, Lemma 11(i) holds with k(δ) = min [c1, . . . , cm, E{K (yi, xi, θ )}] . If E{K (yi, xi, θ )} = , using the similar argument in (19) and (20), we can derive the conclusion that

i=1 K (yi, xi, θ ) a.s. E{K (yi, xi, θ )} = .

Model Linkage Selection for Cooperative Learning

Lemma 11(i) still holds with k(δ) = min {c1, . . . , cm} .

Proof of Lemma 11(ii): Since δ can be suﬃciently small and lim n Pr n bθ θ 2 < δ o =

1, we have n 1|{(In(bθ) In(θ )}i,j| < n 1 Pn i=1 Mδ(yi, xi, θ ) from (IX) in Appendix A, the limiting property

i=1 Mδ(yi, xi, θ ) = E {Mδ(yi, xi, θ )} 0

and the weak law of large number imply that

i,j = lim n 1 n {In(θ )}i,j p {I(θ )}i,j , (27)

Finally, by continuous mapping theorem, we obtain n pdet|In(bθ)| p det|I(θ )|.

Proof of Lemma 11(iii): Recall that bθ is the maximum likelihood estimator of θ. Thus, ℓ(bθ) = 0. By a second-order Taylor expansion, for θ Θ, there exists a t [0, 1] such that

ℓ(θ ) = ℓ(bθ) 1

2(θ bθ)T h In{bθ + t(θ bθ)} i (θ bθ). (28)

It suﬃces to show (θ bθ)T h In{bθ + t(θ bθ)} i (θ bθ) = Θ(1). Since bθ p θ , by (27) in

the proof of Lemma 11(ii), we have n 1[In{bθ + t(θ bθ)}]i,j p {I(θ )}i,j. Also, we have bθ = θ + Θ(n 1/2). Consequently, we obtain

(θ bθ)T h In{bθ + t(θ bθ)} i (θ bθ) p n n 1 2 (bθ θ ) o T I(θ ) n n 1 2 (bθ θ ) o

= Θ(1)I(θ )Θ(1)

Proof of Lemma 11(iv): Recall that p(y, X|θ) = exp{ℓ(θ)}. Thus, the marginal likelihood can be written as

p(y, X) = Z

θ Θ π(θ) exp{ℓ(θ)}dθ

= p(y, X|bθ) Z

θ Θ π(θ) exp{ℓ(θ) ℓ(bθ)}dθ

= p(y, X|bθ) Z

θ Θ\N(δ) π(θ) exp{ℓ(θ) ℓ(bθ)}dθ + p(y, X|bθ) Z

θ N(δ) π(θ) exp{ℓ(θ) ℓ(bθ)}dθ

:= I1 + I2. (30)

It suﬃces to show that n p(y, X|bθ)bξ o 1 I1 p 0 and n p(y, X|bθ)bξ o 1 I2 p (2π)p/2π(θ ), respectively.

Zhou, Ding, Tan, Tarokh

We ﬁrst show n p(y, X|bθ)bξ o 1 I1 p 0. Note that

I1 = p(y, X|bθ) exp n ℓ(θ ) ℓ(bθ) o Z

θ Θ\N(δ) π(θ) exp {ℓ(θ) ℓ(θ )} dθ,

We start the proof by conditioning on the event exp {ℓ(θ) ℓ(θ )} exp{ nk(δ)}. Thus, Z

θ Θ\N(δ) π(θ) exp {ℓ(θ) ℓ(θ )} dθ exp { nk(δ)} Z

θ Θ\N(δ) π(θ)dθ exp { nk(δ)} .

Multiplying I1 with n p(y, X|bθ)bξ o 1 and by (31), we obtain n p(y, X|bθ)bξ o 1 I1 exp{ℓ(θ )

ℓ(bθ)}bξ 1 exp { nk(δ)}. By Lemma 11(iii), we have ℓ(θ ) ℓ(bθ) = Θ(1). Moreover, by Lemma 11(ii) and Slutsky s Theorem, we obtain

lim n bξ 1 exp { nk(δ)} = lim n n p/2(bξ 2)1/2np/2 exp { nk(δ)}

= (det|I(θ )|)1/2 lim n [exp { nk(δ) + p log(n)/2}]

Finally, by Lemma 11(i), the event exp {ℓ(θ) ℓ(θ )} exp{ nk(δ)} holds with probability one as n for any θ Θ \ N(δ). Combining the above, we have n p(y, X|bθ)bξ o 1 I1 p 0.

Next, we show that n p(y, X|bθ)bξ o 1 I2 p (2π)p/2π(θ ). By a second-order Taylor expansion, we have

ℓ(θ) = ℓ(bθ) 1

2(θ bθ)TIn{bθ + t(θ bθ)}(θ bθ)

2(θ bθ)TIn(bθ)(θ bθ) 1

2(θ bθ)T h In{bθ + t(θ bθ)} In(bθ) i (θ bθ).

For notational simplicity, let Rn = 0.5(θ bθ)T h In{bθ + t(θ bθ)} In(bθ) i (θ bθ). By (33), I2 can be rewritten as

I2 =p(y, X|bθ) Z

θ N(δ) π(θ) exp n ℓ(θ) ℓ(bθ) o dθ

=p(y, X|bθ) Z

θ N(δ) π(θ) exp 1

2(θ bθ)TIn(bθ)(θ bθ) Rn

Under (X) in Appendix A, given any ϵ > 0, let ϵ = 2ϵ /π(θ ), since the prior function π(θ) is continuous around θ , thus, we can choose a δ such that

|π(θ) π(θ )| < ϵ < ϵπ(θ ) if θ N(δ). (35)

Model Linkage Selection for Cooperative Learning

θ N(δ) exp 1

2(θ bθ)TIn(bθ)(θ bθ) Rn

By (35), we obtain

(1 ϵ)π(θ )I3 < n p(y, X|bθ) o 1 I2 < (1 + ϵ)π(θ )I3. (37)

It suﬃces to obtain lower and upper bounds for bξ 1I3. We divide the derivation of upper and lower bounds for bξ 1I3 into two parts, we ﬁrst show that bξ 1 R

θ N(δ) exp n 1

2(θ bθ)TIn(bθ)(θ bθ) o dθ (2π)p/2, and prove later that |Rn| < ϵ for some proper δ. Recall that bξ 1 = det |In(bθ)|1/2, and by the regularity condition (VI) in Appendix A, {In(bθ)} 1 exists since In(bθ) is non-singular. We have

θ N(δ) exp 1

2(θ bθ)TIn(bθ)(θ bθ) dθ

2(θ bθ)TIn(bθ)(θ bθ) o

(2π)pdet|{In(bθ)} 1| dθ. (38)

Because the part inside the integral is the density function of multivariate normal distribution θ N(bθ, {In(bθ)} 1), we have that (38) will be less than

(2π)p Z exp n 1

2(θ bθ)TIn(bθ)(θ bθ) o

(2π)pdet|{In(bθ)} 1| dθ = (2π) p 2 . (39)

Moreover, the symmetry property of In(bθ) indicates that there exists a p p matrix V such that In(bθ) = V TV . Change variable θ = V θ, (38) will be greater than

2(θ V bθ)T(θ V bθ) o

(2π)pdet|{In(bθ)} 1| det|V 1|dθ

2(θ V bθ)T(θ V bθ) o

where N (δ ) = {θ : θ V θ 2 < δ }, and δ is determined by δ = min θ: θ θ 2=δ V θ

V θ 2. We show that δ = min θ: θ θ 2=δ V θ V θ 2 . Let δ(θ) = V θ V θ 2

under the condition that θ θ 2 = δ, if δ(θ) < , then V θ V θ and θ θ are both elementwise ﬁnite, and their lengths are both ﬁnite number p, we conclude from above that det |(V θ V θ )(θ θ )T| < . However, det |V | = (det |In(bθ)|) 1 2 from Lemma 11(ii), thus det |(V θ V θ )(θ θ )T| = det |V | det |(θ θ )(θ θ )T| , which is a contradiction. All the above indicates that δ(θ) for any θ θ 2 = δ, which

Zhou, Ding, Tan, Tarokh

concludes δ = min θ: θ θ 2=δ V θ V θ 2 . Therefore, we have (40) converging to (2π)p/2

in probability. The conclusion bξ 1 R

θ N(δ) exp n 1

2(θ bθ)TIn(bθ)(θ bθ) o dθ (2π)p/2 is then derived. We next derive the upper bound of |Rn|. By the triangle inequality, we have

1 n|Rn| = 1

1 2(θ bθ)T[In{bθ + t(θ bθ)} In(bθ)](θ bθ)

(θ bθ)T[In{bθ + t(θ bθ)} In(θ )](θ bθ)

(θ bθ)T n In(θ ) In(bθ) o (θ bθ) .

To further derive the upper bound of (41), consider the length p vector b = (b1, b2, . . . , bp)T

and p p matrix A with {A}i,j = ai,j, i, j {1, 2, . . . , p}. Let g(A, b) = tr(b TAb), this function can be formalized as g(A, b) = tr(bb TA) = Pp j=1 Pp i=1 bibjai,j. Let b = θ bθ and

A = In{bθ+t(θ bθ)} In(θ ), we have b 2 = (θ θ )+(θ bθ) 2 θ θ 2+ θ bθ 2 δ in probability, because θ N(δ) and bθ θ = Θ(n 1/2). Thus, |bi| δ for i = 1, 2, . . . , p. We can get the inequality that |g(A, b)| δ2 Pp j=1 Pp i=1 |ai,j|. From triangle inequality we

have bθ +t(θ bθ) θ 2 t θ θ 2 +(1 t) bθ θ 2 p tδ δ, thus bθ +t(θ bθ) N(δ) in probability. From (IX) in Appendix A, we have |ai,j| = |[In{bθ + t(θ bθ)} In(θ )]i,j| Pn i=1 Mδ(yi, xi, θ ). Thus we have b TAb δ2 Pp j=1 Pp k=1 Pn i=1 Mδ(yi, xi, θ ). Note that

this inequality also holds when A = In(θ ) In(bθ), since bθ θ = Θ(n 1/2), which indicates bθ N(δ) almost surely, thus (IX) in Appendix A can be applied and get the same conclusion as well. Based on (41) and the weak law of large number, n 1|Rn| is less than

i=1 Mδ(yi, xi, θ ) p p2δ2E {Mδ(yi, xi, θ )} when n . (42)

Under (IX) in Appendix A, lim δ 0 E {Mδ(yi, xi, θ )} = 0, given the condition that δ is chosen

to make (35) hold, then for any ϵ > 0, if δ is also chosen such that

E {Mδ(yi, xi, θ )} < ϵ 2np2δ2 .

sup θ N(δ) |Rn| < ϵ

Hence, we get the conclusion

lim n Pr n (2π) p 2 exp( ϵ) < bξ 1I3 < (2π) p 2 exp(ϵ) o = 1. (44)

Since δ can be chosen so that (44) and (37) both hold for arbitrary small ϵ, we deduce the result

lim n Pr (2π) p 2 π(θ )(1 ϵ) exp( ϵ) < n p(y, X|bθ)bξ o 1 I2 < (2π) p 2 π(θ )(1 + ϵ) exp(ϵ) = 1,

Model Linkage Selection for Cooperative Learning

which leads to n p(y, X|bθ)bξ o 1 I2 p (2π)p/2π(θ ) when n . Then we get the conclusion

n (y, X|bθ)bξ o 1 p(y, X) = (2π) p 2 π(θ ). (45)

D.2 Proof of Lemma 12

Proof We start with the proof of Lemma 12(i). Recall that ΘC is the parameter space of θC, and eθ1 = (θT 1, S1, θT S1,S2)T and eθ2 = (θT 2, S2, θT S1,S2)T as the parameters for L1 and L2 after incorporating information from the model linkage between the two learners. Let Nκ(δκ) = {θ : θ θ κ 2 < δκ} be the neighborhood of θ κ, and let Θ1,δ1 = {θC : eθ1 N1(δ1), θ2, S2 = θ 2, S2} and Θ2,δ2 = {θC : eθ2 N2(δ2), θ1, S1 = θ 1, S1}. The parameter space ΘC can be divided into Θ1,δ1, Θ2,δ2, and ΘC \ (Θ1,δ1 Θ2,δ2). For a ﬁxed δ, there exist δ1 and δ2, such that Θκ,δκ NC(δ), κ = 1, 2. Lemma 11(i) indicates that

sup eθκ Θκ\Nκ(δκ) n 1 κ n ℓκ(eθκ) ℓκ(θ κ) o < kκ(δκ) for some positive functions kκ(δκ), κ = 1, 2.

Let k(δ) = c1 k1(δ1) + c2 k2(δ2). Then,

sup θC ΘC\NC(δ) n 1 {ℓC(θC) ℓC(θ C)} (46)

sup θC ΘC\(Θ1,δ1 Θ2,δ2) n 1 {ℓC(θC) ℓC(θ C)} (47)

sup eθ1 Θ1\N1(δ1) n 1 n ℓ1(eθ1) ℓ1(θ 1) o + sup eθ2 Θ2\N2(δ2) n 1 n ℓ2(eθ2) ℓ2(θ 2) o (48)

= c1 sup eθ1 Θ1\N1(δ1) n 1 1 n ℓ1(eθ1) ℓ1(θ 1) o + c2 sup eθ2 Θ2\N2(δ2) n 1 2 n ℓ2(eθ2) ℓ2(θ 2) o (49)

c1 { k1(δ1)} + c2 { k2(δ2)} (50)

= k(δ), (51)

where (48) holds by the fact that ℓC(θC) ℓC(θ C) = {ℓ1(eθ1) ℓ1(θ 1)} + {ℓ2(eθ2) ℓ2(θ 2)} and (50) holds by applications of Lemma 11(i). Therefore, Lemma 12(i) is proved. For Lemma 12(ii), we prove that n 1In(bθC) is positive deﬁnite by showing that n 1In(bθC) can be rewritten as a sum of two matrices, namely n 1In(bθC) = M1 + M2, where M1, M2 are positive deﬁnite matrices. The proof can then be concluded by the fact that all elements in M1 and M2 are bounded by some ﬁnite constants. Recall that θC = (θT 1, S1, θT S1,S2, θT 2, S2)T Rp C. Let bθC,κ be the MLE of eθκ obtained by

maximizing ℓC, κ = 1, 2. Let λκ be the smallest eigenvalue in n 1 κ I(κ) nκ (bθC,κ). By Condition VI in Appendix A (VI, the positive deﬁnite property of I(κ) nκ (bθC,κ) indicates that λκ > 0, κ = 1, 2. In the following proof, we will focus on constructing M1. Construction of M2 is similar as M1 and is omitted. We now deﬁne the elements in M1. For i, j p1, let

{M1}i,j = c1 1 n1 {I(1) n1 (bθC,1)}i,j c1

Zhou, Ding, Tan, Tarokh

for i = j and i p1 ps, and let

{M1}i,j = c1 1 n1 {I(1) n1 (bθC,1)}i,j (53)

for the other elements. When p1 + 1 i p1 + p2 ps or p1 + 1 j p1 + p2 ps, all the elements are zeros except when i = j, we set

{M1}i,j = c2

By construction, M1 is a 2 2 block diagonal matrix, where each block matrix is positive deﬁnite. The upper left diagonal block is the diﬀerence between c1n 1 1 {I(1) n1 (bθC,1)} and a diagonal matrix, with the ﬁrst p1 ps diagonal entries are set to equal 0.5c1λ1. The bottom right block matrix is a diagonal matrix with all diagonal entries equaling 0.5c2λ2. Thus M1 is positive deﬁnite. The matrix M2 is constructed in a similar fashion and can be shown to be positive deﬁnite. We now proceed to prove the limiting property of n 1 1 I(1) n1 (bθC,1). We have

1 n1 |{I(1) n1 (bθC,1) I(1) n1 (θ 1)}i,j| < 1

i=1 Mδ1(y(1) i , x(1) i , θ 1) (55)

from (IX) in Appendix A. Moreover, the limiting property

lim n1 1 n1

i=1 Mδ1(y(1) i , x(1) i , θ 1) = E n Mδ1(y(1) i , x(1) i , θ 1) o 0 (56)

implies that

lim n1 1 n1

n I(1) n1 (bθC,1) o

i,j = lim n1 1 n1

n I(1) n1 (θ 1) o

i,j = n I(1)(θ 1) o

Equations (52) (57) imply that M1 and M2 are positive deﬁnite and elementwise ﬁnite. Combining this with the continuous mapping theorem, we obtain the conclusion that det |M1 + M2| = det |n 1In(bθC)| = Θ(1), which concludes Lemma 12(ii). Lemma 12(iii) is implied by Lemma 11(iii). The proof of Lemma 12(iv) is similar to the proof of Lemma 11(iv), and is omitted.

D.3 Proof of Lemma 13

Proof Recall from Lemma 11(iv) and Lemma 12(iv) that (bξ2 1) 1 = det|I(1) n1 (bθ1)|, (bξ2 2) 1 = det|I(2) n2 (bθ2)|, and (bξ2 C) 1 = det|In(bθC)| with n = n1 + n2. Let h1(X(1)) and h2(X(2)) be the density function of X(1) and X(2), respectively. Since the covariates between two learners

Model Linkage Selection for Cooperative Learning

are independent, as m , we have

p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2)) = p(y(1), y(2), X(1), X(2))/h1(X(1))h2(X(2)) p(y(1), X(1))/h1(X(1)) p(y(2), X(2))/h2(X(2))

= p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2))

p(C) θC (y(1), y(2), X(1), X(2))|bθC)

p(1) eθ1 (y(1), X(1)|bθ1)p(2) eθ2 (y(2), X(2)|bθ2) bξC bξ1bξ2

where the third equality holds by an application of Lemma 11(iv) and Lemma 12(iv). Since the model linkages are well-speciﬁed, the joint density can be factored as

p(C) θC (y(1), y(2), X(1), X(2)|θ C) = p(1) eθ1 (y(1), X(1)|θ 1)p(2) eθ2 (y(2), X(2)|θ 2).

Thus, we have

p(C) θC (y(1), y(2), X(1), X(2)|bθC)

p(1) eθ1 (y(1), X(1)|bθ1)p(2) eθ2 (y(2), X(2)|bθ2)

= p(C) θC (y(1), y(2), X(1), X(2)|bθC) n p(C) θC (y(1), y(2), X(1), X(2)|θ C) o 1

p(1) eθ1 (y(1), X(1)|bθ1) n p(1) eθ1 (y(1), X(1)|θ 1) o 1 p(2) eθ2 (y(2), X(2)|bθ2) n p(2) eθ2 (y(2), X(2)|θ 2) o 1

= Θ(1), (59)

where the second equality holds by Lemma 11(iii). Also, Lemma 11(ii) and Lemma 12(ii) imply that

bξC bξ1bξ2 = bξC(n1 + n2) p1+p2 ps

2 2 (n1 + n2) p1+p2 ps

2 2 (n1 + n2) p1+p2 ps

Substituting (59) and (60) into (58) concludes the proof.

D.4 Proof of Lemma 14

Proof From the proof of Lemma 13, we have

p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2)) = p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2)),

and it remains to show that

p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2))

Zhou, Ding, Tan, Tarokh

By deﬁnition, p(y(1), y(2), X(1), X(2)) = R

ΘC p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC. Choose a δ > 0 such that N1(δ) and N2(δ) are non-overlapping neighborhoods of θ 1 and θ 2, respectively. We split p(y(1), y(2), X(1), X(2)) into three integrals, I1, I2, and I3, taken on sets Θ1,δ, Θ2,δ, and ΘC \ (Θ1,δ Θ2,δ), where Θ1,δ = {θC : eθ1 N1(δ)} and Θ2,δ = {θC : eθ2 N2(δ)}. Note that eθ1 θC and eθ2 θC For the ﬁrst integral, we have

Θ1,δ p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC

= p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ2(θ 2) ℓ2(bθ2) o

Θ1,δ bξ 1 2 exp n ℓ2(eθ2) ℓ2(θ 2) o p(1) eθ1 (y(1), X(1)|eθ1)π(θC)dθC.

Since eθ2 / N1(δ) in (62), according to Lemma 11(i), the integral on the right hand side in (62) is less than

Θ1,δ bξ 1 2 exp n ℓ2(eθ2) ℓ2(θ 2) o p(1) eθ1 (y(1), X(1)|eθ1)π(θC)dθC

bξ 1 2 exp{ n2k2(δ)} Z

Θ1,δ p(1) eθ1 (y(1), X(1)|eθ1)π(θC)dθC

bξ 1 2 exp{ n2k2(δ)} Z

Θ1 p(1) eθ1 (y(1), X(1)|eθ1)π(eθ1)deθ1

= {np2 2 bξ2 2} 1

2 2 exp{ n2k2(δ)}p(y(1), X(1))

with probability tending to 1 as n2 . From Lemmas 11(ii) (iv), as n2 , we have

{np2 2 bξ2 2} 1

2 p (det |I(2)(θ 2)|) 1 2 ;

exp n ℓ2(θ 2) ℓ2(bθ2) o Θ(1);

{p(2) eθ2 (y(2), X(2)|bθ2)bξ2} 1p(y(2), X(2)) p (2π) p2

It follows that

I1 p(y(1), X(1))p(y(2), X(2)) = Θ(n

2 2 exp{ n2k2(δ)}) p 0. (65)

Using a similar argument for I2, as n1 , we have

I2 p(y(1), X(1))p(y(2), X(2)) = Θ(n

2 1 exp{ n1k1(δ)}) p 0. (66)

Model Linkage Selection for Cooperative Learning

For the integral I3, we apply a similar argument as in the proof of I1 and I2. Speciﬁcally,

ΘC\{Θ1,δ Θ2,δ} p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC

= p(1) eθ1 (y(1), X(1)|bθ1)bξ1p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ1(θ 1) ℓ1(bθ1) o exp n ℓ2(θ 2) ℓ2(bθ2) o

ΘC\{Θ1,δ Θ2,δ} bξ 1 1 bξ 1 2 exp n ℓ1(eθ1) ℓ1(θ 1) o exp n ℓ2(eθ2) ℓ2(θ 2) o π(θC)dθC,

p(1) eθ1 (y(1), X(1)|bθ1)bξ1p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ1(θ 1) ℓ1(bθ1) o exp n ℓ2(θ 2) ℓ2(bθ2) o

bξ 1 1 bξ 1 2

ΘC exp n ℓ1(eθ1) ℓ1(θ 1) o exp n ℓ2(eθ2) ℓ2(θ 2) o π(θC)dθC.

Since the region ΘC \ {Θ1,δ Θ2,δ} contains neither the neighborhood of θ 1 nor the neighborhood of θ 2, by an application of Lemma 11(i), we have

I3 p(1) eθ1 (y(1), X(1)|bθ1)bξ1p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ1(θ 1) ℓ1(bθ1) o exp n ℓ2(θ 2) ℓ2(bθ2) o

bξ 1 1 bξ 1 2 exp { n1k1(δ)} exp { n2k2(δ)} .

Using an argument similar to that of I1 and Lemmas 12(ii) (iv), we have

I3 p(y(1), X(1))p(y(2), X(2)) = Θ(n

2 2 exp{ n1k1(δ) n2k2(δ)}) p 0. (67)

Combining the above, we have

p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2)) = I1 + I2 + I3 p(y(1), X(1))p(y(2), X(2))

This concludes the proof of Lemma 14.

Rehan Akbani, Kwok shing Ng, Henrica Werner, Maria Shahmoradgoli, Fan Zhang, Zhenlin Ju, Wenbin Liu, Ji-Yeon Yang, Kosuke Yoshihara, Jun Li, Shiyun Ling, Elena Seviour, Prahlad Ram, John Minna, Lixia Diao, Pan Tong, John Heymach, Steven Hill, Frank Dondelinger, and Gordon Mills. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nature Communication, 5(1):1 15, 05 2014. doi: 10.17863/CAM.25567.

Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. Analyzing federated learning through an adversarial lens. In International Conference on Machine Learning, pages 634 643. PMLR, 2019.

Marta Blangiardo, Anna Hansell, and Sylvia Richardson. A Bayesian model of time activity data to investigate health eﬀect of air pollution in time series studies. Atmospheric Environment, 45(2):379 386, 2011.

Zhou, Ding, Tan, Tarokh

Sabine Blaschke, Claudia M uller, Jasmina Markovic-Lipkovski, Sabine Puch, Nicolai Miosge, Volker Becker, Gerhard Mueller, and Gerd Klein. Expression of cadherin-8 in renal cell carcinoma and fetal kidney. International Journal of Cancer, 101(4):327 334, 10 2002. doi: 10.1002/ijc.10623.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011.

Guangzhen Cao and Ya Qiu Jin. A hybrid algorithm of the BP-ANN/GA for classiﬁcation of urban terrain surfaces with fused data of Landsat ETM+ and ERS-2 SAR. International Journal of Remote Sensing, 28(2):293 305, 2007.

GC Cawley and NLC Talbot. On over-ﬁtting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11(Jul):2079 2107, July 2010.

Fengju Chen, Yiqun Zhang, Yasin S enbabao glu, Giovanni Ciriello, Lixing Yang, Ed Reznik, Brian Shuch, Goran Micevic, Guillermo De Velasco, and Eve Shinbrot. Multilevel genomics-based taxonomy of renal cell carcinoma. Cell Reports, 14(10):2476 2489, 2016.

Kevin A. Clarke. The phantom menace: omitted variable bias in econometric research. Conﬂict Management and Peace Science, 22(4):341 352, 2005. doi: 10.1080/ 07388940500339183.

Li Da Xu, Wu He, and Shancang Li. Internet of things in industries: a survey. IEEE Transactions on Industrial Informatics, 10(4):2233 2243, 2014.

Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B, 76(2):373 397, 2014.

A Philip Dawid and Monica Musio. Bayesian model selection based on proper scoring rules. Bayesian Analysis, 10(2):479 499, 2015.

Enmao Diao, Jie Ding, and Vahid Tarokh. Restricted recurrent neural networks. 2019 IEEE International Conference on Big Data, 2019.

Enmao Diao, Jie Ding, and Vahid Tarokh. Multimodal controller for generative models. ar Xiv preprint ar Xiv:2002.02572, 2020a.

Enmao Diao, Jie Ding, and Vahid Tarokh. Hetero FL: Computation and communication eﬃcient federated learning for heterogeneous clients. In International Conference on Learning Representations, 2020b.

Jie Ding, Vahid Tarokh, and Yuhong Yang. Model selection techniques an overview. IEEE Signal Processing Magazine , 35(6):16 34, 2018.

Ian Domowitz and Halbert White. Misspeciﬁed models with dependent observations. Journal of Econometrics, 20(1):35 58, October 1982.

Model Linkage Selection for Cooperative Learning

Chao Du, Chu-Lan Michael Kao, and S. C. Kou. Stepwise signal extraction via marginal likelihood. Journal of the American Statistical Association, 111(513):314 330, 2016.

C Duckworth, L Zhang, Steven Carroll, S Ethier, and H Cheung. Overexpression of GAB2 in ovarian cancer cells promotes tumor growth and angiogenesis by upregulating chemokine expression. Oncogene, 35(31):4036 4047, 2016.

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007.

Pi Guo, Jianjun Zhang, Li Wang, Shaoyi Yang, Ganfeng Luo, Changyu Deng, Ye Wen, and Qingying Zhang. Monitoring seasonal inﬂuenza epidemics by using internet search data with an ensemble penalized regression model. Scientiﬁc Reports, 7(1):46469, 2017.

Pierre E Jacob, Lawrence M Murray, Chris C Holmes, and Christian P Robert. Better together? statistical learning in models made of modules. ar Xiv preprint ar Xiv:1708.08719, 2017.

Shane T Jensen, Guang Chen, and Christian J Stoeckert Jr. Bayesian variable selection and data integration for biological regulatory networks. The Annals of Applied Statistics, 1(2):612 633, 2007.

Malhar S Jere, Tyler Farnan, and Farinaz Koushanfar. A taxonomy of attacks on federated learning. IEEE Security & Privacy, 19(2):20 28, 2020.

Michael I Jordan, Jason D Lee, and Yun Yang. Communication-eﬃcient distributed statistical inference. Journal of the American Statistical Association, 114(526):668 681, 2019.

Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richt arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: strategies for improving communication eﬃciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

Fanjie Kong, Xiaobing Li, Hong Wang, Dengfeng Xie, Xiang Li, and Yunxiao Bai. Land cover classiﬁcation based on fused data from GF-1 and MODIS NDVI time series. Remote Sensing, 8(9):741, 2016.

Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. Journal of Machine Learning Research, 18(1):4404 4446, 2017a.

Jason D Lee, Qiang Liu, Yuekai Sun, and Jonathan E Taylor. Communication-eﬃcient sparse regression. Journal of Machine Learning Research, 18(1):115 144, 2017b.

Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-eﬃcient distributed optimization in networks with gradient tracking. ar Xiv preprint ar Xiv:1909.05844, 2019.

Quefeng Li and Lexin Li. Integrative linear discriminant analysis with guaranteed error rate improvement. Biometrika, 105(4):917 930, 2018.

Zhou, Ding, Tan, Tarokh

Dungang Liu, Regina Y Liu, and Minge Xie. Multivariate meta-analysis of heterogeneous studies using only summary statistics: eﬃciency and robustness. Journal of the American Statistical Association, 110(509):326 340, 2015.

Fei Liu, MJ Bayarri, and JO Berger. Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis, 4(1):119 150, 2009.

David Lunn, Nicky Best, David Spiegelhalter, Gordon Graham, and Beat Neuenschwander. Combining MCMC with sequential PKPD modeling. Journal of Pharmacokinetics and Pharmacodynamics, 36(1):19, 2009.

David J Lunn, Andrew Thomas, Nicky Best, and David Spiegelhalter. Win BUGS - a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing, 10(4):325 337, 2000.

Jing Ma and George Michailidis. Joint structural estimation of multiple graphical models. Journal of Machine Learning Research, 17(1):5777 5824, 2016.

Arnab Maity, Anirban Bhattacharya, Bani Mallick, and Veerabhadran Baladandayuthapani. Bayesian data integration and variable selection for pan-cancer survival prediction using protein expression data. Biometrics, 76(1):316 325, 08 2019. doi: 10.1111/biom.13132.

Olvi L. Mangasarian, W. Nick Street, and William H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4):570 577, 1995. doi: 10.1287/opre.43.4.570.

Delphine Maucort-Boulch, Silvia Franceschi, and Martyn Plummer. International correlation between human papillomavirus prevalence and cervical cancer incidence. Cancer Epidemiology and Prevention Biomarkers, 17(3):717 720, 2008.

H Brendan Mc Mahan, Eider Moore, Daniel Ramage, and Seth Hampson. Communicationeﬃcient learning of deep networks from decentralized data. ar Xiv preprint ar Xiv:1602.05629, 2016.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In International Conference on Machine Learning, pages 689 696, 2011.

Kiona Ogle, Jarrett Barber, and Karla Sartor. Feedback and modularization in a Bayesian meta analysis of tree traits aﬀecting forest dynamics. Bayesian Analysis, 8(1):133 168, 2013.

Matthew Parry. Linear scoring rules for probabilistic binary classiﬁcation. Electronic Journal of Statistics, 10(1):1596 1607, 2016.

Matthew Parry, A Philip Dawid, and Steﬀen Lauritzen. Proper local scoring rules. Annals of Statistics, 40(1):561 592, 2012.

Model Linkage Selection for Cooperative Learning

Martyn Plummer. Cuts in Bayesian graphical models. Statistics and Computing, 25(1): 37 43, Jan 2015. ISSN 1573-1375. doi: 10.1007/s11222-014-9503-z.

Stephane Shao, Pierre E Jacob, Jie Ding, and Vahid Tarokh. Bayesian model comparison with the hyv arinen score: Computation and consistency. Journal of the American Statistical Association, 114(528):1826 1837, 2019.

Jieli Shen, Regina Y Liu, and Min-Ge Xie. i Fusion: Individualized fusion learning. Journal of the American Statistical Association, 115(531):1251 1267, 2019.

Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014.

Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1310 1321. ACM, 2015.

MC Simmonds and JPT Higgins. Covariate heterogeneity in meta-analysis: criteria for deciding between meta-regression and individual patient data. Statistics in Medicine, 26 (15):2982 2999, 2007.

Uthayasankar Sivarajah, Muhammad Mustafa Kamal, Zahir Irani, and Vishanth Weerakkody. Critical analysis of big data challenges and analytical methods. Journal of Business Research, 70:263 286, 2017.

Lu Tang and Peter XK Song. Fused lasso approach in regression coeﬃcients clustering: learning parameter heterogeneity in data integration. Journal of Machine Learning Research, 17(1):3915 3937, 2016.

Lu Tang, Ling Zhou, and Peter XK Song. Fusion learning algorithm to combine partially heterogeneous cox models. Computational Statistics, 34(1):395 414, 2019.

Edward F Vonesh, Tom Greene, and Mark D Schluchter. Shared parameter models for the joint analysis of longitudinal data and event times. Statistics in Medicine, 25(1):143 163, 2006.

AM Walker. On the asymptotic behaviour of posterior distributions. Journal of the Royal Statistical Society: Series B, 31(1):80 88, 1969.

Xinran Wang, Yu Xiang, Jun Gao, and Jie Ding. Information laundering for model privacy. In International Conference on Learning Representations, 2021.

Xiaoquan Wen and Matthew Stephens. Bayesian methods for genetic association analysis with heterogeneous subgroups: from meta-analyses to gene-environment interactions. The Annals of Applied Statistics, 8(1):176, 2014.

Xun Xian, Xinran Wang, Jie Ding, and Reza Ghanadan. Assisted learning: A framework for multi-organization learning. In Advances in Neural Information Processing Systems, pages 14580 14591, 2020.

Zhou, Ding, Tan, Tarokh

Jin-Jun Xiao and Zhi-Quan Luo. Universal decentralized detection in a bandwidthconstrained sensor network. IEEE Transactions on Signal Processing, 53(8):2617 2624, 2005.

Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang Chun Woo. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pages 802 810, 2015.

Shihao Yang, Mauricio Santillana, and SC Kou. Accurate estimation of inﬂuenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences of the United States of America, 112(47):14473 14478, 2015.

Chenglong Ye, Jie Ding, and Reza Ghanadan. Meta clustering for collaborative learning. ar Xiv preprint ar Xiv:2006.00082, 2020.

Qing Zhou, Di Li, Soummya Kar, Lauren M Huie, H Vincent Poor, and Shuguang Cui. Learning-based distributed detection-estimation in sensor networks with unknown sensor defects. IEEE Transactions on Signal Processing, 65(1):130 145, 2016.

Richard Zuech, Taghi M Khoshgoftaar, and Randall Wald. Intrusion detection and big heterogeneous data: a survey. Journal of Big Data, 2(1):3, 2015.