# model_linkage_selection_for_cooperative_learning__88f79339.pdf Journal of Machine Learning Research 22 (2021) 1-44 Submitted 09/20; Revised 09/21; Published 10/21 Model Linkage Selection for Cooperative Learning Jiaying Zhou zhou1054@umn.edu School of Statistics University of Minnesota Minneapolis MN, 55455 Jie Ding dingj@umn.edu School of Statistics University of Minnesota Minneapolis MN, 55455 Kean Ming Tan keanming@umich.edu Department of Statistics University of Michigan Ann Arbor MI, 48109 Vahid Tarokh vahid.tarokh@duke.edu Department of Electrical Engineering and Engineering Duke University Durham NC, 27708 Editor: Bert Huang We consider the distributed learning setting where each agent or learner holds a specific parametric model and a data source. The goal is to integrate information across a set of learners and data sources to enhance the prediction accuracy of a given learner. A natural way to integrate information is to build a joint model across a group of learners that shares common parameters of interest. However, the underlying parameter sharing patterns across a set of learners may not be known a priori. Misspecifying the parameter sharing patterns or the parametric model for each learner often yields a biased estimator that degrades the prediction accuracy. We propose a general method to integrate information across a set of learners that is robust against misspecification of both models and parameter sharing patterns. The main crux of our proposed method is to sequentially incorporate additional learners that can enhance the prediction accuracy of an existing joint model based on userspecified parameter sharing patterns across a set of learners. Theoretically, we show that the proposed method can data-adaptively select a parameter sharing pattern that enhances the predictive performance of a given learner. Extensive numerical studies are conducted to assess the performance of the proposed method. Keywords: Data integration; decentralized learning; federated learning; model linkage selection; prediction efficiency. 1. Introduction In recent years, there has been a growing interest in statistical learning problems with a set of decentralized learners, where each learner encompasses a specific data modality and a c 2021 Jiaying Zhou, Jie Ding, Kean Ming Tan, and Vahid Tarokh. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-1078.html. Zhou, Ding, Tan, Tarokh statistical model built using domain-specific knowledge. The goal is to integrate information across decentralized learners to achieve higher statistical efficiency and predictive accuracy. Integrating information from different data sources is crucial in many scientific domains such as environmental science (Blangiardo et al., 2011; Xingjian et al., 2015), epidemiology (Yang et al., 2015; Guo et al., 2017), statistical machine learning problems (Ngiam et al., 2011; Kong et al., 2016; Cao and Jin, 2007; Ye et al., 2020; Xian et al., 2020), and computational biology (Simmonds and Higgins, 2007; Liu et al., 2009, 2015; Wen and Stephens, 2014). For instance, in the context of epidemiology, a considerable amount of online search data from different platforms are integrated and used to form accurate predictions of influenza epidemics (Yang et al., 2015; Guo et al., 2017). The aforementioned applications raise a critical question: how to reliably integrate information from different data sources to enhance statistical efficiency in a robust manner? We consider the setting in which there are a set of learners, each consisting of a data set and a parametric statistical model. The learners may or may not share common parameters among themselves. The goal is to develop a framework to enhance the statistical efficiency of any particular learner, say L1, by integrating information from the other learners through parameter sharing. In general, learner L1 can be assisted explicitly or implicitly by building a joint model with the other learners with potentially different statistical models and different data sources. Explicit assistance could be achieved by joint modeling with a set of learners that share at least one parameter. On the other hand, implicit assistance could be achieved by joint modeling with learners whose parameters are not directly related to L1, but are related to learners who could explicitly assist L1. In principle, if the true underlying parameter sharing patterns among all learners are known a priori and that the parametric statistical model for each learner is correctly specified, then one can build a joint model with constraints on the shared parameters based on the joint likelihood function. Many existing modeling methods can be formulated as special instances of the above setting. For example, when multiple learners employ the same parametric model across different data sources, it is usually studied in the context of data integration (Jensen et al., 2007; Vonesh et al., 2006; Liu et al., 2015; Lee et al., 2017b; Jordan et al., 2019; Danaher et al., 2014; Ma and Michailidis, 2016; Tang and Song, 2016; Li and Li, 2018; Tang et al., 2019; Maity et al., 2019; Shen et al., 2019), or in the context of distributed optimization (Boyd et al., 2011; Shi et al., 2014; Lee et al., 2017a; Li et al., 2019). A recent topic, referred to as federated learning, is also related to the above setting, where a central server (interpreted as the main learner) sends the current global model to a set of clients (interpreted as other learners), each client updates the model parameters with the local data source, and then returns them to the central server (Shokri and Shmatikov, 2015; Koneˇcn y et al., 2016; Mc Mahan et al., 2016). Though it is often helpful to establish a joint model from multiple learners, naively combining data sources and performing joint modeling can lead to severely degraded statistical performance due to four possible reasons: (i) misspecified statistical models for some learners; (ii) misspecified parameter sharing patterns among learners; (iii) heterogeneity from different data sources (Simmonds and Higgins, 2007; Wen and Stephens, 2014; Liu et al., 2015); and (iv) distinct learning objectives for different learners (Zuech et al., 2015; Sivarajah et al., 2017; Xian et al., 2020; Wang et al., 2021). Most existing data integration and distributed computing methods are based on the assumption that the statistical model Model Linkage Selection for Cooperative Learning for each learner is correctly specified and that the parameter sharing patterns are known a priori. A systematic statistical approach for decentralized learning robust against the four aspects mentioned above is relatively lacking. In this paper, we propose a general approach to enhance the predictive performance of a specific learner L1 by integrating information from the other learners. We consider the setting where there are M learners and that the learners may have different statistical models and heterogeneous data sources. To characterize parameter sharing patterns among all learners, we introduce the notion of a model linkage graph. A model linkage graph G = (V, E) consists of a set of M vertices V and a set of edges E, where each vertex represents a learner, and an edge between a pair of vertices encodes a parameter sharing pattern between the pair of learners. In addition, a pair of learners do not share any common parameters if there is no edge between them. A joint model that enhances the predictive performance of L1 can then be fit given a model linkage graph. However, the ground truth or the most suitable model linkage graph is not known a priori. Due to model misspecification within each learner and misspecified model linkages between pairs of learners, the prediction performance of L1 may degrade after incorporating information from other learners. To enhance robustness against a misspecified model linkage graph, one could exhaustively build joint models for all possible sets of learners that are connected to L1 within a given model linkage graph. Then, the set of learners that yields the largest conditional marginal likelihood of L1 is selected. However, such an approach is computationally infeasible since the number of possible sets of learners grows exponentially with the number of learners. To address this challenge, we propose a greedy algorithm that is robust against a misspecified model linkage graph. Our proposed algorithm sequentially incorporates additional learners based on the user-specified model linkage graph, starting from learner L1. In each iteration of the algorithm, we utilize the joint model built with a group of learners from the previous iteration and search for the next learner to improve the marginal likelihood of the current group of learners. This process is continued until no more learners are included in the joint model. This approach approximately reduces the number of possible sets of model linkages from exponential to quadratic in the number of learners. Compared with the joint modeling approach, our proposed method is of distributed nature and does not require data sharing across learners. To quantify the theoretical aspects of the proposed method, we introduce the notion of model linkage selection consistency and asymptotic prediction efficiency. They are different but conceptually parallel to asymptotic efficiency and selection consistency in the classical model selection literature (see, e.g., Ding et al., 2018). We show that the proposed algorithm achieves linkage selection consistency as long as the user-specified model linkage graph is a superset of the underlying model linkage graph. In other words, the proposed method selects data sources that are truly useful for enhancing the predictive performance of L1 in a dataadaptive manner as the sample size grows. In addition, we show that the proposed method achieves asymptotic prediction efficiency, meaning that the predictive performance of the selected model is asymptotically equivalent to that of the best joint model in hindsight. The paper is outlined as follows. In Section 2, we provide the motivation, problem description, and definitions related to the model linkage graph. In Section 3, we propose a general method for integrating information from different learners to enhance the predictive performance of L1. The theoretical results for the proposed framework are provided in Zhou, Ding, Tan, Tarokh Section 3.2.2. In Section 4, we perform numerical studies to evaluate the performance of the proposed method under different scenarios such as data contamination and model misspecification. We close with a discussion in Section 5, where we highlight some related literature on data integration and federated learning. The technical proofs and the regularity conditions needed for the main results are included in the Appendix. 2. Background and Motivation Suppose that there are M learners, L1, L2, . . . , LM. Each learner is a pair Lκ = (D(κ), Pκ), where D(κ) is a set of nκ observations from the sample space D(κ), and Pκ is a user-specified class of parametric model for modeling D(κ), parameterized by a pκ-dimensional parameter θκ Θκ. Our goal is to develop a framework for enhancing the predictive performance of a particular learner, say L1, by integrating information from other learners, L2, . . . , LM. We will focus on the regression setting where D(κ) = (Y, X (κ)), with Y = R and X (κ) = Rkκ. The covariates are allowed to have different dimensions across the M learners due to different data sources. While we focus on the regression setting, the proposed approach can be applied more generally to data of different forms. Let D(κ) = (y(κ), X(κ)) D(κ), where y(κ) Rnκ is an nκ-dimensional vector of response and X(κ) Rnκ kκ is an nκ kκ matrix of covariates. Let Pκ = n p(κ) θκ ( |θk, X(κ)) : θκ Θκ, X(κ) X (κ)o be a class of user- specified parametric model for modeling y(κ) given X(κ). For each learner, assume that the underlying response variable is generated independently according to the probability law P κ described by a conditional density p κ( |x(κ)), given the covariates x(κ) Rkκ. We start with providing several definitions that will serve as a foundation of the proposed framework: model misspecification, model linkage, and model linkage misspecification. A class of user-specified model Pκ is said to be misspecified when it does not contain the underlying conditional density p κ( |x(κ)). In practice, model misspecification often occurs due to an inappropriate functional form between the response and covariates, such as underfitting the model or neglecting dependent random noise (see, e.g., Domowitz and White, 1982; Clarke, 2005; Cawley and Talbot, 2010). We now provide a formal definition of model misspecification. Definition 1 A model P = {pθ : θ Θ} is well-specified if there exists a θ Θ such that pθ = p almost everywhere. A model P = {pθ : θ Θ} is misspecified if supθ Θ P {y : pθ(y|θ, x) = p (y|x)} < 1 for some covariates x, where P denotes the probability measure corresponding to p . Figure 1 provides several examples of well-specified and misspecified models in the context of regression. The left panel of Figure 1 indicates that the learner L1 is well-specified, since the underlying linear model p 1 P1, where P1 is a class of user-specified linear model with two covariates. On the other hand, the middle panel of Figure 1 presents a case when the learner L2 is misspecified since p 2 / P2. Model misspecification also occurs when the user-specified model class is different from the underlying data-generating process, as illustrated in the right panel of Figure 1. Next, we provide a new definition of model linkage. Suppose that there are two learners Li and Lj. A model linkage occurs between two learners Li and Lj if they are restricted to share some common parameters. A formal definition is given in Definition 2. Model Linkage Selection for Cooperative Learning 1,i + β2x(3) i Bernoulli{logistic(β1x(3) 1,i + β2x(3) Figure 1: Three examples of well-specified and misspecified models. Left panel: wellspecified model. Middle panel: misspecified model. Right panel: misspecified model. Definition 2 (Model linkage) Suppose that two learners Li and Lj are well-specified. Let θi,Si and θj,Sj be subvectors of θi and θj, indexed by the subsets Si {1, . . . , pi} and Sj {1, . . . , pj}, respectively. There exists a model linkage between Li and Lj if θi,Si = θj,Sj, also denoted by θSi,Sj for notational convenience. We also refer to θSi,Sj as the shared common parameter between Li and Lj. To put the idea of model linkage into perspective, we consider an example in the context of an epidemiological study with two learners, illustrated in Figure 2. A similar example was considered in (Plummer, 2015). Suppose that both learners are well-specified. Learner L1 concerns estimating the human papilomavirus (HPV) prevalence with data D(1) = (y(1), x(1)), where y(1) and x(1) are both n-dimensional vectors recording the number of women infected with high-risk HPV and the population size in different states, respectively. A binomial model is specified for y(1) i with a population size x(1) i and HPV prevalence parameter θ1,i, with i = 1, 2, . . . , n. Learner L2 models the relationship between the HPV prevalence θ1,i and cancer incidence in the form of Poisson regression. In L2, let D2 = (y(2), x(2)), where y(2) is the number of cancer incidents and x(2) is the women-years of follow up. Both L1 and L2 are restricted to share the same HPV prevalence parameters θ1,i, i = 1, 2, . . . , n, and thus there is a model linkage between L1 and L2. i Poisson( 1,i 2x(2) Figure 2: Learners L1 and L2 share common parameters θ1,i, i = 1, . . . , n. Zhou, Ding, Tan, Tarokh The definition and example above focus mainly on whether there is a model linkage between two learners. Such an idea can be generalized to a set of model linkages among a group of learners, which we refer to as model linkage graph in the following definition. Definition 3 (Model linkage graph) Let G = (V, E) be an undirected model linkage graph, where V is a set of M vertices representing the M learners L1, . . . , LM, and E is an edge set encoding model linkages between pairs of learners. In practice, the model linkage graph and linkages between pairs of learners are pre-specified by the user, usually based on domain-specific knowledge, before model fitting. Therefore, the prediction performance of a learner may not be improved after incorporating information from the model linkage graph due to the potential misspecification of models and model linkages. A concept in parallel with model misspecification in the context of model linkage misspecification is provided in Definition 4. Definition 4 (Model linkage misspecification) Suppose that there is a model linkage between two learners Li and Lj. In other words, a pair of subsets of parameters in two learners are restricted to be the same, say θi,Si = θj,Sj. A model linkage is misspecified if either Li or Lj is misspecified, or that θ i,Si = θ j,Sj. More generally, a model linkage graph G = (V, E) is misspecified if there exists a misspecified model linkage between a pair of learners. Recall that we are interested in enhancing the predictive performance of L1 by integrating information from other learners L2, . . . , LM. Thus far, it is clear that a model linkage should exist between L1 and Lκ if the pair of learners shares common parameters. We now introduce the notion of information flow in which learners that do not share common parameters directly with L1 can also enhance its predictive performance by sharing common parameters with learners that can, in turn, assist L1. Consider an example with six learners as illustrated in Figure 3. For simplicity, assume that the statistical models for all learners are all well-specified and that the true model linkage graph is known. Learners L2 and L4 share common parameters with L1, and thus there exist model linkages between L1 and L2, and L1 and L4. In addition, learner L3 has a shared common parameter with L2, and thus in principle, L3 can help enhance the predictive performance of L1 implicitly. This process of positive feedback transmission is an information flow that enables implicit assistance to L1. In Figure 3, there are two such information flows, namely L3 L2 L1 and L5 L4 L1. Thus, in the model linkage graph, paths exist from L3 and L5 to L1. Learner L6 does not share common parameters with learners related to L1, and thus there is no model linkage between L6 and the others. However, in practice, the true model linkage graph is not known a priori and needs to be specified. A misspecified model linkage graph may hamper the predictive performance of L1. The following section proposes a data-adaptive framework to identify an appropriate model linkage graph for prediction. The same idea can be used to improve parameter estimation accuracy. Model Linkage Selection for Cooperative Learning Figure 3: A model linkage graph with six learners. Learner L1 shares common parameters with L2 and L4, and is implicitly connected to L3 and L5 through L2 and L4. 3. General Bayesian Framework for Model Linkage Selection 3.1 Proposed Method We propose a Bayesian framework to enhance the predictive performance of learner L1 by leveraging data from the other learners L2, . . . , LM according to a user-specified model linkage graph, G = (V, E). Suppose that there exists a model linkage between two learners, say learners Li and Lj. Some of the elements in θi and θj are restricted to be the same, say θi,Si = θj,Sj = θSi,Sj. From the Bayesian perspective, a natural way to integrate information is to compute the posterior distribution of the parameters θ = (θT i, Si, θT Si,Sj, θT j, Sj)T, where θi, Si and θj, Sj are obtained by removing the elements from θi and θj, indexed by the sets Si and Sj, respectively. Then, the posterior distribution of θ can be computed as π(θ | D(i), D(j)) = p(i) θi (y(i)|θi, X(i))p(j) θj (y(j)|θj, X(j))π(θ) p(y(i), y(j)|X(i), X(j)) where π(θ) is a prior distribution on θ, and p(y(i), y(j)|X(i), X(j)) is the marginal likelihood obtained by integrating the numerator of the above equation with respect to θ. More generally, one can compute the posterior distribution of the parameters according to the user-specified model linkage graph. Let C(G) be a set of indices recording the vertices that form a connected component with learner L1 in the user-specified graph G, including learner L1. That is, C(G) is a set containing all indices of learners that have at least a path to L1 and learner L1. For brevity, throughout the paper, let θ be a vector obtained by concatenating the entries of θκ for all κ C(G) without duplication. That is, the shared parameters between pairs of learners appear only once in θ. The posterior distribution for θ under a specific model linkage can then be computed by π(θ | κ C(G)D(κ)) = π(θ) Q κ C(G) p(κ) θκ (y(κ)|θκ, X(κ)) p( κ C(G)y(κ)| κ C(G) X(κ)) Then, the posterior predictive distribution for a new observation in learner L1 can be computed as p(ey| κ C(G) D(κ), ex) = Z Θ p(1) θ1 (ey|θ1, ex)π(θ | κ C(G)D(κ))dθ, (1) Zhou, Ding, Tan, Tarokh Algorithm 1 Greedy Algorithm for Model Linkage Selection. Input: User-specified graph G, data D(κ), parameter vector θκ, parametric distribution p(κ) θκ ( | θκ, X(κ)), prior distribution πκ( ) on parameters θκ, for κ = 1, . . . , M. 1: Initialize the index ℓ= 1, linkage set ζ(1) = {1}. 2: for ℓ= 2, . . . , M do 3: Let NG(ζ(ℓ 1)) denote the neighboring set of ζ(ℓ 1) within G, namely the set of learners in G\ζ(ℓ 1) that have a model linkage with at least one learner in ζ(ℓ 1). 4: Calculate jopt = argmaxj NG(ζ(ℓ 1))p( κ ζ(ℓ 1)y(κ) | κ ζ(ℓ 1)X(κ), D(j)). 5: if p( κ ζ(ℓ 1)y(κ)| κ ζ(ℓ 1) X(κ)) p( κ ζ(ℓ 1)y(κ) | κ ζ(ℓ 1)X(κ), D(jopt)) break 6: Let ζ(ℓ) = {jopt} ζ(ℓ 1). 7: if ζ(ℓ) = C(G) break 8: end for 9: For a new predictor ex, let bp(ℓ) = p( | κ ζ(ℓ) D(κ), ex) and bπ(ℓ) = π( | κ ζ(ℓ) D(κ)). Output: Predictive distribution bp = bp(ℓ), posterior distribution bπ = bπ(ℓ), and model linkage graph b G. where ex denotes the future observed covariates and Θ is the parameter space. Note that the posterior predictive distribution has integrated information from learners in C(G) through the parameters in θ. In practice, the true model linkage graph is not known a priori, and the user-specified model linkage graph G may be misspecified, as defined in Definition 4. A joint model as in (1) with a misspecified model linkage graph G can lead to severely biased parameter estimation, which affects the predictive quality of L1. Let |C(G)| be the cardinality of the set C(G). One way to address the aforementioned challenge is to exhaustively search all possible sets of model linkages over a graph with |C(G)| learners, and pick the set of model linkages that yields the largest marginal likelihood for L1 conditioned on other learners. However, the number of possible sets of model linkages grows exponentially with |C(G)|, and it is computationally infeasible to evaluate all possible subgraphs of G. To address the above challenge, we propose a greedy algorithm that is computationally feasible with theoretical guarantees. The proposed algorithm reduces the possible sets of model linkages from exponential in |C(G)| to quadratic in |C(G)|. The main idea is to successively search for the next learner that will improve the conditional marginal likelihood of the current group of learners, starting from a singleton set L1. The algorithm will terminate and output an estimated model linkage graph b G = (V, b E) when adding any further learner no longer increases the marginal likelihood of the maintained learners. The greedy algorithm is outlined in Algorithm 1. When the algorithm terminates at the ℓth iteration, L1 will integrate information from learners in ζ(ℓ), leading to the following posterior distribution of θ: π(θ| κ ζ(ℓ) D(κ)) = π(θ) Q κ ζ(ℓ) p(κ) θk (y(κ)|θk, X(κ)) p( κ ζ(ℓ)y(κ)| κ ζ(ℓ) X(κ)) . Recall that the above θ is the vector obtained by concatenating free parameters of θκ for all κ ζ(ℓ) without duplication. The posterior predictive distribution for a new observation ex in learner L1 obtained from Algorithm 1 is computed as p(ey| κ ζ(ℓ) D(κ), ex) = Z Θ p(1) θ1 (ey|θ1, ex)π(θ| κ ζ(ℓ) D(κ))dθ. Model Linkage Selection for Cooperative Learning Input Input graph " = ($, &) *!, *", *#, *$ Iteration 1 Iteration 2 Output graph E8Ehghn+e ABl4y Cm Gh Cq OR6V0x HRBIKOs WKDs Fe PHm Zd Oo1u1Gz7dt Gt Xl V5FGR+g Yn SIb Xa Amuk Et1EYUPa EX9Ibej Wfj1fgw Pmet Ja OYOUR/YHz9Amtonbc= S2,S4 E8Ehghn+e ABl4y Cm Gh Cq OR6V0x HRBIKOs WKDs Fe PHm Zd Oo1u1Gz7dt Gt Xl V5FGR+g Yn SIb Xa Amuk Et1EYUPa EX9Ibej Wfj1fgw Pmet Ja OYOUR/YHz9Amtonbc= S2,S4 Algorithm terminates Figure 4: A schematic diagram illustrating Algorithm 1 with four learners. Each learner Lκ = (D(κ), Pκ) consists of the data and a user-specified class of parametric model. The algorithm starts with a user-specified graph that may or may not be misspecified. At each iteration, the algorithm selects a learner that maximizes the marginal likelihood of the current group of learners. The algorithm terminates when no such learners can be found. Finally, the algorithm outputs an estimated linkage graph b G, predictive distribution bp, and the posterior distribution bπ for θ. We now illustrate Algorithm 1 with a toy example in Figure 4. In this example, there are four learners, and the user-specified model linkage forms a connected component among all learners, i.e., there is a path from one learner to another learner. Suppose that the goal is to assist learner L1. It can be seen from the user-specified model linkage graph G that L1 is directly connected to L2 and L3, and implicitly connected with L4 through L2. At the first iteration of Algorithm 1, L1 computes its marginal likelihood p(D(1)), and conditional marginal likelihoods p(D(1) | D(2)) and p(D(1) | D(3)). If none of the conditional marginal likelihoods is larger than the marginal likelihood, Algorithm 1 will be terminated; otherwise, it includes the learner that produces the larger conditional marginal likelihood into the model linkage set. In this case, learner L2 is included in the model linkage set. At the second iteration, the linkage set {L1, L2} is treated as one learner since they are linked together during the first iteration. On the other hand, the candidate learners to establish a linkage are L3 and L4. In Figure 4, L4 is included in the linkage set following a similar argument as in the first iteration. Finally, L3 is not included in the linkage set based on our criterion and the algorithm terminates. Consequently, L1 will obtain a parameter estimation and predictive model that are trained from the union of L1, L2, and L4. Zhou, Ding, Tan, Tarokh Remark 5 Computation and Communication: In the above example, we note that L1 does not need to access the raw data of the other learners during the first iteration. In particular, L1 only requires L2 to share its likelihood function (e.g., in the form of an API) to calculate the (conditional) marginal likelihoods required for decision-making. Moreover, one could design more sophisticated protocols to increase the computation and communication efficiency in the proposed algorithm. For instance, when L1 decides whom to collaborate with, it only needs L2 and L3 to transmit their posterior distribution (in the form of, e.g., Monte Carlo samples) computed locally, to calculate the conditional marginal likelihood. Once a collaborator, say L2, is determined, the likelihood function of L2 will be sent to L1 to calculate the latest marginal likelihood and posterior. We briefly comment on the computation complexity of L1 in a general scenario. Suppose that at each iteration ℓ, each learner in NG(ζ(ℓ 1)) (following the notation in Algorithm 1) will send their likelihood functions to the linkage set ζ(ℓ 1). Let us consider the complexity of evaluating each candidate learner j in line 4 of Algorithm 1 as one unit. On the one hand, if learner L1 performs the computation alone, its total complexity units will be O(P ℓ=2,...,M(M ℓ+ 1)) = O(M2) for large M. On the other hand, if everyone in the current collaboration set shares the computation cost, the complexity of L1 will be reduced to O(P ℓ=2,...,M(M ℓ+ 1)/(ℓ 1)) = O(M log M). The second setting is reasonable, because learners in ζ(ℓ 1) are already collaborators of L1 in the joint modeling. 3.2 Theoretical Results We first provide definitions on asymptotic prediction efficiency and model linkage selection consistency in Section 3.2.1. Theoretical results for the proposed framework are presented in Section 3.2.2. Throughout the section, let G = (V, E) and b G = (V, b E) be the user-specified and estimated model linkage graphs, respectively. The user-specified G = (V, E) is usually specified based on prior scientific knowledge of practitioners, which may or may not be wellspecified. Let G = (V, E ) be the largest subgraph of G, whose underlying model linkages are all well-specified after the statistical model for each learner is specified. That is, for any pair of learners Li and Lj connected on G, their linkage is well-specified if and only if there exists an edge between them in G . Recall that more linkages imply a fewer number of free parameters in the joint model from G. Thus, intuitively speaking, G represents the most parsimonious parameterization of the underlying data-generating process. Ideally, the proposed algorithm can data-adaptively select b G = G , thus identifying the correct model linkages and filtering out misspecified ones within G. 3.2.1 Definitions Recall that the goal of the proposed framework is to enhance the predictive performance of L1 by borrowing information from other learners, L2, . . . , LM. To evaluate the predictive performance of L1, we consider a general class of proper scoring rules (Gneiting and Raftery, 2007; Parry et al., 2012; Shao et al., 2019). Examples of proper scoring functions are the logarithmic score s(p, y, x) = log p(y|x) and the Brier score s(p, y, x) = p(y|x) + 0.5 R R p(ey|x)2dey (Parry, 2016). Model Linkage Selection for Cooperative Learning Definition 6 (Proper scoring function) Let p be the true data-generating density function. A scoring function s : (p, y, x) 7 s(p, y, x) is proper if for any conditional density function p, we have R Y s(p, y, x)p (y|x)dy R Y s(p , y, x)p (y|x)dy almost surely. Let E be the expectation with respect to the data-generating distribution of y conditional on x, denoted by p . Let (ey, ex) be a new observation. The value E[s(p , ey, ex) | ex] is referred to as the oracle score, and E[s(bp, ey, ex) s(p , ey, ex) | ex] is a non-negative expected prediction loss since s( ) is a proper scoring rule. It can be seen that the non-negativity of the Kullback-Leibler divergence from p ( |x) to bp( |x), defined as Dkl{p ( |x)||bp( |x)} = R Y p (y|x)[log{p (y|x)} log{bp(y|x)}]dy, implies that the logarithmic score is proper. Recall that C(G) is a set of indices recording a set of learners that forms a connected component with L1 in a model linkage graph G. Next, we define linkage selection consistency. Definition 7 (Linkage selection consistency) Given a pre-specified model linkage graph G, suppose that ψ : {D(κ) : κ C(G)} 7 b G is a linkage selection criterion in order to assist L1. Then, the linkage selection criterion ψ achieves linkage selection consistency if Pr( b E = E ) 1 as n . Here, the probability is defined over the observed data. In other words, a consistent linkage selection criterion ψ selects all the correct model linkages present in the user-specified linkage graphs. Next, we introduce the notion of asymptotic prediction efficiency. Suppose that C denotes a generic set of learners other than L1. Let bp C denote the predictive distribution of L1 conditional all the learners in C. Definition 8 (Asymptotic prediction efficiency) Let bp be a constructed marginal predictive distribution for L1, and let s( ) be a proper scoring function. Then, bp is asymptotically prediction efficient if E[s(bp C(G ), ey, ex) s(p , ey, ex)] E[s(bp, ey, ex) s(p , ey, ex)] (2) converges in probability to one as the number of observations nκ for κ = 1, . . . , M. Here, the expectation is taken over a new observation (ey, ex). If bp is the posterior predictive distribution for L1 under a certain model linkage b G, then b G is also referred to as an asymptotically efficient model linkage graph. The ratio (2) contrasts the expected prediction loss of the constructed predictive density function bp and that of the predictive density function induced by G . 3.2.2 Theoretical Properties of Algorithm 1 We now proceed to study the theoretical properties of Algorithm 1. For technical convenience, we assume that learner L1 is well-specified. Note that, more generally, learner L1 may or may not be well-specified. In either case, the predictive performance can be evaluated by a proper scoring rule, e.g., the logarithmic rule. Throughout the theoretical studies, we consider the regime in which the number of learners M is fixed and the number of observations nκ satifies nκ/n cκ, where n = PM κ=1 nκ and cκ is a positive constant. Zhou, Ding, Tan, Tarokh Recall that G = (V, E) is a user-specified graph. Moreover, recall that G = (V, E ) is the true model linkage graph, defined as the largest subgraph of G with correct model linkages between pairs of learners (given that each learner s model is specified). We denote the estimated model linkage graph from Algorithm 1 as b G = (V, b E). Theorem 9 Under some regularity conditions in Appendix A and given a user-specified graph G, the estimated model linkage edge set b E from Algorithm 1 achieves linkage selection consistency, namely Pr( b E = E ) 1 as n . Theorem 9 indicates that the estimated model linkage graph b E from Algorithm 1 is consistent in linkage selection, only selecting all the well-specified model linkages from G. Thus, the proposed approach is asymptotically robust against model linkage misspecification. We will verify the finite sample performance of Algorithm 1 using numerical studies in Section 4, by considering various settings such as model misspecification, model linkage misspecification, and data contamination. The following theorem guarantees that the predictive distribution constructed from Algorithm 1 is asymptotically prediction efficient. Theorem 10 Let bp be the constructed predictive distribution for L1 via Algorithm 1 based on its selected model linkage graph b G. Let s( ) be a proper scoring function. Under the same conditions as in Theorem 9, bp is asymptotically prediction efficient. Note that Theorem 10 holds even when the statistical models for some learners are misspecified due to the definition of G . In other words, the proposed method yields a predictive distribution that is robust to model misspecification and model linkage graph misspecification. 4. Numerical Studies 4.1 Linear Regression Example We consider a regression setting with six learners L1, . . . , L6. The goal is to enhance the predictive performance of L1 by incorporating information from other learners. The data for the six learners are generated as follows: P7 j=1 β(κ) j x(κ) ij + ϵ(κ) i for κ = 1, 3, 4, P15 j=1 β(κ) j x(κ) ij + ϵ(κ) i for κ = 2, P7 j=1 β(κ) j (x(κ) ij + 5)2 + ϵ(κ) i for κ = 5, P15 j=8 β(κ) j x(κ) ij + ϵ(κ) i for κ = 6, where all regression coefficients are set to equal 0.3 except that β(4) j = 0.6 for j = 1, . . . , 7. Each covariate x(κ) ij and the random noise are generated from a standard Gaussian distribution for all the learners. Since learners L1, L2, and L3 share common parameters, and L2 shares common parameters with L6, there are model linkages among learners L1, L2, L3, and L6. The true underlying model linkage graph G is illustrated on the right panel of Figure 5. Note that there is a model linkage between L2 and L3 since both share the same Model Linkage Selection for Cooperative Learning Figure 5: The user-specified model linkage graph G and the largest well-specified model linkage graph G within G are shown on the left and right panels, respectively. common parameters with L1. For simplicity, we set the sample size for each learner to be n. In practice, the user needs to specify a model linkage graph for the six learners and a statistical model for each learner. For the numerical studies, we specify correct statistical models for κ = {1, 2, 3, 4, 6}, and we misspecify L5 by assuming that the covariates are linearly related to the response y. We further restrict β(κ) j to be the same for κ = 1, . . . , 5 and j = 1, . . . , 7. Moreover, we restrict β(2) j = β(6) j to be the same for j = 8, . . . , 15. Hence, the model linkage graph is misspecified in the sense that we assume that there exist model linkages between L4 and {L1, L2, L3, L5}, and between L5 and {L1, L2, L3}. The userspecified graph G is illustrated on the left panel of Figure 5. We apply Algorithm 1 with the aforementioned user-specified graph and impose a multivariate Gaussian distribution, Np(0, 4Ip), as the prior distribution for the regression coefficients for all learners. We will compare the proposed Algorithm 1 to fitting the model using data only from L1, and using the combined data from L1 and L4. Recall that the actual regression coefficients in L4 are different from that of L1, and thus combining data in L1 and L4 can lead to severely biased estimates of regression coefficients. To assess the model linkage selection accuracy, we calculate the selection accuracy as the proportion of times when the estimated model linkage graph b G from Algorithm 1 is equal to G . To evaluate the performance across different models, we generate 50 test data for L1, and calculate the mean squared error between the predicted response and the actual response in the test data. Note that the mean squared error is a surrogate of the predictive logarithmic score under the Gaussian noise assumption. The predicted response is obtained by taking the mean of the posterior predictive distribution for each model. In addition, we calculate the length of the 95% prediction interval obtained from the posterior predictive distribution. The results for a range of sample sizes n {50, 75, . . . , 150}, averaged over 200 replications, are shown in Figure 6. Zhou, Ding, Tan, Tarokh From the left panel of Figure 6, the selection accuracy from Algorithm 1 approaches one as the sample size increases. This is in line with the selection consistency result of Theorem 9. Notably, Algorithm 1 yields a consistent model linkage graph even though the user-specified model linkage graph G is misspecified as shown in Figure 5. From the middle panel of Figure 6, we see that the proposed greedy algorithm yields the lowest prediction mean squared error across a range of sample sizes n. The results indicate that combining data from L1 and L4 imprudently can lead to higher prediction mean squared errors. Meanwhile, properly integrating data can improve the predictive performance of a single learner L1. Finally, the average length of the 95% prediction interval for the different models is presented in the right panel of Figure 6. We see that the proposed algorithm yields the narrowest prediction interval across the range of n. Figure 6: The results for modeling based on the proposed algorithm ( greedy algorithm ), modeling of the single agent L1 ( L1 ), and combined modeling of L1 and L4 ( L1, L4 ) in the regression experiment. The evaluation uses the selection accuracy (left), prediction mean squared error (middle), and length of the 95% prediction interval (right), averaged over 200 replications. 4.2 Logistic Regression Example In this section, we illustrate that the proposed framework can be employed for classification problems. We perform a numerical study with five learners L1, . . . , L5 to enhance the prediction accuracy of L1. For each learner, we generate the covariates independently from a uniform distribution on the closed interval [ 1, 1]. Then, the response variable is generated as the following: logit(Pr(y(κ) i | x(κ) i )) = ( (x(κ) i )Tβ for κ = 1, 2, 3, 4, 0 for κ = 5, where β = { 0.8, 0.5, 0.2, 0.1, 0.4, 0.7, 1.0, 1.3, 1.6}T. Learners L1, L2, L3, and L4 share common parameters β , and there are model linkages among L1, L2, L3, and L4. Learner L5 indicates that y(5) i follows a Bernoulli distribution with probability 0.5 and is independent of the covariates. Thus, there are no model linkages between L5 and the other learners. We again set the sample sizes for all learners to the same n for simplicity. Model Linkage Selection for Cooperative Learning Figure 7: The results for modeling based on the proposed algorithm ( greedy algorithm ), modeling of the single agent L1 ( L1 ), and combined modeling of L1 and L5 ( L1, L5 ) in the logistic regression experiment. The evaluation uses selection accuracy and prediction mean squared error, averaged over 200 replications. We compare Algorithm 1 to models that are based on the data from L1, and combined data from L1 and L5. To evaluate the performance across different methods, we calculate the selection accuracy and prediction error. For Algorithm 1, we specify an incorrect model linkage graph where all learners are connected and fit the same logistic regression model for all learners. We set a multivariate normal distribution, N(0, 4I), as the prior distribution for the regression coefficients for all learners. The results for a range of sample sizes n = {100, 150, . . . , 350}, averaged over 200 replications, are shown in Figure 7. From the left panel of Figure 7, we see that the selection accuracy converges to one as we increase the sample size n for each learner. That is, the greedy algorithm chooses not to include information from L5, even when the user-specified model linkage graph contains model linkages between L5 and the other learners. In addition, the proposed greedy algorithm yields the highest prediction error, whereas the modeling based on the joint data from L1 and L5 yields the lowest prediction error. 4.3 Logistic Regression with Data Contamination using Breast Cancer Data Data contamination is an important issue when one decides whether to incorporate information from other learners. In practice, when several data sources are collected to have the same covariates, users tend to analyze the combined dataset to leverage more information. However, if some data sources are corrupted or contaminated, it is crucial to discriminate against them and avoid incorporating information from the contaminated learners. We illustrate that Algorithm 1 is robust against data contamination on some data sources. We consider the Wisconsin Breast Cancer database (Mangasarian et al., 1995). The data consist of a response variable recording whether a cancer tissue is benign or malignant with 9 covariates from a total of 699 subjects. We randomly choose 100 samples as the test data Zhou, Ding, Tan, Tarokh for evaluating prediction accuracy. Then, the data are randomly divided into 10 learners, in which each learner has n samples. Since the data from the 10 learners L1, L2, . . . , L10 are subsamples of the original data set, we assume that the regression coefficients are the same across all learners. We then contaminate the data in L10 such that the binary response is flipped. We apply Algorithm 1 with a misspecified model linkage graph by assuming that all learners are linked among each other. For each learner, we assume a logistic regression model with an intercept and 9 covariates. For simplicity, we impose the prior distribution N(0, 42) on all regression coefficients. The prediction accuracy for the proposed method, the method that uses L1, and the method that uses the combined data L1 and L10, averaged over 200 replications, are reported in Table 1. From Table 1, we see that naively combining data from L1 and L10 will lead to a much lower prediction accuracy than the model using only data from L1. Our proposed method, on the other hand, chooses not to incorporate information from L10. Also, by combining data sources adaptively, our proposed method yields a prediction accuracy that is much higher than the model fit using data from L1 alone, for both cases of n = {25, 50}. Table 1: Performance results of the proposed approach, model of L1 alone, and joint model of L1 and L10, as evaluated by the prediction accuracy. The results are averaged over 200 replications, with n = {25, 50}. number of samples proposed method L1 L1 and L10 n = 25 0.928 0.844 0.500 n = 50 0.952 0.894 0.498 4.4 Integrating Information on Kidney Cancer Data In this section, we analyze the kidney cancer data considered in Maity et al. (2019). The kidney cancer data consist of 33 different types of tumors, with a total of n = 8108 samples and up to p = 198 proteins for the different types of tumors. Each tumor type may have a different number of proteins. To study the association between patients survival time and proteins, Maity et al. (2019) fit an accelerated failure time model with a log-normal assumption, from which they identified eight proteins that are most related to patients survival time. We now illustrate that integrating information from related cancer tumors using the proposed method can improve the prediction accuracy of patients survival time. For simplicity, we consider only patients that are not alive at the observed survival time and fit a linear regression on the log-transformed survival time. We consider three types of tumors: (i) kidney renal clear cell carcinoma (KIRC), L1, (ii) kidney renal papillary cell carcinoma (KIRP), L2, and (iii) uterine corpus endometrial carcinoma (UCEC), L3, each of which has 146, 24, and 34 samples, respectively. Moreover, we pick up three proteins PCADHERIN , GAB2 , and HER3 p Y1298 as the covariates, following (Maity et al., 2019) . The three proteins have been well studied and are all well-known for kidney tumor growth and in- Model Linkage Selection for Cooperative Learning vasion (Blaschke et al., 2002; Duckworth et al., 2016; Akbani et al., 2014). In particular, the PCADHERIN has been considered as one of the most important proteins for kidney cancer (Maity et al., 2019). Both KIRC and KIRP originate from cells in the proximal convoluted tubules of the nephron (Chen et al., 2016). Thus, it is reasonable to assume that PCADHERIN has a similar effect on the log-transformed survival time of patients with KIRC and KIRP. On the other hand, PCADHERIN is expected to have a different effect on UCEC since UCEC is a type of uterine cancer. Inspired by the above domain knowledge, we set up a model linkage graph in the following way. We assume that there are linkages among KIRP, KIRC, and UCEC by sharing the effect of PCADHERIN on survival time across three tumor types. We fit a linear regression model with a log-transformed response, namely log(y(κ) i ) = P3 j=1 x(κ) ij β(κ) j +ϵ(κ) i , where the indices i, j, and κ denote the ith subject, the jth protein, and the κth tumor type. For simplicity, we assume that the random noise is normally distributed with different variances to account for heterogeneity across different tumor types, namely ϵ(κ) i N(0, σ2 κ). Let β(1) 1 , β(2) 1 , β(3) 1 be the regression coefficient for PCADHERIN across the three cancer types, which are linked on the specified linkage graph. We compare the prediction accuracy of the proposed greedy algorithm with the model using only data from L1 (namely KIRC). To that end, we sample 20 data points from L1 such that the sample sizes across three tumor types are approximately the same. We treat the remaining data points for test purposes. The prior distributions of the regression coefficients are assumed to be standard normal, and the prior distributions for the three intercepts are assumed to follow a normal distribution with a mean of 10 and variance one. Moreover, we assume that σ2 κ Inv Gamma(2, 1) for κ = 1, 2, 3, where Inv Gamma denotes the inverse gamma distribution. We replicate the experiments above 100 times by sub-sampling 20 data observations from L1 for training the models. The results indicate that Algorithm 1 connects L2 to L1 in 65% of the replications, and L2 to L3 20% of the replications. In this example, Algorithm 1 selects different linkages across different replications due to randomness in the samples. The average prediction mean squared error (with its standard error) of the proposed method is 1.69(0.027). In comparison, the values are 1.74(0.031) if using data only from L1, and 1.66(0.024) if using the joint data from L1 and L2 throughout the 100 replications. The results imply that collaborating with L2 increases the predictive performance of L1, and Algorithm 1 tends to favor such a collaboration. 4.5 Epidemiological Data Study In this experiment, we revisit the example illustrated in Figure 2 of Section 2. Recall that L1 employs n Binomial models y(1) i Binomial(x(1) i , θ1,i), and learner L2 has poisson models y(2) i Poisson(θ1,iθ2x(2) i ), with i = 1, 2, . . . , n. The datasets for the two learners are denoted by D(1) = (y(1), x(1)) and D(2) = (y(2), x(2)). Note that the two learners are heterogeneous, and the parameters of L2 are not identifiable since θ1,iθ2 = cθ1,i c 1θ2 for any positive constant c. With a joint modeling of L1 and L2, the parameters become identifiable. We first use the data in (Maucort-Boulch et al., 2008), where each learner has n = 13 data. For L1, the population size x(1) i ranges from 37 to 700 and the empirical infection rate y(1) i /x(1) i Zhou, Ding, Tan, Tarokh Table 2: Predictive performance of using a joint modeling of L1 and L2 ( Joint ), the proposed method ( Proposed ), and a single-agent modeling ( L1 or L2 alone ), evaluated for each learner. The evaluation is based on the expected log-predictive distribution, numerically computed from 1000 out-sample data and 100 replications. Standard errors are reported in the parentheses. L1 (Binomial) L2 (Poisson) Joint Proposed L1 alone Joint Proposed L2 alone Case 1 -3.76(0.006) -3.86(0.016) -4.05(0.009) -3.76(0.002) -3.76(0.002) -3.76(0.002) Case 2 -4.29(0.033) -4.38(0.041) -4.87(0.019) -4.03(0.011) -4.03(0.011) -4.03(0.010) Case 3 -8.37(0.069) -4.58(0.135) -4.04(0.016) -3.62(0.009) -3.63(0.009) -3.63(0.009) Case 4 -3.56(0.033) -3.56(0.033) -4.05(0.016) -3.26(0.008) -3.26(0.008) -3.31(0.008) ranges from 0 to 0.2. For L2, the women-years follow up x(2) i ranges from 20000 to 550000, and the number of cancer incidences y(2) i ranges from 10 to 700. In this experiment, we divided each x(2) i by 1000 for computation efficiency, and note that such a scaling does not make an essential difference for parameter estimation. We used Uniform(0, 1) as the prior distribution for each θ1,i and Gamma(5, 1) for θ2. Algorithm 1 outputs that there is no linkage established between L1 and L2. To develop more insights into the nature of the above models and data size, we perform four cases of simulated data experiments. We use n = 13 and the identical prior distributions as in the real-data experiment. In Case 1, we generated simulated data with x(1) = (200, 12 z }| { 1000, ..., 1000) and x(2) = ( 13 z }| { 1000, ..., 1000), θ2 = 10, and θ1,i = 0.1 for i = 1, 2, . . . , n. It serves as a case with large data, which tends to produce an accurate parameter estimation. Case 2 is similar to Case 1, except that x(1) = (20, 12 z }| { 100, ..., 100) and 13 z }| { 100, ..., 100). The latter two experimental cases are based on parameters estimated from the real data. Case 3 simulates the setting where there exists no underlying linkage between two learners. In particular, we simulate y(1) i from the Binomial model with the original population size x(1) i and the empirical rate θ1,i that is calculated from the original data observations. We simulate y(2) i from the Poisson model with the expectation that equals the original count observation. In contrast, Case 4 simulates the setting where there exists an underlying linkage between two learners. In particular, we estimate a joint model of the two learners, and use the posterior of θ1,i s and θ2 to generate y(1) in Binomial models and y(2) in Poisson models, with the original population sizes x(1) and x(2). To calculate the out-sample test performance of L1 or L2, whomever is being assisted, we generate 1000 test data based on the underlying data distributions. In particular, we numerically calculate E[log p(yf | xf)] where p is the predictive distribution and (yf, xf) denotes the test data. The results are shown in Table 2. Recall that in Cases 1, 2, and 4, there exists a model linkage between L1 and L2. In Case 3, there exists no underlying Model Linkage Selection for Cooperative Learning model linkage. From the results, the test performance of the proposed method can dataadaptively approach the better performance of two options, namely with and without a linkage through the parameters θ1,i. Also, though L2 alone is not identifiable, it can improve L1 s performance in collaborative cases. 5. Conclusion and Further Remarks With the rapid growth of low-cost data collection devices and decentralized learners, data analysts are faced with an important challenge to integrate information across a set of learners with diverse data sources. However, prudently combining data sources and fitting a joint model on all data sources can lead to biased estimates with a low prediction accuracy due to misspecified models or model linkages. We proposed a general approach to enhance the predictive performance of learner L1 by robustly integrating information across a set of learners. Since information are integrated through parameter linkages by sharing parameters, the data sources do not need to be transmitted across learners. As such, the proposed method is naturally compatible with decentralized learning that involves internetof-things requiring low-energy consumption (Da Xu et al., 2014), smart sensors with limited hardware capacities (Zhou et al., 2016), and decentralized networks with limited communication bandwidths (Xiao and Luo, 2005). We showed that the proposed method is linkage selection-consistent and asymptotically prediction-efficient. The theoretical properties are established under the regime in which the number of learners M and the parameter dimensions are fixed. An interesting future problem is to study the theoretical properties of the proposed framework under the regime in which the number of learners or the dimension of parameters is allowed to diverge with the sample size. In the following, we briefly describe the connection between the proposed framework and existing methods on data integration and distributed learning. Data Integration. Integrating information from different data sources has been studied in the context of data integration (see, for instance, Tang and Song, 2016; Li and Li, 2018, and the references therein). When there is a unified model across multiple data sources, it is possible to improve statistical efficiency through parameter sharing or fitting one model using the combined data. For instance, Tang and Song (2016) employed a fused lasso approach to encourage the regression coefficients for different data sources to be similar. Li and Li (2018) developed an integrative linear discriminant analysis method by combining different data sources and showed that the classification accuracy could be improved compared with using a single data source. Some other work pre-specified constraints on latent variables to utilize heterogeneous data sources to address multiple parametric models. For example, in the study of gene regulatory networks, Jensen et al. (2007) proposed a Bayesian hierarchical model to integrate gene expression data, Ch IP binding data, and promoter sequence data to infer statistical relationships between transcription factors and genes. The uniqueness of our work compared with existing methods is that our proposed method allows for a set of learners with diverse learning objectives and distinct statistical models. Furthermore, the set of learners share information only through linked parameters of interest. Therefore, the method in Section 3 can be used to help any learner efficiently identify cooperative learners when prior information is lacking. Zhou, Ding, Tan, Tarokh Lunn et al. (2000) and Plummer (2015) also studied data integrations in the context of cut distributions, which can be seen as a probabilistic version of a two-step estimator. The main idea is to cut the propagation from uncertain models to precise models during joint learning to reduce biases propagated from incorrectly specified models (Lunn et al., 2009; Ogle et al., 2013). Jacob et al. (2017) proposed a predictive score principle for choosing the most appropriate joint modeling approach among the cut, full posterior, prior, and two-step approaches over a set of learners. It is possible to incorporate cut distribution into our proposed method to improve the finite-sample regime s performance. Nevertheless, the number of possible candidates exponentially increases with the number of learners, and thus the search space of each greedy selection step can be computationally prohibitive. Also, a systematic theoretical study of the cut distribution remains a challenging problem. Distributed Learning. Data privacy has gained much attention recently, especially in distributed learning, where data curators do not wish to share the original data. This motivates some recent advances in distributed learning such as federated learning, where a central server sends the current global statistical model to a set of selected clients, and each client updates the model parameter with local data and returns the updates to the central server (Shokri and Shmatikov, 2015; Koneˇcn y et al., 2016; Diao et al., 2020b). The objective function for federated learning is typically formulated as min θ F(θ) := κ=1 Fκ(θ), where Fκ(θ) = X (x,y) D(κ) f(θ; x, y), (3) where f denotes a global loss function, θ parameterizes a global model to learn, and D(κ) is a labeled dataset of the κth client. To optimize over θ, each client locally takes several iterations of (stochastic) gradient descent on the current parameter using its local data. Then the server takes a weighted average of the resulting parameters. Within a similar context, Jordan et al. (2019) proposed a communication-efficient surrogate likelihood framework for solving distributed statistical estimation problems, which provably improves upon simple averaging schemes. In the context of our proposed framework, consider a set of learners each holding a data source D(κ) and personal objective function fκ, and a set of optimization constraints C. A frequentist counterpart of our Bayesian approach is to minimize a proper scoring function (Dawid and Musio, 2015), e.g., the negative log-likelihood function, added with some form of regularization. The unknown parameters can be estimated by solving the following optimization problem min θ1,...,θM F(θ) := κ=1 Fκ(θκ) + R(θ1, . . . , θM), subject to C, (4) where Fκ(θ) = X (x,y) D(κ) fκ(θ; x, y), where R is a suitably chosen regularization function. The federated learning falls into the above formulation when the models are restricted to be the same among different learners, namely fκ = f, θκ = θ for all κ. Without the constraint C and regularization R, (4) is equivalent to optimizing M individual objectives separately. Model Linkage Selection for Cooperative Learning Both the developed Bayesian formulation or the frequentist counterpart in (4) can also be regarded as forms of personalized federated learning, where decentralized and heterogeneous agents participate in joint learning with peer agents to boost local learning performance. A key characteristic of personalization is that each agent has a specific local task, loss, model, and data. As such, the order of establishing linkages and the selected set of collaborative learners may depend on whom to assist. This is reflected through the asymmetric nature of the proposed Algorithm 1 in finite-sample regimes. To illustrate this point, we provide a toy example that involves three learners, each with a Gaussian model y(κ) N(µκ, 1), κ = 1, 2, 3. It can be regarded as a regression with x(κ) = 1 and unknown parameters µκ s. Suppose that G is a fully connected graph, and the prior of µκ is N(0, 102). The observed data are y(1) = 2, y(2) = 0.3, y(3) = 2, respectively. One can verify that if L1 is the one to be assisted, Algorithm 1 will link it with L2 at the first step; Then, L1 and L2 are not linked with L3, terminating the algorithm. On the other hand, if L2 will be assisted, the algorithm first links it with L3 and then stops. Consequently, L1 and L2 will not establish a linkage in that case. The above indicates the asymmetric nature of collaboration: L1 being assisted by L2 doest not mean that L1 can assist L2 in the presence of other learners. Nevertheless, it can be verified that there is no such asymmetry in the case of two learners, meaning that L1 selecting L2 is equivalent to L2 selecting L1 in Algorithm 1. Even in the context of vanilla federated learning, considerations of user misspecification or adversarial attacks are relatively new (Bhagoji et al., 2019; Jere et al., 2020), and the proposed notion of prediction efficiency and selection consistency are readily applicable. Moreover, in a general learning scenario where parameters may lose interpretability, it is still possible to build linkages among learners to reduce the overall model complexity and generalization errors. For example, Diao et al. (2019, 2020a) recently showed that appropriately restricting deep neural network parameters can significantly improve the performance of multi-modal image generation, compared with state-of-the-art methods that train an image generator separately from each data modality. From a theoretical perspective, the risk bound can be reduced by restricting the size of the function spaces through C. When each learner s predictive performance is not severely biased by other learners compared with its reduced variance, it is worth establishing a joint optimization in the form of (4). Acknowledgments The authors thank the action editor and two anonymous referees for their constructive comments. The authors thank Dr. Veera Baladandayuthapani for sharing the kidney cancer data in Maity et al. (2019). Jiaying Zhou was supported by the Army Research Laboratory and the Army Research Office under grant number W911NF-20-1-0222 and National Science Foundation under grant number ECCS-2038603. Jie Ding and Vahid Tarokh were supported by the Office of Naval Research under grant number N00014-18-1-2244. Kean Ming Tan was supported by National Science Foundation under grant numbers DMS-1949730 and DMS-2113346, and National Institutes of Health under grant number RF1-MH122833. Zhou, Ding, Tan, Tarokh Appendix A. Notation and Regularity Conditions We start with introducing some regularity conditions needed for the theoretical development. These regularity conditions are generalizations of those in Walker (1969) from scalar to multidimensional vector. Suppose that each observation (yi, xi), 1 i n, is modeled via a joint distribution with a density function p(y, x | θ) with respect to a σ-finite measure µ. Moreover, x Rk is modeled using the density function h(x) that is independent of the parameter θ = (θ1, θ2, . . . , θp)T Rp. The joint density can thus be written as p(y, x | θ) = p(y | x, θ)h(x). Let the true conditional density function of y given x be p (y | x), and thus the true joint density of (y, x) can be written as p (y, x) = p (y | x)h (x). Note that h(x) and h (x) are not necessarily the same due to potential model misspecification when modeling x. Let θ be an interior value in the parameter space Θ defined as the minimizer of the Kullback-Leibler divergence between p(y | x, θ ) and p (y | x): θ = argmax θ Rp {Y,X} log p(y, x | θ) p (y, x) p (y, x)dµ = argmax θ Rp {Y,X} [log p(y | x, θ) + log{h(x)}]p (y | x)h (x)dxdy = argmax θ Rp {X} h (x) Z {Y} log p(y | x, θ)p (y | x)dydx = argmax θ Rp {Y} log p(y | x, θ)p (y | x)dy. (5) Note that when the parametric model p(y | x, θ) is well-specified, p(y | x, θ ) = p (y | x). Let ℓ(θ) = Pn i=1 log p(yi, xi | θ) be the log-likelihood function for the n observations and let bθ = argmax θ Rp ℓ(θ) be the maximum likelihood estimator (MLE) of θ. Let In(θ) be the observed Fisher information matrix with {In(θ)}i,j = 2ℓ(θ) θi θj , and let I(θ) be the expected Fisher information matrix for a single observation (y, x) with {I(θ)}i,j = E n 2 log p(y,x|θ) o . Let {J(θ)}i,j = E{ log p(y,x|θ) θi log p(y,x|θ) θj }. We define T(n) = Θ{f(n)}, namely when n is large, there exist some fixed constants c1 and c2 such that c1f(n) |T(n)| c2f(n). Some regularity conditions that are needed in the theoretical development are listed in the following. (I) The parameter space Θ Rp is compact. (II) The set of points {Y, X} = {(y, x) : p(y, x | θ) > 0} is independent of θ. (III) If θα = θβ, then µ {(y, x) : p(y, x | θα) = p(y, x | θβ)} > 0. (IV) For all (y, x) {Y, X} and δ > 0, we have |log p(y, x|θ) log p(y, x|θ )| < Hδ(y, x, θ ) as long as θ θ 2 < δ. Here, function Hδ has the property that lim δ 0 Hδ(y, x, θ ) = 0 and that lim δ 0 {Y,X} Hδ(y, x, θ )p (y, x)dµ = 0. Model Linkage Selection for Cooperative Learning (V) If Θ is not bounded, then for any θM Θ, and sufficiently large , we have log p(y, x|θ) log p(y, x|θM) < K (y, x, θM), where θ 2 > and K has the property that {Y,X} K (y, x, θM)p (y, x)dµ < 0. (VI) The maximum likelihood estimator (MLE), denoted by bθ, exists, and the matrix In(bθ) is positive definite almost surely. (VII) The log-likelihood function log p(y, x|θ) is twice continuously differentiable with respect to θ in some neighborhood of θ . (VIII) The first and second derivatives with respect to θ, and the integral of log p(y, x|θ), are exchangeable. (IX) There exists a δ > 0 such that 2 log p(y, x | θ) θi θj 2 log p(y, x | θ ) < Mδ(y, x, θ ) for any pair (i, j) and θ θ 2 < δ, where the function Mδ satisfies {Y,X} Mδ(y, x, θ )p (y, x)dµ = 0. (X) The prior density function is continuous at θ = θ and π(θ ) > 0. (XI) If L1 has a well-specified model pθ1, and L2 has misspecified model pθ2, the model linkage (defined in Definition 2) between L1 and L2 is misspecified (Definition 4). Conditions (I) (V) ensure that when θ is an interior value of Θ, ℓ(θ) ℓ(θ ) is sufficiently small for values of θ that are not in the vicinity of θ , with probability tending to one as n . We note that Condition (I) is a stronger than necessary assumption made to simplify the technical arguments. Alternatively, one may remove this assumption and show that the parameter estimate falls into a compact set with the probability increasing to one as the sample size increases. Conditions (VI) (IX) ensure that when θ = θ , n1/2(bθ θ ) has a limiting distribution N 0, I(θ ) 1J(θ )I(θ ) 1 . Conditions (VII) (IX) assume that In(θ) is smooth in the vicinity of θ . Condition (XI) is needed to guarantee that learners with misspecified models are not included in the joint model to enhance the statistical performance of L1. Zhou, Ding, Tan, Tarokh Appendix B. Assumptions on the Scoring Rule Let y = {y1, y2, . . . , yn}T be an n-dimensional vector of the response and X = {x1, x2, . . . , xn}T Rn k be the design matrix of covariates. In the following, we state some assumptions on the score function s defined in Definition 6. (A1) sup p,y,x E[s(p, y, x)] (0, ). (A2) Assume that the model p(y|x, θ) is well-specified. Let p(ey|ex, y, X) be the Bayesian predictive distribution of ey given the new predictor ex and the data y, X. As n , E s{p(ey|ex, y, X), ey, ex} s{p(ey|ex, θ ), ey, ex} = Θ(n 1 The first assumption indicates that the expected loss is bounded by some positive constant for any given density function and prediction point. The second assumption indicates that the difference between the prediction score and oracle score is in the order of n 1/2. Many commonly used scoring rules such as the Kullback-Leibler divergence and cross-entropy satisfy the aforementioned assumptions. Appendix C. Proof of Theorems 9 10 We start with some technical lemmas that will be helpful for the proof of Theorems 9 10. Let L(θ|y, X) = p(y, X|θ) = Qn i=1 p(yi, xi|θ) be the likelihood function for the n observations and let ℓ(θ) be the log-likelihood function. Let p(y, X) = R p(y, X|θ)π(θ)dθ be the marginal likelihood of (y, X). The following lemma is a multidimensional counterpart of Walker (1969) that provides limiting properties for the maximum likelihood estimator and marginal likelihood. Lemma 11 Assume that the regularity conditions in Appendix A hold. Let θ Rp be an interior value in the parameter space Θ, and assume that the data pair (yi, xi) Rk+1 has density p(yi, xi|θ) for i = 1, 2, . . . , n. Let bθ be the maximum likelihood estimator of θ. As n , the following results hold: (i) Let N(δ) = {θ : θ θ 2 < δ} be a neighborhood of θ contained in Θ. For any positive δ, there exists a positive number k(δ) depending on δ such that sup θ Θ\N(δ) n 1 {ℓ(θ) ℓ(θ )} < k(δ) (ii) Let (bξ2) 1 = det|In(bθ)|, we have lim n n p n (bξ2) 1o = det|I(θ )|. (iii) ℓ(θ ) ℓ(bθ) = Θ(1). n p(y, X|bθ)bξ o 1 p(y, X) = (2π) p 2 π(θ ). Model Linkage Selection for Cooperative Learning Lemma 11(i) indicates that the difference between ℓ(θ) and ℓ(θ ) will be large when θ is not in the δ-neighborhood of θ . Lemma 11(ii) establishes that the determinant of the observed Fisher information matrix converges to the determinant of the expected Fisher information matrix. Lemma 11(iii) shows that the log-likelihood function evaluated at θ and bθ are at the same order. The proof of Lemma 11 is provided in Appendix D.1. Next, we present a key lemma that provides similar results as those of Lemma 11, but under the setting with non-identically distributed data that arise from two different data sources from two different learners. To this end, we define some notation. Without loss of generality, we consider two learners L1 and L2 with sample size n1 and n2, respectively. In particular, for each learner Lκ with κ = 1, 2, the user-specified density function of the data pair (y(κ) i , x(κ) i ) Rkκ+1 is denoted as p(κ) θκ (y(κ) i , x(κ) i |θκ). Let p κ(y(κ) i , x(κ) i ) be the true underlying density function. Similar to Lemma 11, let θ κ Rpκ be an interior value in the parameter space Θκ. As defined in Definition 2, let θ1,S1 = θ2,S2 = θS1,S2 Rps be the shared parameter between L1 and L2. Moreover, let θC = (θT 1, S1, θT S1,S2, θT 2, S2)T Rp C be the vector obtained by concatenating entries of θ1 and θ2 without duplication, which has a dimension of p C = p1 + p2 ps. Let θ C Rp C be its corresponding interior value in ΘC. We denote eθ1 = (θT 1, S1, θT S1,S2)T and eθ2 = (θT 2, S2, θT S1,S2)T as the parameters for L1 and L2 after incorporating information from the model linkage between the two learners. We note that eθ1 and eθ2 are equivalent to θ1 and θ2, respectively, after some reordering of the elements. The notation are simply defined to facilitate the proof of the theoretical results. Let y(κ) = (y1, y2, . . . , ynκ)T and X(κ) = (x(κ) 1 , x(κ) 2 , . . . , x(κ) nκ )T. Denote L(eθ1|y(1), X(1)), L(eθ2|y(2), X(2)), and L(θC|y(1), y(2), X(1), X(2)) as the likelihood functions of L1, L2, and (L1, L2), where L(eθ1|y(1), X(1)) = p(1) eθ1 (y(1), X(1)|eθ1), L(eθ2|y(2), X(2)) = p(2) eθ2 (y(2), X(2)|eθ2), and L(θC|y(1), y(2), X(1), X(2)) = p(C) θC (y(1), y(2), X(1), X(2)|θC) = p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2). Let ℓ1, ℓ2, and ℓC be their corresponding log-likelihood functions and let bθ1, bθ2, and bθC be the MLEs obtained from maximizing ℓ1, ℓ2, and ℓC, respectively. Let I(κ) nκ (eθκ) and I(κ)(eθκ) be the observed Fisher information matrix and expected Fisher information matrix on single observation, respectively, for κ = 1, 2. Let In(θC) be the matrix with element {In(θC)}i,j = 2ℓC(θC) θC,i θC,j , where θC,i is the ith element of θC. Next, let π(eθ1), π(eθ2), and π(θC) be the prior density of eθ1, eθ2, and θC, respectively. Let p(y(κ), X(κ)) = R p(κ) eθκ (y(κ), X(κ)|eθκ)π(eθκ)deθκ be the marginal likelihood of (y(κ), X(κ)), for κ = 1, 2. Moreover, the marginal likelihood of (y(1), y(2), X(1), X(2)) is expressed as p(y(1), y(2), X(1), X(2)) = Z p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC. We now present the results in the following lemma. Zhou, Ding, Tan, Tarokh Lemma 12 Assume that the regularity conditions in Appendix A hold. Suppose that n = n1 + n2, and n1/n c1, n2/n c2, as n , where c1, c2 (0, 1). Then, the following results hold: (i) Let NC(δ) = {θ : θ θ C 2 < δ} be a neighborhood of θ C contained in ΘC. For any δ > 0, there exists a positive number k(δ) depending on δ such that sup θ ΘC\NC(δ) n 1 {ℓC(θ) ℓC(θ C)} < k(δ) (ii) Let (bξ2 C) 1 = det |In(bθC)|, we have lim n n p C n (bξ2 C) 1o = Θ(1). (iii) ℓC(θ C) ℓC(bθC) = Θ(1). n p(C) θC (y(1), y(2), X(1), X(2)|bθC)bξC o 1 p(y(1), y(2), X(1), X(2)) = (2π) p C Lemmas 13 14 concern the magnitudes of the joint marginal likelihood and marginal likelihood for each learner, under the case when the model linkage is well-specified or misspecified. Similar results were established in the context of change point detection (Du et al., 2016). Both lemmas will be used to prove that Algorithm 1 will select the correct model linkage in each iteration of the algorithm. Lemma 13 Assume that the regularity conditions in Appendix A hold. Let θS1,S2 Rps be the shared parameter between L1 and L2 as defined in Definition 2. Let the interior value of θS1,S2 in L1 and L2 be θ 1,S1 and θ 2,S2, respectively. If θ 1,S1 = θ 2,S2, and n1/n2 = Θ(1), as m = min {n1, n2} , we have p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2)) Lemma 14 Under the same conditions as in Lemma 13, if θ 1,S1 = θ 2,S2, as min{n1, n2} , we have p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2)) C.1 Proof of Theorem 9 Proof The main idea of the proof is to show that in each iteration, the proposed algorithm will integrate information from one additional learner based on the linkage set E , and avoid incorporating information from the learner that is not in the linkage set E . Consequently, the final output of the algorithm b E = E . We start the proof at the ℓth iteration. At the ℓth iteration, let ζ(ℓ 1) be the set of learners that are already included in the joint model built from the previous ℓ 1 iterations. Let ζ(ℓ 1) C(G) {2, . . . , M}\ζ(ℓ 1) be a non-empty set of learners with at least a path to L1 as defined in the user-specified model Model Linkage Selection for Cooperative Learning linkage graph G. Note that ζ(ℓ 1) C(G) is non-empty, otherwise, the algorithm would have been terminated at the (ℓ 1)th iteration. There are two cases: (i) there is at least a j ζ(ℓ 1) C(G) and an i ζ(ℓ 1) such that the model linkage between Lj and Li is well-specified; and (ii) there are no well-specified linkages between pairs of learners in ζ(ℓ 1) C(G) and ζ(ℓ 1). Case (i): Suppose that there is a well-specified model linkage between Lj for a j ζ(ℓ 1) C(G) and Li for an i ζ(ℓ 1). By Lemma 13, Step 2 of Algorithm 1 will hold, and Lj will form a new set with ζ(ℓ 1), namely ζ(ℓ) = ζ(ℓ 1) {j}. More generally, if there are more than one learner in ζ(ℓ 1) C(G) that have well-specified linkages with learners in ζ(ℓ 1), the algorithm will select jopt = argmaxj ζ(ℓ 1) C(G) p( κ ζ(ℓ 1)y(κ)|y(j)) and form a new set ζ(ℓ) = ζ(ℓ 1) {jopt}. Case (ii): On the other hand, if the model linkages between Lj for j ζ(ℓ 1) C(G) and Li for i ζ(ℓ 1) are misspecified for all i and j, Lemma 14 ensures that Algorithm 1 will be terminated, and hence, the misspecified model linkages are not included into ζ(ℓ 1). As a result, the algorithm terminates when all well-specified linkages are included, and no misspecified linkages and misspecified models will be included, namely b E = E . This concludes the proof. C.2 Proof of Theorem 10 Proof Recall that C(G) is the set of indices recording the vertices that form a connected component with learner L1 in the user-specified graph G, and G is the true underlying graph after the statistical models for all learners and G are specified. Let G = { G = (V, E) : E E, E = E } be a set of graphs that is a subgraph of G, but E = E . Let bp C(G) be the posterior predictive distribution constructed based on learners in C(G) as defined in (1), and let Gi = (V, Ei) be a graph with edge set Ei. We consider the prediction efficiency ratio defined as follows: E{s(bp C(G ), ey, ex) s(p , ey, ex)} Pr( b E = E )E{s(bp C(G ), ey, ex) s(p , ey, ex)} + P Gi G Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)} . To prove Theorem 10, it suffices to show that Pr( b E = E ) 1 (8) and Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)} E{s(bp C(G ), ey, ex) s(p , ey, ex)} 0 (9) for all Gi G. Equation (8) is a direct consequence of the result in Theorem 1. In the remaining of the proof, we focus on establishing (9). To show (9), we consider two cases: (1) all learners in C(Gi) are well-specified and that the model linkages that form a connected component with L1 are well-specified; (2) there exist at least one learner with misspecified models or misspecified model linkages in C(Gi). Zhou, Ding, Tan, Tarokh For case (1), Assumption (A2) in Section B indicates that E{s(bp C(Gi), ey, ex) s(p , ey, ex)} = Θ(n 1/2 C(Gi)), and E{s(bp C(G ), ey, ex) s(p , ey, ex)} = Θ(n 1/2 C(G )), where n C(Gi) and n C(G ) are the sample sizes of C(Gi) and C(G ), respectively. Combining the above and the result from Theorem 9 that Pr( b E = Ei) 0 leads to (9). For case (2), consider a learner Lw C(Gi) that has misspecified model linkages with some other learners in C(Gi), denoted as C(Gi)miss. By definition, C(Gi)miss C(Gi). If b E = Ei, then Gi must include the path between Lw and the learners in C(Gi)miss, which further indicates that Pr( b E = Ei) Pr p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ)) p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ)) > 1 , (10) where the right side of the equation can be interpreted as the probability of including Lw in Step 2 of Algorithm 1 to build a joint model. From the proof of Lemma 14, we have p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ)) p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ)) = Θ{np /2 w exp( nw Cw)} (11) for some positive finite constants Cw and p , where nw is the number of samples in Lw. By an application of the Markov s inequality and (11), we have Pr p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ)) p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ)) > 1 (12) E p(y(w), κ C(Gi)missy(κ)|X(w), κ C(Gi)miss X(κ)) p(y(w)|X(w))p( κ C(Gi)missy(κ)| κ C(Gi)miss X(κ)) < np /2 w exp( 1 2nw Cw), (14) implying Pr( b E = Ei) < np /2 w exp( 0.5nw Cw). Due to the existence of misspecified linkages in C(Gi), Assumption (A2) is no longer applicable to bound E{s(bp C(Gi), ey, ex) s(p , ey, ex)}. We instead employ Assumption (A1) to bound E{s(bp C(Gi), ey, ex) s(p , ey, ex)} by a finite constant C . Combining the above, we have Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)} < C np /2 w exp( 1 2nw Cw). (15) By Assumption (A2) in Section B and (15), we conclude that Pr( b E = Ei)E{s(bp C(Gi), ey, ex) s(p , ey, ex)} E{s(bp C(G ), ey, ex) s(p , ey, ex)} C n(1+p )/2 w exp( 1 2nw Cw) 0 (16) as nw , where C is some positive finite constant. This concludes the proof. Model Linkage Selection for Cooperative Learning Appendix D. Proof of Lemmas 11 14 D.1 Proof of Lemma 11 Proof Proof of Lemma 11(i): Let θ Θ \ θ and let Zi = log {p(yi, xi|θ)/p(yi, xi|θ )} be the log ratio between two joint densities evaluated under θ and θ . Let E(Zi) be the expectation of Zi with respect to the density function p(yi, xi | θ ). We start with proving the following intermediate result that is helpful for the proof of Lemma 11(i): n {ℓ(θ) ℓ(θ )} < c(θ) = 1, (17) where c(θ) is a positive finite number that may depend on θ. We consider two cases when E(Zi) is finite and infinite, respectively. When E(Zi) is finite, it follows from the Jensen s inequality that E(Zi) < log E {exp(Zi)} = 0. (18) Thus, by the law of large number, we have i=1 Zi p E(Zi) < 0, implying Pr{n 1 Pn i=1 Zi 0.5E(Zi)} 0. That is, Pr{n 1 Pn i=1 Zi 0.5E(Zi)} 1. Pick c(θ) = 0.5E(Zi), and (17) is satisfied. If E(Zi) is not finite, then we have E(Zi) = . Let Z i = max{Zi, k}, where k < 0. Then E(|Z i |) < . By the strong law of large number, we obtain 1 n i=1 Z i a.s. E(Z i ). (19) As k , by the monotone convergence theorem, we obtain E(Z i ) E(Zi) = . Moreover, (19) implies lim sup n 1 n i=1 Zi = lim n sup n m i=1 Zi lim n 1 n i=1 Z i = E(Z i ) = (20) almost surely . Thus, we obtain lim sup n n 1 Pn i=1 Zi = almost surely. In other words, n 1 Pn i=1 Zi a.s. . Any positive finite number c(θ) guarantees that (17) will hold. We now apply (17) to prove Lemma 11(i) holds in some open balls, where the union of these finite number of open balls covers Θ \ N(δ). Consider θj Θ and let Nj(δj) = {θ : θ θj 2 < δj} be a ball of size δj centered at θj. By the regularity condition (IV) in Appendix A, we have sup θ Nj(δj) 1 n {ℓ(θ) ℓ(θ )} = sup θ Nj(δj) n {ℓ(θ) ℓ(θj)} + 1 n {ℓ(θj) ℓ(θ )} i=1 Hδ(yi, xi, θj) + 1 n {ℓ(θj) ℓ(θ )} . (21) Zhou, Ding, Tan, Tarokh By the weak law of large number, as n , we have lim δ 0 1 n i=1 Hδ(yi, xi, θj) p E{Hδ(yi, xi, θj)} = 0 (22) Applying (17) with θ = θj, we obtain n {ℓ(θj) ℓ(θ )} < cj where cj is a positive constant that depends on θj. Applying (22) and (23), we get the upper bound in (21), which is shown below: sup θ Nj(δj) 1 n{ℓ(θ) ℓ(θ )} < 1 n{ℓ(θj) ℓ(θ )} + 1 i=1 Hδ(yi, xi, θj) < cj Then we get sup θ Nj(δj) 1 n{ℓ(θ) ℓ(θ )} < cj thereafter. If Θ is bounded, the compact set Θ \ N(δ) can be covered by a finite number of balls, namely N1(δ1), N2(δ2), . . . , Nm(δm), centered at θ1, θ2, . . . , θm, respectively. Then Lemma 11(i) holds by (24) with k(δ) = min{c1, c2, . . . , cm}. If Θ is unbounded, we apply the same argument to the bounded compact set Θ \ {N(δ) S( )}, where S( ) = {θ : θ 2 > } for sufficiently large , from (V) in Appendix A, we have 1 n{ℓ(θ) ℓ(θ )} < 1 i=1 K (yi, xi, θ ). (25) If E{K (yi, xi, θ )} > , by the weak law of large number, we get n 1 Pn i=1 K (yi, xi, θ ) p E{K (yi, xi, θ )} < 0, thus we have 1 n {ℓ(θ) ℓ(θ )} < E {K (yi, xi, θ )} Under this case, Lemma 11(i) holds with k(δ) = min [c1, . . . , cm, E{K (yi, xi, θ )}] . If E{K (yi, xi, θ )} = , using the similar argument in (19) and (20), we can derive the conclusion that i=1 K (yi, xi, θ ) a.s. E{K (yi, xi, θ )} = . Model Linkage Selection for Cooperative Learning Lemma 11(i) still holds with k(δ) = min {c1, . . . , cm} . Proof of Lemma 11(ii): Since δ can be sufficiently small and lim n Pr n bθ θ 2 < δ o = 1, we have n 1|{(In(bθ) In(θ )}i,j| < n 1 Pn i=1 Mδ(yi, xi, θ ) from (IX) in Appendix A, the limiting property i=1 Mδ(yi, xi, θ ) = E {Mδ(yi, xi, θ )} 0 and the weak law of large number imply that i,j = lim n 1 n {In(θ )}i,j p {I(θ )}i,j , (27) Finally, by continuous mapping theorem, we obtain n pdet|In(bθ)| p det|I(θ )|. Proof of Lemma 11(iii): Recall that bθ is the maximum likelihood estimator of θ. Thus, ℓ(bθ) = 0. By a second-order Taylor expansion, for θ Θ, there exists a t [0, 1] such that ℓ(θ ) = ℓ(bθ) 1 2(θ bθ)T h In{bθ + t(θ bθ)} i (θ bθ). (28) It suffices to show (θ bθ)T h In{bθ + t(θ bθ)} i (θ bθ) = Θ(1). Since bθ p θ , by (27) in the proof of Lemma 11(ii), we have n 1[In{bθ + t(θ bθ)}]i,j p {I(θ )}i,j. Also, we have bθ = θ + Θ(n 1/2). Consequently, we obtain (θ bθ)T h In{bθ + t(θ bθ)} i (θ bθ) p n n 1 2 (bθ θ ) o T I(θ ) n n 1 2 (bθ θ ) o = Θ(1)I(θ )Θ(1) Proof of Lemma 11(iv): Recall that p(y, X|θ) = exp{ℓ(θ)}. Thus, the marginal likelihood can be written as p(y, X) = Z θ Θ π(θ) exp{ℓ(θ)}dθ = p(y, X|bθ) Z θ Θ π(θ) exp{ℓ(θ) ℓ(bθ)}dθ = p(y, X|bθ) Z θ Θ\N(δ) π(θ) exp{ℓ(θ) ℓ(bθ)}dθ + p(y, X|bθ) Z θ N(δ) π(θ) exp{ℓ(θ) ℓ(bθ)}dθ := I1 + I2. (30) It suffices to show that n p(y, X|bθ)bξ o 1 I1 p 0 and n p(y, X|bθ)bξ o 1 I2 p (2π)p/2π(θ ), respectively. Zhou, Ding, Tan, Tarokh We first show n p(y, X|bθ)bξ o 1 I1 p 0. Note that I1 = p(y, X|bθ) exp n ℓ(θ ) ℓ(bθ) o Z θ Θ\N(δ) π(θ) exp {ℓ(θ) ℓ(θ )} dθ, We start the proof by conditioning on the event exp {ℓ(θ) ℓ(θ )} exp{ nk(δ)}. Thus, Z θ Θ\N(δ) π(θ) exp {ℓ(θ) ℓ(θ )} dθ exp { nk(δ)} Z θ Θ\N(δ) π(θ)dθ exp { nk(δ)} . Multiplying I1 with n p(y, X|bθ)bξ o 1 and by (31), we obtain n p(y, X|bθ)bξ o 1 I1 exp{ℓ(θ ) ℓ(bθ)}bξ 1 exp { nk(δ)}. By Lemma 11(iii), we have ℓ(θ ) ℓ(bθ) = Θ(1). Moreover, by Lemma 11(ii) and Slutsky s Theorem, we obtain lim n bξ 1 exp { nk(δ)} = lim n n p/2(bξ 2)1/2np/2 exp { nk(δ)} = (det|I(θ )|)1/2 lim n [exp { nk(δ) + p log(n)/2}] Finally, by Lemma 11(i), the event exp {ℓ(θ) ℓ(θ )} exp{ nk(δ)} holds with probability one as n for any θ Θ \ N(δ). Combining the above, we have n p(y, X|bθ)bξ o 1 I1 p 0. Next, we show that n p(y, X|bθ)bξ o 1 I2 p (2π)p/2π(θ ). By a second-order Taylor expansion, we have ℓ(θ) = ℓ(bθ) 1 2(θ bθ)TIn{bθ + t(θ bθ)}(θ bθ) 2(θ bθ)TIn(bθ)(θ bθ) 1 2(θ bθ)T h In{bθ + t(θ bθ)} In(bθ) i (θ bθ). For notational simplicity, let Rn = 0.5(θ bθ)T h In{bθ + t(θ bθ)} In(bθ) i (θ bθ). By (33), I2 can be rewritten as I2 =p(y, X|bθ) Z θ N(δ) π(θ) exp n ℓ(θ) ℓ(bθ) o dθ =p(y, X|bθ) Z θ N(δ) π(θ) exp 1 2(θ bθ)TIn(bθ)(θ bθ) Rn Under (X) in Appendix A, given any ϵ > 0, let ϵ = 2ϵ /π(θ ), since the prior function π(θ) is continuous around θ , thus, we can choose a δ such that |π(θ) π(θ )| < ϵ < ϵπ(θ ) if θ N(δ). (35) Model Linkage Selection for Cooperative Learning θ N(δ) exp 1 2(θ bθ)TIn(bθ)(θ bθ) Rn By (35), we obtain (1 ϵ)π(θ )I3 < n p(y, X|bθ) o 1 I2 < (1 + ϵ)π(θ )I3. (37) It suffices to obtain lower and upper bounds for bξ 1I3. We divide the derivation of upper and lower bounds for bξ 1I3 into two parts, we first show that bξ 1 R θ N(δ) exp n 1 2(θ bθ)TIn(bθ)(θ bθ) o dθ (2π)p/2, and prove later that |Rn| < ϵ for some proper δ. Recall that bξ 1 = det |In(bθ)|1/2, and by the regularity condition (VI) in Appendix A, {In(bθ)} 1 exists since In(bθ) is non-singular. We have θ N(δ) exp 1 2(θ bθ)TIn(bθ)(θ bθ) dθ 2(θ bθ)TIn(bθ)(θ bθ) o (2π)pdet|{In(bθ)} 1| dθ. (38) Because the part inside the integral is the density function of multivariate normal distribution θ N(bθ, {In(bθ)} 1), we have that (38) will be less than (2π)p Z exp n 1 2(θ bθ)TIn(bθ)(θ bθ) o (2π)pdet|{In(bθ)} 1| dθ = (2π) p 2 . (39) Moreover, the symmetry property of In(bθ) indicates that there exists a p p matrix V such that In(bθ) = V TV . Change variable θ = V θ, (38) will be greater than 2(θ V bθ)T(θ V bθ) o (2π)pdet|{In(bθ)} 1| det|V 1|dθ 2(θ V bθ)T(θ V bθ) o where N (δ ) = {θ : θ V θ 2 < δ }, and δ is determined by δ = min θ: θ θ 2=δ V θ V θ 2. We show that δ = min θ: θ θ 2=δ V θ V θ 2 . Let δ(θ) = V θ V θ 2 under the condition that θ θ 2 = δ, if δ(θ) < , then V θ V θ and θ θ are both elementwise finite, and their lengths are both finite number p, we conclude from above that det |(V θ V θ )(θ θ )T| < . However, det |V | = (det |In(bθ)|) 1 2 from Lemma 11(ii), thus det |(V θ V θ )(θ θ )T| = det |V | det |(θ θ )(θ θ )T| , which is a contradiction. All the above indicates that δ(θ) for any θ θ 2 = δ, which Zhou, Ding, Tan, Tarokh concludes δ = min θ: θ θ 2=δ V θ V θ 2 . Therefore, we have (40) converging to (2π)p/2 in probability. The conclusion bξ 1 R θ N(δ) exp n 1 2(θ bθ)TIn(bθ)(θ bθ) o dθ (2π)p/2 is then derived. We next derive the upper bound of |Rn|. By the triangle inequality, we have 1 n|Rn| = 1 1 2(θ bθ)T[In{bθ + t(θ bθ)} In(bθ)](θ bθ) (θ bθ)T[In{bθ + t(θ bθ)} In(θ )](θ bθ) (θ bθ)T n In(θ ) In(bθ) o (θ bθ) . To further derive the upper bound of (41), consider the length p vector b = (b1, b2, . . . , bp)T and p p matrix A with {A}i,j = ai,j, i, j {1, 2, . . . , p}. Let g(A, b) = tr(b TAb), this function can be formalized as g(A, b) = tr(bb TA) = Pp j=1 Pp i=1 bibjai,j. Let b = θ bθ and A = In{bθ+t(θ bθ)} In(θ ), we have b 2 = (θ θ )+(θ bθ) 2 θ θ 2+ θ bθ 2 δ in probability, because θ N(δ) and bθ θ = Θ(n 1/2). Thus, |bi| δ for i = 1, 2, . . . , p. We can get the inequality that |g(A, b)| δ2 Pp j=1 Pp i=1 |ai,j|. From triangle inequality we have bθ +t(θ bθ) θ 2 t θ θ 2 +(1 t) bθ θ 2 p tδ δ, thus bθ +t(θ bθ) N(δ) in probability. From (IX) in Appendix A, we have |ai,j| = |[In{bθ + t(θ bθ)} In(θ )]i,j| Pn i=1 Mδ(yi, xi, θ ). Thus we have b TAb δ2 Pp j=1 Pp k=1 Pn i=1 Mδ(yi, xi, θ ). Note that this inequality also holds when A = In(θ ) In(bθ), since bθ θ = Θ(n 1/2), which indicates bθ N(δ) almost surely, thus (IX) in Appendix A can be applied and get the same conclusion as well. Based on (41) and the weak law of large number, n 1|Rn| is less than i=1 Mδ(yi, xi, θ ) p p2δ2E {Mδ(yi, xi, θ )} when n . (42) Under (IX) in Appendix A, lim δ 0 E {Mδ(yi, xi, θ )} = 0, given the condition that δ is chosen to make (35) hold, then for any ϵ > 0, if δ is also chosen such that E {Mδ(yi, xi, θ )} < ϵ 2np2δ2 . sup θ N(δ) |Rn| < ϵ Hence, we get the conclusion lim n Pr n (2π) p 2 exp( ϵ) < bξ 1I3 < (2π) p 2 exp(ϵ) o = 1. (44) Since δ can be chosen so that (44) and (37) both hold for arbitrary small ϵ, we deduce the result lim n Pr (2π) p 2 π(θ )(1 ϵ) exp( ϵ) < n p(y, X|bθ)bξ o 1 I2 < (2π) p 2 π(θ )(1 + ϵ) exp(ϵ) = 1, Model Linkage Selection for Cooperative Learning which leads to n p(y, X|bθ)bξ o 1 I2 p (2π)p/2π(θ ) when n . Then we get the conclusion n (y, X|bθ)bξ o 1 p(y, X) = (2π) p 2 π(θ ). (45) D.2 Proof of Lemma 12 Proof We start with the proof of Lemma 12(i). Recall that ΘC is the parameter space of θC, and eθ1 = (θT 1, S1, θT S1,S2)T and eθ2 = (θT 2, S2, θT S1,S2)T as the parameters for L1 and L2 after incorporating information from the model linkage between the two learners. Let Nκ(δκ) = {θ : θ θ κ 2 < δκ} be the neighborhood of θ κ, and let Θ1,δ1 = {θC : eθ1 N1(δ1), θ2, S2 = θ 2, S2} and Θ2,δ2 = {θC : eθ2 N2(δ2), θ1, S1 = θ 1, S1}. The parameter space ΘC can be divided into Θ1,δ1, Θ2,δ2, and ΘC \ (Θ1,δ1 Θ2,δ2). For a fixed δ, there exist δ1 and δ2, such that Θκ,δκ NC(δ), κ = 1, 2. Lemma 11(i) indicates that sup eθκ Θκ\Nκ(δκ) n 1 κ n ℓκ(eθκ) ℓκ(θ κ) o < kκ(δκ) for some positive functions kκ(δκ), κ = 1, 2. Let k(δ) = c1 k1(δ1) + c2 k2(δ2). Then, sup θC ΘC\NC(δ) n 1 {ℓC(θC) ℓC(θ C)} (46) sup θC ΘC\(Θ1,δ1 Θ2,δ2) n 1 {ℓC(θC) ℓC(θ C)} (47) sup eθ1 Θ1\N1(δ1) n 1 n ℓ1(eθ1) ℓ1(θ 1) o + sup eθ2 Θ2\N2(δ2) n 1 n ℓ2(eθ2) ℓ2(θ 2) o (48) = c1 sup eθ1 Θ1\N1(δ1) n 1 1 n ℓ1(eθ1) ℓ1(θ 1) o + c2 sup eθ2 Θ2\N2(δ2) n 1 2 n ℓ2(eθ2) ℓ2(θ 2) o (49) c1 { k1(δ1)} + c2 { k2(δ2)} (50) = k(δ), (51) where (48) holds by the fact that ℓC(θC) ℓC(θ C) = {ℓ1(eθ1) ℓ1(θ 1)} + {ℓ2(eθ2) ℓ2(θ 2)} and (50) holds by applications of Lemma 11(i). Therefore, Lemma 12(i) is proved. For Lemma 12(ii), we prove that n 1In(bθC) is positive definite by showing that n 1In(bθC) can be rewritten as a sum of two matrices, namely n 1In(bθC) = M1 + M2, where M1, M2 are positive definite matrices. The proof can then be concluded by the fact that all elements in M1 and M2 are bounded by some finite constants. Recall that θC = (θT 1, S1, θT S1,S2, θT 2, S2)T Rp C. Let bθC,κ be the MLE of eθκ obtained by maximizing ℓC, κ = 1, 2. Let λκ be the smallest eigenvalue in n 1 κ I(κ) nκ (bθC,κ). By Condition VI in Appendix A (VI, the positive definite property of I(κ) nκ (bθC,κ) indicates that λκ > 0, κ = 1, 2. In the following proof, we will focus on constructing M1. Construction of M2 is similar as M1 and is omitted. We now define the elements in M1. For i, j p1, let {M1}i,j = c1 1 n1 {I(1) n1 (bθC,1)}i,j c1 Zhou, Ding, Tan, Tarokh for i = j and i p1 ps, and let {M1}i,j = c1 1 n1 {I(1) n1 (bθC,1)}i,j (53) for the other elements. When p1 + 1 i p1 + p2 ps or p1 + 1 j p1 + p2 ps, all the elements are zeros except when i = j, we set {M1}i,j = c2 By construction, M1 is a 2 2 block diagonal matrix, where each block matrix is positive definite. The upper left diagonal block is the difference between c1n 1 1 {I(1) n1 (bθC,1)} and a diagonal matrix, with the first p1 ps diagonal entries are set to equal 0.5c1λ1. The bottom right block matrix is a diagonal matrix with all diagonal entries equaling 0.5c2λ2. Thus M1 is positive definite. The matrix M2 is constructed in a similar fashion and can be shown to be positive definite. We now proceed to prove the limiting property of n 1 1 I(1) n1 (bθC,1). We have 1 n1 |{I(1) n1 (bθC,1) I(1) n1 (θ 1)}i,j| < 1 i=1 Mδ1(y(1) i , x(1) i , θ 1) (55) from (IX) in Appendix A. Moreover, the limiting property lim n1 1 n1 i=1 Mδ1(y(1) i , x(1) i , θ 1) = E n Mδ1(y(1) i , x(1) i , θ 1) o 0 (56) implies that lim n1 1 n1 n I(1) n1 (bθC,1) o i,j = lim n1 1 n1 n I(1) n1 (θ 1) o i,j = n I(1)(θ 1) o Equations (52) (57) imply that M1 and M2 are positive definite and elementwise finite. Combining this with the continuous mapping theorem, we obtain the conclusion that det |M1 + M2| = det |n 1In(bθC)| = Θ(1), which concludes Lemma 12(ii). Lemma 12(iii) is implied by Lemma 11(iii). The proof of Lemma 12(iv) is similar to the proof of Lemma 11(iv), and is omitted. D.3 Proof of Lemma 13 Proof Recall from Lemma 11(iv) and Lemma 12(iv) that (bξ2 1) 1 = det|I(1) n1 (bθ1)|, (bξ2 2) 1 = det|I(2) n2 (bθ2)|, and (bξ2 C) 1 = det|In(bθC)| with n = n1 + n2. Let h1(X(1)) and h2(X(2)) be the density function of X(1) and X(2), respectively. Since the covariates between two learners Model Linkage Selection for Cooperative Learning are independent, as m , we have p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2)) = p(y(1), y(2), X(1), X(2))/h1(X(1))h2(X(2)) p(y(1), X(1))/h1(X(1)) p(y(2), X(2))/h2(X(2)) = p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2)) p(C) θC (y(1), y(2), X(1), X(2))|bθC) p(1) eθ1 (y(1), X(1)|bθ1)p(2) eθ2 (y(2), X(2)|bθ2) bξC bξ1bξ2 where the third equality holds by an application of Lemma 11(iv) and Lemma 12(iv). Since the model linkages are well-specified, the joint density can be factored as p(C) θC (y(1), y(2), X(1), X(2)|θ C) = p(1) eθ1 (y(1), X(1)|θ 1)p(2) eθ2 (y(2), X(2)|θ 2). Thus, we have p(C) θC (y(1), y(2), X(1), X(2)|bθC) p(1) eθ1 (y(1), X(1)|bθ1)p(2) eθ2 (y(2), X(2)|bθ2) = p(C) θC (y(1), y(2), X(1), X(2)|bθC) n p(C) θC (y(1), y(2), X(1), X(2)|θ C) o 1 p(1) eθ1 (y(1), X(1)|bθ1) n p(1) eθ1 (y(1), X(1)|θ 1) o 1 p(2) eθ2 (y(2), X(2)|bθ2) n p(2) eθ2 (y(2), X(2)|θ 2) o 1 = Θ(1), (59) where the second equality holds by Lemma 11(iii). Also, Lemma 11(ii) and Lemma 12(ii) imply that bξC bξ1bξ2 = bξC(n1 + n2) p1+p2 ps 2 2 (n1 + n2) p1+p2 ps 2 2 (n1 + n2) p1+p2 ps Substituting (59) and (60) into (58) concludes the proof. D.4 Proof of Lemma 14 Proof From the proof of Lemma 13, we have p(y(1), y(2)|X(1), X(2)) p(y(1)|X(1))p(y(2)|X(2)) = p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2)), and it remains to show that p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2)) Zhou, Ding, Tan, Tarokh By definition, p(y(1), y(2), X(1), X(2)) = R ΘC p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC. Choose a δ > 0 such that N1(δ) and N2(δ) are non-overlapping neighborhoods of θ 1 and θ 2, respectively. We split p(y(1), y(2), X(1), X(2)) into three integrals, I1, I2, and I3, taken on sets Θ1,δ, Θ2,δ, and ΘC \ (Θ1,δ Θ2,δ), where Θ1,δ = {θC : eθ1 N1(δ)} and Θ2,δ = {θC : eθ2 N2(δ)}. Note that eθ1 θC and eθ2 θC For the first integral, we have Θ1,δ p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC = p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ2(θ 2) ℓ2(bθ2) o Θ1,δ bξ 1 2 exp n ℓ2(eθ2) ℓ2(θ 2) o p(1) eθ1 (y(1), X(1)|eθ1)π(θC)dθC. Since eθ2 / N1(δ) in (62), according to Lemma 11(i), the integral on the right hand side in (62) is less than Θ1,δ bξ 1 2 exp n ℓ2(eθ2) ℓ2(θ 2) o p(1) eθ1 (y(1), X(1)|eθ1)π(θC)dθC bξ 1 2 exp{ n2k2(δ)} Z Θ1,δ p(1) eθ1 (y(1), X(1)|eθ1)π(θC)dθC bξ 1 2 exp{ n2k2(δ)} Z Θ1 p(1) eθ1 (y(1), X(1)|eθ1)π(eθ1)deθ1 = {np2 2 bξ2 2} 1 2 2 exp{ n2k2(δ)}p(y(1), X(1)) with probability tending to 1 as n2 . From Lemmas 11(ii) (iv), as n2 , we have {np2 2 bξ2 2} 1 2 p (det |I(2)(θ 2)|) 1 2 ; exp n ℓ2(θ 2) ℓ2(bθ2) o Θ(1); {p(2) eθ2 (y(2), X(2)|bθ2)bξ2} 1p(y(2), X(2)) p (2π) p2 It follows that I1 p(y(1), X(1))p(y(2), X(2)) = Θ(n 2 2 exp{ n2k2(δ)}) p 0. (65) Using a similar argument for I2, as n1 , we have I2 p(y(1), X(1))p(y(2), X(2)) = Θ(n 2 1 exp{ n1k1(δ)}) p 0. (66) Model Linkage Selection for Cooperative Learning For the integral I3, we apply a similar argument as in the proof of I1 and I2. Specifically, ΘC\{Θ1,δ Θ2,δ} p(1) eθ1 (y(1), X(1)|eθ1)p(2) eθ2 (y(2), X(2)|eθ2)π(θC)dθC = p(1) eθ1 (y(1), X(1)|bθ1)bξ1p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ1(θ 1) ℓ1(bθ1) o exp n ℓ2(θ 2) ℓ2(bθ2) o ΘC\{Θ1,δ Θ2,δ} bξ 1 1 bξ 1 2 exp n ℓ1(eθ1) ℓ1(θ 1) o exp n ℓ2(eθ2) ℓ2(θ 2) o π(θC)dθC, p(1) eθ1 (y(1), X(1)|bθ1)bξ1p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ1(θ 1) ℓ1(bθ1) o exp n ℓ2(θ 2) ℓ2(bθ2) o bξ 1 1 bξ 1 2 ΘC exp n ℓ1(eθ1) ℓ1(θ 1) o exp n ℓ2(eθ2) ℓ2(θ 2) o π(θC)dθC. Since the region ΘC \ {Θ1,δ Θ2,δ} contains neither the neighborhood of θ 1 nor the neighborhood of θ 2, by an application of Lemma 11(i), we have I3 p(1) eθ1 (y(1), X(1)|bθ1)bξ1p(2) eθ2 (y(2), X(2)|bθ2)bξ2 exp n ℓ1(θ 1) ℓ1(bθ1) o exp n ℓ2(θ 2) ℓ2(bθ2) o bξ 1 1 bξ 1 2 exp { n1k1(δ)} exp { n2k2(δ)} . Using an argument similar to that of I1 and Lemmas 12(ii) (iv), we have I3 p(y(1), X(1))p(y(2), X(2)) = Θ(n 2 2 exp{ n1k1(δ) n2k2(δ)}) p 0. (67) Combining the above, we have p(y(1), y(2), X(1), X(2)) p(y(1), X(1))p(y(2), X(2)) = I1 + I2 + I3 p(y(1), X(1))p(y(2), X(2)) This concludes the proof of Lemma 14. Rehan Akbani, Kwok shing Ng, Henrica Werner, Maria Shahmoradgoli, Fan Zhang, Zhenlin Ju, Wenbin Liu, Ji-Yeon Yang, Kosuke Yoshihara, Jun Li, Shiyun Ling, Elena Seviour, Prahlad Ram, John Minna, Lixia Diao, Pan Tong, John Heymach, Steven Hill, Frank Dondelinger, and Gordon Mills. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nature Communication, 5(1):1 15, 05 2014. doi: 10.17863/CAM.25567. Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. Analyzing federated learning through an adversarial lens. In International Conference on Machine Learning, pages 634 643. PMLR, 2019. Marta Blangiardo, Anna Hansell, and Sylvia Richardson. A Bayesian model of time activity data to investigate health effect of air pollution in time series studies. Atmospheric Environment, 45(2):379 386, 2011. Zhou, Ding, Tan, Tarokh Sabine Blaschke, Claudia M uller, Jasmina Markovic-Lipkovski, Sabine Puch, Nicolai Miosge, Volker Becker, Gerhard Mueller, and Gerd Klein. Expression of cadherin-8 in renal cell carcinoma and fetal kidney. International Journal of Cancer, 101(4):327 334, 10 2002. doi: 10.1002/ijc.10623. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. Guangzhen Cao and Ya Qiu Jin. A hybrid algorithm of the BP-ANN/GA for classification of urban terrain surfaces with fused data of Landsat ETM+ and ERS-2 SAR. International Journal of Remote Sensing, 28(2):293 305, 2007. GC Cawley and NLC Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11(Jul):2079 2107, July 2010. Fengju Chen, Yiqun Zhang, Yasin S enbabao glu, Giovanni Ciriello, Lixing Yang, Ed Reznik, Brian Shuch, Goran Micevic, Guillermo De Velasco, and Eve Shinbrot. Multilevel genomics-based taxonomy of renal cell carcinoma. Cell Reports, 14(10):2476 2489, 2016. Kevin A. Clarke. The phantom menace: omitted variable bias in econometric research. Conflict Management and Peace Science, 22(4):341 352, 2005. doi: 10.1080/ 07388940500339183. Li Da Xu, Wu He, and Shancang Li. Internet of things in industries: a survey. IEEE Transactions on Industrial Informatics, 10(4):2233 2243, 2014. Patrick Danaher, Pei Wang, and Daniela M Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B, 76(2):373 397, 2014. A Philip Dawid and Monica Musio. Bayesian model selection based on proper scoring rules. Bayesian Analysis, 10(2):479 499, 2015. Enmao Diao, Jie Ding, and Vahid Tarokh. Restricted recurrent neural networks. 2019 IEEE International Conference on Big Data, 2019. Enmao Diao, Jie Ding, and Vahid Tarokh. Multimodal controller for generative models. ar Xiv preprint ar Xiv:2002.02572, 2020a. Enmao Diao, Jie Ding, and Vahid Tarokh. Hetero FL: Computation and communication efficient federated learning for heterogeneous clients. In International Conference on Learning Representations, 2020b. Jie Ding, Vahid Tarokh, and Yuhong Yang. Model selection techniques an overview. IEEE Signal Processing Magazine , 35(6):16 34, 2018. Ian Domowitz and Halbert White. Misspecified models with dependent observations. Journal of Econometrics, 20(1):35 58, October 1982. Model Linkage Selection for Cooperative Learning Chao Du, Chu-Lan Michael Kao, and S. C. Kou. Stepwise signal extraction via marginal likelihood. Journal of the American Statistical Association, 111(513):314 330, 2016. C Duckworth, L Zhang, Steven Carroll, S Ethier, and H Cheung. Overexpression of GAB2 in ovarian cancer cells promotes tumor growth and angiogenesis by upregulating chemokine expression. Oncogene, 35(31):4036 4047, 2016. Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007. Pi Guo, Jianjun Zhang, Li Wang, Shaoyi Yang, Ganfeng Luo, Changyu Deng, Ye Wen, and Qingying Zhang. Monitoring seasonal influenza epidemics by using internet search data with an ensemble penalized regression model. Scientific Reports, 7(1):46469, 2017. Pierre E Jacob, Lawrence M Murray, Chris C Holmes, and Christian P Robert. Better together? statistical learning in models made of modules. ar Xiv preprint ar Xiv:1708.08719, 2017. Shane T Jensen, Guang Chen, and Christian J Stoeckert Jr. Bayesian variable selection and data integration for biological regulatory networks. The Annals of Applied Statistics, 1(2):612 633, 2007. Malhar S Jere, Tyler Farnan, and Farinaz Koushanfar. A taxonomy of attacks on federated learning. IEEE Security & Privacy, 19(2):20 28, 2020. Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668 681, 2019. Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richt arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016. Fanjie Kong, Xiaobing Li, Hong Wang, Dengfeng Xie, Xiang Li, and Yunxiao Bai. Land cover classification based on fused data from GF-1 and MODIS NDVI time series. Remote Sensing, 8(9):741, 2016. Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. Journal of Machine Learning Research, 18(1):4404 4446, 2017a. Jason D Lee, Qiang Liu, Yuekai Sun, and Jonathan E Taylor. Communication-efficient sparse regression. Journal of Machine Learning Research, 18(1):115 144, 2017b. Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-efficient distributed optimization in networks with gradient tracking. ar Xiv preprint ar Xiv:1909.05844, 2019. Quefeng Li and Lexin Li. Integrative linear discriminant analysis with guaranteed error rate improvement. Biometrika, 105(4):917 930, 2018. Zhou, Ding, Tan, Tarokh Dungang Liu, Regina Y Liu, and Minge Xie. Multivariate meta-analysis of heterogeneous studies using only summary statistics: efficiency and robustness. Journal of the American Statistical Association, 110(509):326 340, 2015. Fei Liu, MJ Bayarri, and JO Berger. Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis, 4(1):119 150, 2009. David Lunn, Nicky Best, David Spiegelhalter, Gordon Graham, and Beat Neuenschwander. Combining MCMC with sequential PKPD modeling. Journal of Pharmacokinetics and Pharmacodynamics, 36(1):19, 2009. David J Lunn, Andrew Thomas, Nicky Best, and David Spiegelhalter. Win BUGS - a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing, 10(4):325 337, 2000. Jing Ma and George Michailidis. Joint structural estimation of multiple graphical models. Journal of Machine Learning Research, 17(1):5777 5824, 2016. Arnab Maity, Anirban Bhattacharya, Bani Mallick, and Veerabhadran Baladandayuthapani. Bayesian data integration and variable selection for pan-cancer survival prediction using protein expression data. Biometrics, 76(1):316 325, 08 2019. doi: 10.1111/biom.13132. Olvi L. Mangasarian, W. Nick Street, and William H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4):570 577, 1995. doi: 10.1287/opre.43.4.570. Delphine Maucort-Boulch, Silvia Franceschi, and Martyn Plummer. International correlation between human papillomavirus prevalence and cervical cancer incidence. Cancer Epidemiology and Prevention Biomarkers, 17(3):717 720, 2008. H Brendan Mc Mahan, Eider Moore, Daniel Ramage, and Seth Hampson. Communicationefficient learning of deep networks from decentralized data. ar Xiv preprint ar Xiv:1602.05629, 2016. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In International Conference on Machine Learning, pages 689 696, 2011. Kiona Ogle, Jarrett Barber, and Karla Sartor. Feedback and modularization in a Bayesian meta analysis of tree traits affecting forest dynamics. Bayesian Analysis, 8(1):133 168, 2013. Matthew Parry. Linear scoring rules for probabilistic binary classification. Electronic Journal of Statistics, 10(1):1596 1607, 2016. Matthew Parry, A Philip Dawid, and Steffen Lauritzen. Proper local scoring rules. Annals of Statistics, 40(1):561 592, 2012. Model Linkage Selection for Cooperative Learning Martyn Plummer. Cuts in Bayesian graphical models. Statistics and Computing, 25(1): 37 43, Jan 2015. ISSN 1573-1375. doi: 10.1007/s11222-014-9503-z. Stephane Shao, Pierre E Jacob, Jie Ding, and Vahid Tarokh. Bayesian model comparison with the hyv arinen score: Computation and consistency. Journal of the American Statistical Association, 114(528):1826 1837, 2019. Jieli Shen, Regina Y Liu, and Min-Ge Xie. i Fusion: Individualized fusion learning. Journal of the American Statistical Association, 115(531):1251 1267, 2019. Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014. Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1310 1321. ACM, 2015. MC Simmonds and JPT Higgins. Covariate heterogeneity in meta-analysis: criteria for deciding between meta-regression and individual patient data. Statistics in Medicine, 26 (15):2982 2999, 2007. Uthayasankar Sivarajah, Muhammad Mustafa Kamal, Zahir Irani, and Vishanth Weerakkody. Critical analysis of big data challenges and analytical methods. Journal of Business Research, 70:263 286, 2017. Lu Tang and Peter XK Song. Fused lasso approach in regression coefficients clustering: learning parameter heterogeneity in data integration. Journal of Machine Learning Research, 17(1):3915 3937, 2016. Lu Tang, Ling Zhou, and Peter XK Song. Fusion learning algorithm to combine partially heterogeneous cox models. Computational Statistics, 34(1):395 414, 2019. Edward F Vonesh, Tom Greene, and Mark D Schluchter. Shared parameter models for the joint analysis of longitudinal data and event times. Statistics in Medicine, 25(1):143 163, 2006. AM Walker. On the asymptotic behaviour of posterior distributions. Journal of the Royal Statistical Society: Series B, 31(1):80 88, 1969. Xinran Wang, Yu Xiang, Jun Gao, and Jie Ding. Information laundering for model privacy. In International Conference on Learning Representations, 2021. Xiaoquan Wen and Matthew Stephens. Bayesian methods for genetic association analysis with heterogeneous subgroups: from meta-analyses to gene-environment interactions. The Annals of Applied Statistics, 8(1):176, 2014. Xun Xian, Xinran Wang, Jie Ding, and Reza Ghanadan. Assisted learning: A framework for multi-organization learning. In Advances in Neural Information Processing Systems, pages 14580 14591, 2020. Zhou, Ding, Tan, Tarokh Jin-Jun Xiao and Zhi-Quan Luo. Universal decentralized detection in a bandwidthconstrained sensor network. IEEE Transactions on Signal Processing, 53(8):2617 2624, 2005. Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang Chun Woo. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, pages 802 810, 2015. Shihao Yang, Mauricio Santillana, and SC Kou. Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences of the United States of America, 112(47):14473 14478, 2015. Chenglong Ye, Jie Ding, and Reza Ghanadan. Meta clustering for collaborative learning. ar Xiv preprint ar Xiv:2006.00082, 2020. Qing Zhou, Di Li, Soummya Kar, Lauren M Huie, H Vincent Poor, and Shuguang Cui. Learning-based distributed detection-estimation in sensor networks with unknown sensor defects. IEEE Transactions on Signal Processing, 65(1):130 145, 2016. Richard Zuech, Taghi M Khoshgoftaar, and Randall Wald. Intrusion detection and big heterogeneous data: a survey. Journal of Big Data, 2(1):3, 2015.