# multitask_learning_with_summary_statistics__a22b1683.pdf Multi-Task Learning with Summary Statistics Parker Knight Department of Biostatistics Harvard University Boston, MA pknight@g.harvard.edu Rui Duan Department of Biostatistics Harvard University Boston, MA rduan@hsph.harvard.edu Multi-task learning has emerged as a powerful machine learning paradigm for integrating data from multiple sources, leveraging similarities between tasks to improve overall model performance. However, the application of multi-task learning to real-world settings is hindered by data-sharing constraints, especially in healthcare settings. To address this challenge, we propose a flexible multi-task learning framework utilizing summary statistics from various sources. Additionally, we present an adaptive parameter selection approach based on a variant of Lepski s method, allowing for data-driven tuning parameter selection when only summary statistics are available. Our systematic non-asymptotic analysis characterizes the performance of the proposed methods under various regimes of the sample complexity and overlap. We demonstrate our theoretical findings and the performance of the method through extensive simulations. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction and many other fields. 1 Introduction The growing availability of extensive and intricate datasets presents an opportunity to integrate data from multiple sources. Multi-task learning has emerged as a promising machine learning approach that enables the simultaneous learning of multiple related models, leveraging shared structure between tasks to enhance the performance on each task individually [36, 35]. In healthcare and biomedical research, the practical application of multi-task learning is often hindered by data-sharing constraints, which stem from concerns about the ownership and privacy of individual-level data [32, 8]. Patient data in these domains is typically sensitive and less likely to be publicly available or shared across study sites, limiting researchers access to individual-level data from different domains. To overcome this limitation, researchers have increasingly integrated summary statistics into analysis pipelines as a substitute for individual-level data [22, 13, 12]. Summary statistics are straightforward, interpretable measures derived from raw data that can offer insights into data distribution, variability, and relationships among variables. Furthermore, they can be aggregated across studies to facilitate data integration and reused in various research projects. Recently, the use of summary statistics has garnered interest in healthcare and biomedical research. For example, many genetic risk prediction methods rely on summary-level statistics such as associations from Genome-wide Association Studies (GWAS), Linkage Disequilibrium estimations (LD), and minor allele frequencies (MAFs) [6]. These summary statistics can help predict an individual s likelihood of developing specific diseases based on their genetic profile. Inspired by a potential use case in genetic risk prediction, we propose a multi-task learning framework that enables simultaneous learning of multiple genetic risk prediction models using only publicly available summary statistics. Our proposed framework can be used in the context of predicting genetic risks for multiple traits leveraging potentially shared genetic pathways, and can also be used 37th Conference on Neural Information Processing Systems (Neur IPS 2023). to develop trans-ethnic genetic risk prediction models that account for potential heterogeneity across populations, improving generalizability and real-world applicability. Beyond genetic risk prediction, the ability to learn from summary statistics offers a versatile tool for developing models across a wide range of domains, including healthcare, finance, and marketing. To summarize, the contributions of this work are threefold: First, we propose a flexible multi-task learning framework which allows training multiple models simultaneously using basic summary statistics characterizing marginal relationship between outcomes and features, which are often publicly available. We allow summary statistics corresponding to each task to be generated from distinct or potentially overlapping samples. Secondly, we conducted a systematic non-asymptotic analysis which characterizes how the performance of the proposed methods are influenced by the characteristics of summary statistics. In particular, we show that there are multiple regimes of performance depending on the sample complexity of the source datasets and their overlap. The theoretical results are supported with extensive simulations. Lastly, We propose an adaptive scheme for tuning parameter selection based on the variant of Lepski s method [14] given in [5]. This allows us to select a data-driven tuning parameter when only summary statistics are available and cross-validation is not feasible. We prove that tuning parameters chosen by this method satisfy an oracle inequality with high probability, and demonstrate the effectiveness of the method via simulations. 1.1 Related work The use of summary statistics for regression modeling has been considered in the statistical genetics literature [6]. The lassosum method for polygenic risk prediction was introduced by [22], which considered fitting a L1 penalized linear regression with summary statistics, and its theoretical properties were studied in depth by [15]. In [4], the authors extend these ideas to polygenic risk prediction with binary traits. The summary statistics used in these methods include the marginal associations between genetic variants and phenotypes, and statistics summarizing the covariance structures among all genetic variants oftentimes derived from a reference genotype dataset. Empirical studies have demonstrated that the efficacy of such models is significantly influenced by the choices of the GWAS summary statistics and the reference dataset [30]. However, there s still limited theoretical understanding regarding how the overlap of samples and the inherent heterogeneity between datasets impact the model performance. Moreover, most current approaches devise models for a single trait within a single ancestral population. Considering shared genetic architectures could potentially enhance performance by employing a multi-task learning strategy [23].The authors of [24] take this approach, and describe a multi-task estimator for multi-ancestry p QTL analysis. However, they do not consider the setting when only summary statistics are available for each task. Our methods build upon classical multi-task learning techniques, and enable fitting models only using basic summary statistics which are often made publicly available. The sparse regularized estimator extends the group-sparse estimators studied in [19, 18], while the nuclear norm estimator expands on the low-rank regression model described in [25]. The nuclear norm approach is closely related to the linear representation learning problem [7, 31], which constrains the regression coefficients to a shared low-dimensional subspace. Another recent line of work studies the multi-task learning problem under data-sharing constraints. In [17], the authors describe a federated multi-task learning linear regression model for privacypreserving data analysis. Similarly, [2] presents a computational framework for multi-task learning under Data SHIELD [11] constraints. The formulation of these methods is conceptually similar to ours, but they do not provide theoretical guarantees for their estimators, and we consider a more flexible setting where the summary statistics can be derived from different sources. Finally, our methods are closely related to the one-shot federated learning paradigm, in which only one round of communication is permitted between the primary local research site and additional sites. [9] presents an algorithm for fitting logistic regression models using summary statistics from different research sites. The works of [20, 34] extend these ideas to linear mixed effects models and generalized mixed effects models, respectively. [21] presents a federated algorithm for fitting the Cox proportional hazards model, and [16] studies federated transfer learning methods for fitting generalized linear models. Nevertheless, the summary statistics addressed in our research are frequently reported in existing studies and can be employed across various models. This is in contrast to the one-shot federated algorithm, where the summary statistics are model specific and the implementation relies on the infrastructure of a collaborative environment. 2 Problem setup and methods Consider the setting where we are interested in learning a total of Q tasks simultaneously. For each q [Q], we posit the linear model Y(q) = X(q)β(q) + ε(q) where Y(q) Rnq, X(q) Rnq p, and ε(q) is mean-zero random noise. Each index q corresponds to the qth task. The dataset D(q) = (Y(q), X(q)) contains the individual-level observations of the outcome and features respectively for the qth task. We consider the generic setting where the features D(q) might be collected from either overlapping or non-overlapping samples across tasks. Our estimand of interest is the matrix B = [β(1), ..., β(Q)] Rp Q, where the qth column of B is β(q). Furthermore, let ei denote the ith standard basis vector, so that β(q) = B eq. If all the individual-level observations D(q) are available, a natural estimator of B is the regularized multi-task least-squares estimator b B = arg min B Y(q) X(q)Beq 2 where P is a suitable penalty, chosen to enforce similarity structure between tasks, with tuning parameter λ > 0. However, in many applications, we are less likely to observe D(q). Rather, summary statistics which contains information of the feature-outcome and feature-feature relationships may be more likely to be made publicly available. Motivated by the use case in genetic risk prediction, we assume that only summary statistics b S(q) and eΣ(q) are observable, where b S(q) = 1 nq X(q)T Y(q) are derived from {X(q), Y(q)}, which we termed as the discovery data, and eΣ(q) = 1 nq e X(q)T e X(q) is a sample covariance matrix computed from the proxy data e X(q) R nq p, which may or may not have overlap with X(q). Our goal is to estimate B using two sets of summary statistics b S(q) and eΣ(q). We note that e X(q) is not necessarily equal to X(q). In practice, the studies which report b S(q) may not be the same as the ones reporting eΣ(q). Intuitively, we hope that e X(q) is generated from a similar population as X(q), but this may not hold in general. In Section 3, our theoretical analysis reveals how the overlap between e X(q) and X(q) and their distributional shift can influence the accuracy of multi-task learning. To construct an estimator that uses only the information provided by e D(q), we notice that the leastsquares loss can be written as L(β) = Y Xβ 2 2 = YT Y 2 β, XT Y + βT XT Xβ By dropping the constant term, we arrive at a loss function that can be computed using only summarylevel information, namely the matrices XT Y and XT X. This motivates our general strategy for constructing an estimator only using summary statistics: we substitute b S(q) and eΣ(q) where appropriate in each least-square loss function in Equation 1 and arrive at the following optimization problem. b B = arg min B eΣ(q)1/2Beq 2 2 b S(q), Beq + λP(B) There are many possible choices of P for enforcing structure similarities across tasks. For instance, the recent works of [29] and [12] study low-rank and angle-based penalties for enforcing a shared orientation among the task-specific parameters. In this work, we study two estimators obtained under the ℓ2,1 norm penalty, denoted . 2,1, and the nuclear norm penalty, denoted . . These penalties are chosen for their intuitive interpretation: the ℓ2,1 penalty is more likely to be effective if a common set of variables are active across the tasks. If the task-specific parameters tend to be correlated , in the sense that they lie in a low-dimensional subspace, the nuclear norm penalty is preferred. In practice, certain domain knowledge can be incorporated to determine the penalty structure, or it can be chosen in a data-driven way in the existence of a validation dataset. The corresponding estimators are expressed as follows: b B(sp) = arg min B eΣ(q)1/2Beq 2 2 b S(q), Beq + λ B 2,1 b B(lr) = arg min B eΣ(q)1/2Beq 2 2 b S(q), Beq + λ B The superscripts (sp) and (lr) stand for sparse and low-rank respectively. 3 Theoretical guarantees Before presenting our theoretical results, we first introduce the relevant notation. Let N denote the total size of discovery observations and proxy observation across all Q tasks. Formally, q=1 (nq + nq) We note that N may double-count individuals who are part of both the proxy data and the discovery data. Define the subset Iq [N] as the index set for the discovery data points in the qth task; in other words i Iq implies Xi Rp is a row of X(q). We define e Iq analogously for the proxy data; i e Iq implies Xi is a row of e X(q). Let ρq = |Iq e Iq|/ nq denote the proportion of proxy samples which are also in the discovery dataset for the qth task. In the results that follow, let γq = 1 + β(q) 2 nq + 1 2 ρq) and take γ = maxq γq. Additionally, let Ξ Rp Q be the matrix with its qth column equal to (Σ(q) 1 Σ(q) 2 )β(q), where Σ(q) 1 and Σ(q) 2 are the population-level covariance matrices of X(q) and e X(q) respectively. The quantities γ and Ξ play important roles in our results that follow. In particular, γ is a multiplicative factor in our bounds that represents the cost of using proxy data rather than individual-level data. Similarly, Ξ will represent the cost of using a proxy dataset with a distributional shift from the discovery data. Finally, we will let nmin and nmin denote the smallest sample size of discovery and proxy data, respectively. All proofs are given in the supplement. 3.1 Guarantees for ℓ2,1-norm estimator In this section, we formally state our assumptions and results for the b B(sp) estimator. The assumptions are standard for high-dimensional regularized estimators, see [26] for a deeper discussion of these conditions. Assumption 3.1 (Sub-gaussian design and noise). The following holds for each q [Q]: The rows of X(q) are independent and identically distributed according to a sub-Gaussian distribution with covariance matrix Σ(q) 1 Rp p. Similarly, the rows of e X(q) are independent and identically distributed according to a sub-Gaussian distribution with covariance Σ(q) 2 Rp p. The matrices Σ(q) 1 and Σ(q) 2 have bounded eigenvalues. The entries of ε(q) are independent and identically distributed according to a sub-Gaussian distribution with parameter σ2. The X(q) and ε(q) are independent of one another. Assumption 3.2 (Shared support). There exists a subset S [p] such that supp(β(q)) = S for each q. Definition 3.1 (Sparse cone). For any S [p], let Cα(S) = n Rp Q : Sc 2,1 α S 2,1 o Assumption 3.3 (Restricted strong convexity). There exists a constant κ > 0 and a sequence a N 0 as N such that the following inequality holds for each C3(S ) with probability at least 1 a N: eΣ(q)1/2 eq 2 Theorem 3.1. Under assumptions 3.1, 3.2, and 3.3, there exist constants c1 and c2 depending only on the σ2 and the eigenvalues of Σ(q) 1 and Σ(q) 2 such that if nmin nmin c1 B , (Q + log p) and λ = O( p γ(Q + log p)/nmin + Ξ 2, ), the following inequality holds with probability at least 1 e log p a N: b B(sp) B F c2 γs(Q + log p) nmin + s Ξ 2, In the subsequent discussion, we take q = arg maxq [Q] γq and (n, n, ρ) = (nq , nq , ρq ) so that the triplet (n, n, ρ) corresponds to the same sizes and overlap factor used to compute γ. There are three main quantities in this upper bound that are of novel interest: the ratio of discovery data size to proxy data size n/ n, the proportion of overlap between the discovery and proxy data ρ, and the error in specifying the proxy data distribution Ξ = (Σ(1) Σ(2))β. The first two of these are captured by the factor γ. Our results show that with fixed n and n, the larger proportion of overlap leads to better estimation accuracy. When the proxy data and discovery data are precisely the same, meaning that Iq = Iq for all q, we recover the minimax rate of estimation for the ℓ2,1 penalized multi-task learning problem established by Theorem 6.1 of [18]. If the proxy data and the discovery data are disjoint, meaning that Iq Iq = for all q, the error is increased relative to the minimax rate by a factor of (1 + n/ n) β 2 2. This recovers the result of Theorem 2.1 in [15] up to a constant factor, assuming that Ξ = 0. The novelty of Theorem 3.1 is that we are able to characterize the convergence rate of b B(sp) for any values of n/ n, ρ, and Ξ. Additionally, we emphasize that the form of the γ term implies that a price is paid anytime when X(q) is not fully contained in e X(q). Indeed, if ρ < 1/2 and we take n we still have that γ > 1 as long as the signal is nonzero. Counter-intuitively, this indicates that an oracle model which has full access to the population-level covariance matrix of the covariates will perform worse in terms of estimation error than an estimator which has access to individual-level data. Furthermore, if ρ > 1/2, our theorem predicts that the estimator will out-perform the oracle estimator that uses the population covariance matrix. These phenomena are validated in our simulation studies in Section 5. 3.2 Guarantees for the nuclear norm estimator Now we state our results for the low-rank proxy data estimator, when the penalty is taken to be the nuclear norm. Once again, these assumptions are standard for high-dimensional regression problems with the nuclear norm [26, 33]. Assumption 3.4 (Low rank). The matrix B has rank r << p Q. Let U and V denote the column space and row space of B respectively. Note that U and V each have dimension r. Definition 3.2 (Subspaces). Let U denote a dimension k p Q subspace of Rp, and let V denote a dimension k p Q subspace of RQ. Define M = M(U, V) := Rp Q : row( ) = V, col( ) = U M = M (U, V) = Rp Q : row( ) V, col( ) U Furthermore, for any subspace Ωof Rp Q, let Ωdenote the projection of onto Ω. We will denote M = M(U , V ). Definition 3.3 (Low rank cone). For any set M as defined above, let Cα(M) = Rp Q : M α M Assumption 3.5 (Restricted strong convexity). There exists a constant κ > 0 and a sequence b N 0 as N such that the following inequality holds for each C3(M ) with probability at least 1 b N: eΣ(q)1/2 eq 2 Theorem 3.2. Under assumptions 3.1, 3.4, and 3.5, there exist constants c1 and c2 depending only on σ2 and the eigenvalues of Σ(q) 1 and Σ(q) 2 such that if nmin nmin c1 B , (Q + p) and λ = O( p γ(Q + p)/nmin + Ξ op), the following inequality holds with probability at least 1 e p b N: b B(lr) B F c2 nmin + r Ξ op This theorem recovers precisely the same behavior with respect to γ and Ξ as Theorem 3.1. As γ 1, we achieve the minimax rate of estimation for low-rank regression as derived in [27] as long as Ξ = 0. 4 Tuning parameter selection with Lepski s method A key challenge of applying penalized regression models to summary statistics is that model tuning based on data splitting (e.g., training and validation) is no longer an option. Model selection methods based on information criteria require knowing the log squared loss log Y(q) X(q)β 2 2, which cannot be recovered from b S(q) and eΣ(q) [1]. To address this, we propose to use a tuning scheme based on Lepski s method [14], a classical tool of nonparametric statistics for adaptive estimation with unknown tuning parameters. The authors of [5] apply the ideas of Lepski to the LASSO, providing a fast algorithm for model tuning with non-asymptotic guarantees. In this section, we extend the methods in [5] to tune the multi-task estimators described in the present work. The results and ideas in this section apply to both b B(sp) and b B(lr), so without loss of generality, let (b Bλ, P) denote a generic estimator-regularizer pair with tuning paramter λ, which may refer to either (b B(sp) λ , . 2,1) or (b B(lr) λ , . ). Additionally, let P denote the dual of P, meaning that P (X) = sup Y :P(Y ) 1 X, Y Finally, we let L denote the loss function for both estimators and let L denote its gradient. The intuition behind the adaptive tuning procedure is that the tuning parameter should be chosen large enough to control fluctuations in the gradient of the loss function, but not too large such that too much bias is incurred. [26] articulates that the performance of regression estimator with a convex penalty is contingent on the following event occurring with high probability: A(λ) = P ( L(B )) λ where we use our problem s notation for continuity. The proofs of Theorem 3.1 and 3.2 involve showing that A(λ) holds with high probability under our stated conditions. It is straightforward to prove the following proposition, which states that conditional on A, the gradient of the loss function at our generic estimator b B is close to the gradient at the true parameter B . Proposition 4.1. Let (b Bλ, P) denote a generic estimator-regularizer pair. Conditional on the event A(λ), there exists a constant C > 0 such that the following inequality is satisfied almost surely: P ( L(b Bλ) L(B )) Cλ This proposition motivates the following definition, which we adopt from [5]. Definition 4.1. Let Λ = {λ1, λ2, ..., λM} denote a grid of potential tuning parameters ordered such that 0 < λ1 < λ2 < ... < λM < . Fix δ (0, 1). The oracle tuning parameter λ δ is defined as λ δ = arg min λ Λ {P {A(λ)} 1 δ} The oracle tuning parameter provides the tightest upper bound in Proposition 4.1, but is unknowable in practice, since we do not observe B and hence cannot verify A. The aim of our Lepski-type method is to mimic the performance of λ δ in an entirely data-driven fashion. Letting Λ denote our ordered grid of potential tuning parameters, we follow [5] and choose the tuning parameter ˆλ Λ that satisfies ˆλ = arg min λ Λ max λ ,λ Λ,λ ,λ λ P ( L(b Bλ ) L(b Bλ )) C(λ + λ ) (5) where C is a constant chosen by the statistician. The following theorem states that ˆλ recovers the behavior of λ δ with high probability, as long as C is sufficiently large. Theorem 4.1. Let C denote the constant in Proposition 4.1. If ˆλ is chosen as in Equation 5 with C C, then the following inequalities hold simultaneously with probability at least 1 δ: 2. P ( L(b Bˆλ) L(B )) C λ δ This theorem is a generalization of Theorem 3 in [5], adapted to our setting. The primary advantage of this Lepski-style tuning scheme is that it can be performed using only the gradient of the loss function, which in our setting consists only of summary-level statistics. This is a marked improvement over other summary statistic-based estimators, which typically require an additional set of individual-level data for tuning. The adaptive tuning scheme does hold some disadvantages. First of all, it requires a choice of constant C, which should be taken to be as close to the constant in Proposition 4.1 as possible. Remark 10 in [5] offers some guidance as to how to choose C but unfortunately their analysis corresponds only to the LASSO. Deriving the exact constant in Proposition 4.1 may be possible under stronger assumptions on the data-generating process (i.e. Gaussianity), and we view this as an area of future work. Furthermore, Theorem 4.1 offers only a bound on L(b B) L(B ). Translating this to a bound on b B B will require strong element-wise conditions on each of the matrices eΣ(q), which we do not explore in the present work. Nevertheless, our simulations in Section 5 indicate that the adaptive tuning method performs well in terms of the MSE of b B, suggesting that adaptive tuning is a good option for model selection when only summary statistics are available. 5 Numerical experiments and real data application We validate our theory and demonstrate the effectiveness of multi-task learning in proxy data settings via extensive experiments. In each experiment, we take the proxy dataset to be well-specified; in other words, we assume that Ξ = 0. When Ξ is nonzero, this predictably leads to worse performance, which we demonstrate in the supplement. The code, further implementation details, and additional simulations which explore the use of our adaptive tuning procedure are also available in the supplement. First, we consider the effect of varying proxy data size on empirical MSE per task. We generate synthetic Gaussian data with nmin = 100, p = 100, nmin = τnmin for τ {0.5, 1, 2, 5, 10}, and ρq = 0 for each q. The number of tasks was fixed at 8. Furthermore, we generate a row-sparse B matrix with 10 nonzero rows and a B with rank 2 for the sparse and low-rank multi-task estimators, respectively. We then fit the proxy data multi-task learning estimator and compare the prediction MSE per task to the estimator that has access to all of the individual level data and to the estimator that uses the true covariance matrix Σ. The results of this simulation are given in Figure 1. Figure 1: Average prediction MSE per task after 100 repetitions plotted against τ = n/n. The left hand side corresponds to the sparse estimator, and the right hand side is the low-rank estimator. The red boxes indicate the estimator that uses all of the individual-level data (IL_Cov), the blue boxes indicate the estimator that uses the true covariance matrix of the features (true_Cov), and the green boxes correspond to the estimator that uses just the proxy data (Proxy_Cov). We observe a performance gap between the estimators that use the true covariance matrix and the individual level estimators, as predicted by our theory in Section 3. The performance of the proxy data estimators increase with increasing proxy sample size, but are unable to match the performance of the individual level estimator, as expected. Next we study the effect of varying the proportion of overlapping samples between the discovery and proxy datasets. Simiarly, we generate synthetic data with n = n = 100, and vary ρ, which indicates the proportion of proxy data points that are also in the discovery dataset. With Q = 8, we generate B in the same way as in the previous simulation. These results are given in Figure 2. Once again, we observe the expected performance gap between the estimators with the true covariance and the individual-level estimators. As the proportion of overlap between the proxy dataset and the discovery dataset grows, we see that the performance of the proxy data estimator converges to that of the individual level estimator. This is anticipated by Theorems 3.1 and 3.2. Figure 2: Average MSE per task after 100 repetitions plotted against ρ. The orientation and colors are the same as in Figure 1. Finally, we have applied our method to analyze real genetic data to demonstrate the real-world applicability of our method. We use a multi-site data obtained from the electronic Medical Records and Genomics (e MERGE) network [28], which includes individual-level genotype data from multiple research sites in the United States. Our goal is to predict levels of low-density lipoprotein (LDL) across five adult sites, treating the data from each site as a separate task. We split the data (with sample sizes n1 = 3813, n2 = 546, n3 = 2666, n4 = 1435, n5 = 525) at each task into a training and test set (with a test set data size of 100 for each task) and evaluate the performance of our method using the prediction MSE on the test set. The training data from each site is used to construct the discovery summary statistics b S(q) for each task. For approximating Σ(q), we choose two different approaches: one is to use the half of the genotype data from each site (this approach is labeled as Proxy_MTL1); the other approach is to use X1 (genotype data from site 1) to approximate Σ(q) for all the sites. This approach is labeled as Proxy_MTL2. We use these two approaches to demonstrate a potential trade-off in the construction of the reference panel: Proxy_MTL1 uses a well-specified reference dataset with a smaller sample size; Proxy_MTL2 uses a larger reference dataset that may suffer from a distribution shift. For comparison, we also fit a multi-task learning estimator that uses all of the individual level training data for each task, which is labeled Individual_MTL , and we fit a ridge regression estimator that models each task separately, for comparison to our multi-task learning approach. The ridge estimator uses the proxy sample covariance instead of the individual level covariance matrix for a fair comparison with our method. We repeat the train-test split process 10 times, which admits the distribution of prediction MSE values that we report in the figure. We use the nuclear norm penalized multi-task learning estimators in this application, because we believe that the genetic effects in this dataset are dense. Figure 3: Prediction MSE per task after 10 splits of the e MERGE data. Our results in Figure demonstrate that our method is highly practical when only summary-level information is available, as the prediction MSE of our method is nearly the same as the estimator which uses the individual-level data, despite a slight cost in performance. Furthermore, all multi-task learning estimators outperform the ridge-estimator, confirming that multi-task learning is a strong approach when there is shared structure between tasks. 6 Discussion, Limitations, and Broader Impacts We have described a flexible multi-task framework incorporating summary statistics from distinct sources with a general data-driven tuning scheme for selecting tuning parameters. Our theoretical analysis sheds light on the intrinsic price of using summary-level information from distinct sources for statistical analysis, and suggests that more overlap between the sources, less distributional shift, and larger proxy data sample sizes can alleviate this cost. Our data-driven tuning scheme allows models to be trained without sample splitting, making it more applicable to real-world settings with only summary statistics available. The limitations of our work are summarized as follows. First of all, our methods depend on a linear relationship between the covariates and the outcomes. This assumption is often satisfied in our target application of genetic risk prediction [3], but it does limit the applicability of our method to other domains. To extend our framework to non-linear models, we may use the second-order Taylor approximation of the loss function as in [16]. However, the summary statistics used by such an algorithm are not found in existing literature or publicly available databases. Additionally, our theoretical results provide only upper bounds on the estimation error of the two estimators that we consider in this work. To fully characterize the cost of using summary statistics for multi-task learning, lower bounds resembling Theorem 2.2 of [15] are needed. We conjecture that our estimators converge at a minimax optimal rate, and we view the proof of this conjecture as an important future direction. Finally, we may also extend the framework of [10] to our summary statistic based setting to adjust for potential differences between tasks. Nevertheless, our results have important implications beyond high-dimensional statistical theory. The trade-off between proxy data sample size and discovery-proxy overlap may inform how polygenic risk models are built in real-world applications: Practitioners should prioritise alignment between the sources of summary statistics that they use to build these models, rather than optimizing for large sample sizes. This guidance may lead to more accurate polygenic scores, which have emerged as an important predictive tool in the field of precision medicine. We recognize that the development of polygenic risk scores, if done without care, may worsen existing health disparities [23]. This is a potential negative societal impact of our work. We hope that our multi-task learning framework may be used to incorporate data from diverse populations to improve generalizability and transportability of genetic risk predictions to overcome these negative impacts. Acknowledgements Rui Duan is supported by National Institute of General Medical Sciences (NIGM) R01GM148494. Parker Knight is supported by an NSF Graduate Research Fellowship. We would like to thank Rajarshi Mukherjee for an insightful discussion of Lepski s method, and thank the reviewers for their comments and feedback which greatly improved the paper. [1] Kenneth P. Burnham and David R. Anderson. Multimodel Inference: Understanding AIC and BIC in Model Selection . In: Sociological Methods & Research 33.2 (Nov. 2004), pp. 261 304. ISSN: 0049-1241, 1552-8294. DOI: 10.1177/0049124104268644. [2] Han Cao et al. ds MTL: a computational framework for privacy-preserving, distributed multitask machine learning . In: Bioinformatics 38.21 (Oct. 31, 2022). Ed. by Jonathan Wren, pp. 4919 4926. ISSN: 1367-4803, 1367-4811. DOI: 10.1093/bioinformatics/btac616. [3] Nilanjan Chatterjee, Jianxin Shi, and Montserrat García-Closas. Developing and evaluating polygenic risk prediction models for stratified disease prevention . In: Nature Reviews Genetics 17.7 (July 2016), pp. 392 406. ISSN: 1471-0056, 1471-0064. DOI: 10.1038/nrg.2016.27. [4] Ting-Huei Chen et al. A Penalized Regression Framework for Building Polygenic Risk Models Based on Summary Statistics From Genome-Wide Association Studies and Incorporating External Information . In: Journal of the American Statistical Association 116.533 (Jan. 2, 2021), pp. 133 143. ISSN: 0162-1459, 1537-274X. DOI: 10.1080/01621459.2020.1764849. [5] Michaël Chichignoud, Johannes Lederer, and Martin Wainwright. A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees. ar Xiv:1410.0247. type: article. ar Xiv, Nov. 8, 2016. ar Xiv: 1410.0247[math,stat]. [6] Shing Wan Choi, Timothy Shin-Heng Mak, and Paul F. O Reilly. Tutorial: a guide to performing polygenic risk score analyses . In: Nature Protocols 15.9 (Sept. 1, 2020), pp. 2759 2772. ISSN: 1754-2189, 1750-2799. DOI: 10.1038/s41596-020-0353-1. [7] Simon S. Du et al. Few-Shot Learning via Learning the Representation, Provably. ar Xiv:2002.09434. type: article. ar Xiv, Mar. 30, 2021. ar Xiv: 2002.09434[cs,math,stat]. [8] Rui Duan, Yang Ning, and Yong Chen. Heterogeneity-aware and communication-efficient distributed statistical inference . In: Biometrika (Feb. 12, 2021), asab007. ISSN: 0006-3444, 1464-3510. DOI: 10.1093/biomet/asab007. [9] Rui Duan et al. Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm . In: Journal of the American Medical Informatics Association 27.3 (Mar. 1, 2020), pp. 376 385. ISSN: 1527974X. DOI: 10.1093/jamia/ocz199. [10] Yaqi Duan and Kaizheng Wang. Adaptive and Robust Multi-task Learning. ar Xiv:2202.05250. type: article. ar Xiv, Apr. 1, 2022. ar Xiv: 2202.05250[cs,math,stat]. [11] Amadou Gaye et al. Data SHIELD: taking the analysis to the data, not the data to the analysis . In: International Journal of Epidemiology 43.6 (Dec. 2014), pp. 1929 1944. ISSN: 1464-3685, 0300-5771. DOI: 10.1093/ije/dyu188. [12] Tian Gu, Yi Han, and Rui Duan. Robust angle-based transfer learning in high dimensions. Apr. 7, 2023. ar Xiv: 2210.12759[stat]. [13] Tian Gu, Phil H. Lee, and Rui Duan. COMMUTE: Communication-efficient transfer learning for multi-site risk prediction . In: Journal of Biomedical Informatics 137 (Jan. 2023), p. 104243. ISSN: 15320464. DOI: 10.1016/j.jbi.2022.104243. [14] O. V. Lepski and V. G. Spokoiny. Optimal pointwise adaptive methods in nonparametric estimation . In: The Annals of Statistics 25.6 (Dec. 1, 1997). ISSN: 0090-5364. DOI: 10.1214/ aos/1030741083. [15] Sai Li, T. Tony Cai, and Hongzhe Li. Estimation and Inference with Proxy Data and its Genetic Applications . In: ar Xiv:2201.03727 [math, stat] (Jan. 10, 2022). ar Xiv: 2201.03727. [16] Sai Li, Tianxi Cai, and Rui Duan. Targeting Underrepresented Populations in Precision Medicine: A Federated Transfer Learning Approach. Aug. 27, 2021. ar Xiv: 2108.12112[cs, stat]. [17] Kunpeng Liu et al. Privacy-Preserving Multi-task Learning . In: 2018 IEEE International Conference on Data Mining (ICDM). 2018 IEEE International Conference on Data Mining (ICDM). Singapore: IEEE, Nov. 2018, pp. 1128 1133. ISBN: 978-1-5386-9159-5. DOI: 10. 1109/ICDM.2018.00147. [18] Karim Lounici et al. Oracle inequalities and optimal inference under group sparsity . In: The Annals of Statistics 39.4 (Aug. 2011). Publisher: Institute of Mathematical Statistics, pp. 2164 2204. ISSN: 0090-5364, 2168-8966. DOI: 10.1214/11-AOS896. [19] Karim Lounici et al. Taking Advantage of Sparsity in Multi-Task Learning. ar Xiv:0903.1468. type: article. ar Xiv, Mar. 8, 2009. ar Xiv: 0903.1468[math,stat]. [20] Chongliang Luo et al. DLMM as a lossless one-shot algorithm for collaborative multi-site distributed linear mixed models . In: Nature Communications 13.1 (Mar. 30, 2022), p. 1678. ISSN: 2041-1723. DOI: 10.1038/s41467-022-29160-4. [21] Chongliang Luo et al. ODACH: a one-shot distributed algorithm for Cox model with heterogeneous multi-center data . In: Scientific Reports 12.1 (Apr. 22, 2022), p. 6627. ISSN: 2045-2322. DOI: 10.1038/s41598-022-09069-0. [22] Timothy Shin Heng Mak et al. Polygenic scores via penalized regression on summary statistics: MAK et al. In: Genetic Epidemiology 41.6 (Sept. 2017), pp. 469 480. ISSN: 07410395. DOI: 10.1002/gepi.22050. [23] Alicia R. Martin et al. Clinical use of current polygenic risk scores may exacerbate health disparities . In: Nature Genetics 51.4 (Apr. 2019), pp. 584 591. ISSN: 1061-4036, 1546-1718. DOI: 10.1038/s41588-019-0379-x. [24] Aaron J. Molstad et al. Heterogeneity-aware integrative analyses for ancestry-specific association studies. June 8, 2023. ar Xiv: 2306.05571[stat]. [25] Sahand Negahban and Martin J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling . In: The Annals of Statistics 39.2 (Apr. 1, 2011). ISSN: 0090-5364. DOI: 10.1214/10-AOS850. [26] Sahand N. Negahban et al. A Unified Framework for High-Dimensional Analysis of $M$- Estimators with Decomposable Regularizers . In: Statistical Science 27.4 (Nov. 1, 2012). ISSN: 0883-4237. DOI: 10.1214/12-STS400. [27] Angelika Rohde and Alexandre B. Tsybakov. Estimation of high-dimensional low-rank matrices . In: The Annals of Statistics 39.2 (Apr. 1, 2011). ISSN: 0090-5364. DOI: 10.1214/ 10-AOS860. [28] the e MERGE Team et al. The e MERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies . In: BMC Medical Genomics 4.1 (Dec. 2011), p. 13. ISSN: 1755-8794. DOI: 10.1186/1755-8794-4-13. [29] Ye Tian, Yuqi Gu, and Yang Feng. Learning from Similar Linear Representations: Adaptivity, Minimaxity, and Robustness. ar Xiv:2303.17765. type: article. ar Xiv, Mar. 30, 2023. ar Xiv: 2303.17765[cs,stat]. [30] Ali Torkamani, Nathan E. Wineinger, and Eric J. Topol. The personal and clinical utility of polygenic risk scores . In: Nature Reviews Genetics 19.9 (Sept. 2018), pp. 581 590. ISSN: 1471-0056, 1471-0064. DOI: 10.1038/s41576-018-0018-x. [31] Nilesh Tripuraneni, Chi Jin, and Michael I. Jordan. Provable Meta-Learning of Linear Representations. ar Xiv:2002.11684. type: article. ar Xiv, Dec. 31, 2021. ar Xiv: 2002.11684[cs, stat]. [32] Joshua R Vest and Larry D Gamm. Health information exchange: persistent challenges and new strategies: Table 1 . In: Journal of the American Medical Informatics Association 17.3 (May 2010), pp. 288 294. ISSN: 1067-5027, 1527-974X. DOI: 10.1136/jamia.2010. 003673. [33] Martin Wainwright. High-dimensional statistics: a non-asymptotic viewpoint. Cambridge series in statistical and probabilistic mathematics 48. Cambridge ; New York, NY: Cambridge University Press, 2019. ISBN: 978-1-108-49802-9. [34] Zhiyu Yan et al. A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data . In: PLOS ONE 18.1 (Jan. 17, 2023). Ed. by Sathishkumar V. E., e0280192. ISSN: 1932-6203. DOI: 10.1371/journal.pone.0280192. [35] Yu Zhang and Qiang Yang. A Survey on Multi-Task Learning . In: IEEE Transactions on Knowledge and Data Engineering 34.12 (Dec. 1, 2022), pp. 5586 5609. ISSN: 1041-4347, 1558-2191, 2326-3865. DOI: 10.1109/TKDE.2021.3070203. [36] Yu Zhang and Qiang Yang. An overview of multi-task learning . In: National Science Review 5.1 (Jan. 1, 2018), pp. 30 43. ISSN: 2095-5138, 2053-714X. DOI: 10.1093/nsr/nwx105.