# assumptionlean_and_dataadaptive_postprediction_inference__088b471e.pdf Journal of Machine Learning Research 26 (2025) 1-31 Submitted 1/24; Revised 6/25; Published 6/25 Assumption-lean and data-adaptive post-prediction inference Jiacheng Miao jiacheng.miao@wisc.edu Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA Xinran Miao xinran.miao@wisc.edu Department of Statistics University of Wisconsin-Madison Madison, WI 53706, USA Yixuan Wu wu638@wisc.edu University of Wisconsin-Madison Madison, WI 53726, USA Jiwei Zhao jiwei.zhao@wisc.edu Department of Biostatistics and Medical Informatics, Department of Statistics University of Wisconsin-Madison Madison, WI 53726, USA Qiongshi Lu qlu@biostat.wisc.edu Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA Editor: Edo Airoldi A primary challenge facing modern scientific research is the limited availability of goldstandard data, which can be costly, labor-intensive, or invasive to obtain. With the rapid development of machine learning (ML), scientists can now employ ML algorithms to predict gold-standard outcomes using variables that are easier to obtain. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce Po St-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data. Its assumption-lean property guarantees reliable statistical inference without assumptions on the ML prediction. Its data-adaptive feature guarantees an efficiency gain over existing methods, regardless of the accuracy of ML prediction. We demonstrate the statistical superiority and broad applicability of our method through simulations and realdata applications. Keywords: Inference with predicted data; Missing data; Semiparametric efficiency 1. *These authors contribute equally to this work. c 2025 Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, and Qiongshi Lu. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0056.html. Post-prediction adaptive inference 1. Introduction Gene expression data obtained from various tissues and cell types can provide crucial insights into the coordinated biological mechanisms that underlie disease etiology and characterize homeostasis (Lonsdale et al., 2013). However, some important tissues are often difficult to collect, which leads to a limited size of gene expression samples in those tissues. For example, the Genotype-Tissue Expression (GTEx) project is a comprehensive study focusing on gene expression regulation in many human tissues (GTEx Consortium et al., 2015). The percentage of individuals with missing gene expression ranges from 5% to 90% (median 47%) across all tissues in GTEx (GTEx Consortium, 2020). This limits the scientific understanding of transcription regulation across tissue contexts. The difficulty in obtaining gold-standard data certainly extends beyond human gene expression applications (Wang et al., 2023). While gold-standard data with high reliability are essential to the validity of scientific discoveries, obtaining them is often costly and laborintensive. Fortunately, the advent and rapid development of machine learning (ML) have enabled the prediction of outcomes using more accessible variables (Le Cun et al., 2015; He et al., 2016), showing great potential in reducing the need to acquire gold-standard data. Despite these benefits, replacing gold-standard data with ML predictions poses new challenges, particularly in maintaining the validity of downstream statistical analyses. This issue is exemplified by the statistical analysis using imputed gene expression in the GTEx project. To address the issue of insufficient sample sizes for rare tissues, several approaches have been proposed to impute gene expression in these tissues using expression data from tissues that are easier to acquire (Wang et al., 2016; Vi nas et al., 2023; Basu et al., 2021). However, imputed gene expression is often treated as if it were observed, and used directly in subsequent statistical analyses for scientific discovery (Vi nas et al., 2023), such as exploring sex differences and the genetic basis of gene regulation. As demonstrated in Wang et al. (2020); Angelopoulos et al. (2023a), using ML predictions directly in statistical inference will most likely produce false positive findings, leading to invalid scientific conclusions. The challenge of making valid inferences from ML predictions extends beyond the specific example discussed, affecting numerous scientific disciplines where ML is applied (Bullock et al., 2020). Recent work has introduced a method called prediction-powered inference (PPI), which leverages a small set of labeled data with gold-standard outcomes and a large amount of unlabeled data with ML predictions to perform valid statistical inference (Angelopoulos et al., 2023a). Although validity is guaranteed, PPI may be less efficient than inference relying solely on labeled data. For instance, when applied to GTEx data, PPI identifies fewer sex-biased gene expression compared to the classical approach based on labeled data alone, highlighting a loss of statistical efficiency (Figure 1). This limitation reduces the broader applicability of PPI and similar methods. To address this problem, we introduce a valid, powerful, and widely applicable approach named Po St-Prediction Adaptive inference (PSPA). PSPA is designed to integrate ML prediction with the observed gold-standard data to ensure valid and efficient statistical inference. We highlight two key features of our PSPA method: Assumption-Lean: The PSPA estimator is consistent and asymptotically normal, ensuring valid inference with no assumptions on the ML model or prediction accuracy. This means that PSPA can be used with arbitrarily misspecified ML predictions. Post-prediction adaptive inference Data-Adaptive: The PSPA estimator is adaptive to the accuracy of the ML prediction. It utilizes more information from a better ML prediction to reduce variance and to avoid variance inflation. Below, we demonstrate that PSPA achieves element-wise variance reduction compared to existing methods, regardless of the ML prediction quality. Our contributions are threefold: (i) We present a simple and data-adaptive method for statistical inference problems with predicted data defined through estimating equations, without any assumptions on the ML prediction or parameterization of the true data-generating process. (ii) We establish the consistent and asymptotic normality of our proposed estimator. We further demonstrate the optimality of our estimator in terms of asymptotic variance over a class of estimators that includes existing methods in post-prediction inference. (iii) In contrast to current methods, which only focus on incorporating ML-predicted labels in statistical inference, our approach extends to utilizing both ML-predicted covariates and labels, enhancing the versatility and applicability of post-prediction inference. The rest of this paper is organized as follows. In Section 2, we formulate our problem and illustrate our method with mean estimation to build an intuition. Next, in Section 3, we introduce a general protocol for applying PSPA to estimands defined through estimating equations. Further, we establish the asymptotic properties of the PSPA estimator and describe the estimation procedure. We build the connection with the semiparametric efficiency theory in Section 4 and describe extensions in Section 5. Through extensive simulations (Section 6) and real data analysis (Section 7), we validate our theoretical claims and demonstrate the practical utility and effectiveness of the PSPA estimator. We then compare our method with a recent method PPI++ (Angelopoulos et al., 2023b) in Section 8. This paper concludes with a discussion in Section 9. Figure 1: Comparison of PPI, PSPA, and classical method in identifying genes with sex-biased expression using GTEx data. (a) the number of sex-biased genes identified by each of the four approaches. (b) x-axis: absolute value of imputation correlation. y-axis: relative ratio of estimated standard error between PPI and classical method. (c) same as b but for our method PSPA. The dashed lines represent y = 1 in (b)-(c). Post-prediction adaptive inference 2. Problem Formulation Let Y be a scalar outcome and X be d-dimensional covariates. The scientific interest is to estimate a q-dimensional parameter θ defined through an estimating equation, E{ψ(Y, X; θ)} = 0, (1) where ψ( , ; θ) is a user-defined function. Such a definition of θ is very general, including outcome mean, outcome quantile, least squares coefficients, or any other specific quantities of interest involving both x and y, including the unique minimizer of a loss function and the maximizer of a criterion function (Van der Vaart, 2000). We observe a random sample where only a small subset is labeled with outcome Y . In addition, we observe the auxiliary variable Z that is predictive of the outcome Y . Note that Z could be a subset of X and vice versa; see Remark 1. Denote this sample as L U where L = {(yi, xi, zi), i = 1, , n} and U = {(xi, zi), i = n + 1, , N + n}. Assume that L and U are independent and share the same marginal distributions of (X, Z). Let n/N ρ as n and N , and let ρ = if there is no unlabeled data. We consider the availability of an external and independent prediction algorithm bf( ) on z that produces predictions bf for the outcome variable Y . We assume that the operating characteristics of bf are unknown to the user, and the data used to fit bf are unavailable. bf is considered a black box function and can be incorrect in predicting y. With access to this black-box bf, our primary goal is to improve efficiency while making inference on the parameter θ. We refer to our problem as post-prediction inference, where inference occurs after predictions are made independently (Wang et al., 2020; Angelopoulos et al., 2023a). Specifically, we make no assumptions on bf. The resulting protocol can benefit researchers where training a prediction algorithm is unrealistic, and offers guidelines for scientists lacking computational resources or predictive modeling expertise. Example 1 (Sex-differentiated gene expression) We present the scientific problem that motivates our setup: identifying sex-biased gene expression in the brain cortex tissue using data from the GTEx project. For this analysis, we have labeled data denoted by L = {(yi, bfi, xi, zi), i = 1, , 205}, where yi represents the gene expression in brain cortex tissue, xi includes a binary indicator for biological sex in{0, 1}, along with other technical factors such as surrogate variables for batch effects, age, RNA integrity number, and total ischemic time. zi is the whole-transcriptome gene expression profile in whole blood tissue that is easier to access, and bf is the imputed gene expression in the brain cortex tissue using zi. In addition, our unlabeled data, U = {( bfi, xi, zi), i = 206, , 670}, contains 465 samples whose gene expression was measured in whole blood but not in brain cortex. Our interest lies in the linear regression coefficient θ as a solution to E[x(y x θ)] = 0, focusing particularly on the coefficients corresponding to biological sex. Remark 1 (The consideration of X and Z) We split the covariate into two (possibly overlapping) parts, Z and X, where the former is predictive of the outcome Y and the latter appears in the estimating equation (1). They could be the same, but a wide range of applications suggests they are different. When Z is not a subset of X, the information it carries about the outcome Y may be the most informative for the estimation problem (1). Post-prediction adaptive inference Hence, a better prediction algorithm bf does not necessarily correspond to a better estimate of θ. As justified above, this work does not attempt to improve the prediction algorithm, but to empower the inference on the parameter of interest, given any arbitrary black-box bf. In the following, Section 2.1 presents the key idea of the proposed method using mean estimation as an example. Section 2.2 reviews related work. In Section 3, we introduce the proposed framework for the estimation problem defined in equation (1). 2.1 Example: estimating the outcome mean We illustrate the intuition of our approach using mean estimation. Consider ψ(y, x; θ) = y θ. The resulting estimand is θ = E(Y ), the outcome mean. Then the classical approach calculates an estimator from the sample average of observed outcomes, This estimator is unbiased, but can have high variance due to the small sample size of labeled data. To increase efficiency with the ML-predicted outcomes in unlabeled data, we propose to introduce another unbiased estimator of zero with a weighting scalar ω, bθPSPA(ω) = 1 We refer to this estimator as the PSPA estimator. Both the classical and PPI estimator (Angelopoulos et al., 2023a) can be viewed as special cases of the PSPA. The classical estimator is the PSPA estimator with ω = 0. The PPI estimator is the PSPA estimator with ω = 1: As detailed in Section 4, another candidate estimator is guided by the efficient influence function (EIF). We refer to this estimator as EIF -based where represents the fact that the nuisance function is plugged in as an estimate, i=1 yi + N N + n This is a special case of PSPA estimator with ω = N N+n, achieving the semiparametric efficiency bound when bf approximates its true counterpart at a sufficiently fast rate. Since these estimators are summations of consistent estimators of the population mean and zero, they are consistent for the population mean. Additionally, Wald-type confidence intervals can be constructed using a consistent estimator for the asymptotic variance. ω2 Var[ bf] n 2ω Cov[Y, bf] n + ω2 Var[ bf] | {z } Additional terms | {z } Variance of PSPA estimator Post-prediction adaptive inference To guarantee that the asymptotic variance (2) is no greater than the classical approach, regardless of the quality of bf, the additional terms need to be no larger than zero. A key insight is that these additional terms are a quadratic function of ω: q(ω) = ω2 ( Var[ bf] n + Var[ bf] 2ωCov[Y, bf] which achieves its minimum at ωopt = Cov[Y, bf]/{Var[ bf] + n Var[ bf]/N} with minima q(ωopt) = Cov[Y, bf] n Var[ bf] + n2 Var[ bf]/N 0, where the equality holds if and only if Cov[Y, bf] = 0. In this case, our PSPA estimator is at least as efficient as the classical estimator. In contrast, both PPI and EIF -based estimators can be less efficient than the classical estimator, when the ML predictions are not accurate. As we shall see in Section 3, similar patterns appear in estimating the general parameter defined in (1). In addition, the additional terms are below zero if and only if Cov[Y, bf] > ω(N+n) Var[ bf] 2N . This indicates that in order for PPI to achieve efficiency improvement (where ω = 1), the prediction accuracy of the ML algorithm cannot be low. Remark 2 (Data-Adaptive Feature) We provide an intuition for the data-adaptive feature of PSPA. When the ML algorithm is purely random, Cov[Y, bf] = 0 and hence the optimal weight ωopt = 0. Therefore, PSPA degenerates to the classical estimator that uses only labeled data. This aligns with the intuition that such predictions should not contribute to inference. On the other hand, if the ML prediction is perfect, Cov[Y, bf] = Var[ bf] = Var[Y ] and hence the optimal weight is ωopt = N/(N + n). In this case, PSPA degenerates to the EIF*-based estimator that achieves the semiparametric efficiency bound. This is also intuitive since perfect ML prediction can be treated as gold-standard data and the estimator should utilize all the (measured and predicted) information and thus is efficient. 2.2 Related work Our setting is closely related to a recent method called prediction-powered inference (PPI) (Angelopoulos et al., 2023a), which ensures valid inference with gold-standard labels and arbitrary ML predictions. It corresponds to a special case of our PSPA estimator where ω = 1 (or a vector of all ones). Both theoretical and numerical comparisons suggest that the PPI estimator is less efficient than the proposed PSPA estimator. Our approach builds upon and generalizes the method proposed by PPI++ (Angelopoulos et al., 2023b), called PPI++, which originally enhanced the efficiency of PPI. However, unlike PSPA, PPI++ cannot guarantee element-wise variance reduction. Moreover, PPI++ cannot be applied to statistical inference with ML-predicted covariates. On the other hand, PSPA is capable of handling this type of application, while guaranteeing improved efficiency, element-wise. We present a comprehensive comparison of PSPA and PPI++ in Section 8. Our proposal of boosting a consistent estimator with a consistent estimator of zero has been a long-standing idea. One famous estimator is the augmented inverse probability weighting estimator in the literature of missing data (Robins et al., 1994) and causal Post-prediction adaptive inference inference (Robins, 2000), in which a consistent inverse probability weighting estimator is augmented by a weighted residual term with mean zero, resulting in an efficiency gain (Hahn, 1998). In fact, it reaches the semiparametric efficiency bound (Bickel et al., 1993; Hahn, 1998) when nuisance functions are correctly specified or approximated sufficiently fast. Although the pursuit of efficiency under correct specifications differs from our main focus, we establish the efficiency bound of our estimation problem (1) in Section 4 and illustrate the connection between the PSPA estimator and the efficient influence function. The idea of using unlabeled data fits into the broader context of improving efficiency with auxiliary data, which has been widely applied in survey estimation (Breidt and Opsomer, 2017), missing data (Robins et al., 1994), measurement error models (Chen et al., 2005), causal inference with surrogate outcomes (Kallus and Mao, 2020), and semi-supervised learning (Wang and Shen, 2007; Chakrabortty and Cai, 2018). The literature on semisupervised learning has adopted a similar data structure to our problem in which labeled data are accompanied by unlabeled data (possibly with surrogate outcomes). Recently, Chakrabortty and Cai (2018) has considered efficiency improvement in linear models under possible model misspecifications. After that, estimation methods have been proposed for mean estimation (Zhang et al., 2019), best linear predictor estimation (Azriel et al., 2022), and general M-estimation problems (Song et al., 2023). Our work is distinct from this line of research in that we consider post-prediction inference with any black-box ML predictions for efficiency improvement. 3. PSPA for assumption-lean and data-adaptive post-prediction inference In this section, we introduce our method PSPA for assumption-lean and data-adaptive postprediction inference. For the estimand defined by (1) and any ML prediction, it guarantees the validity of inference results and element-wise variance reduction compared with the classical method that uses labeled data alone. We present our estimator and algorithm in Section 3.1 and establish the theoretical guarantees for our method in Section 3.2. Examples of applying our method for regression tasks are provided in Section 3.3. 3.1 PSPA estimator Bearing in mind that bf could be incorrect, one typical approach to estimating θ is to ignore the unlabeled data and use the labeled data only. This classical estimator bθC solves for i=1 ψ(yi, xi; θ) = 0. It is always consistent, but may have low efficiency because it ignores the unlabeled data. In this work, we aim to provide an estimator that is consistent and more efficient than the classical estimator, regardless of the quality of the bf. Similar to the mean estimation example, we add an augmentation term, which has a mean of zero and is indexed by a vector ω = [ω1, , ωq] , to the estimation equation for the classical approach: i=1 ψ( bf, xi; θ) + 1 i=n+1 ψ( bf, xi; θ) Post-prediction adaptive inference where diag(ω) is a diagonal matrix with diagonal elements ω1, , ωq. We refer to the vector ω as a weighting vector, as it controls the contribution of labeled outcomes and the arbitrary prediction bf in a data-adaptive way. Together, we propose our PSPA estimator bθPSPA(ω) that solves the equation, Ψω PSPA(θ) := 1 i=1 ψ(yi, xi; θ) + diag(ω) i=1 ψ( bf, xi; θ) + 1 i=n+1 ψ( bf, xi; θ) for a given weighting vector ω. This equation corresponds to a class of estimators bθPSPA(ω) for different fixed values of ω. We next establish the consistency and asymptotic normality of these estimators with the following regularity conditions. Condition (C1) is a regularity assumption on the parameter of interest. Conditions (C2)-(C4) guarantee consistency. Condition (C5) is needed for asymptotic normality. All conditions are reasonable and standard (Van der Vaart, 2000). (C1) The parameter space Θ is a compact subset of Rq, containing the true parameter θ0 as a unique solution to (1). (C2) The function in (3) is continuous and the equation has a unique solution. (C3) supθ Θ E[ ψ(Y, X; θ) 2] < and supθ Θ E[ diag(ω) ψ( bf(Z), X; θ) 2] < where is the L2-norm that takes over the true data-generating process. (C4) Labeled and unlabeled data are i.i.d drawn from the population of interest. (C5) There exists a function ϕ such that E[supθ Θ ϕ2] < and E ψ(Y, X; θ ) ψ(Y, X; θ ) 2 2 ϕ θ θ 2 for any θ and θ in a neighborhood of θ. Furthermore, E{ψ(y, x; θ)} is differentiable at θ with a nonsingular derivative matrix. Theorem 3 Under Conditions (C1)-(C4), assuming n N ρ as n and N , then the proposed estimator bθPSPA(ω) converges to θ in probability. Assuming additionally Condition (C5), we have n(bθPSPA(ω) θ) D N(0, Σ(ω)), where Σ(ω) = A 1V(ω)A 1, A = E[ ψ(Y, X; θ)/ θ], V(ω) = M1 + diag(ω)(M2 + ρM3)diag(ω) 2diag(ω)M4, M1 = Var[ψ(Y, X; θ)], M2 = Var[ψ( bf(Z), X; θ)], M3 = Var[ψ( bf(Z), X; θ0)], M4 = Cov[ψ(Y, X; θ), ψ( bf(Z), X; θ)]. The proof is contained in Appendix B.1. Theorem 3 suggests that the proposed estimator bθPSPA(ω) is asymptotically normal with a fixed weighting vector ω. In practice, this weighting vector ω should be estimated from data with the pursuit of efficiency. We provide an estimation procedure aimed at element-wise variance reduction. This is made possible by a key observation that the j-th diagonal element of the asymptotic covariance matrix Σ(ω) is a quadratic function of ωj and does not depend on other components of ω: Σjj(ω) = ω2 j [A 1(M2 + ρM3)A 1]jj 2ωj[A 1M4A 1]jj + [A 1M1A 1]jj Post-prediction adaptive inference where [M]jj represents the j-th diagonal element of matrix M. We define the optimal weighting vector ωopt = [ωopt 1 , . . . , ωopt q ] such that each coordinate of the weighting vector minimizes the asymptotic variance for the corresponding coordinate: ωopt j = arg min ωj Σjj(ωj) = [A 1M4A 1]jj [A 1(M2 + ρM3)A 1]jj for all j 1, . . . , q, The resulting asymptotic variance for j-th coordinate is Σjj(ωopt) = [A 1M1A 1]jj [A 1M4A 1]jj [A 1(M2 + ρM3)A 1]jj . Such construction of ωopt guarantees the optimality of the element-wise asymptotic variance across the class of estimators with any weighting vector ω. Next, we present our algorithm for PSPA estimation and inference. Algorithm 1 PSPA estimation with ML-predicted labels Input: Data L U, pre-trained ML model bf, error rate α (0, 1). 1: Obtain the classical estimator bθC by solving 1 n Pn i=1 ψ(yi, xi; θ) = 0. 2: Obtain the optimal weighting vector bωopt = [bωopt 1 , . . . , bωopt q ] by bωopt j = min( [ b A 1 C c M4,C b A 1 C ]jj [ b A 1 C (c M2,C + ρc M3,C) b A 1 C ]jj , 1) for all j 1, . . . , q, where b AC, c M1,C, c M2,C, c M3,C, c M4,C are sample analogs of A, M1, M2, M3, M4 with bθC plugged in. 3: Obtain the PSPA estimator with a one-step update: bθPSPA = bθC h Ψbωopt PSPA(bθC) i 1 Ψbωopt PSPA(bθC), where Ψbωopt PSPA = Ψbωopt PSPA(θ)/ θ|θ=bθC 4: Obtain the asymptotic variance of bθPSPA: bΣ(bωopt) = b A 1 PSPA[c M1,PSPA + diag(bωopt)(c M2,PSPA + ρc M3,PSPA)diag(ω) 2diag(bωopt)c M4,PSPA] b A 1 PSPA, where b APSPA, c M1,PSPA, c M2,PSPA, c M3,PSPA, c M4,PSPA are sample analogs of A, M1, M2, M3, M4 with bθPSPA plugged in. Output: PSPA estimator bθPSPA, standard error q bΣ(bωopt)jj n , α-level confidence interval CPSPA α,j = (bθPSPAj z1 α/2 bΣ(bωopt)jj n ), and (two-sided) p-value 2(1 Φ(| bθPSPAj r b Σ(bωopt)jj for the j-th coordinate. Here, Φ is the CDF of the standard normal distribution. 3.2 Theoretical guarantees Next, we establish the theoretical guarantees for our proposed estimator and algorithm. We modify the theorem 3 to reflect that bω in the algorithm is estimated from the data. Post-prediction adaptive inference Corollary 4 Suppose bω P ω and conditions for Theorem 3 hold, then bθPSPA(ω) P θ and n(bθPSPA(ω) θ) D N(0, Σ(ω)). Proof of Corollary 4 is contained in Appendix B.2. In our algorithm, we substitute the sample analogs for A, M1, M2, M3, M4 into ωopt j . Given that these sample analogs are typically consistent estimators, Slutsky s theorem implies that bωopt P ωopt. Therefore, this additional condition is satisfied by our algorithm. The asymptotic normality in Corollary 4 guarantees the validity for inference for our algorithm by the following Corollary 5. Corollary 5 Suppose bΣ(bω) P Σ(ω) and conditions for Theorem 3 hold. Given confidence level 1 α (0, 1), lim n P(θj C bω α,j) 1 α, j = 1 , q, where θj is the j-th coordinate of the parameter and CPSPA α,j = bθPSPAj(bω) z1 α/2 Proof of Corollary 5 follows from the asymptotic normality of the PSPA estimator in Corollary 4. Since we plug in the consistent estimator for each element in Σ(ωopt), Slutsky s theorem implies that bΣ(bωopt) P Σ(ωopt), and therefore the condition in Corollary 5 is also satisfied by our algorithm. The above two corollaries ensure the validity of the inference for our algorithm. We next demonstrate that our method achieves element-wise variance reduction when compared to all baseline methods. Proposition 6 Suppose n(bθPSPA(ω) θ) D N(0, Σ(ω)), given ωopt defined by equation 3.1, Σ(ωopt)jj Σ(ω)jj for all j {1, , q}. Proof of Proposition 6 is contained in Appendix B.3. By setting the weighting vector ω to ωC = [0, . . . , 0] or ωPPI = [1, . . . , 1] , our method reduces to classical method and PPI, respectively. Therefore, PSPA guarantees an element-wise reduction in the asymptotic variance compared to both the classical method and PPI, for arbitrary ML models. 3.3 Examples: linear and logistic regression Here, we present two examples to illustrate the intuitions behind the PSPA estimation procedure. We begin with the estimation of coefficients in a linear regression model, where the resulting estimator is expressed in a closed form. Example 2 (Linear Regression) We consider the ordinary least squares problem, where ψ(y, x; θ) = x(y x θ). Therefore, the estimand is θ0 = E[XX ] 1E[XY ]. From equation (3), the PSPA estimator is the solution to the equation i=1 xi(yi x i θ) + 1 i=n+1 diag(ω)xi{ bf x i θ} 1 i=1 diag(ω)xi{ bf x i θ} = 0, Post-prediction adaptive inference yielding a closed-form solution for PSPA linear regression i=1 xix i + diag(bωopt) i=n+1 xix i 1 i=1 xiyi + diag(bωopt) i=n+1 xi bf(zi) 1 where bωopt is a consistent estimator of optimal ωopt, obtainable through Algorithm 1. The second example is logistic regression, which lacks a closed-form solution but can be solved using Algorithm 1. Example 3 (Logistic Regression) For logistic regression, ψ(y, x; θ) = xy + xγθ(x), where γθ(x) = 1/ 1 + exp x θ . Using estimation equation (3), the PSPA estimator bθPSPA is the solution to the equation i=1 { xiyi + xiγθ(xi)} + diag(ω) i=n+1 [ xi bf(zi) + xiγθ(xi)] 1 i=1 [ xi bf + xiγθ(xi)] 4. Relationship with Efficient Influence Function To connect our method with the literature on semiparametric efficiency (Bickel et al., 1993; Tsiatis, 2006), in this section, we state the efficient influence function (EIF) and the semiparametric efficiency bound for the estimation problem (1) (Proposition 7) and connect it with our proposed PSPA estimator. To ease notation, we introduce a random variable R to indicate the labeling; r1 = = rn = 1 and rn+1 = = rn+N = 1. In this section, suppose π pr(R = 1) = n/(n + N) is a fixed constant between zero and one. Note that the introduction of R is for notational simplicity, and we make no inference on it. Proposition 7 (Efficient Influence Function) The EIF for estimating θ is φ(yi, xi, θ) = r πA 1[ψ(y, x; θ) E{ψ(Y, x; θ) | x, z}] + A 1E{ψ(Y, x; θ) | x, z}. The proof is contained in Appendix B.4. Guided by Proposition 7, an EIF-based estimator would be the solution to the empirical equation, 0 = 1 n + N i=1 φ(yi, xi, bθEIF). (4) With the derived EIF, the EIF-based estimator is asymptotically normal and achieves the semiparametric efficiency bound by section 8.4 of Molenberghs et al. (2014). Post-prediction adaptive inference Corollary 8 (Semiparametric Efficiency Bound) Under Conditions (C1) and (C3), assume E{supθ Θ φ(Y, X, θ)} < and equation (4) has a unique solution, then as n and N , n(bθEIF θ0) D N(0, E{φ(Y, X, θ0)φ(Y, X, θ0) π}). If the nuisance function E{ψ(Y, x; θ) | x, z} that appears in the EIF can be correctly specified or well approximated, then an EIF-based estimator will achieve the semiparametric efficiency bound. However, this is nearly impossible in reality with limited knowledge of the data generation process or a small computational budget. Instead, it s typical to construct an estimator with nuisance functions plugged in as their estimates. We refer to such estimators as EIF -based , because the influence function of the resulting estimator, denoted by bθEIF , may or may not be the efficient influence function. In our setting, the nuisance function E{ψ(Y, x; θ) | x, z} can be estimated with E{ψ( bf, x; θ) | x, z}. Upon simple calculation, we can write the resulting bθEIF as the solution to i=1 A 1ψ(yi, xi; θ) + (1 π)A 1 ( i=1 ψ( bf, xi; θ) + 1 i=n+1 ψ( bf, xi; θ) Similar to the equation (3) that the PSPA estimator solves, the EIF -based estimator (5) solves an equation with two parts: one term that can produce a consistent estimator with labeled data only, and the other term that utilizes predictions bf, which always have a mean of zero. If one multiples both sides of (3) with matrix A 1, then (5) corresponds to that equation weight equals to a vector of 1 π. In estimating the outcome mean (Section 2.1) where A is essentially the scalar one, the EIF -based estimator is the PSPA estimator with weight being 1 π = N/(n + N). The difference between (3) and (5) clarifies one of the main distinctions between the proposed PSPA estimator and an EIF-based estimator. The former seeks variance reduction in a data-adaptive way, while the latter attains minimum variance at correct specifications. In real-world situations, the proposed PSPA estimator can be much more efficient in that it utilizes information in a data-adaptive fashion. Numeric comparisons are provided in Section 6. 5. Extensions Next, we extend PSPA to address more general scenarios. Specifically, Section 5.1 outlines an inference procedure for fully labeled outcomes with partially labeled covariate x, and Section 5.2 discusses cases where both y and x are partially labeled. 5.1 ML-predicted covariates We adjusted the previously described setup by modifying it to use ML to predict the covariate X instead of the outcome Y . Our data can be divided into two parts: L U. The L part includes data points (yi, xi, zi) for i = 1, , n and the U part includes data points (yi, zi) for i = n + 1, , N + n. In addition, we use an external prediction algorithm with the notation bq( ). This algorithm is applied to z to generate the predicted value bq(z) of the covariate X. However, it is important to note that bq may produce inaccurate or biased predictions of X. The estimand is also defined by the estimating equation (1). Similar to Post-prediction adaptive inference our proposal to handle the ML-predicted outcome, we propose our PSPA estimator bθPSPA (ω) with ML-predicted covariates that solves the equation Ψω PSPA (θ) := 1 i=1 ψ(yi, xi; θ) + diag(ω) i=1 ψ(yi, bq; θ) + 1 i=n+1 ψ(yi, bq; θ) Denote A = E[ ψ(Y, X; θ)/ θ], M1 = Var[ψ(Y, X; θ)], M 2 = Var[ψ(Y, bq; θ0)], M 3 = Var[ψ(Y, bq; θ0)], M 4 = Cov[ψ(Y, X; θ0), ψ(Y, bq; θ0)]. Upon calculations similar to Theorem 3, the j-th diagonal element of the asymptotic variance of bθPSPA (ω) is Σ jj(ω) = ω2 j [A 1(M 2 + ρM 3)A 1]jj 2ωj[A 1M 4A 1]jj + [A 1M 1A 1]jj, which achieve its minimum with ωopt j = [A 1M 4A 1]jj [A 1(M 2+ρM 3)A 1]jj . We modify our Algorithm 1 to the following algorithm to incorporate ML-predicted covariates. Algorithm 2 PSPA estimation with ML-predicted covariates Input: Data L U, pre-trained ML model bf, error rate α (0, 1). 1: Obtain the classical estimator bθC by solving 1 n Pn i=1 ψ(yi, xi; θ) = 0. 2: Obtain the optimal weighting vector bωopt = [bωopt 1 , . . . , bωopt q ] by min( [ b A 1 C c M 4,C b A 1 C ]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj , λmin,+[ 1 n ψ(yi,xi;θ)] λmax[ 1 n ψ(yi,bq;θ)] ) if [ b A 1 C c M 4,C b A 1]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj > 0 max( [ b A 1 C c M 4,C b A 1 C ]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj , 0) if [ b A 1 C c M 4,C b A 1]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj 0 for all j 1, . . . , q. Here, b AC, c M 1,C, c M 2,C, c M 3,C, c M 4,C are sample analogs of A, M 1, M 2, M 3, M 4 for Ψω PSPA (θ) with bθC plugged in. λmin,+[M] and λmax[M] represent the smallest non-negative and largest eigenvalue of matrix M, respectively. 3: Obtain the PSPA estimator with a one-step update: bθPSPA = bθC h Ψbωopt PSPA (bθC) i 1 Ψbωopt PSPA (bθC), where Ψbωopt PSPA = Ψbωopt PSPA (θ)/ θ|θ=bθC. 4: Obtain the asymptotic variance of bθPSPA : bΣ (bωopt) = b A 1 PSPA [c M 1,PSPA + diag(bωopt)(c M 2,PSPA + ρc M 3,PSPA)diag(ω) 2diag(bωopt)c M 4,PSPA] b A 1 PSPA , where b A PSPA, c M 1,PSPA, c M 2,PSPA, c M 3,PSPA, c M 4,PSPA are sample analogs of A, M 1, M 2, M 3, M 4 for Ψω PSPA (θ) with bθPSPA plugged in. Output: PSPA estimator bθPSPA , standard error q bΣ (bωopt)jj n , α-level confidence interval α,j = (bθPSPA j z1 α/2 bΣ (bωopt)jj n ), and (two-sided) p-value 2(1 Φ(| b Σ (bωopt)jj for the j-th coordinate. Here, Φ is the CDF of the standard normal distribution. Post-prediction adaptive inference Corollary 9 states the asymptotic distribution that enables interval estimation, with its proof following from the proof of Theorem 3, Corollary 4 and 5, and thus omitted. Corollary 9 Under Conditions (C1), (C2) on equation 5.1, and (C3)-(C4), assuming n ρ as n and N , and bω P ω, then the proposed estimator bθPSPA(bω) converges to θ0 in probability. Assuming additionally Condition (C5), we have that as n and N , n(bθ PSPA(bω) θ0) D N(0, Σ (ω)). where Σ (ω) = A 1V (ω)A 1, A = E[ ψ(Y, X; θ)/ θ], V (ω) = M1 + diag(ω)(M 2 + ρM 3)diag(ω) 2diag(ω)M 4, M1 = Var[ψ(Y, X; θ)], M 2 = Var[ψ(Y, bq; θ0)], M 3 = Var[ψ(Y, bq; θ0)], M 4 = Cov[ψ(Y, X; θ0), ψ(Y, bq; θ0)]. With bΣ (bω) P Σ (ω), lim n P(θj C bω α,j) 1 α, j = 1, , q, where C bω α,j = (bθ PSPAj(bω) z1 α/2 n ), 1 α (0, 1) is the confidence level, and θj is the j-th coordinate of the parameter. Remark 10 Condition (C2) requires equation 5.1 to have a unique solution. To ensure this, we impose a constraint on ω. Equation 5.1 has a unique solution if its deriva- n Pn i=1[ ψ(yi, xi; θ) diag(ω) ψ(yi, bq; θ)]+ 1 N Pn+N i=n+1 ψ(yi, bq; θ), is positive semi- definite. The second term, 1 N Pn+N i=n+1 ψ(yi, bq; θ), is already positive semi-definite, so we only need to ensure the first term is positive semi-definite. According to Lemma 13 in the appendix, a sufficient condition for this is: 0 ωj λmin,+[ 1 n Pn i=1 ψ(yi, xi; θ)] n Pn i=1 ψ(yi, bq; θ)] for all j = 1, , q, where λmin,+[M] and λmax[M] represent the smallest positive and largest eigenvalue of matrix M, respectively. By imposing this constraint on the weighting vector in the algorithm, we ensure that condition (C2) holds. In practice, one can relax this constraint by substituting the prefixed ω into the derivative and confirming its positive semi-definiteness. If necessary, one can iteratively reduce the value of each positive ωj (or increase negative ωj) until the derivative achieves positive semi-definiteness. Next, we state the theoretical guarantees of PSPA for element-wise variance reduction compared with the classical approach with weighting vector ωC = [0, . . . , 0] . Corollary 11 Suppose n(bθPSPA (ωC) θ) D N(0, Σ(0)). Denote ˆωopt = [bωopt 1 , . . . , bωopt q ] , where min( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , λmin,+[ 1 n ψ(yi,xi;θ)] λmax[ 1 n ψ(yi,bq;θ)] ) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj > 0 max( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , 0) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj 0 for all j 1, . . . , q. Suppose ˆωopt P ωopt, then Σ(ωopt)jj Σ(ωC)jj. The proof is contained in Appendix B.6. Post-prediction adaptive inference 5.2 ML-predicted outcome and covariates Next, we explore a situation where the outcome and covariates are not directly observed in the unlabeled data. Instead, we employ machine learning techniques to predict both. Additionally, we possess much smaller data with both outcome and covariates measured. This full data can be categorized into two segments: L U. The first segment, L, includes the data points (yi, xi, zi) for i = 1, , n, while the second segment, U, comprises the instances of zi for i = n + 1, , N + n, and N >> n. The variable z is to predict both the outcome and the covariates. The ML-predicted outcome and covariates are represented by bf and bq(z), respectively, where bf and bq are independently obtained. The PSPA estimator for this situation is to solve the equation, i=1 ψ(yi, xi; θ) + diag(ω) i=1 ψ( bf, bq; θ) + 1 i=n+1 ψ( bf, bq; θ) Replacing the estimating equation in Algorithm 2 with Equation 5.2 gives an algorithm for estimation and statistical inference for this task. The consistency and asymptotic normality of the resulting estimator can be established by the proof of Theorem 3. 6. Simulations We conduct simulations to evaluate the performance of PSPA using linear and logistic regression across two settings: one with ML-predicted labels and another with ML-predicted covariates. In the ML-predicted labels setting, we considered classical, PPI, and EIF*- based methods. The EIF*-based method employs a weighting vector defined as ωEIF* = h N N+n, . . . , N N+n i , where N and n represent the sample sizes of unlabeled and labeled data, respectively. For the ML-predicted covariates setting, we used classical and imputationbased methods as baselines. We present the implementation details in Section 6.1 and the simulation results in Section 6.2. 6.1 Implementation details In all simulations, the ground truth coefficients are obtained by a Monte Carlo approximation with 5 104 samples. The labeled data is with 500 samples, and the unlabeled data is with 500, 1500, 2500, 5000, or 10000 samples in different settings. A pre-trained random forest with 100 trees to grow is obtained from a hold-out dataset with 1000 samples. All simulations are repeated 1000 times. 6.1.1 ML-predicted labels We simulated the labels Yi and covariates X1i, . . . , X50i for the linear regression setting by X1i, . . . , X50i i.i.d N(0, 1), Zi N(0, 1), θ1, . . . , θ10 = 0.1 10; , θ11, . . . , θ50 = 0, Yi = P50 k=1 Xkiθk+ r Zi + ϵi, ϵi N(0, τϵ) where τϵ such that Var(Yi) = 1, where r is set to 0.8 for settings with different sample sizes of unlabeled data and 0, 0.2, 0.4, 0.6, or 0.8 for settings with different levels of imputation accuracy. We used the X1i, . . . , X50i, and Zi as inputs to train the random forest model, which aimed to predict Yi in unlabeled data. For logistic regression, Post-prediction adaptive inference we adopted the same data-generating process except that we generated the label Yi by Yi = 1(Yi > median (Yi)). Our parameter of interest is the regression coefficient for X1i. 6.1.2 ML-predicted covariates We simulated the labels Yi and covariates X1i, . . . , X10i for linear regression setting by X1i, . . . , X10i i.i.d N(0, 1), Zi N(0, 1), θ1 = 0.1, θ2, . . . , θ10 = 0, Yi = P10 k=1 Xkiθk + ϵi, ϵi N(0, τϵ) where τϵ is set to be the value such that Var(Yi) = 1, Zi = 0.1Yi + r X1i + δi where δi N(0, τδ) where τδ is set to be the value such that Var(Zi) = 1 where r is set to 0.8 for settings with different samples size of unlabeled data and 0, 0.2, 0.4, 0.6, or 0.8 for settings with different levels of imputation accuracy. We employed the variables Zi as inputs to train the random forest model, which aimed to predict X1i in unlabeled data. For logistic regression, we used the same data-generating process except that we generated the label Yi by Yi = 1(Yi > median (Yi)). We are interested in the regression coefficient for X1i. 500 1500 2500 5000 10000 Linear regression a 500 1500 2500 5000 10000 Logistic regression b 0 0.2 0.4 0.6 0.8 r Linear regression c 0 0.2 0.4 0.6 0.8 r Logistic regression d 500 1500 2500 5000 10000 Nunlabeled Width (method) / Width (classical) Linear regression e 500 1500 2500 5000 10000 Nunlabeled Width (method) / Width (classical) Logistic regression f 0 0.2 0.4 0.6 0.8 r Width (method) / Width (classical) Linear regression g 0 0.2 0.4 0.6 0.8 r Width (method) / Width (classical) Logistic regression h Methods Classical EIF* based PSPA PPI Figure 2: Coverage of the confidence interval and relative ratio of its width compared to the classical method for linear and logistic regression. ML is used to predict the labels. Panels (a)-(d) show the coverage of the confidence interval. Panels (e)-(h) show the relative ratio of the width of the confidence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with different levels of imputation accuracy. The dashed lines represent y = 0.95 in (a)-(d) and y = 1 in (e)-(h). Post-prediction adaptive inference 6.2 Results 6.2.1 ML-predicted labels Figure 2 shows the simulation results for the setting of ML-predicted labels. All evaluated methods achieved confidence interval coverage rates close to 95%, indicating the inference validity across different unlabeled sample sizes and different levels of ML prediction accuracy. PPI and EIF*-based methods may be less efficient than classical methods when the ML has poor prediction accuracy or limited unlabeled data. PSPA has narrower confidence intervals compared to the classical method and other baseline methods, especially as the unlabeled sample size and ML prediction performance increase. 6.2.2 ML-predicted covariates Figure 3 shows the simulation results for the setting with ML-predicted covariates. The imputation-based method fails to achieve the correct confidence interval coverage. Both classical and PSPA methods lead to coverage rates around 95%, which suggests valid inference results. PSPA has narrower confidence intervals compared to the classical method, regardless of the unlabeled sample size and the accuracy of the ML model. 500 1500 2500 5000 10000 Linear regression a 500 1500 2500 5000 10000 Logistic regression b 0 0.2 0.4 0.6 0.8 r Linear regression c 0 0.2 0.4 0.6 0.8 r Logistic regression d 500 1500 2500 5000 10000 Nunlabeled Width (method) / Width (classical) Linear regression e 500 1500 2500 5000 10000 Nunlabeled Width (method) / Width (classical) Logistic regression f 0 0.2 0.4 0.6 0.8 r Width (method) / Width (classical) Linear regression g 0 0.2 0.4 0.6 0.8 r Width (method) / Width (classical) Logistic regression h Methods Classical PSPA Imputation based Figure 3: Coverage of the confidence interval and relative ratio of its width compared to the classical method for linear and logistic regression. ML is used to predict the covariates. Panels (a)-(d) show the coverage of the confidence interval. Panels (e)-(h) show the relative ratio of the width of the confidence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with different levels of imputation accuracy. The dashed lines represent y = 0.95 in (a)-(d) and y = 1 in (e)-(h). Post-prediction adaptive inference 7. Data Applications 7.1 Sex-differentiated gene expression We used the PSPA method to assess the effect of biological sex on gene expression across 44 human tissues using GTEx data. We aim to use an ML algorithm called hypergraph factorization (HYFA) to impute the missing gene expression (Vi nas et al., 2023) in the uncollected tissue. The imputed gene expression is then used in PSPA to increase the statistical power to identify sex-biased genes. We compared PSPA with classical, PPI, and EIF -based approaches. In particular, we processed the GTEx-v8 data following the GTEx pipeline (GTEx Consortium et al., 2015). We then used HYFA to predict transcript levels for each uncollected tissue for each individual in the data, resulting in a gene expression dataset of 834 individuals across 44 tissues. HYFA is a parameter-efficient graph representation learning approach for multi-tissue prediction of gene expression. We used cross-validation to predict expression levels in the labeled data to avoid over-fitting. Prediction accuracy is measured by the correlation between measured and imputed gene expression in the labeled data. We then used linear regression to assess the effect of sex on gene expression while controlling for technical factors, including surrogate variables, age, RNA integrity number, and total ischemic time for a sample. Figure 4: Comparison of PSPA, classical, PPI, and EIF -based approaches in identifying sex-biased gene expression using GTEx data. Each panel illustrates a different aspect of comparison on the yand xaxes: point estimates between the (a) classical and PPI approaches; (b) classical and EIF -based approaches; (c) classical and PSPA approaches; estimated standard errors between the (d) classical and PPI approaches; (e) classical and EIF -based approaches; (f) classical and PSPA approaches; (g) number of sex-biased genes identified by each of the four approaches. The dashed lines represent y = x in (a)-(c) and y = 1 in (d)-(f). Post-prediction adaptive inference Results are shown in Figure 4. Each dot represents the result of the inference for one gene in one tissue in Figure 4 (a)-(f). The point estimates of all methods are close to the classical approach (Figure 4 (a)-(c)). However, both PPI and EIF -based are less efficient than the classical approach when the prediction accuracy is low (Figure 4 (d)-(e)). In contrast, the PSPA estimator is always at least as efficient as the classical approach, regardless of the prediction accuracy (Figure 4 (f)). Moreover, PSPA identifies more sex-biased genes than other approaches, demonstrating its improved efficiency over alternatives (Figure 4 (g)). 7.2 Risk factors for bone mineral density Next, we applied PSPA to a multi-dimensional linear regression task. The goal is to identify the associations between dual-energy x-ray absorptiometry (DXA)-derived total bone mineral density (DXA-BMD) and several covariates, including biological sex, age, physical activities (PA), sedentary behavior (SB), smoking status (current smoker or not), and frequency of alcohol intake. DXA-BMD serves as the primary diagnostic marker for osteoporosis and fracture risk in clinical settings. For PA, we assigned individuals three levels (low, medium, and high) scores according to the International Physical Activity Questionnaire guidelines. SB was quantified as an integer value representing the combined hours spent driving, using a computer, and watching television. Table 1: Comparison of different methods in identifying risk factors for bone mineral density. Estimate and c SE represent the estimated linear regression coefficient and its corresponding standard error for each covariate. c SE ratio indicates the ratio of the standard error for a specific method compared with the classical method. PA denotes physical activity, and SB refers to sedentary behavior. The bold fonts represent the method that gives the smallest c SE ratio for each covariate. Biological sex Age PA SB Smoking Alcohol Estimate -0.616 -0.190 0.019 0.040 0.006 -0.008 c SE 4.20E-03 4.12E-03 4.07E-03 4.18E-03 4.13E-03 4.20E-03 c SE ratio 1 1 1 1 1 1 Estimate -0.604 -0.201 0.009 0.043 -0.026 -0.002 c SE 5.73E-03 6.26E-03 6.35E-03 6.48E-03 6.39E-03 6.47E-03 c SE ratio 1.363 1.517 1.560 1.553 1.549 1.541 Estimate -0.605 -0.200 0.010 0.042 -0.023 -0.003 c SE 5.28E-03 5.74E-03 5.83E-03 5.95E-03 5.87E-03 5.93E-03 c SE ratio 1.257 1.391 1.433 1.424 1.423 1.412 Estimate -0.614 -0.181 0.010 0.040 -0.008 -5.15E-07 c SE 3.74E-03 3.74E-03 3.74E-03 3.82E-03 3.80E-03 3.81E-03 c SE ratio 0.892 0.907 0.919 0.914 0.920 0.908 Post-prediction adaptive inference We regressed DXA-BMD on these variables using data from the UK Biobank (UKB). In the UKB, DXA-BMD measurements are available for only 10% of the participants. Therefore, we employed the Softimpute algorithm to impute DXA-BMD values for the remaining 90% individuals in the unlabeled dataset. To prevent overfitting, cross-validation was applied for the imputation in the labeled data. We consider classical, PPI, and EIF*- based method as the baseline method. Table 1 presents the inference results. Both the PPI and EIF*-based methods show larger standard errors across all covariates compared to the classical method. In contrast, PSPA shows smaller standard errors for all coordinates when compared to the classical method, demonstrating its feature for element-wise variance reduction. In conclusion, the real data results are consistent with our theoretical and simulation results, demonstrating that our method PSPA provides more efficient inference results in real-world post-prediction inference applications. 8. Comparison with PPI++ Finally, we present a comparison between PPI++ (Angelopoulos et al., 2023b) and our method PSPA, both theoretically and empirically. First, we note that PPI++ can be reviewed as a special case of PSPA when each element of the weighting vector ω is constrained to the same values. PSPA degenerates to PPI++ for the one-dimensional estimation task such as mean estimation. Moreover, PPI++ cannot be applied to the case where the covariates are predicted by ML instead of the labels. Theoretically, since PPI++ can be reviewed as a special case of PSPA, PPI++ is less efficient than PSPA by Proposition 6. Moreover, PPI++ fails to guarantee element-wise variance reduction since it only incorporates one scalar for variance reduction. Empirically, we compared PPI++ with PSPA for all experiments detailed in this paper. We present the simulation results in Figure 5 and real data applications in Figure 6 and Table A, respectively. Simulation results show that both PPI++ and PSPA have the correct confidence interval coverage. However, PSPA has narrower confidence intervals compared with PPI++. In real data applications, PSPA identifies more sex-biased gene expression than PPI++ in the GTEx example and has a smaller estimated standard error compared to PPI++ for every covariate in the DXA-BMD example. These results are consistent with Proposition 6, showing PSPA is more efficient than PPI++ in post-prediction inference applications. 9. Discussion We have provided a simple yet powerful method, PSPA, to improve the efficiency of statistical inference with arbitrary ML predictions. To enable inference, we establish the consistency and asymptotic normality of the proposed estimator and prove its superiority over baseline methods. Through extensive simulations and real data applications, we demonstrate the superiority of the PSPA estimator over alternatives. This work fits into a variety of scientific problems and unlocks several directions for future research. One can apply the same principle to more complicated data structures and missing mechanisms, for example, in cases where a multi-dimensional outcome or covariate is subject to missingness. One promising direction of future work is to carefully discuss Post-prediction adaptive inference relationships between X and Z and how they may affect the magnitude of the efficiency improvement. The R codes to implement PSPA, benchmark methods, and replicate the simulation and real data analysis, is available at https://github.com/qlu-lab/pspa. Acknowledgments We thank the participants of the UW Madison Statistics Graduate Student Association Student Seminar and the Lu Group Lab Meeting for their valuable comments and questions. This work was supported by the NIH (U01 HG012039, R01 DC021431) and the NSF (DMS1953526, DMS-2122074, DMS-2310942). Post-prediction adaptive inference Appendix A. Supplementary Figures and Tables 500 1500 2500 5000 10000 Linear regression a 500 1500 2500 5000 10000 Logistic regression b 0 0.2 0.4 0.6 0.8 r Linear regression c 0 0.2 0.4 0.6 0.8 r Logistic regression d 500 1500 2500 5000 10000 Nunlabeled Width (method) / Width (classical) Linear regression e 500 1500 2500 5000 10000 Nunlabeled Width (method) / Width (classical) Logistic regression f 0 0.2 0.4 0.6 0.8 r Width (method) / Width (classical) Linear regression g 0 0.2 0.4 0.6 0.8 r Width (method) / Width (classical) Logistic regression h Methods PSPA PPI++ Figure 5: Comparison of PPI++ and PSPA in simulation. This figure shows the coverage of the confidence interval and the relative ratio of the width of the confidence interval compared to the classical method for linear and logistic regression. ML is used to predict the labels. Panels (a)-(d) show the coverage of the confidence interval. Panels (e)-(h) show the relative ratio of the width of the confidence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with different levels of imputation accuracy. The dashed line represents y = 0.95 in (a)-(d) and y = 1 in (e)-(h). Figure 6: Comparison of PPI++ and PSPA in identifying sex-biased genes in GTEx. Post-prediction adaptive inference Table 2: Comparison of PPI++ and PSPA in identifying risk factors for bone mineral density. Estimate and c SE represent the estimated linear regression coefficient and its corresponding standard error for each covariate, respectively. c SE ratio indicates the ratio of the standard error for a specific method compared with the classical method. PA denotes physical activity, and SB refers to sedentary behavior. The bold font represents the method that gives the smallest c SE ratio for each covariate. Biological sex Age PA SB Smoking Alcohol Estimate -0.613 -0.193 0.016 0.041 -0.002 -0.006 c SE 3.86E-03 3.79E-03 3.84E-03 3.91E-03 3.87E-03 3.89E-03 c SE ratio 0.920 0.919 0.943 0.935 0.939 0.926 Estimate -0.614 -0.181 0.010 0.040 -0.008 -5.15E-07 c SE 3.74E-03 3.74E-03 3.74E-03 3.82E-03 3.80E-03 3.81E-03 c SE ratio 0.892 0.907 0.919 0.914 0.920 0.908 Appendix B. Proofs B.1 Proof of Theorem 3 We first present a lemma that will be used later. Lemma 12 Suppose that X1, . . . , Xn are i.i.d., and the parameter space Θ is compact and L(X, θ) is continuous in θ Θ almost everywhere. Moreover, there exists a function h(X) satisfying L(X, θ) h(X) for arbitrary θ Θ and E{h(X)} < , then E{L(X, θ)} is continuous in θ and i=1 L (Xi, θ) E{L(X, θ)} Lemma 12 follows from Tauchen (1985, Lemma 1) and hence the proof is omitted. Condition (C5): There exists a small neighborhood of θ, denoted as B, in which ψ(y, x, θ) is continuously differentiable with respect to θ almost everywhere and the partial derivative satisfies E supθ B ψ(Y, X, θ)/ θ < and E supθ B ψ( bf(Z), X, θ)/ θ < . For simplicity of notation, we re-write the estimating equation (3) as Ψ(bθPSPA, ω) = ΨY n + diag(ω)( Ψ bf n + Ψ bf N) = 0, where i=1 ψ(yi, xi; bθPSPA), Ψ bf n = 1 i=1 ψ( bf, xi; bθPSPA), and Ψ bf N = 1 i=n+1 ψ( bf, xi; bθPSPA). We first establish consistency following Van der Vaart (2000, Theorem 5.9) by checking its two conditions. The deterministic condition is implied by Condition (C1) and (C2). We are Post-prediction adaptive inference left to verify the uniform convergence condition Ψ(bθPSPA) E(ΨY n ) = op(1). By Condition (C3) and Lemma 12, we have diag(ω)Ψ bf n E(diag(ω)Ψ bf n) = op(1), diag(ω)Ψ bf N E(diag(ω)Ψ bf N) = op(1), and ΨY n E(ΨY n ) = op(1). Hence, by triangular inequality, Ψ(bθPSPA) E(ΨY n ) Ψ(bθPSPA) ΨY n + ΨY n E(ΨY n ) = diag(ω)Ψ bf n + diag(ω)Ψ bf N + ΨY n E(ΨY n ) ω Ψ bf n + E(diag(ω)Ψ bf n) + diag(ω)Ψ bf N E(diag(ω)Ψ bf N) + E(diag(ω)Ψ bf N diag(ω)Ψ bf n) + ΨY n E(ΨY n ) which completes the proof of consistency, i.e., bθPSPA P θ. Next, we establish the asymptotic normality. Expanding Ψ(bθPSPA, ω) at the true value θ, 0 =Ψ(θ, ω) + (bθPSPA θ) Ψ( θ)/ θ for some θ between θ and bθPSPA. Multiplying both sides with n and with a direct calculation, Ψ( θ)/ θ n(bθPSPA θ) = nΨ(θ, ω). In order to show the asymptotic normality, by Slutsky s theorem, it s sufficient to verify the following: Ψ( θ, ω)/ θ E{ ψ(θ)/ θ} = op(1), nΨ(θ, ω) D N(0, V(ω)). To show (B.1), we note that within the neighborhood B in Condition (C5), Ψ(θ, ω) θ ΨY n (θ) θ = diag(ω) Ψ bf n/ θ + E h Ψ bf n/ θ i + diag(ω) Ψ bf N/ θ E h Ψ bf N/ θ i diag(ω) Ψ bf n/ θ + E h Ψ bf n/ θ i + diag(ω) Ψ bf N/ θ E h Ψ bf N/ θ i . Then together with Condition (C4) and Lemma 12, we have Ψ(eθ)/ eθ ΨY n (eθ)/ eθ = op(1). Post-prediction adaptive inference By Conditions (C1, (C4) and Lemma 12, we have ΨY n (eθ)/ eθ E h ΨY n (eθ)/ eθ i = op(1). Equations (B.1) and (B.1) give supeθ B Ψ(eθ)/ eθ E h ΨY n (eθ)/ eθ i = op(1). Then the consistency of bθPSPA and the continuous mapping theorem imply that Ψ( θ, ω)/ θ E [ ψ(θ)/ θ] Ψ(eθ)/ eθ E ΨY n (θ)/ θ + E ψ( θ)/ θ E [ ψ(θ)/ θ] Therefore, condition (B.1) has been verified. Meanwhile, (B.1) holds by Central Limit Theorem. The proof of asymptotic normality is completed. B.2 Proof of Corollary 4 We denote the corresponding estimator as bθ(bω). Then the consistency of bθ(bω) follows from the consistency of bω and the proof in Section B.1. For asymptotic normality of bθ(bω), applying Taylor expansion on 0 = Ψ(bθ, bω) yields 0 = Ψ(bθ, bω) = Ψ(θ, ωopt) + diag(bω ωopt) Ψ bf n + Ψ bf N + (bθ θ) Ψ( θ)/ θ for some θ between θ and bθ(bω). With the same proof technique in Section B.1, we have n(bθ(bω) θ) D N(0, Σ(ωopt)), which implies Corollary 4. B.3 Proof of Proposition 6 The j-th diagonal element of the asymptotic covariance matrix Σ(ω) of bθPSPA(ω) given weighting vector ω is Σjj(ω) = ω2 j [A 1(M2 + ρM3)A 1]jj 2ωj[A 1M4A 1]jj + [A 1M1A 1]jj. It is a quadratic function of ωj with a unique minimizer ωopt j = [A 1M4A 1]jj [A 1(M2 + ρM3)A 1]jj . Therefore, given any weighting vector ω = [ω 1, , ω q] , Σjj(ωopt) Σjj(ω ), which completes the proof. Post-prediction adaptive inference B.4 Proof of Proposition 7 The joint log-likelihood for a generic observation (r, x, z, y) is l = rlogp(y | x, z) + logp(x | z) + logp(z) + rlog(π) + (1 r)log(1 π). Taking the semiparametric approach (Bickel et al., 1993; Tsiatis, 2006), we consider the Hilbert space H that contains all one-dimensional zero-mean measurable functions of the observed data with finite variance, equipped with the inner product h1, h2 = E{h1( )h2( )} where h1, h2 H. To estimate θ, we regard p(y | x, z), p(x | z) and p(z) as nuisance functions. Denote their nuisance tangent spaces by Ty, Tx, and Tz, which are defined as the mean squared closure of the nuisance tangent spaces of parametric submodels spanned by the nuisance score vectors. We have that T = Ty Tx Tz where Ty = [ra1(y, x, z) : E{a1(Y, x, z) | x, z} = 0], Tx = [a2(x, z) : E{a2(X, z) | z} = 0], and Tz = [a3(z), E{a3(Z)} = 0], respectively, and the notation represents the direct sum of two spaces that are orthogonal to each other. We introduce parametric submodels pηy(y | x, z), pηx(x | z), and pηz(z), where η = [η y , η x , η z , η r ] is a vector of nuisance parameters. Let Sηy(y, x, z) = logpηy(y, x, z) ηy , Sηx(x, z) = logpηx(x, z) ηx , and Sηz(z) = logpηz(z) then these scores functions satisfy r Sηy(y, x, z) Ty, Sηx(x, z) Tx, and Sηz(z) Tz. Recall the definition of parameter of interest, i.e., ZZZ ψ(y, x; θ)p(y | x, z)p(x | z)p(z)dzdxdy = 0. Taking derivative of θ with respect to nuisance parameters, θ η y =A 1 E{ψ(Y, X; θ)} η y = A 1E[ψ(Y, X; θ)S ηy(Y, X, Z)], θ η x =A 1 E{ψ(Y, X; θ)} η x = A 1E{ψ(Y, X; θ)S ηx(X, Z)}, and θ η z =A 1 E{ψ(Y, X; θ)} η z = A 1E{ψ(Y, X; θ)S ηz(Z)}. Let φ(y, x; θ) = ra1(y, x, z) + a2(x, z) + a3(z), where a1(y, x, z) = 1 π(z)A 1 [ψ(y, x; θ) E{ψ(Y, x; θ) | x, z}] , a2(x, z) =A 1 [E{ψ(Y, x; θ) | x, z} E{ψ(Y, x; θ) | x, z}] , and a3(z) =A 1E{ψ(Y, x; θ) | x, z}. Post-prediction adaptive inference Then it can be easily verified that ra1(y, x, z) Ty, a2(x, z) Tx, a3(z) Tx, and E{φ(Y, X; θ)RS ηy(Y, X, Z)} = θ η y , E n φ(Y, X; θ)S ηx(X, Z)) o = θ η x , and E{φ(Y, X; θ)S ηz(Z)} = θ Therefore, φ(y, x; θ) is the efficient influence function. The proof of Proposition 7 is completed. B.5 Lemma for PSPA when applied to ML-predicted covariates Lemma 13 Letting A and B be two q q gram matrices, and ω is a q-dimensional vector. If 0 ωj λmin,+(A) λmax(B) for all j {1, , p} then A diag(ω)B is positive semi-definite. To prove the lemma, we will verify that A diag(ω)B is positive semi-definite by showing that for any vector x, the quadratic form x T (A diag(ω)B)x 0. Let x be any vector in Rp. Since x Ax λmin,+(A) x 2 x diag(ω)Bx λmax(diag(ω)B) x 2 max j ωjλmax B x 2 x Ax x diag(ω)B (λmin,+(A) max j ωjλmax B) x 2 To ensure x Ax x diag(ω)B 0, we require λmin,+(A) maxj ωjλmax(B) 0, that is max j ωj λmin,+(A) B.6 Proof of Corollary 11 The j-th diagonal element of the asymptotic covariance matrix Σ (ω) of bθPSPA given weighting vector ω is Σ jj(ω) = ω2 j [A 1(M 2 + ρM 3)A 1]jj 2ωj[A 1M 4A 1]jj + [A 1M 1A 1]jj, It is a quadratic function of ωj with a unique minimizer ωopt, j = [A 1M 4A 1]jj [A 1(M 2 + ρM 3)A 1]jj . If ωopt, j > 0, Σ jj(ω) is a decreasing function with ωj [0, ωopt, j ]. If ωopt, j 0, Σ jj(ω) Σ jj(ω) is an increasing function with ωj [ωopt, j , 0]. We also have λmin,+[ 1 n ψ(yi,xi;θ)] λmax[ 1 n ψ(yi,bq;θ)] > 0 Post-prediction adaptive inference Therefore, given ˆωopt = [bωopt 1 , . . . , bωopt q ] , where min( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , λmin,+[ 1 n ψ(yi,xi;θ)] λmax[ 1 n ψ(yi,bq;θ)] ) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj > 0 max( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , 0) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj 0 for all j 1, . . . , q, and the weighting vector ωC = [0, , 0] that correspond to the classical estimator, Σjj(ωopt) Σjj(ωC), which completes the proof. Acknowledgments We thank the participants of the UW Madison Statistics Graduate Student Association Student Seminar and the Lu Group Lab Meeting for their valuable comments and questions. This work was supported by the NIH (U01 HG012039, R01 DC021431) and the NSF (DMS1953526, DMS-2122074, DMS-2310942). Post-prediction adaptive inference Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669 674, 2023a. doi: 10.1126/science.adi6000. URL https://www.science.org/doi/abs/10.1126/science. adi6000. Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. Ppi++: Efficient predictionpowered inference. ar Xiv preprint ar Xiv:2311.01453, 2023b. David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238 2251, 2022. Mahashweta Basu, Kun Wang, Eytan Ruppin, and Sridhar Hannenhalli. Predicting tissuespecific gene expression from whole blood transcriptome. Science Advances, 7(14): eabd6991, 2021. Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya acov Ritov, J Klaassen, Jon A Wellner, and YA Acov Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Springer, 1993. F Jay Breidt and Jean D Opsomer. Model-assisted survey estimation with modern prediction techniques. Statistical Science, 2017. Eric L Bullock, Curtis E Woodcock, Carlos Souza Jr, and Pontus Olofsson. Satellite-based estimates reveal widespread forest degradation in the amazon. Global Change Biology, 26 (5):2956 2969, 2020. Abhishek Chakrabortty and Tianxi Cai. Efficient and adaptive linear regression in semisupervised settings. The Annals of Statistics, 2018. Xiaohong Chen, Han Hong, and Elie Tamer. Measurement error models with auxiliary data. The Review of Economic Studies, 72(2):343 366, 2005. GTEx Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues. Science, 369(6509):1318 1330, 2020. GTEx Consortium, Kristin G Ardlie, David S Deluca, Ayellet V Segr e, Timothy J Sullivan, Taylor R Young, Ellen T Gelfand, Casandra A Trowbridge, Julian B Maller, Taru Tukiainen, et al. The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans. Science, 348(6235):648 660, 2015. Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315 331, 1998. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. Post-prediction adaptive inference Nathan Kallus and Xiaojie Mao. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. ar Xiv preprint ar Xiv:2003.12408, 2020. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553): 436 444, 2015. John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotypetissue expression (gtex) project. Nature genetics, 45(6):580 585, 2013. Geert Molenberghs, Garrett Fitzmaurice, Michael G Kenward, Anastasios Tsiatis, and Geert Verbeke. Handbook of missing data methodology. CRC Press, 2014. James M Robins. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, volume 1999, pages 6 10. Indianapolis, IN, 2000. James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846 866, 1994. Shanshan Song, Yuanyuan Lin, and Yong Zhou. A general m-estimation theory in semisupervised framework. Journal of the American Statistical Association, pages 1 11, 2023. George Tauchen. Diagnostic testing and evaluation of maximum likelihood models. Journal of Econometrics, 30(1):415 443, 1985. ISSN 0304-4076. doi: https://doi. org/10.1016/0304-4076(85)90149-6. URL https://www.sciencedirect.com/science/ article/pii/0304407685901496. Anastasios A Tsiatis. Semiparametric theory and missing data. Springer, 2006. Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. Ramon Vi nas, Chaitanya K Joshi, Dobrik Georgiev, Phillip Lin, Bianca Dumitrascu, Eric R Gamazon, and Pietro Li o. Hypergraph factorization for multi-tissue gene expression imputation. Nature machine intelligence, 5(7):739 753, 2023. Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47 60, 2023. Jiebiao Wang, Eric R Gamazon, Brandon L Pierce, Barbara E Stranger, Hae Kyung Im, Robert D Gibbons, Nancy J Cox, Dan L Nicolae, and Lin S Chen. Imputing gene expression in uncollected tissues within and beyond gtex. The American Journal of Human Genetics, 98(4):697 708, 2016. Junhui Wang and Xiaotong Shen. Large margin semi-supervised learning. Journal of Machine Learning Research, 8(8), 2007. Post-prediction adaptive inference Siruo Wang, Tyler H Mc Cormick, and Jeffrey T Leek. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266 30275, 2020. Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics, 2019.