# assumptionlean_and_dataadaptive_postprediction_inference__088b471e.pdf

Journal of Machine Learning Research 26 (2025) 1-31 Submitted 1/24; Revised 6/25; Published 6/25

Assumption-lean and data-adaptive post-prediction inference

Jiacheng Miao jiacheng.miao@wisc.edu Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA

Xinran Miao xinran.miao@wisc.edu Department of Statistics University of Wisconsin-Madison Madison, WI 53706, USA

Yixuan Wu wu638@wisc.edu University of Wisconsin-Madison Madison, WI 53726, USA

Jiwei Zhao jiwei.zhao@wisc.edu Department of Biostatistics and Medical Informatics, Department of Statistics University of Wisconsin-Madison Madison, WI 53726, USA

Qiongshi Lu qlu@biostat.wisc.edu Department of Biostatistics and Medical Informatics University of Wisconsin-Madison Madison, WI 53726, USA

Editor: Edo Airoldi

A primary challenge facing modern scientiﬁc research is the limited availability of goldstandard data, which can be costly, labor-intensive, or invasive to obtain. With the rapid development of machine learning (ML), scientists can now employ ML algorithms to predict gold-standard outcomes using variables that are easier to obtain. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive ﬁndings and invalid scientiﬁc conclusions. In this work, we introduce Po St-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data. Its assumption-lean property guarantees reliable statistical inference without assumptions on the ML prediction. Its data-adaptive feature guarantees an eﬃciency gain over existing methods, regardless of the accuracy of ML prediction. We demonstrate the statistical superiority and broad applicability of our method through simulations and realdata applications.

Keywords: Inference with predicted data; Missing data; Semiparametric eﬃciency

1. *These authors contribute equally to this work.

c 2025 Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, and Qiongshi Lu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0056.html.

Post-prediction adaptive inference

1. Introduction

Gene expression data obtained from various tissues and cell types can provide crucial insights into the coordinated biological mechanisms that underlie disease etiology and characterize homeostasis (Lonsdale et al., 2013). However, some important tissues are often diﬃcult to collect, which leads to a limited size of gene expression samples in those tissues. For example, the Genotype-Tissue Expression (GTEx) project is a comprehensive study focusing on gene expression regulation in many human tissues (GTEx Consortium et al., 2015). The percentage of individuals with missing gene expression ranges from 5% to 90% (median 47%) across all tissues in GTEx (GTEx Consortium, 2020). This limits the scientiﬁc understanding of transcription regulation across tissue contexts. The diﬃculty in obtaining gold-standard data certainly extends beyond human gene expression applications (Wang et al., 2023). While gold-standard data with high reliability are essential to the validity of scientiﬁc discoveries, obtaining them is often costly and laborintensive. Fortunately, the advent and rapid development of machine learning (ML) have enabled the prediction of outcomes using more accessible variables (Le Cun et al., 2015; He et al., 2016), showing great potential in reducing the need to acquire gold-standard data. Despite these beneﬁts, replacing gold-standard data with ML predictions poses new challenges, particularly in maintaining the validity of downstream statistical analyses. This issue is exempliﬁed by the statistical analysis using imputed gene expression in the GTEx project. To address the issue of insuﬃcient sample sizes for rare tissues, several approaches have been proposed to impute gene expression in these tissues using expression data from tissues that are easier to acquire (Wang et al., 2016; Vi nas et al., 2023; Basu et al., 2021). However, imputed gene expression is often treated as if it were observed, and used directly in subsequent statistical analyses for scientiﬁc discovery (Vi nas et al., 2023), such as exploring sex diﬀerences and the genetic basis of gene regulation. As demonstrated in Wang et al. (2020); Angelopoulos et al. (2023a), using ML predictions directly in statistical inference will most likely produce false positive ﬁndings, leading to invalid scientiﬁc conclusions. The challenge of making valid inferences from ML predictions extends beyond the speciﬁc example discussed, aﬀecting numerous scientiﬁc disciplines where ML is applied (Bullock et al., 2020). Recent work has introduced a method called prediction-powered inference (PPI), which leverages a small set of labeled data with gold-standard outcomes and a large amount of unlabeled data with ML predictions to perform valid statistical inference (Angelopoulos et al., 2023a). Although validity is guaranteed, PPI may be less eﬃcient than inference relying solely on labeled data. For instance, when applied to GTEx data, PPI identiﬁes fewer sex-biased gene expression compared to the classical approach based on labeled data alone, highlighting a loss of statistical eﬃciency (Figure 1). This limitation reduces the broader applicability of PPI and similar methods. To address this problem, we introduce a valid, powerful, and widely applicable approach named Po St-Prediction Adaptive inference (PSPA). PSPA is designed to integrate ML prediction with the observed gold-standard data to ensure valid and eﬃcient statistical inference. We highlight two key features of our PSPA method:

Assumption-Lean: The PSPA estimator is consistent and asymptotically normal, ensuring valid inference with no assumptions on the ML model or prediction accuracy. This means that PSPA can be used with arbitrarily misspeciﬁed ML predictions.

Post-prediction adaptive inference

Data-Adaptive: The PSPA estimator is adaptive to the accuracy of the ML prediction. It utilizes more information from a better ML prediction to reduce variance and to avoid variance inﬂation. Below, we demonstrate that PSPA achieves element-wise variance reduction compared to existing methods, regardless of the ML prediction quality.

Our contributions are threefold: (i) We present a simple and data-adaptive method for statistical inference problems with predicted data deﬁned through estimating equations, without any assumptions on the ML prediction or parameterization of the true data-generating process. (ii) We establish the consistent and asymptotic normality of our proposed estimator. We further demonstrate the optimality of our estimator in terms of asymptotic variance over a class of estimators that includes existing methods in post-prediction inference. (iii) In contrast to current methods, which only focus on incorporating ML-predicted labels in statistical inference, our approach extends to utilizing both ML-predicted covariates and labels, enhancing the versatility and applicability of post-prediction inference.

The rest of this paper is organized as follows. In Section 2, we formulate our problem and illustrate our method with mean estimation to build an intuition. Next, in Section 3, we introduce a general protocol for applying PSPA to estimands deﬁned through estimating equations. Further, we establish the asymptotic properties of the PSPA estimator and describe the estimation procedure. We build the connection with the semiparametric eﬃciency theory in Section 4 and describe extensions in Section 5. Through extensive simulations (Section 6) and real data analysis (Section 7), we validate our theoretical claims and demonstrate the practical utility and eﬀectiveness of the PSPA estimator. We then compare our method with a recent method PPI++ (Angelopoulos et al., 2023b) in Section 8. This paper concludes with a discussion in Section 9.

Figure 1: Comparison of PPI, PSPA, and classical method in identifying genes with sex-biased expression using GTEx data. (a) the number of sex-biased genes identiﬁed by each of the four approaches. (b) x-axis: absolute value of imputation correlation. y-axis: relative ratio of estimated standard error between PPI and classical method. (c) same as b but for our method PSPA. The dashed lines represent y = 1 in (b)-(c).

Post-prediction adaptive inference

2. Problem Formulation

Let Y be a scalar outcome and X be d-dimensional covariates. The scientiﬁc interest is to estimate a q-dimensional parameter θ deﬁned through an estimating equation,

E{ψ(Y, X; θ)} = 0, (1)

where ψ( , ; θ) is a user-deﬁned function. Such a deﬁnition of θ is very general, including outcome mean, outcome quantile, least squares coeﬃcients, or any other speciﬁc quantities of interest involving both x and y, including the unique minimizer of a loss function and the maximizer of a criterion function (Van der Vaart, 2000). We observe a random sample where only a small subset is labeled with outcome Y . In addition, we observe the auxiliary variable Z that is predictive of the outcome Y . Note that Z could be a subset of X and vice versa; see Remark 1. Denote this sample as L U where L = {(yi, xi, zi), i = 1, , n} and U = {(xi, zi), i = n + 1, , N + n}. Assume that L and U are independent and share the same marginal distributions of (X, Z). Let n/N ρ as n and N , and let ρ = if there is no unlabeled data. We consider the availability of an external and independent prediction algorithm bf( ) on z that produces predictions bf for the outcome variable Y . We assume that the operating characteristics of bf are unknown to the user, and the data used to ﬁt bf are unavailable. bf is considered a black box function and can be incorrect in predicting y. With access to this black-box bf, our primary goal is to improve eﬃciency while making inference on the parameter θ. We refer to our problem as post-prediction inference, where inference occurs after predictions are made independently (Wang et al., 2020; Angelopoulos et al., 2023a). Speciﬁcally, we make no assumptions on bf. The resulting protocol can beneﬁt researchers where training a prediction algorithm is unrealistic, and oﬀers guidelines for scientists lacking computational resources or predictive modeling expertise.

Example 1 (Sex-diﬀerentiated gene expression) We present the scientiﬁc problem that motivates our setup: identifying sex-biased gene expression in the brain cortex tissue using data from the GTEx project. For this analysis, we have labeled data denoted by L = {(yi, bfi, xi, zi), i = 1, , 205}, where yi represents the gene expression in brain cortex tissue, xi includes a binary indicator for biological sex in{0, 1}, along with other technical factors such as surrogate variables for batch eﬀects, age, RNA integrity number, and total ischemic time. zi is the whole-transcriptome gene expression proﬁle in whole blood tissue that is easier to access, and bf is the imputed gene expression in the brain cortex tissue using zi. In addition, our unlabeled data, U = {( bfi, xi, zi), i = 206, , 670}, contains 465 samples whose gene expression was measured in whole blood but not in brain cortex. Our interest lies in the linear regression coeﬃcient θ as a solution to E[x(y x θ)] = 0, focusing particularly on the coeﬃcients corresponding to biological sex.

Remark 1 (The consideration of X and Z) We split the covariate into two (possibly overlapping) parts, Z and X, where the former is predictive of the outcome Y and the latter appears in the estimating equation (1). They could be the same, but a wide range of applications suggests they are diﬀerent. When Z is not a subset of X, the information it carries about the outcome Y may be the most informative for the estimation problem (1).

Post-prediction adaptive inference

Hence, a better prediction algorithm bf does not necessarily correspond to a better estimate of θ. As justiﬁed above, this work does not attempt to improve the prediction algorithm, but to empower the inference on the parameter of interest, given any arbitrary black-box bf.

In the following, Section 2.1 presents the key idea of the proposed method using mean estimation as an example. Section 2.2 reviews related work. In Section 3, we introduce the proposed framework for the estimation problem deﬁned in equation (1).

2.1 Example: estimating the outcome mean

We illustrate the intuition of our approach using mean estimation. Consider ψ(y, x; θ) = y θ. The resulting estimand is θ = E(Y ), the outcome mean. Then the classical approach calculates an estimator from the sample average of observed outcomes,

This estimator is unbiased, but can have high variance due to the small sample size of labeled data. To increase eﬃciency with the ML-predicted outcomes in unlabeled data, we propose to introduce another unbiased estimator of zero with a weighting scalar ω,

bθPSPA(ω) = 1

We refer to this estimator as the PSPA estimator. Both the classical and PPI estimator (Angelopoulos et al., 2023a) can be viewed as special cases of the PSPA. The classical estimator is the PSPA estimator with ω = 0. The PPI estimator is the PSPA estimator with ω = 1:

As detailed in Section 4, another candidate estimator is guided by the eﬃcient inﬂuence function (EIF). We refer to this estimator as EIF -based where represents the fact that the nuisance function is plugged in as an estimate,

i=1 yi + N N + n

This is a special case of PSPA estimator with ω = N N+n, achieving the semiparametric eﬃciency bound when bf approximates its true counterpart at a suﬃciently fast rate. Since these estimators are summations of consistent estimators of the population mean and zero, they are consistent for the population mean. Additionally, Wald-type conﬁdence intervals can be constructed using a consistent estimator for the asymptotic variance.

ω2 Var[ bf]

n 2ω Cov[Y, bf]

n + ω2 Var[ bf]

| {z } Additional terms | {z } Variance of PSPA estimator

Post-prediction adaptive inference

To guarantee that the asymptotic variance (2) is no greater than the classical approach, regardless of the quality of bf, the additional terms need to be no larger than zero. A key insight is that these additional terms are a quadratic function of ω:

q(ω) = ω2 ( Var[ bf]

n + Var[ bf]

2ωCov[Y, bf]

which achieves its minimum at ωopt = Cov[Y, bf]/{Var[ bf] + n Var[ bf]/N} with minima

q(ωopt) = Cov[Y, bf]

n Var[ bf] + n2 Var[ bf]/N 0,

where the equality holds if and only if Cov[Y, bf] = 0. In this case, our PSPA estimator is at least as eﬃcient as the classical estimator. In contrast, both PPI and EIF -based estimators can be less eﬃcient than the classical estimator, when the ML predictions are not accurate. As we shall see in Section 3, similar patterns appear in estimating the general parameter deﬁned in (1). In addition, the additional terms are below zero if and only if Cov[Y, bf] > ω(N+n) Var[ bf]

2N . This indicates that in order for PPI to achieve eﬃciency improvement (where ω = 1), the prediction accuracy of the ML algorithm cannot be low.

Remark 2 (Data-Adaptive Feature) We provide an intuition for the data-adaptive feature of PSPA. When the ML algorithm is purely random, Cov[Y, bf] = 0 and hence the optimal weight ωopt = 0. Therefore, PSPA degenerates to the classical estimator that uses only labeled data. This aligns with the intuition that such predictions should not contribute to inference. On the other hand, if the ML prediction is perfect, Cov[Y, bf] = Var[ bf] = Var[Y ] and hence the optimal weight is ωopt = N/(N + n). In this case, PSPA degenerates to the EIF*-based estimator that achieves the semiparametric eﬃciency bound. This is also intuitive since perfect ML prediction can be treated as gold-standard data and the estimator should utilize all the (measured and predicted) information and thus is eﬃcient.

2.2 Related work

Our setting is closely related to a recent method called prediction-powered inference (PPI) (Angelopoulos et al., 2023a), which ensures valid inference with gold-standard labels and arbitrary ML predictions. It corresponds to a special case of our PSPA estimator where ω = 1 (or a vector of all ones). Both theoretical and numerical comparisons suggest that the PPI estimator is less eﬃcient than the proposed PSPA estimator. Our approach builds upon and generalizes the method proposed by PPI++ (Angelopoulos et al., 2023b), called PPI++, which originally enhanced the eﬃciency of PPI. However, unlike PSPA, PPI++ cannot guarantee element-wise variance reduction. Moreover, PPI++ cannot be applied to statistical inference with ML-predicted covariates. On the other hand, PSPA is capable of handling this type of application, while guaranteeing improved eﬃciency, element-wise. We present a comprehensive comparison of PSPA and PPI++ in Section 8. Our proposal of boosting a consistent estimator with a consistent estimator of zero has been a long-standing idea. One famous estimator is the augmented inverse probability weighting estimator in the literature of missing data (Robins et al., 1994) and causal

Post-prediction adaptive inference

inference (Robins, 2000), in which a consistent inverse probability weighting estimator is augmented by a weighted residual term with mean zero, resulting in an eﬃciency gain (Hahn, 1998). In fact, it reaches the semiparametric eﬃciency bound (Bickel et al., 1993; Hahn, 1998) when nuisance functions are correctly speciﬁed or approximated suﬃciently fast. Although the pursuit of eﬃciency under correct speciﬁcations diﬀers from our main focus, we establish the eﬃciency bound of our estimation problem (1) in Section 4 and illustrate the connection between the PSPA estimator and the eﬃcient inﬂuence function. The idea of using unlabeled data ﬁts into the broader context of improving eﬃciency with auxiliary data, which has been widely applied in survey estimation (Breidt and Opsomer, 2017), missing data (Robins et al., 1994), measurement error models (Chen et al., 2005), causal inference with surrogate outcomes (Kallus and Mao, 2020), and semi-supervised learning (Wang and Shen, 2007; Chakrabortty and Cai, 2018). The literature on semisupervised learning has adopted a similar data structure to our problem in which labeled data are accompanied by unlabeled data (possibly with surrogate outcomes). Recently, Chakrabortty and Cai (2018) has considered eﬃciency improvement in linear models under possible model misspeciﬁcations. After that, estimation methods have been proposed for mean estimation (Zhang et al., 2019), best linear predictor estimation (Azriel et al., 2022), and general M-estimation problems (Song et al., 2023). Our work is distinct from this line of research in that we consider post-prediction inference with any black-box ML predictions for eﬃciency improvement.

3. PSPA for assumption-lean and data-adaptive post-prediction inference

In this section, we introduce our method PSPA for assumption-lean and data-adaptive postprediction inference. For the estimand deﬁned by (1) and any ML prediction, it guarantees the validity of inference results and element-wise variance reduction compared with the classical method that uses labeled data alone. We present our estimator and algorithm in Section 3.1 and establish the theoretical guarantees for our method in Section 3.2. Examples of applying our method for regression tasks are provided in Section 3.3.

3.1 PSPA estimator

Bearing in mind that bf could be incorrect, one typical approach to estimating θ is to ignore the unlabeled data and use the labeled data only. This classical estimator bθC solves for

i=1 ψ(yi, xi; θ) = 0.

It is always consistent, but may have low eﬃciency because it ignores the unlabeled data. In this work, we aim to provide an estimator that is consistent and more eﬃcient than the classical estimator, regardless of the quality of the bf. Similar to the mean estimation example, we add an augmentation term, which has a mean of zero and is indexed by a vector ω = [ω1, , ωq] , to the estimation equation for the classical approach:

i=1 ψ( bf, xi; θ) + 1

i=n+1 ψ( bf, xi; θ)

Post-prediction adaptive inference

where diag(ω) is a diagonal matrix with diagonal elements ω1, , ωq. We refer to the vector ω as a weighting vector, as it controls the contribution of labeled outcomes and the arbitrary prediction bf in a data-adaptive way. Together, we propose our PSPA estimator bθPSPA(ω) that solves the equation,

Ψω PSPA(θ) := 1

i=1 ψ(yi, xi; θ) + diag(ω)

i=1 ψ( bf, xi; θ) + 1

i=n+1 ψ( bf, xi; θ)

for a given weighting vector ω. This equation corresponds to a class of estimators bθPSPA(ω) for diﬀerent ﬁxed values of ω. We next establish the consistency and asymptotic normality of these estimators with the following regularity conditions. Condition (C1) is a regularity assumption on the parameter of interest. Conditions (C2)-(C4) guarantee consistency. Condition (C5) is needed for asymptotic normality. All conditions are reasonable and standard (Van der Vaart, 2000).

(C1) The parameter space Θ is a compact subset of Rq, containing the true parameter θ0 as a unique solution to (1).

(C2) The function in (3) is continuous and the equation has a unique solution.

(C3) supθ Θ E[ ψ(Y, X; θ) 2] < and supθ Θ E[ diag(ω) ψ( bf(Z), X; θ) 2] < where is the L2-norm that takes over the true data-generating process.

(C4) Labeled and unlabeled data are i.i.d drawn from the population of interest.

(C5) There exists a function ϕ such that E[supθ Θ ϕ2] < and E ψ(Y, X; θ ) ψ(Y, X; θ ) 2 2 ϕ θ θ 2 for any θ and θ in a neighborhood of θ. Furthermore, E{ψ(y, x; θ)} is diﬀerentiable at θ with a nonsingular derivative matrix.

Theorem 3 Under Conditions (C1)-(C4), assuming n N ρ as n and N , then the proposed estimator bθPSPA(ω) converges to θ in probability. Assuming additionally Condition (C5), we have

n(bθPSPA(ω) θ) D N(0, Σ(ω)),

where Σ(ω) = A 1V(ω)A 1, A = E[ ψ(Y, X; θ)/ θ], V(ω) = M1 + diag(ω)(M2 + ρM3)diag(ω) 2diag(ω)M4, M1 = Var[ψ(Y, X; θ)], M2 = Var[ψ( bf(Z), X; θ)], M3 = Var[ψ( bf(Z), X; θ0)], M4 = Cov[ψ(Y, X; θ), ψ( bf(Z), X; θ)].

The proof is contained in Appendix B.1. Theorem 3 suggests that the proposed estimator bθPSPA(ω) is asymptotically normal with a ﬁxed weighting vector ω. In practice, this weighting vector ω should be estimated from data with the pursuit of eﬃciency. We provide an estimation procedure aimed at element-wise variance reduction. This is made possible by a key observation that the j-th diagonal element of the asymptotic covariance matrix Σ(ω) is a quadratic function of ωj and does not depend on other components of ω:

Σjj(ω) = ω2 j [A 1(M2 + ρM3)A 1]jj 2ωj[A 1M4A 1]jj + [A 1M1A 1]jj

Post-prediction adaptive inference

where [M]jj represents the j-th diagonal element of matrix M. We deﬁne the optimal weighting vector ωopt = [ωopt 1 , . . . , ωopt q ] such that each coordinate of the weighting vector minimizes the asymptotic variance for the corresponding coordinate:

ωopt j = arg min ωj Σjj(ωj) = [A 1M4A 1]jj [A 1(M2 + ρM3)A 1]jj for all j 1, . . . , q,

The resulting asymptotic variance for j-th coordinate is

Σjj(ωopt) = [A 1M1A 1]jj [A 1M4A 1]jj [A 1(M2 + ρM3)A 1]jj .

Such construction of ωopt guarantees the optimality of the element-wise asymptotic variance across the class of estimators with any weighting vector ω. Next, we present our algorithm for PSPA estimation and inference.

Algorithm 1 PSPA estimation with ML-predicted labels

Input: Data L U, pre-trained ML model bf, error rate α (0, 1).

1: Obtain the classical estimator bθC by solving 1

n Pn i=1 ψ(yi, xi; θ) = 0.

2: Obtain the optimal weighting vector bωopt = [bωopt 1 , . . . , bωopt q ] by

bωopt j = min( [ b A 1 C c M4,C b A 1 C ]jj [ b A 1 C (c M2,C + ρc M3,C) b A 1 C ]jj , 1) for all j 1, . . . , q,

where b AC, c M1,C, c M2,C, c M3,C, c M4,C are sample analogs of A, M1, M2, M3, M4 with bθC plugged in.

3: Obtain the PSPA estimator with a one-step update:

bθPSPA = bθC h Ψbωopt PSPA(bθC) i 1 Ψbωopt PSPA(bθC), where Ψbωopt PSPA = Ψbωopt PSPA(θ)/ θ|θ=bθC

4: Obtain the asymptotic variance of bθPSPA:

bΣ(bωopt) = b A 1 PSPA[c M1,PSPA + diag(bωopt)(c M2,PSPA + ρc M3,PSPA)diag(ω) 2diag(bωopt)c M4,PSPA] b A 1 PSPA,

where b APSPA, c M1,PSPA, c M2,PSPA, c M3,PSPA, c M4,PSPA are sample analogs of A, M1, M2, M3, M4 with bθPSPA plugged in.

Output: PSPA estimator bθPSPA, standard error q

bΣ(bωopt)jj

n , α-level conﬁdence interval

CPSPA α,j = (bθPSPAj z1 α/2

bΣ(bωopt)jj

n ), and (two-sided) p-value 2(1 Φ(| bθPSPAj r

b Σ(bωopt)jj

for the j-th coordinate. Here, Φ is the CDF of the standard normal distribution.

3.2 Theoretical guarantees

Next, we establish the theoretical guarantees for our proposed estimator and algorithm. We modify the theorem 3 to reﬂect that bω in the algorithm is estimated from the data.

Post-prediction adaptive inference

Corollary 4 Suppose bω P ω and conditions for Theorem 3 hold, then bθPSPA(ω) P θ and

n(bθPSPA(ω) θ) D N(0, Σ(ω)).

Proof of Corollary 4 is contained in Appendix B.2. In our algorithm, we substitute the sample analogs for A, M1, M2, M3, M4 into ωopt j . Given that these sample analogs are

typically consistent estimators, Slutsky s theorem implies that bωopt P ωopt. Therefore, this additional condition is satisﬁed by our algorithm. The asymptotic normality in Corollary 4 guarantees the validity for inference for our algorithm by the following Corollary 5.

Corollary 5 Suppose bΣ(bω) P Σ(ω) and conditions for Theorem 3 hold. Given conﬁdence level 1 α (0, 1),

lim n P(θj C bω α,j) 1 α, j = 1 , q,

where θj is the j-th coordinate of the parameter and CPSPA α,j = bθPSPAj(bω) z1 α/2

Proof of Corollary 5 follows from the asymptotic normality of the PSPA estimator in Corollary 4. Since we plug in the consistent estimator for each element in Σ(ωopt), Slutsky s

theorem implies that bΣ(bωopt) P Σ(ωopt), and therefore the condition in Corollary 5 is also satisﬁed by our algorithm. The above two corollaries ensure the validity of the inference for our algorithm. We next demonstrate that our method achieves element-wise variance reduction when compared to all baseline methods.

Proposition 6 Suppose n(bθPSPA(ω) θ) D N(0, Σ(ω)), given ωopt deﬁned by equation 3.1, Σ(ωopt)jj Σ(ω)jj for all j {1, , q}.

Proof of Proposition 6 is contained in Appendix B.3. By setting the weighting vector ω to ωC = [0, . . . , 0] or ωPPI = [1, . . . , 1] , our method reduces to classical method and PPI, respectively. Therefore, PSPA guarantees an element-wise reduction in the asymptotic variance compared to both the classical method and PPI, for arbitrary ML models.

3.3 Examples: linear and logistic regression

Here, we present two examples to illustrate the intuitions behind the PSPA estimation procedure. We begin with the estimation of coeﬃcients in a linear regression model, where the resulting estimator is expressed in a closed form.

Example 2 (Linear Regression) We consider the ordinary least squares problem, where ψ(y, x; θ) = x(y x θ). Therefore, the estimand is θ0 = E[XX ] 1E[XY ]. From equation (3), the PSPA estimator is the solution to the equation

i=1 xi(yi x i θ) + 1

i=n+1 diag(ω)xi{ bf x i θ} 1

i=1 diag(ω)xi{ bf x i θ} = 0,

Post-prediction adaptive inference

yielding a closed-form solution for PSPA linear regression

i=1 xix i + diag(bωopt)

i=n+1 xix i 1

i=1 xiyi + diag(bωopt)

i=n+1 xi bf(zi) 1

where bωopt is a consistent estimator of optimal ωopt, obtainable through Algorithm 1.

The second example is logistic regression, which lacks a closed-form solution but can be solved using Algorithm 1.

Example 3 (Logistic Regression) For logistic regression, ψ(y, x; θ) = xy + xγθ(x), where γθ(x) = 1/ 1 + exp x θ . Using estimation equation (3), the PSPA estimator bθPSPA is the solution to the equation

i=1 { xiyi + xiγθ(xi)} + diag(ω)

i=n+1 [ xi bf(zi) + xiγθ(xi)] 1

i=1 [ xi bf + xiγθ(xi)]

4. Relationship with Eﬃcient Inﬂuence Function

To connect our method with the literature on semiparametric eﬃciency (Bickel et al., 1993; Tsiatis, 2006), in this section, we state the eﬃcient inﬂuence function (EIF) and the semiparametric eﬃciency bound for the estimation problem (1) (Proposition 7) and connect it with our proposed PSPA estimator. To ease notation, we introduce a random variable R to indicate the labeling; r1 = = rn = 1 and rn+1 = = rn+N = 1. In this section, suppose π pr(R = 1) = n/(n + N) is a ﬁxed constant between zero and one. Note that the introduction of R is for notational simplicity, and we make no inference on it.

Proposition 7 (Eﬃcient Inﬂuence Function) The EIF for estimating θ is

φ(yi, xi, θ) = r

πA 1[ψ(y, x; θ) E{ψ(Y, x; θ) | x, z}] + A 1E{ψ(Y, x; θ) | x, z}.

The proof is contained in Appendix B.4. Guided by Proposition 7, an EIF-based estimator would be the solution to the empirical equation,

0 = 1 n + N

i=1 φ(yi, xi, bθEIF). (4)

With the derived EIF, the EIF-based estimator is asymptotically normal and achieves the semiparametric eﬃciency bound by section 8.4 of Molenberghs et al. (2014).

Post-prediction adaptive inference

Corollary 8 (Semiparametric Eﬃciency Bound) Under Conditions (C1) and (C3), assume E{supθ Θ φ(Y, X, θ)} < and equation (4) has a unique solution, then as n and N ,

n(bθEIF θ0) D N(0, E{φ(Y, X, θ0)φ(Y, X, θ0) π}).

If the nuisance function E{ψ(Y, x; θ) | x, z} that appears in the EIF can be correctly speciﬁed or well approximated, then an EIF-based estimator will achieve the semiparametric eﬃciency bound. However, this is nearly impossible in reality with limited knowledge of the data generation process or a small computational budget. Instead, it s typical to construct an estimator with nuisance functions plugged in as their estimates. We refer to such estimators as EIF -based , because the inﬂuence function of the resulting estimator, denoted by bθEIF , may or may not be the eﬃcient inﬂuence function. In our setting, the nuisance function E{ψ(Y, x; θ) | x, z} can be estimated with E{ψ( bf, x; θ) | x, z}. Upon simple calculation, we can write the resulting bθEIF as the solution to

i=1 A 1ψ(yi, xi; θ) + (1 π)A 1 (

i=1 ψ( bf, xi; θ) + 1

i=n+1 ψ( bf, xi; θ)

Similar to the equation (3) that the PSPA estimator solves, the EIF -based estimator (5) solves an equation with two parts: one term that can produce a consistent estimator with labeled data only, and the other term that utilizes predictions bf, which always have a mean of zero. If one multiples both sides of (3) with matrix A 1, then (5) corresponds to that equation weight equals to a vector of 1 π. In estimating the outcome mean (Section 2.1) where A is essentially the scalar one, the EIF -based estimator is the PSPA estimator with weight being 1 π = N/(n + N). The diﬀerence between (3) and (5) clariﬁes one of the main distinctions between the proposed PSPA estimator and an EIF-based estimator. The former seeks variance reduction in a data-adaptive way, while the latter attains minimum variance at correct speciﬁcations. In real-world situations, the proposed PSPA estimator can be much more eﬃcient in that it utilizes information in a data-adaptive fashion. Numeric comparisons are provided in Section 6.

5. Extensions

Next, we extend PSPA to address more general scenarios. Speciﬁcally, Section 5.1 outlines an inference procedure for fully labeled outcomes with partially labeled covariate x, and Section 5.2 discusses cases where both y and x are partially labeled.

5.1 ML-predicted covariates

We adjusted the previously described setup by modifying it to use ML to predict the covariate X instead of the outcome Y . Our data can be divided into two parts: L U. The L part includes data points (yi, xi, zi) for i = 1, , n and the U part includes data points (yi, zi) for i = n + 1, , N + n. In addition, we use an external prediction algorithm with the notation bq( ). This algorithm is applied to z to generate the predicted value bq(z) of the covariate X. However, it is important to note that bq may produce inaccurate or biased predictions of X. The estimand is also deﬁned by the estimating equation (1). Similar to

Post-prediction adaptive inference

our proposal to handle the ML-predicted outcome, we propose our PSPA estimator bθPSPA (ω) with ML-predicted covariates that solves the equation

Ψω PSPA (θ) := 1

i=1 ψ(yi, xi; θ) + diag(ω)

i=1 ψ(yi, bq; θ) + 1

i=n+1 ψ(yi, bq; θ)

Denote A = E[ ψ(Y, X; θ)/ θ], M1 = Var[ψ(Y, X; θ)], M 2 = Var[ψ(Y, bq; θ0)], M 3 = Var[ψ(Y, bq; θ0)], M 4 = Cov[ψ(Y, X; θ0), ψ(Y, bq; θ0)]. Upon calculations similar to Theorem 3, the j-th diagonal element of the asymptotic variance of bθPSPA (ω) is

Σ jj(ω) = ω2 j [A 1(M 2 + ρM 3)A 1]jj 2ωj[A 1M 4A 1]jj + [A 1M 1A 1]jj,

which achieve its minimum with ωopt j = [A 1M 4A 1]jj [A 1(M 2+ρM 3)A 1]jj . We modify our Algorithm 1 to the following algorithm to incorporate ML-predicted covariates.

Algorithm 2 PSPA estimation with ML-predicted covariates

Input: Data L U, pre-trained ML model bf, error rate α (0, 1).

1: Obtain the classical estimator bθC by solving 1

n Pn i=1 ψ(yi, xi; θ) = 0.

2: Obtain the optimal weighting vector bωopt = [bωopt 1 , . . . , bωopt q ] by

min( [ b A 1 C c M 4,C b A 1 C ]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj , λmin,+[ 1

n ψ(yi,xi;θ)] λmax[ 1

n ψ(yi,bq;θ)] ) if [ b A 1 C c M 4,C b A 1]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj > 0

max( [ b A 1 C c M 4,C b A 1 C ]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj , 0) if [ b A 1 C c M 4,C b A 1]jj [ b A 1 C (c M 2,C+ρc M 3,C) b A 1 C ]jj 0

for all j 1, . . . , q. Here, b AC, c M 1,C, c M 2,C, c M 3,C, c M 4,C are sample analogs of A, M 1,

M 2, M 3, M 4 for Ψω PSPA (θ) with bθC plugged in. λmin,+[M] and λmax[M] represent the smallest non-negative and largest eigenvalue of matrix M, respectively.

3: Obtain the PSPA estimator with a one-step update:

bθPSPA = bθC h Ψbωopt

PSPA (bθC) i 1 Ψbωopt

PSPA (bθC), where Ψbωopt

PSPA = Ψbωopt

PSPA (θ)/ θ|θ=bθC.

4: Obtain the asymptotic variance of bθPSPA :

bΣ (bωopt) = b A 1 PSPA [c M 1,PSPA + diag(bωopt)(c M 2,PSPA + ρc M 3,PSPA)diag(ω) 2diag(bωopt)c M 4,PSPA] b A 1 PSPA ,

where b A PSPA, c M 1,PSPA, c M 2,PSPA, c M 3,PSPA, c M 4,PSPA are sample analogs of A, M 1, M 2, M 3, M 4 for Ψω PSPA (θ) with bθPSPA plugged in.

Output: PSPA estimator bθPSPA , standard error q

bΣ (bωopt)jj

n , α-level conﬁdence interval

α,j = (bθPSPA j z1 α/2

bΣ (bωopt)jj

n ), and (two-sided) p-value 2(1 Φ(|

b Σ (bωopt)jj

for the j-th coordinate. Here, Φ is the CDF of the standard normal distribution.

Post-prediction adaptive inference

Corollary 9 states the asymptotic distribution that enables interval estimation, with its proof following from the proof of Theorem 3, Corollary 4 and 5, and thus omitted.

Corollary 9 Under Conditions (C1), (C2) on equation 5.1, and (C3)-(C4), assuming n

ρ as n and N , and bω P ω, then the proposed estimator bθPSPA(bω) converges to θ0 in probability. Assuming additionally Condition (C5), we have that as n and N , n(bθ PSPA(bω) θ0) D N(0, Σ (ω)).

where Σ (ω) = A 1V (ω)A 1, A = E[ ψ(Y, X; θ)/ θ], V (ω) = M1 + diag(ω)(M 2 + ρM 3)diag(ω) 2diag(ω)M 4, M1 = Var[ψ(Y, X; θ)], M 2 = Var[ψ(Y, bq; θ0)], M 3 = Var[ψ(Y, bq; θ0)],

M 4 = Cov[ψ(Y, X; θ0), ψ(Y, bq; θ0)]. With bΣ (bω) P Σ (ω),

lim n P(θj C bω α,j) 1 α, j = 1, , q,

where C bω α,j = (bθ PSPAj(bω) z1 α/2

n ), 1 α (0, 1) is the conﬁdence level, and θj is the j-th coordinate of the parameter.

Remark 10 Condition (C2) requires equation 5.1 to have a unique solution. To ensure this, we impose a constraint on ω. Equation 5.1 has a unique solution if its deriva-

n Pn i=1[ ψ(yi, xi; θ) diag(ω) ψ(yi, bq; θ)]+ 1

N Pn+N i=n+1 ψ(yi, bq; θ), is positive semi-

deﬁnite. The second term, 1

N Pn+N i=n+1 ψ(yi, bq; θ), is already positive semi-deﬁnite, so we only need to ensure the ﬁrst term is positive semi-deﬁnite. According to Lemma 13 in the

appendix, a suﬃcient condition for this is: 0 ωj λmin,+[ 1

n Pn i=1 ψ(yi, xi; θ)]

n Pn i=1 ψ(yi, bq; θ)] for all

j = 1, , q, where λmin,+[M] and λmax[M] represent the smallest positive and largest eigenvalue of matrix M, respectively. By imposing this constraint on the weighting vector in the algorithm, we ensure that condition (C2) holds. In practice, one can relax this constraint by substituting the preﬁxed ω into the derivative and conﬁrming its positive semi-deﬁniteness. If necessary, one can iteratively reduce the value of each positive ωj (or increase negative ωj) until the derivative achieves positive semi-deﬁniteness.

Next, we state the theoretical guarantees of PSPA for element-wise variance reduction compared with the classical approach with weighting vector ωC = [0, . . . , 0] .

Corollary 11 Suppose n(bθPSPA (ωC) θ) D N(0, Σ(0)). Denote ˆωopt = [bωopt 1 , . . . , bωopt q ] , where

min( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , λmin,+[ 1

n ψ(yi,xi;θ)] λmax[ 1

n ψ(yi,bq;θ)] ) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj > 0

max( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , 0) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj 0

for all j 1, . . . , q. Suppose ˆωopt P ωopt, then Σ(ωopt)jj Σ(ωC)jj.

The proof is contained in Appendix B.6.

Post-prediction adaptive inference

5.2 ML-predicted outcome and covariates

Next, we explore a situation where the outcome and covariates are not directly observed in the unlabeled data. Instead, we employ machine learning techniques to predict both. Additionally, we possess much smaller data with both outcome and covariates measured. This full data can be categorized into two segments: L U. The ﬁrst segment, L, includes the data points (yi, xi, zi) for i = 1, , n, while the second segment, U, comprises the instances of zi for i = n + 1, , N + n, and N >> n. The variable z is to predict both the outcome and the covariates. The ML-predicted outcome and covariates are represented by bf and bq(z), respectively, where bf and bq are independently obtained. The PSPA estimator for this situation is to solve the equation,

i=1 ψ(yi, xi; θ) + diag(ω)

i=1 ψ( bf, bq; θ) + 1

i=n+1 ψ( bf, bq; θ)

Replacing the estimating equation in Algorithm 2 with Equation 5.2 gives an algorithm for estimation and statistical inference for this task. The consistency and asymptotic normality of the resulting estimator can be established by the proof of Theorem 3.

6. Simulations

We conduct simulations to evaluate the performance of PSPA using linear and logistic regression across two settings: one with ML-predicted labels and another with ML-predicted covariates. In the ML-predicted labels setting, we considered classical, PPI, and EIF*- based methods. The EIF*-based method employs a weighting vector deﬁned as ωEIF* = h N N+n, . . . , N N+n i , where N and n represent the sample sizes of unlabeled and labeled data, respectively. For the ML-predicted covariates setting, we used classical and imputationbased methods as baselines. We present the implementation details in Section 6.1 and the simulation results in Section 6.2.

6.1 Implementation details

In all simulations, the ground truth coeﬃcients are obtained by a Monte Carlo approximation with 5 104 samples. The labeled data is with 500 samples, and the unlabeled data is with 500, 1500, 2500, 5000, or 10000 samples in diﬀerent settings. A pre-trained random forest with 100 trees to grow is obtained from a hold-out dataset with 1000 samples. All simulations are repeated 1000 times.

6.1.1 ML-predicted labels

We simulated the labels Yi and covariates X1i, . . . , X50i for the linear regression setting by X1i, . . . , X50i i.i.d N(0, 1), Zi N(0, 1), θ1, . . . , θ10 = 0.1

10; , θ11, . . . , θ50 = 0, Yi = P50 k=1 Xkiθk+ r Zi + ϵi, ϵi N(0, τϵ) where τϵ such that Var(Yi) = 1, where r is set to 0.8 for settings with diﬀerent sample sizes of unlabeled data and 0, 0.2, 0.4, 0.6, or 0.8 for settings with diﬀerent levels of imputation accuracy. We used the X1i, . . . , X50i, and Zi as inputs to train the random forest model, which aimed to predict Yi in unlabeled data. For logistic regression,

Post-prediction adaptive inference

we adopted the same data-generating process except that we generated the label Yi by Yi = 1(Yi > median (Yi)). Our parameter of interest is the regression coeﬃcient for X1i.

6.1.2 ML-predicted covariates

We simulated the labels Yi and covariates X1i, . . . , X10i for linear regression setting by X1i, . . . , X10i i.i.d N(0, 1), Zi N(0, 1), θ1 = 0.1, θ2, . . . , θ10 = 0, Yi = P10 k=1 Xkiθk + ϵi, ϵi N(0, τϵ) where τϵ is set to be the value such that Var(Yi) = 1, Zi = 0.1Yi + r X1i + δi where δi N(0, τδ) where τδ is set to be the value such that Var(Zi) = 1 where r is set to 0.8 for settings with diﬀerent samples size of unlabeled data and 0, 0.2, 0.4, 0.6, or 0.8 for settings with diﬀerent levels of imputation accuracy. We employed the variables Zi as inputs to train the random forest model, which aimed to predict X1i in unlabeled data. For logistic regression, we used the same data-generating process except that we generated the label Yi by Yi = 1(Yi > median (Yi)). We are interested in the regression coeﬃcient for X1i.

500 1500 2500 5000 10000

Linear regression a

500 1500 2500 5000 10000

Logistic regression b

0 0.2 0.4 0.6 0.8 r

Linear regression c

0 0.2 0.4 0.6 0.8 r

Logistic regression d

500 1500 2500 5000 10000 Nunlabeled

Width (method) / Width (classical)

Linear regression e

500 1500 2500 5000 10000 Nunlabeled

Width (method) / Width (classical)

Logistic regression f

0 0.2 0.4 0.6 0.8 r

Width (method) / Width (classical)

Linear regression g

0 0.2 0.4 0.6 0.8 r

Width (method) / Width (classical)

Logistic regression h

Methods Classical EIF* based PSPA PPI

Figure 2: Coverage of the conﬁdence interval and relative ratio of its width compared to the classical method for linear and logistic regression. ML is used to predict the labels. Panels (a)-(d) show the coverage of the conﬁdence interval. Panels (e)-(h) show the relative ratio of the width of the conﬁdence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with diﬀerent levels of imputation accuracy. The dashed lines represent y = 0.95 in (a)-(d) and y = 1 in (e)-(h).

Post-prediction adaptive inference

6.2 Results

6.2.1 ML-predicted labels

Figure 2 shows the simulation results for the setting of ML-predicted labels. All evaluated methods achieved conﬁdence interval coverage rates close to 95%, indicating the inference validity across diﬀerent unlabeled sample sizes and diﬀerent levels of ML prediction accuracy. PPI and EIF*-based methods may be less eﬃcient than classical methods when the ML has poor prediction accuracy or limited unlabeled data. PSPA has narrower conﬁdence intervals compared to the classical method and other baseline methods, especially as the unlabeled sample size and ML prediction performance increase.

6.2.2 ML-predicted covariates

Figure 3 shows the simulation results for the setting with ML-predicted covariates. The imputation-based method fails to achieve the correct conﬁdence interval coverage. Both classical and PSPA methods lead to coverage rates around 95%, which suggests valid inference results. PSPA has narrower conﬁdence intervals compared to the classical method, regardless of the unlabeled sample size and the accuracy of the ML model.

500 1500 2500 5000 10000

Linear regression a

500 1500 2500 5000 10000

Logistic regression b

0 0.2 0.4 0.6 0.8 r

Linear regression c

0 0.2 0.4 0.6 0.8 r

Logistic regression d

500 1500 2500 5000 10000 Nunlabeled

Width (method) / Width (classical)

Linear regression e

500 1500 2500 5000 10000 Nunlabeled

Width (method) / Width (classical)

Logistic regression f

0 0.2 0.4 0.6 0.8 r

Width (method) / Width (classical)

Linear regression g

0 0.2 0.4 0.6 0.8 r

Width (method) / Width (classical)

Logistic regression h

Methods Classical PSPA Imputation based Figure 3: Coverage of the conﬁdence interval and relative ratio of its width compared to the classical method for linear and logistic regression. ML is used to predict the covariates. Panels (a)-(d) show the coverage of the conﬁdence interval. Panels (e)-(h) show the relative ratio of the width of the conﬁdence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with diﬀerent levels of imputation accuracy. The dashed lines represent y = 0.95 in (a)-(d) and y = 1 in (e)-(h).

Post-prediction adaptive inference

7. Data Applications

7.1 Sex-diﬀerentiated gene expression

We used the PSPA method to assess the eﬀect of biological sex on gene expression across 44 human tissues using GTEx data. We aim to use an ML algorithm called hypergraph factorization (HYFA) to impute the missing gene expression (Vi nas et al., 2023) in the uncollected tissue. The imputed gene expression is then used in PSPA to increase the statistical power to identify sex-biased genes. We compared PSPA with classical, PPI, and EIF -based approaches. In particular, we processed the GTEx-v8 data following the GTEx pipeline (GTEx Consortium et al., 2015). We then used HYFA to predict transcript levels for each uncollected tissue for each individual in the data, resulting in a gene expression dataset of 834 individuals across 44 tissues. HYFA is a parameter-eﬃcient graph representation learning approach for multi-tissue prediction of gene expression. We used cross-validation to predict expression levels in the labeled data to avoid over-ﬁtting. Prediction accuracy is measured by the correlation between measured and imputed gene expression in the labeled data. We then used linear regression to assess the eﬀect of sex on gene expression while controlling for technical factors, including surrogate variables, age, RNA integrity number, and total ischemic time for a sample.

Figure 4: Comparison of PSPA, classical, PPI, and EIF -based approaches in identifying sex-biased gene expression using GTEx data. Each panel illustrates a diﬀerent aspect of comparison on the yand xaxes: point estimates between the (a) classical and PPI approaches; (b) classical and EIF -based approaches; (c) classical and PSPA approaches; estimated standard errors between the (d) classical and PPI approaches; (e) classical and EIF -based approaches; (f) classical and PSPA approaches; (g) number of sex-biased genes identiﬁed by each of the four approaches. The dashed lines represent y = x in (a)-(c) and y = 1 in (d)-(f).

Post-prediction adaptive inference

Results are shown in Figure 4. Each dot represents the result of the inference for one gene in one tissue in Figure 4 (a)-(f). The point estimates of all methods are close to the classical approach (Figure 4 (a)-(c)). However, both PPI and EIF -based are less eﬃcient than the classical approach when the prediction accuracy is low (Figure 4 (d)-(e)). In contrast, the PSPA estimator is always at least as eﬃcient as the classical approach, regardless of the prediction accuracy (Figure 4 (f)). Moreover, PSPA identiﬁes more sex-biased genes than other approaches, demonstrating its improved eﬃciency over alternatives (Figure 4 (g)).

7.2 Risk factors for bone mineral density

Next, we applied PSPA to a multi-dimensional linear regression task. The goal is to identify the associations between dual-energy x-ray absorptiometry (DXA)-derived total bone mineral density (DXA-BMD) and several covariates, including biological sex, age, physical activities (PA), sedentary behavior (SB), smoking status (current smoker or not), and frequency of alcohol intake. DXA-BMD serves as the primary diagnostic marker for osteoporosis and fracture risk in clinical settings. For PA, we assigned individuals three levels (low, medium, and high) scores according to the International Physical Activity Questionnaire guidelines. SB was quantiﬁed as an integer value representing the combined hours spent driving, using a computer, and watching television.

Table 1: Comparison of diﬀerent methods in identifying risk factors for bone mineral density. Estimate and c SE represent the estimated linear regression coeﬃcient and its corresponding standard error for each covariate. c SE ratio indicates the ratio of the standard error for a speciﬁc method compared with the classical method. PA denotes physical activity, and SB refers to sedentary behavior. The bold fonts represent the method that gives the smallest c SE ratio for each covariate.

Biological sex Age PA SB Smoking Alcohol

Estimate -0.616 -0.190 0.019 0.040 0.006 -0.008 c SE 4.20E-03 4.12E-03 4.07E-03 4.18E-03 4.13E-03 4.20E-03 c SE ratio 1 1 1 1 1 1

Estimate -0.604 -0.201 0.009 0.043 -0.026 -0.002 c SE 5.73E-03 6.26E-03 6.35E-03 6.48E-03 6.39E-03 6.47E-03 c SE ratio 1.363 1.517 1.560 1.553 1.549 1.541

Estimate -0.605 -0.200 0.010 0.042 -0.023 -0.003 c SE 5.28E-03 5.74E-03 5.83E-03 5.95E-03 5.87E-03 5.93E-03 c SE ratio 1.257 1.391 1.433 1.424 1.423 1.412

Estimate -0.614 -0.181 0.010 0.040 -0.008 -5.15E-07 c SE 3.74E-03 3.74E-03 3.74E-03 3.82E-03 3.80E-03 3.81E-03 c SE ratio 0.892 0.907 0.919 0.914 0.920 0.908

Post-prediction adaptive inference

We regressed DXA-BMD on these variables using data from the UK Biobank (UKB). In the UKB, DXA-BMD measurements are available for only 10% of the participants. Therefore, we employed the Softimpute algorithm to impute DXA-BMD values for the remaining 90% individuals in the unlabeled dataset. To prevent overﬁtting, cross-validation was applied for the imputation in the labeled data. We consider classical, PPI, and EIF*- based method as the baseline method. Table 1 presents the inference results. Both the PPI and EIF*-based methods show larger standard errors across all covariates compared to the classical method. In contrast, PSPA shows smaller standard errors for all coordinates when compared to the classical method, demonstrating its feature for element-wise variance reduction. In conclusion, the real data results are consistent with our theoretical and simulation results, demonstrating that our method PSPA provides more eﬃcient inference results in real-world post-prediction inference applications.

8. Comparison with PPI++

Finally, we present a comparison between PPI++ (Angelopoulos et al., 2023b) and our method PSPA, both theoretically and empirically. First, we note that PPI++ can be reviewed as a special case of PSPA when each element of the weighting vector ω is constrained to the same values. PSPA degenerates to PPI++ for the one-dimensional estimation task such as mean estimation. Moreover, PPI++ cannot be applied to the case where the covariates are predicted by ML instead of the labels. Theoretically, since PPI++ can be reviewed as a special case of PSPA, PPI++ is less eﬃcient than PSPA by Proposition 6. Moreover, PPI++ fails to guarantee element-wise variance reduction since it only incorporates one scalar for variance reduction. Empirically, we compared PPI++ with PSPA for all experiments detailed in this paper. We present the simulation results in Figure 5 and real data applications in Figure 6 and Table A, respectively. Simulation results show that both PPI++ and PSPA have the correct conﬁdence interval coverage. However, PSPA has narrower conﬁdence intervals compared with PPI++. In real data applications, PSPA identiﬁes more sex-biased gene expression than PPI++ in the GTEx example and has a smaller estimated standard error compared to PPI++ for every covariate in the DXA-BMD example. These results are consistent with Proposition 6, showing PSPA is more eﬃcient than PPI++ in post-prediction inference applications.

9. Discussion

We have provided a simple yet powerful method, PSPA, to improve the eﬃciency of statistical inference with arbitrary ML predictions. To enable inference, we establish the consistency and asymptotic normality of the proposed estimator and prove its superiority over baseline methods. Through extensive simulations and real data applications, we demonstrate the superiority of the PSPA estimator over alternatives. This work ﬁts into a variety of scientiﬁc problems and unlocks several directions for future research. One can apply the same principle to more complicated data structures and missing mechanisms, for example, in cases where a multi-dimensional outcome or covariate is subject to missingness. One promising direction of future work is to carefully discuss

Post-prediction adaptive inference

relationships between X and Z and how they may aﬀect the magnitude of the eﬃciency improvement. The R codes to implement PSPA, benchmark methods, and replicate the simulation and real data analysis, is available at https://github.com/qlu-lab/pspa.

Acknowledgments

We thank the participants of the UW Madison Statistics Graduate Student Association Student Seminar and the Lu Group Lab Meeting for their valuable comments and questions. This work was supported by the NIH (U01 HG012039, R01 DC021431) and the NSF (DMS1953526, DMS-2122074, DMS-2310942).

Post-prediction adaptive inference

Appendix A. Supplementary Figures and Tables

500 1500 2500 5000 10000

Linear regression a

500 1500 2500 5000 10000

Logistic regression b

0 0.2 0.4 0.6 0.8 r

Linear regression c

0 0.2 0.4 0.6 0.8 r

Logistic regression d

500 1500 2500 5000 10000 Nunlabeled

Width (method) / Width (classical)

Linear regression e

500 1500 2500 5000 10000 Nunlabeled

Width (method) / Width (classical)

Logistic regression f

0 0.2 0.4 0.6 0.8 r

Width (method) / Width (classical)

Linear regression g

0 0.2 0.4 0.6 0.8 r

Width (method) / Width (classical)

Logistic regression h

Methods PSPA PPI++

Figure 5: Comparison of PPI++ and PSPA in simulation. This ﬁgure shows the coverage of the conﬁdence interval and the relative ratio of the width of the conﬁdence interval compared to the classical method for linear and logistic regression. ML is used to predict the labels. Panels (a)-(d) show the coverage of the conﬁdence interval. Panels (e)-(h) show the relative ratio of the width of the conﬁdence interval in comparison with the classical method. Panels (a), (b), (e), and (f) correspond to settings with varying sample sizes of unlabeled data. Panels (c), (d), (g), and (h) correspond to settings with diﬀerent levels of imputation accuracy. The dashed line represents y = 0.95 in (a)-(d) and y = 1 in (e)-(h).

Figure 6: Comparison of PPI++ and PSPA in identifying sex-biased genes in GTEx.

Post-prediction adaptive inference

Table 2: Comparison of PPI++ and PSPA in identifying risk factors for bone mineral density. Estimate and c SE represent the estimated linear regression coeﬃcient and its corresponding standard error for each covariate, respectively. c SE ratio indicates the ratio of the standard error for a speciﬁc method compared with the classical method. PA denotes physical activity, and SB refers to sedentary behavior. The bold font represents the method that gives the smallest c SE ratio for each covariate.

Biological sex Age PA SB Smoking Alcohol

Estimate -0.613 -0.193 0.016 0.041 -0.002 -0.006 c SE 3.86E-03 3.79E-03 3.84E-03 3.91E-03 3.87E-03 3.89E-03 c SE ratio 0.920 0.919 0.943 0.935 0.939 0.926

Estimate -0.614 -0.181 0.010 0.040 -0.008 -5.15E-07 c SE 3.74E-03 3.74E-03 3.74E-03 3.82E-03 3.80E-03 3.81E-03 c SE ratio 0.892 0.907 0.919 0.914 0.920 0.908

Appendix B. Proofs

B.1 Proof of Theorem 3

We ﬁrst present a lemma that will be used later.

Lemma 12 Suppose that X1, . . . , Xn are i.i.d., and the parameter space Θ is compact and L(X, θ) is continuous in θ Θ almost everywhere. Moreover, there exists a function h(X) satisfying L(X, θ) h(X) for arbitrary θ Θ and E{h(X)} < , then E{L(X, θ)} is continuous in θ and

i=1 L (Xi, θ) E{L(X, θ)}

Lemma 12 follows from Tauchen (1985, Lemma 1) and hence the proof is omitted. Condition (C5): There exists a small neighborhood of θ, denoted as B, in which ψ(y, x, θ) is continuously diﬀerentiable with respect to θ almost everywhere and the partial derivative satisﬁes E supθ B ψ(Y, X, θ)/ θ < and E supθ B ψ( bf(Z), X, θ)/ θ < . For simplicity of notation, we re-write the estimating equation (3) as

Ψ(bθPSPA, ω) = ΨY n + diag(ω)( Ψ bf n + Ψ bf N) = 0, where

i=1 ψ(yi, xi; bθPSPA), Ψ bf n = 1

i=1 ψ( bf, xi; bθPSPA), and Ψ bf N = 1

i=n+1 ψ( bf, xi; bθPSPA).

We ﬁrst establish consistency following Van der Vaart (2000, Theorem 5.9) by checking its two conditions. The deterministic condition is implied by Condition (C1) and (C2). We are

Post-prediction adaptive inference

left to verify the uniform convergence condition Ψ(bθPSPA) E(ΨY n ) = op(1). By Condition (C3) and Lemma 12, we have

diag(ω)Ψ bf n E(diag(ω)Ψ bf n) = op(1),

diag(ω)Ψ bf N E(diag(ω)Ψ bf N) = op(1), and

ΨY n E(ΨY n ) = op(1).

Hence, by triangular inequality,

Ψ(bθPSPA) E(ΨY n )

Ψ(bθPSPA) ΨY n + ΨY n E(ΨY n )

= diag(ω)Ψ bf n + diag(ω)Ψ bf N + ΨY n E(ΨY n )

ω Ψ bf n + E(diag(ω)Ψ bf n) + diag(ω)Ψ bf N E(diag(ω)Ψ bf N)

+ E(diag(ω)Ψ bf N diag(ω)Ψ bf n) + ΨY n E(ΨY n )

which completes the proof of consistency, i.e., bθPSPA P θ. Next, we establish the asymptotic normality. Expanding Ψ(bθPSPA, ω) at the true value θ,

0 =Ψ(θ, ω) + (bθPSPA θ) Ψ( θ)/ θ

for some θ between θ and bθPSPA. Multiplying both sides with n and with a direct calculation, Ψ( θ)/ θ n(bθPSPA θ) = nΨ(θ, ω).

In order to show the asymptotic normality, by Slutsky s theorem, it s suﬃcient to verify the following:

Ψ( θ, ω)/ θ E{ ψ(θ)/ θ} = op(1), nΨ(θ, ω) D N(0, V(ω)).

To show (B.1), we note that within the neighborhood B in Condition (C5), Ψ(θ, ω)

θ ΨY n (θ) θ

= diag(ω) Ψ bf n/ θ + E h Ψ bf n/ θ i + diag(ω) Ψ bf N/ θ E h Ψ bf N/ θ i

diag(ω) Ψ bf n/ θ + E h Ψ bf n/ θ i + diag(ω) Ψ bf N/ θ E h Ψ bf N/ θ i .

Then together with Condition (C4) and Lemma 12, we have

Ψ(eθ)/ eθ ΨY n (eθ)/ eθ = op(1).

Post-prediction adaptive inference

By Conditions (C1, (C4) and Lemma 12, we have

ΨY n (eθ)/ eθ E h ΨY n (eθ)/ eθ i = op(1).

Equations (B.1) and (B.1) give supeθ B

Ψ(eθ)/ eθ E h ΨY n (eθ)/ eθ i = op(1). Then the

consistency of bθPSPA and the continuous mapping theorem imply that

Ψ( θ, ω)/ θ E [ ψ(θ)/ θ]

Ψ(eθ)/ eθ E ΨY n (θ)/ θ + E ψ( θ)/ θ E [ ψ(θ)/ θ]

Therefore, condition (B.1) has been veriﬁed. Meanwhile, (B.1) holds by Central Limit Theorem. The proof of asymptotic normality is completed.

B.2 Proof of Corollary 4

We denote the corresponding estimator as bθ(bω). Then the consistency of bθ(bω) follows from the consistency of bω and the proof in Section B.1. For asymptotic normality of bθ(bω), applying Taylor expansion on 0 = Ψ(bθ, bω) yields

0 = Ψ(bθ, bω) = Ψ(θ, ωopt) + diag(bω ωopt) Ψ bf n + Ψ bf N + (bθ θ) Ψ( θ)/ θ

for some θ between θ and bθ(bω). With the same proof technique in Section B.1, we have

n(bθ(bω) θ) D N(0, Σ(ωopt)),

which implies Corollary 4.

B.3 Proof of Proposition 6

The j-th diagonal element of the asymptotic covariance matrix Σ(ω) of bθPSPA(ω) given weighting vector ω is

Σjj(ω) = ω2 j [A 1(M2 + ρM3)A 1]jj 2ωj[A 1M4A 1]jj + [A 1M1A 1]jj.

It is a quadratic function of ωj with a unique minimizer

ωopt j = [A 1M4A 1]jj [A 1(M2 + ρM3)A 1]jj .

Therefore, given any weighting vector ω = [ω 1, , ω q] , Σjj(ωopt) Σjj(ω ), which completes the proof.

Post-prediction adaptive inference

B.4 Proof of Proposition 7

The joint log-likelihood for a generic observation (r, x, z, y) is

l = rlogp(y | x, z) + logp(x | z) + logp(z) + rlog(π) + (1 r)log(1 π).

Taking the semiparametric approach (Bickel et al., 1993; Tsiatis, 2006), we consider the Hilbert space H that contains all one-dimensional zero-mean measurable functions of the observed data with ﬁnite variance, equipped with the inner product h1, h2 = E{h1( )h2( )} where h1, h2 H. To estimate θ, we regard p(y | x, z), p(x | z) and p(z) as nuisance functions. Denote their nuisance tangent spaces by Ty, Tx, and Tz, which are deﬁned as the mean squared closure of the nuisance tangent spaces of parametric submodels spanned by the nuisance score vectors. We have that T = Ty Tx Tz where

Ty = [ra1(y, x, z) : E{a1(Y, x, z) | x, z} = 0],

Tx = [a2(x, z) : E{a2(X, z) | z} = 0], and

Tz = [a3(z), E{a3(Z)} = 0],

respectively, and the notation represents the direct sum of two spaces that are orthogonal to each other. We introduce parametric submodels pηy(y | x, z), pηx(x | z), and pηz(z), where η = [η y , η x , η z , η r ] is a vector of nuisance parameters. Let

Sηy(y, x, z) = logpηy(y, x, z)

ηy , Sηx(x, z) = logpηx(x, z)

ηx , and Sηz(z) = logpηz(z)

then these scores functions satisfy r Sηy(y, x, z) Ty, Sηx(x, z) Tx, and Sηz(z) Tz. Recall the deﬁnition of parameter of interest, i.e., ZZZ ψ(y, x; θ)p(y | x, z)p(x | z)p(z)dzdxdy = 0.

Taking derivative of θ with respect to nuisance parameters,

θ η y =A 1 E{ψ(Y, X; θ)}

η y = A 1E[ψ(Y, X; θ)S ηy(Y, X, Z)],

θ η x =A 1 E{ψ(Y, X; θ)}

η x = A 1E{ψ(Y, X; θ)S ηx(X, Z)}, and

θ η z =A 1 E{ψ(Y, X; θ)}

η z = A 1E{ψ(Y, X; θ)S ηz(Z)}.

Let φ(y, x; θ) = ra1(y, x, z) + a2(x, z) + a3(z), where

a1(y, x, z) = 1 π(z)A 1 [ψ(y, x; θ) E{ψ(Y, x; θ) | x, z}] ,

a2(x, z) =A 1 [E{ψ(Y, x; θ) | x, z} E{ψ(Y, x; θ) | x, z}] , and

a3(z) =A 1E{ψ(Y, x; θ) | x, z}.

Post-prediction adaptive inference

Then it can be easily veriﬁed that ra1(y, x, z) Ty, a2(x, z) Tx, a3(z) Tx, and

E{φ(Y, X; θ)RS ηy(Y, X, Z)} = θ

η y , E n φ(Y, X; θ)S ηx(X, Z)) o = θ

η x , and E{φ(Y, X; θ)S ηz(Z)} = θ

Therefore, φ(y, x; θ) is the eﬃcient inﬂuence function. The proof of Proposition 7 is completed.

B.5 Lemma for PSPA when applied to ML-predicted covariates

Lemma 13 Letting A and B be two q q gram matrices, and ω is a q-dimensional vector.

If 0 ωj λmin,+(A)

λmax(B) for all j {1, , p} then A diag(ω)B is positive semi-deﬁnite.

To prove the lemma, we will verify that A diag(ω)B is positive semi-deﬁnite by showing that for any vector x, the quadratic form x T (A diag(ω)B)x 0. Let x be any vector in Rp. Since

x Ax λmin,+(A) x 2

x diag(ω)Bx λmax(diag(ω)B) x 2 max j ωjλmax B x 2

x Ax x diag(ω)B (λmin,+(A) max j ωjλmax B) x 2

To ensure x Ax x diag(ω)B 0, we require λmin,+(A) maxj ωjλmax(B) 0, that is

max j ωj λmin,+(A)

B.6 Proof of Corollary 11

The j-th diagonal element of the asymptotic covariance matrix Σ (ω) of bθPSPA given weighting vector ω is

Σ jj(ω) = ω2 j [A 1(M 2 + ρM 3)A 1]jj 2ωj[A 1M 4A 1]jj + [A 1M 1A 1]jj,

It is a quadratic function of ωj with a unique minimizer

ωopt, j = [A 1M 4A 1]jj [A 1(M 2 + ρM 3)A 1]jj .

If ωopt, j > 0, Σ jj(ω) is a decreasing function with ωj [0, ωopt, j ]. If ωopt, j 0, Σ jj(ω)

Σ jj(ω) is an increasing function with ωj [ωopt, j , 0]. We also have λmin,+[ 1

n ψ(yi,xi;θ)] λmax[ 1

n ψ(yi,bq;θ)] > 0

Post-prediction adaptive inference

Therefore, given ˆωopt = [bωopt 1 , . . . , bωopt q ] , where

min( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , λmin,+[ 1

n ψ(yi,xi;θ)] λmax[ 1

n ψ(yi,bq;θ)] ) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj > 0

max( [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj , 0) if [ b A 1c M 4,C b A 1]jj [ b A 1(c M 2,C+ρc M 3,C) b A 1]jj 0

for all j 1, . . . , q, and the weighting vector ωC = [0, , 0] that correspond to the classical estimator, Σjj(ωopt) Σjj(ωC), which completes the proof.

Acknowledgments

We thank the participants of the UW Madison Statistics Graduate Student Association Student Seminar and the Lu Group Lab Meeting for their valuable comments and questions. This work was supported by the NIH (U01 HG012039, R01 DC021431) and the NSF (DMS1953526, DMS-2122074, DMS-2310942).

Post-prediction adaptive inference

Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669 674, 2023a. doi: 10.1126/science.adi6000. URL https://www.science.org/doi/abs/10.1126/science. adi6000.

Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. Ppi++: Eﬃcient predictionpowered inference. ar Xiv preprint ar Xiv:2311.01453, 2023b.

David Azriel, Lawrence D Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238 2251, 2022.

Mahashweta Basu, Kun Wang, Eytan Ruppin, and Sridhar Hannenhalli. Predicting tissuespeciﬁc gene expression from whole blood transcriptome. Science Advances, 7(14): eabd6991, 2021.

Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Ya acov Ritov, J Klaassen, Jon A Wellner, and YA Acov Ritov. Eﬃcient and adaptive estimation for semiparametric models, volume 4. Springer, 1993.

F Jay Breidt and Jean D Opsomer. Model-assisted survey estimation with modern prediction techniques. Statistical Science, 2017.

Eric L Bullock, Curtis E Woodcock, Carlos Souza Jr, and Pontus Olofsson. Satellite-based estimates reveal widespread forest degradation in the amazon. Global Change Biology, 26 (5):2956 2969, 2020.

Abhishek Chakrabortty and Tianxi Cai. Eﬃcient and adaptive linear regression in semisupervised settings. The Annals of Statistics, 2018.

Xiaohong Chen, Han Hong, and Elie Tamer. Measurement error models with auxiliary data. The Review of Economic Studies, 72(2):343 366, 2005.

GTEx Consortium. The gtex consortium atlas of genetic regulatory eﬀects across human tissues. Science, 369(6509):1318 1330, 2020.

GTEx Consortium, Kristin G Ardlie, David S Deluca, Ayellet V Segr e, Timothy J Sullivan, Taylor R Young, Ellen T Gelfand, Casandra A Trowbridge, Julian B Maller, Taru Tukiainen, et al. The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans. Science, 348(6235):648 660, 2015.

Jinyong Hahn. On the role of the propensity score in eﬃcient semiparametric estimation of average treatment eﬀects. Econometrica, pages 315 331, 1998.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

Post-prediction adaptive inference

Nathan Kallus and Xiaojie Mao. On the role of surrogates in the eﬃcient estimation of treatment eﬀects with limited outcome data. ar Xiv preprint ar Xiv:2003.12408, 2020.

Yann Le Cun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning. nature, 521(7553): 436 444, 2015.

John Lonsdale, Jeﬀrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. The genotypetissue expression (gtex) project. Nature genetics, 45(6):580 585, 2013.

Geert Molenberghs, Garrett Fitzmaurice, Michael G Kenward, Anastasios Tsiatis, and Geert Verbeke. Handbook of missing data methodology. CRC Press, 2014.

James M Robins. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, volume 1999, pages 6 10. Indianapolis, IN, 2000.

James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coeﬃcients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846 866, 1994.

Shanshan Song, Yuanyuan Lin, and Yong Zhou. A general m-estimation theory in semisupervised framework. Journal of the American Statistical Association, pages 1 11, 2023.

George Tauchen. Diagnostic testing and evaluation of maximum likelihood models. Journal of Econometrics, 30(1):415 443, 1985. ISSN 0304-4076. doi: https://doi. org/10.1016/0304-4076(85)90149-6. URL https://www.sciencedirect.com/science/ article/pii/0304407685901496.

Anastasios A Tsiatis. Semiparametric theory and missing data. Springer, 2006.

Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Ramon Vi nas, Chaitanya K Joshi, Dobrik Georgiev, Phillip Lin, Bianca Dumitrascu, Eric R Gamazon, and Pietro Li o. Hypergraph factorization for multi-tissue gene expression imputation. Nature machine intelligence, 5(7):739 753, 2023.

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientiﬁc discovery in the age of artiﬁcial intelligence. Nature, 620(7972):47 60, 2023.

Jiebiao Wang, Eric R Gamazon, Brandon L Pierce, Barbara E Stranger, Hae Kyung Im, Robert D Gibbons, Nancy J Cox, Dan L Nicolae, and Lin S Chen. Imputing gene expression in uncollected tissues within and beyond gtex. The American Journal of Human Genetics, 98(4):697 708, 2016.

Junhui Wang and Xiaotong Shen. Large margin semi-supervised learning. Journal of Machine Learning Research, 8(8), 2007.

Post-prediction adaptive inference

Siruo Wang, Tyler H Mc Cormick, and Jeﬀrey T Leek. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266 30275, 2020.

Anru Zhang, Lawrence D Brown, and T Tony Cai. Semi-supervised inference: General theory and estimation of means. The Annals of Statistics, 2019.