# revisiting_inference_after_prediction__c889162c.pdf Journal of Machine Learning Research 24 (2023) 1-18 Submitted 7/23; Revised 11/23; Published 11/23 Revisiting inference after prediction Keshav Motwani kmotwani@uw.edu Department of Biostatistics University of Washington Seattle, WA Daniela Witten dwitten@uw.edu Departments of Biostatistics and Statistics University of Washington Seattle, WA Editor: Mladen Kolar Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest. 1. Introduction Rapid recent progress in the field of machine learning has enabled the development of complex and high-quality machine learning models to predict a response variable of interest. This is particularly attractive in settings where future measurement of this response variable is prohibitively expensive or impossible. For example, instead of performing expensive experiments to determine a protein s structure, it is possible to obtain high-quality structural predictions using Alpha Fold (Jumper et al., 2021). Similarly, in developing countries, determining the true cause of death may be impossible; instead, one might predict the cause of death on the basis of a verbal autopsy (Clark et al., 2015; Khoury et al., 1999). In the context of gene expression data, it is infeasible to conduct experiments in every possible tissue type, and so instead a machine learning model can be applied to predict gene expression in a tissue type of interest (Ellis et al., 2018; Gamazon et al., 2015; Gusev et al., 2019). c 2023 Keshav Motwani and Daniela Witten. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/23-0896.html. Motwani and Witten In this paper, we will use the notation ˆf : Z Y to denote a pre-trained machine learning model that maps from Z, the space of the predictors Z, to Y, the space of the response variable Y . We assume that the operating characteristics of ˆf( ) in the study population of interest are unknown to the end-user, and the data used to fit ˆf( ) are unavailable. Thus, in what follows, we will treat ˆf( ) as a black box function. In important recent papers, Wang et al. (2020) and Angelopoulos et al. (2023) consider the practice of prediction-based inference1: that is, of quantifying the association between the response Y and some covariates X using realizations not of (Y, X), but rather, of ( ˆf(Z), X). Such an approach is attractive in cases where the association between Y and X is of interest, and both ˆf( ) and a large sample of realizations of predictors Z and covariates X are available, but realizations of Y are expensive or otherwise unavailable. Throughout, we will use Z to denote the predictors in the machine learning model ˆf( ), and X to denote the covariates whose association with Y is of interest. In many settings, the covariate variables X may be a subset of the predictor variables Z, or may be identical to Z, but this is not necessarily the case. Wang et al. (2020) point out that a naive approach to prediction-based inference that simplistically interprets the association between ( ˆf(Z), X) as the association between (Y, X) is problematic from a statistical perspective. For instance, regressing ˆf(Z) onto X using least squares does not lead to valid inference on the association between Y and X: e.g. it leads to hypothesis tests that fail to control type 1 error, and confidence intervals that do not achieve the nominal coverage. See Box 1. Box 1: Naive approach to prediction-based inference. We are given a (pre-trained) prediction function ˆf( ) : Z Y, and an unlabeled dataset zunlab representing realizations from Z. As pointed out by Wang et al. (2020), the naive approach displayed here does not allow for valid inference on the association between Y and X. Step 1: Compute ˆyunlab = ˆf(zunlab). Step 2: Conduct inference on the association between Y and X, using ˆyunlab and xunlab as data, without accounting for the fact that (ˆyunlab, xunlab) is not a sample from the distribution of (Y, X). Wang et al. (2020) and Angelopoulos et al. (2023) propose creative solutions to overcome this issue. They assume that in addition to a large unlabeled dataset (zunlab, xunlab), they also have access to a (relatively small) labeled dataset (ylab, zlab, xlab). We focus on the case where the labeled and unlabeled data are independent and identically distributed samples from the same study population of interest, though Angelopoulos et al. (2023) also extend beyond this setting. See Box 2. We emphasize that the goal is to quantify association between Y and X. 1. Wang et al. (2020) and Angelopoulos et al. (2023) refer to this practice as post-prediction inference and prediction-powered inference, respectively; to unify terminology, here we refer to it as prediction-based inference. Revisiting inference after prediction Box 2: Setting of prediction-based inference. Given: A pre-trained machine learning model ˆf : Z Y. Goal: To quantify the association between a response Y Y and covariates X X. Data: A (relatively small) labeled dataset (ylab, zlab, xlab) consisting of i.i.d. realizations of (Y, Z, X), and a (large) unlabeled dataset (zunlab, xunlab) consisting of i.i.d. realizations of (Z, X). Both are drawn from the same study population. One simple (and valid) option is to quantify association between Y and X using only the labeled data (ylab, xlab). However, this approach entirely discards the vast amount of unlabeled data (zunlab, xunlab). Intuitively, if the prediction function ˆf( ) is nearly perfect on our population of interest, then using ( ˆf(zunlab), xunlab) in addition to (ylab, xlab) will aid our efforts to quantify the association between Y and X. By contrast, if ˆf(Z) is a very poor prediction of Y on our population of interest, then using ( ˆf(zunlab), xunlab) in addition to (ylab, xlab) may hinder our efforts. Our goal is valid quantification of the association between Y and X, regardless of the quality of ˆf( ) on the population of interest. To achieve this goal, Wang et al. (2020) propose to (Step 1 ) model the association between Y and ˆf(Z) using the labeled dataset (ylab, ˆf(zlab)), and then (Step 2 ) incorporate the model in Step 1 to conduct inference between Y and X using the unlabeled data ( ˆf(zunlab), xunlab). See Box 3. Box 3: Wang et al. (2020) s proposal to correct prediction-based inference. Step 1 : Model the association between Y and ˆf(Z) using the labeled data (ylab, ˆf(zlab)). Step 2 : Incorporate the model from Step 2 to conduct inference on the association between Y and X using the unlabeled data ( ˆf(zunlab), xunlab). To do this, bootstrap and analytical approaches are proposed. By contrast, Angelopoulos et al. (2023) propose to de-bias the estimates obtained using the unlabeled data using information from the labeled data. In the case of estimands that are linear in Y , they (Step 1 ) compute the difference between the estimate of the parameter of interest obtained using ( ˆf(zlab), xlab) and the estimate obtained using (ylab, xlab). They then (Step 2 ) correct the estimate obtained using the unlabeled dataset ( ˆf(zunlab), xunlab) by this amount. See Box 4. Angelopoulos et al. (2023) also propose a more general framework for estimands that minimize the expectation of a general loss function, though we focus on the linear case in this paper for simplicity. In this paper, we investigate these two proposals. In Section 2, we ask a fundamental question: what parameter is each proposal targeting? We see that the proposal of Angelopoulos et al. (2023) targets the parameter of interest, whereas that of Wang et al. (2020) does not. In Sections 3 and 4, we investigate the empirical consequences of our Motwani and Witten Box 4: Angelopoulos et al. (2023) s proposal to correct prediction-based inference (in the special case of estimands that are linear in Y ). Step 1 : Compute the difference between the estimate of the parameter of interest obtained using ( ˆf(zlab), xlab) and the estimate obtained using (ylab, xlab). Step 2 : Correct the parameter estimate obtained using the unlabeled dataset ( ˆf(zunlab), xunlab) by the difference computed in Step 1 . findings from Section 2. These empirical investigations paint a clear picture: namely, that failure to target the correct parameter has substantial statistical consequences for the proposal of Wang et al. (2020), in the form of hypothesis tests that fail to control the Type 1 error, and confidence intervals that fail to attain the nominal coverage. The proposal of Angelopoulos et al. (2023) does not suffer these consequences, as it targets the correct parameter. We close with a discussion in Section 5. In this paper, we use capitals to represent a random variable and lower case to represent its realization. Vectors of length equal to the number of observations, or matrices whose rows correspond to the observations, are in bold. 2. What parameter is each method targeting? For concreteness, suppose that we would have fit a linear regression model on realizations of (Y, X) using least squares, had a large number of realizations of (Y, X) been available. Therefore, our goal is to conduct inference on the population parameter β = arg min β E[(Y β X)2] = E[XX ] 1 E[XY ]. (1) The naive method (Box 1) and the proposals of Wang et al. (2020) and Angelopoulos et al. (2023) use a test statistic of the form Tj = ˆβj β j c SE(ˆβj) for testing or constructing confidence intervals for β j . They rely on it having a known distribution, or converging in distribution to a known distribution with increasing sample size. However, if ˆβj p β j and c SE(ˆβj) goes to 0 as nlab and nunlab increase, then Tj does not converge in distribution (see Appendix A for a formal statement of this). Therefore, consistency of ˆβj for β j is a necessary condition for valid inference using this approach. We now investigate whether this is the case. 2.1 The general case for an arbitrary prediction model ˆf We first consider the naive approach, as defined in Box 1. Fitting a linear model with least squares would result in ˆβnaive = (x unlabxunlab) 1x unlab ˆf(zunlab). We can see that ˆβnaive p Revisiting inference after prediction Algorithm 1 Bootstrap correction of Wang et al. (2020). The goal is to conduct inference on the association between Y and X. X 1. Use (ylab, ˆf(zlab)) to fit the relationship model Y | ˆf(Z) K( ˆf(Z), φ), yielding ˆφ. X 2. For b = 1, . . . , B: XX 2.1. Sample unlabeled observations with replacement to obtain zb unlab and xb unlab. XX 2.2. Sample outcomes yb| ˆf(zb unlab) from the relationship model K( ˆf(zb unlab), ˆφ). XX 2.3. Use ( yb, xb unlab) to fit a regression model for the relationship between Y and X, and record the coefficient estimate ˆβb and model-based standard error ˆsb. X 3. Compute the point estimate ˆβ = median{ˆβ1, . . . , ˆβB}. X 4. Compute the nonparametric standard error c SE(ˆβ) = SD{ˆβ1, . . . , ˆβB}. X 5. Compute the parametric standard error c SE(ˆβ) = median{ˆs1, . . . , ˆs B}. E[XX ] 1 E[X ˆf(Z)] as nunlab increases, which is not equal to β = E[XX ] 1 E[XY ] in general. In fact, viewing ˆf( ) as a black-box function with unknown operating characteristics in the study population, we see that the quantity E[XX ] 1 E[X ˆf(Z)] does not even involve the response, Y ; therefore, the parameter targeted by the naive method is not of any scientific interest. Now, we consider the bootstrap variant of the proposal of Wang et al. (2020), which is introduced in Box 3. A detailed description of these proposals are presented in Algorithm 1. We take Step 2.3 to involve a least squares regression. Note that Step 2.2 of Algorithm 1 involves sampling observations yb from K( ˆf(Z), ˆφ) for use in fitting a regression model in Step 2.3, giving us an estimate ˆβbootstrap,b Wang = (x unlabxunlab) 1x unlab yb. Thus ˆβbootstrap,b Wang p E[XX ] 1 E[XK( ˆf(Z), ˆφ)] as nunlab increases, where we slightly abuse notation by letting K( ˆf(Z), ˆφ) denote a random variable with distribution K( ˆf(Z), ˆφ). In general, E[XX ] 1 E[XK( ˆf(Z), ˆφ)] is not equal to the parameter of interest β = E[XX ] 1 E[XY ]. Note that both the parametric and non-parametric bootstrap corrections suffer from this issue. The analytic variant of the proposal of Wang et al. (2020) adjusts the naive estimate by the coefficient of ˆf(Z) in a regression of Y onto ˆf(Z), with an intercept, using the labeled data2; that is, the estimate takes the form ˆβanalytical Wang = d Cov(ylab, ˆf(zlab)) d Var( ˆf(zlab)) ˆβnaive. (2) Thus as nlab and nunlab increase, ˆβanalytical Wang p Cov(Y, ˆf(Z)) Var( ˆf(Z)) E[XX ] 1 E[X ˆf(Z)], which is not again not equal to β in general. 2. The expression for ˆβanalytical Wang given in (2) is implemented in Wang et al. (2020) s code. Their publication involves a slightly different expression for ˆβanalytical Wang , which also does not converge to the parameter of interest β for a similar reason we show this in Appendix B. Motwani and Witten This highlights a cause for concern about the proposals of Wang et al. (2020): namely, that the wrong parameter is being targeted. This calls into question whether the inference that they propose will achieve the desired statistical guarantees; we investigate this issue further in the next two sections. Finally, we turn to the proposal of Angelopoulos et al. (2023), which is introduced in Box 4. In the case of linear regression, their estimate takes the form ˆβAngelopoulos = (x unlabxunlab) 1x unlab ˆf(zunlab)+ n (x labxlab) 1x labylab (x labxlab) 1x lab ˆf(zlab) o . As nlab and nunlab increase, we see that ˆβAngelopoulos p E[XX ] 1 E[X ˆf(Z)] + n E[XX ] 1 E[XY ] E[XX ] 1 E[X ˆf(Z)] o = β , so the proposal of Angelopoulos et al. (2023) correctly targets the desired quantity. 2.2 An extreme setting where all methods target the correct quantity We now consider an extreme setting where the prediction model exactly equals the true regression function: that is, ˆf(z) = E[Y |Z = z]. We also assume that X is contained within Z: that is, Z = (X, Z(2)) for some Z(2). This is a reasonable assumption in practice, since the covariate of interest is likely also a predictor in the machine learning model. In this extreme setting, the naive method targets E[XX ] 1 E[X ˆf(Z)] = E[XX ] 1 E [X E[Y |Z]] = E[XX ] 1 E[XY ] In other words, it targets the correct parameter of interest. We will now show that the bootstrap methods of Wang et al. (2020) can similarly target the correct parameter in this extreme setting, for example, if K( ˆf(Z), ˆφ) is defined by fitting a generalized additive model (GAM) to (Y, ˆf(Z)) and adding mean-zero noise, as in Wang et al. (2020). That is, K( ˆf(Z), ˆφ) d= ˆg( ˆf(Z)) + ϵ, (3) where ϵ is mean-zero noise and ˆg( ) is the fitted GAM. The fitted GAM ˆg( ) takes the form ˆg( ) arg min g E h (Y g( ˆf(Z)))2i (4) = arg min g E h (Y g(E[Y |Z]))2i (5) = arg min g E h E h (Y g(E[Y |Z]))2 |Z ii . (6) The approximation in (4) holds if the labeled sample size is sufficiently large. Equation 5 is a consequence of the extreme assumption that ˆf(z) = E[Y |Z = z]. Equation 6 follows from iterated expectations. Revisiting inference after prediction It is not hard to see that (6) is minimized when g(E[Y |Z]) = E[Y |Z], i.e., ˆg( ) is approximately the identity function. Combining this with the extreme assumption that ˆf(z) = E[Y |Z = z], (3) leads to K( ˆf(Z), ˆφ) d E[Y |Z] + ϵ. (7) Now, recall from Section 2.1 that the parameter targeted by Wang et al. (2020) is E[XX ] 1 E[XK( ˆf(Z), ˆφ)]. We observe that E[XX ] 1 E[XK( ˆf(Z), ˆφ)] = E[XX ] 1 E[X E[K( ˆf(Z), ˆφ)|Z]] (8) E[XX ] 1 E[X E[Y |Z]] (9) = E[XX ] 1 E[XY ] (10) Here, (8) and (10) follow from iterated expectations since Z = (X, Z(2)), and (9) follows from (7). With a large enough labeled dataset, this approximation will hold almost exactly. The analytical method of Wang et al. also targets the correct parameter in this setting, since the naive method targets the correct parameter and Cov(Y, ˆf(Z)) Var( ˆf(Z)) = Cov(Y, E[Y |Z]) Var(E[Y |Z]) = Var(E[Y |Z]) Var(E[Y |Z]) = 1. Therefore, we have seen that under a very extreme assumption that ˆf(z) = E[Y |Z = z], the methods of Wang et al. (2020) will target (nearly) the correct parameter. However, in general, this assumption is not reasonable. And in fact, our goal is valid inference on the parameter β regardless of (the quality of) ˆf( ). 3. An empirical investigation of the distribution of the test statistic We consider a simple simulation setting, inspired by the Simulated Data: Continuous case section of Wang et al. (2020). They generate three datasets: a training dataset consisting of realizations of (Z, X, Y ) used to train a machine learning model ˆf( ), a labeled dataset consisting of realizations of (Z, X, Y ), and an unlabeled dataset consisting only of realizations of (Z, X); both the labeled and unlabeled datasets are used for inference3. They consider predictors Z R4 and response Y R, and define the covariate X Z1. In Wang et al. (2020) s paper, the training, labeled, and unlabeled datasets each consist of 300 observations. Throughout this section, we keep the training sample size fixed at 300 observations, but vary the size of the labeled and unlabeled datasets. As in Wang et al. (2020), we generate the training, labeled, and unlabeled datasets from the same partially linear additive model Y = β0 + β1Z1 + P4 j=2 βjgj(Zj) + ϵ. Their goal is to conduct inference on the marginal association between X = Z1 and Y in a linear model. That is, their parameter of interest is β 1 = arg min β1 min β0 E[(Y β0 β1Z1)2]. (11) 3. Wang et al. (2020) refer to the labeled data set as test data, and to the unlabeled data as validation data. Motwani and Witten Because the features are independent, we have that β 1 = arg min β1 min β0,β2,β3,β4 E[(Y β0 β1Z1 j=2 βjgj(Zj))2] = β1. (12) Thus, β 1 is the marginal regression coefficient of Y onto X = Z1, as well as the coefficient associated with X = Z1 in the partially linear additive model used to generate the data. We consider two settings: one under the null (β 1 = 0) and one under the alternative (β 1 = 1). We generate 3 training sets and fit a GAM to each training set, to obtain three fitted models ˆf1, ˆf2, ˆf3. In each replicate of the simulation study, we generate a new labeled and unlabeled dataset as described above. Note that this differs from the simulation in Wang et al. (2020), which generates a new training set (and thus a different ˆf) in each replicate of the simulation study. We do this to focus on the properties of estimation and inference under a fixed ˆf, e.g. how Alpha Fold would be used in practice. We perform a total of 1,000 simulation replicates. To conduct prediction-based inference on β 1, both Wang et al. (2020) and Angelopoulos et al. (2023) rely on the claim that (ˆβ1 β 1)/c SE(ˆβ1) d N(0, 1). We consider the following versions of Wang et al. (2020): 1. Proposal of Wang et al. (2020), with an analytical correction. Apply Box 3 with the analytical correction (2) using a linear model for the regression model and a linear model for the relationship model. 2. Proposal of Wang et al. (2020), with a parametric bootstrap correction. Apply Box 3 with the parametric bootstrap correction presented in Algorithm 1 using a linear model for the regression model and a GAM for the relationship model. 3. Proposal of Wang et al. (2020), with a non-parametric bootstrap correction. Apply Box 3 with the non-parametric bootstrap correction presented in Algorithm 1 using a linear model for the regression model and a GAM for the relationship model. We additionally consider the proposal of Angelopoulos et al. (2023): 4. Proposal of Angelopoulos et al. (2023). Apply Box 4 using a linear model. Finally, we consider the following two approaches. 5. Classical approach using only the labeled data. Fit a linear model to (ylab, xlab). 6. Naive approach. Apply Box 1 using a linear model. In Figure 1 we show the empirical distribution of (ˆβ1 β 1)/c SE(ˆβ1) under H0 : β 1 = 0, for increasing sample sizes. In the first three panels, we can see that the asymptotic distribution of this test statistic for an arbitrary ˆf does not follow a N(0, 1) for the naive and Wang et al. (2020) methods. This is in line with our findings in Section 2. This is also true under the alternative H0 : β 1 = 1, as shown in Figure 2. On the other hand, for the method of Angelopoulos et al. (2023), this statistic converges in distribution to a N(0, 1) regardless of the choice of ˆf. Revisiting inference after prediction 3 f (z) = E(Y|Z = z) 1000 2000 4000 8000 16000 1000 2000 4000 8000 16000 Naive Wang et al., analytical Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap' Angelopoulos et al. Classical, using labeled data Figure 1: An examination of the distribution of ˆβ1/c SE(ˆβ1) under H0 : β 1 = 0. For each of four different prediction models ˆf( ) (three trained GAMs and one true regression function), we display the empirical distribution of ˆβ1/c SE(ˆβ1) as the sample sizes increase, with nlab = 0.1nunlab. The N(0, 1) distribution is shown in black. The dashed black lines show the 0.025 and 0.975 quantiles of this distribution. The distributions of Wang et al. (2020) s test statistics increasingly diverge from the N(0, 1) distribution as the sample sizes increase. The methods and simulation setup are described in Section 3. In the last panel of Figures 1 and 2, we consider the extreme setting considered in Section 2.2, in which ˆf(z) = E[Y |Z = z]. Here, the empirical distribution of the test statistic is approximately N(0, 1) for all methods. We next performed testing and constructed confidence intervals under the assumption made by Wang et al. (2020) and Angelopoulos et al. (2023) that the test statistic asymptotically follows N(0, 1). We examined how the violation of this distributional assumption impacts type 1 error control and coverage in Figures 5 and 6 in the Appendix. We see that the naive and Wang et al. (2020) methods do not control the type 1 error rate or have nominal coverage for an arbitrary ˆf, and they become increasingly anti-conservative as the sample sizes increase. This can be explained by the increasing discrepancy between the assumed and true distributions of (ˆβ1 β 1)/c SE(ˆβ1) as the sample sizes increase, as previously seen in Figures 1 and 2. In fact, we can directly read offthe type 1 error rate at level 0.05 Motwani and Witten 3 f (z) = E(Y|Z = z) 1000 2000 4000 8000 16000 1000 2000 4000 8000 16000 1 β1 *) SE (β1 ) Naive Wang et al., analytical Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap' Angelopoulos et al. Classical, using labeled data Figure 2: An examination of the distribution of (ˆβ1 β 1)/c SE(ˆβ1) when β 1 = 1. For each of four different prediction models ˆf( ) (three trained GAMs and one true regression function), we display the empirical distribution of (ˆβ1 β 1)/c SE(ˆβ1) as the sample sizes increase, with nlab = 0.1nunlab. The N(0, 1) distribution is shown in black. The dashed black lines show the 0.025 and 0.975 quantiles of this distribution. The distributions of Wang et al. (2020) s test statistics increasingly diverge from the N(0, 1) distribution as the sample sizes increase. The methods and simulation setup are described in Section 3. as the proportion of points falling outside the dashed lines in Figure 1. Similarly, coverage can be read as the proportion of points inside the dashed lines in Figure 2. Because the true distribution of the test statistic matches the assumed one for the method of Angelopoulos et al. (2023) for any ˆf, this method gives correct type 1 error control and coverage in general. For the same reason, under the extreme setting, all methods have correct type 1 error control and coverage. 4. A direct replication of the simulation study of Wang et al. (2020) In the previous section, we considered a simulation setting that was very similar to that in Wang et al. (2020), but differed in that we considered the same prediction models ˆf across all simulation replicates. In this section, we instead directly replicate their simulation setting ( Simulated data; continuous case ), by generating a new training dataset in each Revisiting inference after prediction simulation replicate (resulting in a different ˆf in each replicate). They considered a sample size of 300 for the training, labeled, and unlabeled datasets. In addition to replicating these results, we explore increasing the labeled and unlabeled dataset sizes. nlab = 400 nunlab = 4000 nlab = 800 nunlab = 8000 nlab = 1600 nunlab = 16000 nlab = 300 nunlab = 300 nlab = 100 nunlab = 1000 nlab = 200 nunlab = 2000 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Uniform(0, 1) Quantiles Empirical p value Quantiles Naive Wang et al., analytical Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap' Angelopoulos et al. Classical, using labeled data Figure 3: For data generated under H0 : β 1 = 0, quantile-quantile plots of the p-values across simulation replicates are displayed. The methods are described in Section 3 and the simulation setup is described in Section 4. Each panel corresponds to a different sample sizes of the labeled and unlabeled datasets used for inference. The bootstrap and analytical corrections considered by Wang et al. (2020) become increasingly anticonservative as the sample sizes increase. The classical approach, and that of Angelopoulos et al. (2023), are well-calibrated. We first examine the type 1 error rate under the null hypothesis H0 : β 1 = 0 using each of the approaches described in Section 3. Quantile-quantile plots of the resulting 1, 000 p-values are shown in Figure 3. We see that in agreement with the findings in Wang et al. (2020), the naive approach does not control the type 1 error rate regardless of sample size. While it appears that when nlab = nunlab = 300 (first panel of Figure 3) the methods of Wang et al. (2020) have uniform p-values under the null (as also reported in Wang et al. (2020)), with other sample sizes this no longer holds and the methods of Wang et al. (2020) fail to control the type 1 error rate. As expected based on the previous section, Angelopoulos et al. (2023) controls type 1 error. This finding is in agreement with the theoretical results Motwani and Witten in Angelopoulos et al. (2023), which hold for an arbitrary prediction function ˆf( ). Also as expected, the classical method controls type 1 error. 0 5000 10000 15000 nunlab Empirical Coverage Naive Wang et al., analytical Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap' Angelopoulos et al. Classical, using labeled data Figure 4: For data generated with β 1 = 1, empirical coverage of 95% confidence intervals for each method across each simulation replicate, as the labeled and unlabeled sample sizes increase, with nlab = 0.1nunlab. The methods are described in Section 3 and the simulation setup is described in Section 4. Next we examine coverage of 95% confidence intervals under β 1 = 1. We see from Figure 4 that the naive approach has coverage well below the nominal level. Again, the corrected proposals of Wang et al. (2020) also fail to achieve the nominal coverage. The problem becomes increasingly pronounced as the sample sizes of the data used for inference increase. By contrast, the classical method and the proposal of Angelopoulos et al. (2023) do achieve the nominal coverage, supporting the theory of Angelopoulos et al. (2023). As explained in Section 2, the lack of inferential guarantees for the naive approach and the approach of Wang et al. (2020) is a direct consequence of the fact that ˆβ1 5. Discussion In this paper, we found that the method of Angelopoulos et al. (2023) provides valid type 1 error control and coverage. By contrast, the methods proposed by Wang et al. (2020) do not provide appropriate inferential guarantees in the absence of very strong assumptions: for instance, under the extreme (and unrealistic) scenario where the prediction model is the true regression function. Under this additional assumption, the naive approach also provides valid inference. Furthermore, we see in our simulation study that simply assuming that the prediction model ˆf( ) was trained on data from the same population as the labeled and unlabeled data an assumption made by Wang et al. (2020) is not sufficient to achieve valid inference using their proposed methods. Throughout, for simplicity we have assumed that we are conducting inference (i.e. Step 2 of Box 1) using a linear model. However, our conclusions that the naive method and methods of Wang et al. (2020) result in invalid inference, because they target the incorrect Revisiting inference after prediction parameter apply much more generally. The theory in Angelopoulos et al. (2023) shows that their approach applies to a wide variety of settings beyond linear regression, and has valid inferential properties. Code Availability Scripts to reproduce the results in this manuscript are available at https://github.com/ keshav-motwani/Prediction Based Inference/. Our code is based on the code from Wang et al. (2020); we thank the authors for making it publicly accessible. Acknowledgements We thank JeffLeek and Tyler Mc Cormick for helpful conversations. The authors gratefully acknowledge funding from NIH R01 EB026908, NIH R01 GM123993, NIH R01 DA047869, ONR N00014-23-1-2589, a Simons Investigator Award in Mathematical Modeling of Living Systems, and the NSF Graduate Research Fellowship Program. Motwani and Witten Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669 674, 2023. doi: 10.1126/science.adi6000. URL https://www.science.org/doi/abs/10.1126/science. adi6000. Samuel J Clark, Tyler Mc Cormick, Zehang Li, and Jon Wakefield. In Silico VA: A method to automate cause of death assignment for verbal autopsy. ar Xiv preprint ar Xiv:1504.02129, 2015. Shannon E Ellis, Leonardo Collado-Torres, Andrew Jaffe, and Jeffrey T Leek. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Research, 46(9):e54 e54, 2018. Eric R Gamazon, Heather E Wheeler, Kaanan P Shah, Sahar V Mozaffari, Keston Aquino Michaels, Robert J Carroll, Anne E Eyler, Joshua C Denny, GTEx Consortium, Dan L Nicolae, et al. A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47(9):1091 1098, 2015. Alexander Gusev, Kate Lawrenson, Xianzhi Lin, Paulo C Lyra Jr, Siddhartha Kar, Kevin C Vavra, Felipe Segato, Marcos AS Fonseca, Janet M Lee, Tanya Pejovic, et al. A transcriptome-wide association study of high-grade serous epithelial ovarian cancer identifies new susceptibility genes and splice variants. Nature Genetics, 51(5):815 823, 2019. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇZ ıdek, Anna Potapenko, et al. Highly accurate protein structure prediction with Alpha Fold. Nature, 596(7873):583 589, 2021. SA Khoury, D Massad, and T Fardous. Mortality and causes of death in Jordan 1995-96: assessment by verbal autopsy. Bulletin of the World Health Organization, 77(8):641, 1999. Siruo Wang, Tyler H Mc Cormick, and Jeffrey T Leek. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266 30275, 2020. Revisiting inference after prediction Appendix A. Necessity of consistency for β We formally state the necessity of targeting the correct parameter in order to use the test statistic (ˆβj β j )/c SE(ˆβj) for inference. Lemma 1 Suppose c SE(ˆβj) = op(1) and ˆβj p β j . Then (ˆβj β j )/c SE(ˆβj) does not converge in distribution. Proof Suppose, for the sake of contradiction, that (ˆβj β j )/c SE(ˆβj) converges in distribu- tion. Then (ˆβj β j )/c SE(ˆβj) = Op(1). Thus ˆβj β j = ˆβj β j c SE(ˆβj) c SE(ˆβj) = Op(1)op(1) = op(1), so ˆβj p β j , which is a contradiction. Appendix B. Lack of consistency of analytical method of Wang et al. (2020) In Sections 2.1 and 2.2, we analyzed the analytical correction as implemented in the code by Wang et al. (2020). We now analyze the analytical correction as described in the publication, which is also not consistent for the parameter of interest in general. In the publication, the estimate obtained from the analytical correction is defined as ˆβanalytical Wang = ˆγ0(x unlabxunlab) 1x unlab1nunlab + ˆγ1 ˆβnaive ˆγ1 = d Cov(ylab, ˆf(zlab)) d Var( ˆf(zlab)) and ˆγ0 = ˆE[ylab] ˆγ1ˆE[ ˆf(zlab)]. Thus as nlab increases, ˆγ1 p Cov(Y, ˆf(Z)) Var( ˆf(Z)) γ1 and ˆγ0 p E[Y ] γ1 E[ ˆf(Z)] γ0. Thus as nlab and nunlab increase, ˆβanalytical Wang p γ0 E[XX ] 1 E[X] + γ1 E[XX ] 1 E[X ˆf(Z)], which is not equal to β in general. In the extreme setting (described in Section 2.2), in which ˆf(z) = E[Y |Z = z], we have that γ1 = Cov(Y, ˆf(Z)) Var( ˆf(Z)) = Cov(Y, E[Y |Z]) Var(E[Y |Z]) = Var(E[Y |Z]) Var(E[Y |Z]) = 1 Motwani and Witten and γ0 = E[Y ] γ1 E[ ˆf(Z)] = E[Y ] E[E[Y |Z]] = 0. Thus using the same argument as for the naive estimator, this analytical correction is consistent for β in this extreme setting. Appendix C. Inferential consequences of wrong distribution In this section, we examine how violation of the distributional assumption impacts type 1 error control and coverage in the simulations in Section 3. In Figure 5, we see that the methods proposed by Wang et al. (2020) do not control the type 1 error rate for arbitrary ˆf( ); the situation gets worse as the sample size increases. However, the method proposed by Angelopoulos et al. (2023) does control the type 1 error rate. Wang et al. (2020) controls the type 1 error rate if the machine learning model is the true regression function ˆf(z) = E[Y |Z = z]; of course, such a perfect machine learning model is unattainable in practice. In Figure 6, we see that the methods proposed by Wang et al. (2020) do not attain the nominal coverage for an arbitrary ˆf( ), whereas the proposal of Angelopoulos et al. (2023) does attain the nominal coverage. The naive method and the proposals of Wang et al. (2020) have appropriate coverage when ˆf(z) = E[Y |Z = z]; again, this is unrealistic in practice. Revisiting inference after prediction nlab = 100 nunlab = 1000 nlab = 200 nunlab = 2000 nlab = 400 nunlab = 4000 nlab = 800 nunlab = 8000 nlab = 1600 nunlab = 16000 f 1 f 2 f 3 f (z) = E(Y|Z = z) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Uniform(0, 1) Quantiles Empirical p value Quantiles Naive Wang et al., analytical Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap' Angelopoulos et al. Classical, using labeled data Figure 5: For labeled and unlabeled datasets generated under H0 : β 1 = 0, quantile-quantile plots of the p-values across replicates of the modified simulation study are displayed for each of the four ˆf( ) s considered. The methods and simulation setup are described in Section 3. Each panel corresponds to different sample sizes of the labeled and unlabeled datasets used for inference. The naive method and the bootstrap and analytical corrections considered by Wang et al. (2020) become increasingly anticonservative as the sample sizes increases, unless the machine learning model is perfect, i.e. ˆf(z) = E[Y |Z = z]. Motwani and Witten 3 f (z) = E(Y|Z = z) 4000 8000 12000 16000 4000 8000 12000 16000 4000 8000 12000 16000 4000 8000 12000 16000 Empirical Coverage Naive Wang et al., analytical Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap' Angelopoulos et al. Classical, using labeled data Figure 6: For labeled and unlabeled datasets generated with β 1 = 1, empirical coverage of 95% confidence intervals for each method across each simulation replicate, for each of the four ˆf( ) s considered, as the labeled and unlabeled sample sizes increase, with nlab = 0.1nunlab. The methods and simulation setup are described in Section 3.