# revisiting_inference_after_prediction__c889162c.pdf

Journal of Machine Learning Research 24 (2023) 1-18 Submitted 7/23; Revised 11/23; Published 11/23

Revisiting inference after prediction

Keshav Motwani kmotwani@uw.edu Department of Biostatistics University of Washington Seattle, WA

Daniela Witten dwitten@uw.edu Departments of Biostatistics and Statistics University of Washington Seattle, WA

Editor: Mladen Kolar

Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides conﬁdence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest.

1. Introduction

Rapid recent progress in the ﬁeld of machine learning has enabled the development of complex and high-quality machine learning models to predict a response variable of interest. This is particularly attractive in settings where future measurement of this response variable is prohibitively expensive or impossible. For example, instead of performing expensive experiments to determine a protein s structure, it is possible to obtain high-quality structural predictions using Alpha Fold (Jumper et al., 2021). Similarly, in developing countries, determining the true cause of death may be impossible; instead, one might predict the cause of death on the basis of a verbal autopsy (Clark et al., 2015; Khoury et al., 1999). In the context of gene expression data, it is infeasible to conduct experiments in every possible tissue type, and so instead a machine learning model can be applied to predict gene expression in a tissue type of interest (Ellis et al., 2018; Gamazon et al., 2015; Gusev et al., 2019).

c 2023 Keshav Motwani and Daniela Witten.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/23-0896.html.

Motwani and Witten

In this paper, we will use the notation ˆf : Z Y to denote a pre-trained machine learning model that maps from Z, the space of the predictors Z, to Y, the space of the response variable Y . We assume that the operating characteristics of ˆf( ) in the study population of interest are unknown to the end-user, and the data used to ﬁt ˆf( ) are unavailable. Thus, in what follows, we will treat ˆf( ) as a black box function. In important recent papers, Wang et al. (2020) and Angelopoulos et al. (2023) consider the practice of prediction-based inference1: that is, of quantifying the association between the response Y and some covariates X using realizations not of (Y, X), but rather, of ( ˆf(Z), X). Such an approach is attractive in cases where the association between Y and X is of interest, and both ˆf( ) and a large sample of realizations of predictors Z and covariates X are available, but realizations of Y are expensive or otherwise unavailable. Throughout, we will use Z to denote the predictors in the machine learning model ˆf( ), and X to denote the covariates whose association with Y is of interest. In many settings, the covariate variables X may be a subset of the predictor variables Z, or may be identical to Z, but this is not necessarily the case. Wang et al. (2020) point out that a naive approach to prediction-based inference that simplistically interprets the association between ( ˆf(Z), X) as the association between (Y, X) is problematic from a statistical perspective. For instance, regressing ˆf(Z) onto X using least squares does not lead to valid inference on the association between Y and X: e.g. it leads to hypothesis tests that fail to control type 1 error, and conﬁdence intervals that do not achieve the nominal coverage. See Box 1.

Box 1: Naive approach to prediction-based inference. We are given a (pre-trained) prediction function ˆf( ) : Z Y, and an unlabeled dataset zunlab representing realizations from Z. As pointed out by Wang et al. (2020), the naive approach displayed here does not allow for valid inference on the association between Y and X.

Step 1: Compute ˆyunlab = ˆf(zunlab).

Step 2: Conduct inference on the association between Y and X, using ˆyunlab and xunlab as data, without accounting for the fact that (ˆyunlab, xunlab) is not a sample from the distribution of (Y, X).

Wang et al. (2020) and Angelopoulos et al. (2023) propose creative solutions to overcome this issue. They assume that in addition to a large unlabeled dataset (zunlab, xunlab), they also have access to a (relatively small) labeled dataset (ylab, zlab, xlab). We focus on the case where the labeled and unlabeled data are independent and identically distributed samples from the same study population of interest, though Angelopoulos et al. (2023) also extend beyond this setting. See Box 2. We emphasize that the goal is to quantify association between Y and X.

1. Wang et al. (2020) and Angelopoulos et al. (2023) refer to this practice as post-prediction inference and prediction-powered inference, respectively; to unify terminology, here we refer to it as prediction-based inference.

Revisiting inference after prediction

Box 2: Setting of prediction-based inference.

Given: A pre-trained machine learning model ˆf : Z Y.

Goal: To quantify the association between a response Y Y and covariates X X.

Data: A (relatively small) labeled dataset (ylab, zlab, xlab) consisting of i.i.d. realizations of (Y, Z, X), and a (large) unlabeled dataset (zunlab, xunlab) consisting of i.i.d. realizations of (Z, X). Both are drawn from the same study population.

One simple (and valid) option is to quantify association between Y and X using only the labeled data (ylab, xlab). However, this approach entirely discards the vast amount of unlabeled data (zunlab, xunlab). Intuitively, if the prediction function ˆf( ) is nearly perfect on our population of interest, then using ( ˆf(zunlab), xunlab) in addition to (ylab, xlab) will aid our eﬀorts to quantify the association between Y and X. By contrast, if ˆf(Z) is a very poor prediction of Y on our population of interest, then using ( ˆf(zunlab), xunlab) in addition to (ylab, xlab) may hinder our eﬀorts. Our goal is valid quantiﬁcation of the association between Y and X, regardless of the quality of ˆf( ) on the population of interest. To achieve this goal, Wang et al. (2020) propose to (Step 1 ) model the association between Y and ˆf(Z) using the labeled dataset (ylab, ˆf(zlab)), and then (Step 2 ) incorporate the model in Step 1 to conduct inference between Y and X using the unlabeled data ( ˆf(zunlab), xunlab). See Box 3.

Box 3: Wang et al. (2020) s proposal to correct prediction-based inference.

Step 1 : Model the association between Y and ˆf(Z) using the labeled data (ylab, ˆf(zlab)).

Step 2 : Incorporate the model from Step 2 to conduct inference on the association between Y and X using the unlabeled data ( ˆf(zunlab), xunlab). To do this, bootstrap and analytical approaches are proposed.

By contrast, Angelopoulos et al. (2023) propose to de-bias the estimates obtained using the unlabeled data using information from the labeled data. In the case of estimands that are linear in Y , they (Step 1 ) compute the diﬀerence between the estimate of the parameter of interest obtained using ( ˆf(zlab), xlab) and the estimate obtained using (ylab, xlab). They then (Step 2 ) correct the estimate obtained using the unlabeled dataset ( ˆf(zunlab), xunlab) by this amount. See Box 4. Angelopoulos et al. (2023) also propose a more general framework for estimands that minimize the expectation of a general loss function, though we focus on the linear case in this paper for simplicity. In this paper, we investigate these two proposals. In Section 2, we ask a fundamental question: what parameter is each proposal targeting? We see that the proposal of Angelopoulos et al. (2023) targets the parameter of interest, whereas that of Wang et al. (2020) does not. In Sections 3 and 4, we investigate the empirical consequences of our

Motwani and Witten

Box 4: Angelopoulos et al. (2023) s proposal to correct prediction-based inference (in the special case of estimands that are linear in Y ).

Step 1 : Compute the diﬀerence between the estimate of the parameter of interest obtained using ( ˆf(zlab), xlab) and the estimate obtained using (ylab, xlab).

Step 2 : Correct the parameter estimate obtained using the unlabeled dataset ( ˆf(zunlab), xunlab) by the diﬀerence computed in Step 1 .

ﬁndings from Section 2. These empirical investigations paint a clear picture: namely, that failure to target the correct parameter has substantial statistical consequences for the proposal of Wang et al. (2020), in the form of hypothesis tests that fail to control the Type 1 error, and conﬁdence intervals that fail to attain the nominal coverage. The proposal of Angelopoulos et al. (2023) does not suﬀer these consequences, as it targets the correct parameter. We close with a discussion in Section 5. In this paper, we use capitals to represent a random variable and lower case to represent its realization. Vectors of length equal to the number of observations, or matrices whose rows correspond to the observations, are in bold.

2. What parameter is each method targeting?

For concreteness, suppose that we would have ﬁt a linear regression model on realizations of (Y, X) using least squares, had a large number of realizations of (Y, X) been available. Therefore, our goal is to conduct inference on the population parameter

β = arg min β E[(Y β X)2] = E[XX ] 1 E[XY ]. (1)

The naive method (Box 1) and the proposals of Wang et al. (2020) and Angelopoulos et al. (2023) use a test statistic of the form

Tj = ˆβj β j c SE(ˆβj)

for testing or constructing conﬁdence intervals for β j . They rely on it having a known distribution, or converging in distribution to a known distribution with increasing sample size. However, if ˆβj

p β j and c SE(ˆβj) goes to 0 as nlab and nunlab increase, then Tj does not converge in distribution (see Appendix A for a formal statement of this). Therefore, consistency of ˆβj for β j is a necessary condition for valid inference using this approach. We now investigate whether this is the case.

2.1 The general case for an arbitrary prediction model ˆf

We ﬁrst consider the naive approach, as deﬁned in Box 1. Fitting a linear model with least squares would result in ˆβnaive = (x unlabxunlab) 1x unlab ˆf(zunlab). We can see that ˆβnaive p

Revisiting inference after prediction

Algorithm 1 Bootstrap correction of Wang et al. (2020). The goal is to conduct inference on the association between Y and X. X 1. Use (ylab, ˆf(zlab)) to ﬁt the relationship model Y | ˆf(Z) K( ˆf(Z), φ), yielding ˆφ. X 2. For b = 1, . . . , B: XX 2.1. Sample unlabeled observations with replacement to obtain zb unlab and xb unlab. XX 2.2. Sample outcomes yb| ˆf(zb unlab) from the relationship model K( ˆf(zb unlab), ˆφ). XX 2.3. Use ( yb, xb unlab) to ﬁt a regression model for the relationship between Y and X, and record the coeﬃcient estimate ˆβb and model-based standard error ˆsb. X 3. Compute the point estimate ˆβ = median{ˆβ1, . . . , ˆβB}. X 4. Compute the nonparametric standard error c SE(ˆβ) = SD{ˆβ1, . . . , ˆβB}. X 5. Compute the parametric standard error c SE(ˆβ) = median{ˆs1, . . . , ˆs B}.

E[XX ] 1 E[X ˆf(Z)] as nunlab increases, which is not equal to β = E[XX ] 1 E[XY ] in general. In fact, viewing ˆf( ) as a black-box function with unknown operating characteristics in the study population, we see that the quantity E[XX ] 1 E[X ˆf(Z)] does not even involve the response, Y ; therefore, the parameter targeted by the naive method is not of any scientiﬁc interest. Now, we consider the bootstrap variant of the proposal of Wang et al. (2020), which is introduced in Box 3. A detailed description of these proposals are presented in Algorithm 1. We take Step 2.3 to involve a least squares regression. Note that Step 2.2 of Algorithm 1 involves sampling observations yb from K( ˆf(Z), ˆφ) for use in ﬁtting a regression model in Step 2.3, giving us an estimate ˆβbootstrap,b Wang = (x unlabxunlab) 1x unlab yb.

Thus ˆβbootstrap,b Wang p E[XX ] 1 E[XK( ˆf(Z), ˆφ)] as nunlab increases, where we slightly abuse

notation by letting K( ˆf(Z), ˆφ) denote a random variable with distribution K( ˆf(Z), ˆφ). In general, E[XX ] 1 E[XK( ˆf(Z), ˆφ)] is not equal to the parameter of interest β = E[XX ] 1 E[XY ]. Note that both the parametric and non-parametric bootstrap corrections suﬀer from this issue. The analytic variant of the proposal of Wang et al. (2020) adjusts the naive estimate by the coeﬃcient of ˆf(Z) in a regression of Y onto ˆf(Z), with an intercept, using the labeled data2; that is, the estimate takes the form

ˆβanalytical Wang = d Cov(ylab, ˆf(zlab))

d Var( ˆf(zlab)) ˆβnaive. (2)

Thus as nlab and nunlab increase,

ˆβanalytical Wang p Cov(Y, ˆf(Z))

Var( ˆf(Z)) E[XX ] 1 E[X ˆf(Z)],

which is not again not equal to β in general.

2. The expression for ˆβanalytical Wang given in (2) is implemented in Wang et al. (2020) s code. Their publication involves a slightly diﬀerent expression for ˆβanalytical Wang , which also does not converge to the parameter of interest β for a similar reason we show this in Appendix B.

Motwani and Witten

This highlights a cause for concern about the proposals of Wang et al. (2020): namely, that the wrong parameter is being targeted. This calls into question whether the inference that they propose will achieve the desired statistical guarantees; we investigate this issue further in the next two sections. Finally, we turn to the proposal of Angelopoulos et al. (2023), which is introduced in Box 4. In the case of linear regression, their estimate takes the form

ˆβAngelopoulos = (x unlabxunlab) 1x unlab ˆf(zunlab)+ n (x labxlab) 1x labylab (x labxlab) 1x lab ˆf(zlab) o .

As nlab and nunlab increase, we see that

ˆβAngelopoulos p E[XX ] 1 E[X ˆf(Z)] + n E[XX ] 1 E[XY ] E[XX ] 1 E[X ˆf(Z)] o = β ,

so the proposal of Angelopoulos et al. (2023) correctly targets the desired quantity.

2.2 An extreme setting where all methods target the correct quantity

We now consider an extreme setting where the prediction model exactly equals the true regression function: that is, ˆf(z) = E[Y |Z = z]. We also assume that X is contained within Z: that is, Z = (X, Z(2)) for some Z(2). This is a reasonable assumption in practice, since the covariate of interest is likely also a predictor in the machine learning model. In this extreme setting, the naive method targets

E[XX ] 1 E[X ˆf(Z)] = E[XX ] 1 E [X E[Y |Z]]

= E[XX ] 1 E[XY ]

In other words, it targets the correct parameter of interest. We will now show that the bootstrap methods of Wang et al. (2020) can similarly target the correct parameter in this extreme setting, for example, if K( ˆf(Z), ˆφ) is deﬁned by ﬁtting a generalized additive model (GAM) to (Y, ˆf(Z)) and adding mean-zero noise, as in Wang et al. (2020). That is,

K( ˆf(Z), ˆφ) d= ˆg( ˆf(Z)) + ϵ, (3)

where ϵ is mean-zero noise and ˆg( ) is the ﬁtted GAM. The ﬁtted GAM ˆg( ) takes the form

ˆg( ) arg min g E h (Y g( ˆf(Z)))2i (4)

= arg min g E h (Y g(E[Y |Z]))2i (5)

= arg min g E h E h (Y g(E[Y |Z]))2 |Z ii . (6)

The approximation in (4) holds if the labeled sample size is suﬃciently large. Equation 5 is a consequence of the extreme assumption that ˆf(z) = E[Y |Z = z]. Equation 6 follows from iterated expectations.

Revisiting inference after prediction

It is not hard to see that (6) is minimized when g(E[Y |Z]) = E[Y |Z], i.e., ˆg( ) is approximately the identity function. Combining this with the extreme assumption that ˆf(z) = E[Y |Z = z], (3) leads to

K( ˆf(Z), ˆφ) d E[Y |Z] + ϵ. (7)

Now, recall from Section 2.1 that the parameter targeted by Wang et al. (2020) is E[XX ] 1 E[XK( ˆf(Z), ˆφ)]. We observe that

E[XX ] 1 E[XK( ˆf(Z), ˆφ)] = E[XX ] 1 E[X E[K( ˆf(Z), ˆφ)|Z]] (8)

E[XX ] 1 E[X E[Y |Z]] (9)

= E[XX ] 1 E[XY ] (10)

Here, (8) and (10) follow from iterated expectations since Z = (X, Z(2)), and (9) follows from (7). With a large enough labeled dataset, this approximation will hold almost exactly. The analytical method of Wang et al. also targets the correct parameter in this setting, since the naive method targets the correct parameter and

Cov(Y, ˆf(Z))

Var( ˆf(Z)) = Cov(Y, E[Y |Z])

Var(E[Y |Z]) = Var(E[Y |Z])

Var(E[Y |Z]) = 1.

Therefore, we have seen that under a very extreme assumption that ˆf(z) = E[Y |Z = z], the methods of Wang et al. (2020) will target (nearly) the correct parameter. However, in general, this assumption is not reasonable. And in fact, our goal is valid inference on the parameter β regardless of (the quality of) ˆf( ).

3. An empirical investigation of the distribution of the test statistic

We consider a simple simulation setting, inspired by the Simulated Data: Continuous case section of Wang et al. (2020). They generate three datasets: a training dataset consisting of realizations of (Z, X, Y ) used to train a machine learning model ˆf( ), a labeled dataset consisting of realizations of (Z, X, Y ), and an unlabeled dataset consisting only of realizations of (Z, X); both the labeled and unlabeled datasets are used for inference3. They consider predictors Z R4 and response Y R, and deﬁne the covariate X Z1. In Wang et al. (2020) s paper, the training, labeled, and unlabeled datasets each consist of 300 observations. Throughout this section, we keep the training sample size ﬁxed at 300 observations, but vary the size of the labeled and unlabeled datasets. As in Wang et al. (2020), we generate the training, labeled, and unlabeled datasets from the same partially linear additive model Y = β0 + β1Z1 + P4 j=2 βjgj(Zj) + ϵ. Their goal is to conduct inference on the marginal association between X = Z1 and Y in a linear model. That is, their parameter of interest is

β 1 = arg min β1 min β0 E[(Y β0 β1Z1)2]. (11)

3. Wang et al. (2020) refer to the labeled data set as test data, and to the unlabeled data as validation data.

Motwani and Witten

Because the features are independent, we have that

β 1 = arg min β1 min β0,β2,β3,β4 E[(Y β0 β1Z1

j=2 βjgj(Zj))2] = β1. (12)

Thus, β 1 is the marginal regression coeﬃcient of Y onto X = Z1, as well as the coeﬃcient associated with X = Z1 in the partially linear additive model used to generate the data. We consider two settings: one under the null (β 1 = 0) and one under the alternative (β 1 = 1). We generate 3 training sets and ﬁt a GAM to each training set, to obtain three ﬁtted models ˆf1, ˆf2, ˆf3. In each replicate of the simulation study, we generate a new labeled and unlabeled dataset as described above. Note that this diﬀers from the simulation in Wang et al. (2020), which generates a new training set (and thus a diﬀerent ˆf) in each replicate of the simulation study. We do this to focus on the properties of estimation and inference under a ﬁxed ˆf, e.g. how Alpha Fold would be used in practice. We perform a total of 1,000 simulation replicates. To conduct prediction-based inference on β 1, both Wang et al. (2020) and Angelopoulos

et al. (2023) rely on the claim that (ˆβ1 β 1)/c SE(ˆβ1) d N(0, 1). We consider the following versions of Wang et al. (2020):

1. Proposal of Wang et al. (2020), with an analytical correction. Apply Box 3 with the analytical correction (2) using a linear model for the regression model and a linear model for the relationship model.

2. Proposal of Wang et al. (2020), with a parametric bootstrap correction. Apply Box 3 with the parametric bootstrap correction presented in Algorithm 1 using a linear model for the regression model and a GAM for the relationship model.

3. Proposal of Wang et al. (2020), with a non-parametric bootstrap correction. Apply Box 3 with the non-parametric bootstrap correction presented in Algorithm 1 using a linear model for the regression model and a GAM for the relationship model.

We additionally consider the proposal of Angelopoulos et al. (2023):

4. Proposal of Angelopoulos et al. (2023). Apply Box 4 using a linear model.

Finally, we consider the following two approaches.

5. Classical approach using only the labeled data. Fit a linear model to (ylab, xlab).

6. Naive approach. Apply Box 1 using a linear model.

In Figure 1 we show the empirical distribution of (ˆβ1 β 1)/c SE(ˆβ1) under H0 : β 1 = 0, for increasing sample sizes. In the ﬁrst three panels, we can see that the asymptotic distribution of this test statistic for an arbitrary ˆf does not follow a N(0, 1) for the naive and Wang et al. (2020) methods. This is in line with our ﬁndings in Section 2. This is also true under the alternative H0 : β 1 = 1, as shown in Figure 2. On the other hand, for the method of Angelopoulos et al. (2023), this statistic converges in distribution to a N(0, 1) regardless of the choice of ˆf.

Revisiting inference after prediction

3 f (z) = E(Y|Z = z)

1000 2000 4000 8000 16000 1000 2000 4000 8000 16000

Naive Wang et al., analytical

Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap'

Angelopoulos et al. Classical, using labeled data

Figure 1: An examination of the distribution of ˆβ1/c SE(ˆβ1) under H0 : β 1 = 0. For each of four diﬀerent prediction models ˆf( ) (three trained GAMs and one true regression function), we display the empirical distribution of ˆβ1/c SE(ˆβ1) as the sample sizes increase, with nlab = 0.1nunlab. The N(0, 1) distribution is shown in black. The dashed black lines show the 0.025 and 0.975 quantiles of this distribution. The distributions of Wang et al. (2020) s test statistics increasingly diverge from the N(0, 1) distribution as the sample sizes increase. The methods and simulation setup are described in Section 3.

In the last panel of Figures 1 and 2, we consider the extreme setting considered in Section 2.2, in which ˆf(z) = E[Y |Z = z]. Here, the empirical distribution of the test statistic is approximately N(0, 1) for all methods.

We next performed testing and constructed conﬁdence intervals under the assumption made by Wang et al. (2020) and Angelopoulos et al. (2023) that the test statistic asymptotically follows N(0, 1). We examined how the violation of this distributional assumption impacts type 1 error control and coverage in Figures 5 and 6 in the Appendix. We see that the naive and Wang et al. (2020) methods do not control the type 1 error rate or have nominal coverage for an arbitrary ˆf, and they become increasingly anti-conservative as the sample sizes increase. This can be explained by the increasing discrepancy between the assumed and true distributions of (ˆβ1 β 1)/c SE(ˆβ1) as the sample sizes increase, as previously seen in Figures 1 and 2. In fact, we can directly read oﬀthe type 1 error rate at level 0.05

Motwani and Witten

3 f (z) = E(Y|Z = z)

1000 2000 4000 8000 16000 1000 2000 4000 8000 16000

1 β1 *) SE (β1 )

Naive Wang et al., analytical

Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap'

Angelopoulos et al. Classical, using labeled data

Figure 2: An examination of the distribution of (ˆβ1 β 1)/c SE(ˆβ1) when β 1 = 1. For each of four diﬀerent prediction models ˆf( ) (three trained GAMs and one true regression function), we display the empirical distribution of (ˆβ1 β 1)/c SE(ˆβ1) as the sample sizes increase, with nlab = 0.1nunlab. The N(0, 1) distribution is shown in black. The dashed black lines show the 0.025 and 0.975 quantiles of this distribution. The distributions of Wang et al. (2020) s test statistics increasingly diverge from the N(0, 1) distribution as the sample sizes increase. The methods and simulation setup are described in Section 3.

as the proportion of points falling outside the dashed lines in Figure 1. Similarly, coverage can be read as the proportion of points inside the dashed lines in Figure 2. Because the true distribution of the test statistic matches the assumed one for the method of Angelopoulos et al. (2023) for any ˆf, this method gives correct type 1 error control and coverage in general. For the same reason, under the extreme setting, all methods have correct type 1 error control and coverage.

4. A direct replication of the simulation study of Wang et al. (2020)

In the previous section, we considered a simulation setting that was very similar to that in Wang et al. (2020), but diﬀered in that we considered the same prediction models ˆf across all simulation replicates. In this section, we instead directly replicate their simulation setting ( Simulated data; continuous case ), by generating a new training dataset in each

Revisiting inference after prediction

simulation replicate (resulting in a diﬀerent ˆf in each replicate). They considered a sample size of 300 for the training, labeled, and unlabeled datasets. In addition to replicating these results, we explore increasing the labeled and unlabeled dataset sizes.

nlab = 400 nunlab = 4000 nlab = 800 nunlab = 8000 nlab = 1600 nunlab = 16000

nlab = 300 nunlab = 300 nlab = 100 nunlab = 1000 nlab = 200 nunlab = 2000

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Uniform(0, 1) Quantiles

Empirical p value Quantiles

Naive Wang et al., analytical

Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap'

Angelopoulos et al. Classical, using labeled data

Figure 3: For data generated under H0 : β 1 = 0, quantile-quantile plots of the p-values across simulation replicates are displayed. The methods are described in Section 3 and the simulation setup is described in Section 4. Each panel corresponds to a diﬀerent sample sizes of the labeled and unlabeled datasets used for inference. The bootstrap and analytical corrections considered by Wang et al. (2020) become increasingly anticonservative as the sample sizes increase. The classical approach, and that of Angelopoulos et al. (2023), are well-calibrated.

We ﬁrst examine the type 1 error rate under the null hypothesis H0 : β 1 = 0 using each of the approaches described in Section 3. Quantile-quantile plots of the resulting 1, 000 p-values are shown in Figure 3. We see that in agreement with the ﬁndings in Wang et al. (2020), the naive approach does not control the type 1 error rate regardless of sample size. While it appears that when nlab = nunlab = 300 (ﬁrst panel of Figure 3) the methods of Wang et al. (2020) have uniform p-values under the null (as also reported in Wang et al. (2020)), with other sample sizes this no longer holds and the methods of Wang et al. (2020) fail to control the type 1 error rate. As expected based on the previous section, Angelopoulos et al. (2023) controls type 1 error. This ﬁnding is in agreement with the theoretical results

Motwani and Witten

in Angelopoulos et al. (2023), which hold for an arbitrary prediction function ˆf( ). Also as expected, the classical method controls type 1 error.

0 5000 10000 15000 nunlab

Empirical Coverage

Naive Wang et al., analytical

Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap'

Angelopoulos et al. Classical, using labeled data

Figure 4: For data generated with β 1 = 1, empirical coverage of 95% conﬁdence intervals for each method across each simulation replicate, as the labeled and unlabeled sample sizes increase, with nlab = 0.1nunlab. The methods are described in Section 3 and the simulation setup is described in Section 4.

Next we examine coverage of 95% conﬁdence intervals under β 1 = 1. We see from Figure 4 that the naive approach has coverage well below the nominal level. Again, the corrected proposals of Wang et al. (2020) also fail to achieve the nominal coverage. The problem becomes increasingly pronounced as the sample sizes of the data used for inference increase. By contrast, the classical method and the proposal of Angelopoulos et al. (2023) do achieve the nominal coverage, supporting the theory of Angelopoulos et al. (2023). As explained in Section 2, the lack of inferential guarantees for the naive approach and

the approach of Wang et al. (2020) is a direct consequence of the fact that ˆβ1

5. Discussion

In this paper, we found that the method of Angelopoulos et al. (2023) provides valid type 1 error control and coverage. By contrast, the methods proposed by Wang et al. (2020) do not provide appropriate inferential guarantees in the absence of very strong assumptions: for instance, under the extreme (and unrealistic) scenario where the prediction model is the true regression function. Under this additional assumption, the naive approach also provides valid inference. Furthermore, we see in our simulation study that simply assuming that the prediction model ˆf( ) was trained on data from the same population as the labeled and unlabeled data an assumption made by Wang et al. (2020) is not suﬃcient to achieve valid inference using their proposed methods. Throughout, for simplicity we have assumed that we are conducting inference (i.e. Step 2 of Box 1) using a linear model. However, our conclusions that the naive method and methods of Wang et al. (2020) result in invalid inference, because they target the incorrect

Revisiting inference after prediction

parameter apply much more generally. The theory in Angelopoulos et al. (2023) shows that their approach applies to a wide variety of settings beyond linear regression, and has valid inferential properties.

Code Availability

Scripts to reproduce the results in this manuscript are available at https://github.com/ keshav-motwani/Prediction Based Inference/. Our code is based on the code from Wang et al. (2020); we thank the authors for making it publicly accessible.

Acknowledgements

We thank JeﬀLeek and Tyler Mc Cormick for helpful conversations. The authors gratefully acknowledge funding from NIH R01 EB026908, NIH R01 GM123993, NIH R01 DA047869, ONR N00014-23-1-2589, a Simons Investigator Award in Mathematical Modeling of Living Systems, and the NSF Graduate Research Fellowship Program.

Motwani and Witten

Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference. Science, 382(6671):669 674, 2023. doi: 10.1126/science.adi6000. URL https://www.science.org/doi/abs/10.1126/science. adi6000.

Samuel J Clark, Tyler Mc Cormick, Zehang Li, and Jon Wakeﬁeld. In Silico VA: A method to automate cause of death assignment for verbal autopsy. ar Xiv preprint ar Xiv:1504.02129, 2015.

Shannon E Ellis, Leonardo Collado-Torres, Andrew Jaﬀe, and Jeﬀrey T Leek. Improving the value of public RNA-seq expression data by phenotype prediction. Nucleic Acids Research, 46(9):e54 e54, 2018.

Eric R Gamazon, Heather E Wheeler, Kaanan P Shah, Sahar V Mozaﬀari, Keston Aquino Michaels, Robert J Carroll, Anne E Eyler, Joshua C Denny, GTEx Consortium, Dan L Nicolae, et al. A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47(9):1091 1098, 2015.

Alexander Gusev, Kate Lawrenson, Xianzhi Lin, Paulo C Lyra Jr, Siddhartha Kar, Kevin C Vavra, Felipe Segato, Marcos AS Fonseca, Janet M Lee, Tanya Pejovic, et al. A transcriptome-wide association study of high-grade serous epithelial ovarian cancer identiﬁes new susceptibility genes and splice variants. Nature Genetics, 51(5):815 823, 2019.

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇZ ıdek, Anna Potapenko, et al. Highly accurate protein structure prediction with Alpha Fold. Nature, 596(7873):583 589, 2021.

SA Khoury, D Massad, and T Fardous. Mortality and causes of death in Jordan 1995-96: assessment by verbal autopsy. Bulletin of the World Health Organization, 77(8):641, 1999.

Siruo Wang, Tyler H Mc Cormick, and Jeﬀrey T Leek. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266 30275, 2020.

Revisiting inference after prediction

Appendix A. Necessity of consistency for β

We formally state the necessity of targeting the correct parameter in order to use the test statistic (ˆβj β j )/c SE(ˆβj) for inference.

Lemma 1 Suppose c SE(ˆβj) = op(1) and ˆβj

p β j . Then (ˆβj β j )/c SE(ˆβj) does not converge in distribution.

Proof Suppose, for the sake of contradiction, that (ˆβj β j )/c SE(ˆβj) converges in distribu-

tion. Then (ˆβj β j )/c SE(ˆβj) = Op(1). Thus

ˆβj β j = ˆβj β j c SE(ˆβj) c SE(ˆβj) = Op(1)op(1) = op(1),

so ˆβj p β j , which is a contradiction.

Appendix B. Lack of consistency of analytical method of Wang et al. (2020)

In Sections 2.1 and 2.2, we analyzed the analytical correction as implemented in the code by Wang et al. (2020). We now analyze the analytical correction as described in the publication, which is also not consistent for the parameter of interest in general. In the publication, the estimate obtained from the analytical correction is deﬁned as

ˆβanalytical Wang = ˆγ0(x unlabxunlab) 1x unlab1nunlab + ˆγ1 ˆβnaive

ˆγ1 = d Cov(ylab, ˆf(zlab))

d Var( ˆf(zlab))

and ˆγ0 = ˆE[ylab] ˆγ1ˆE[ ˆf(zlab)].

Thus as nlab increases,

ˆγ1 p Cov(Y, ˆf(Z))

Var( ˆf(Z)) γ1

and ˆγ0 p E[Y ] γ1 E[ ˆf(Z)] γ0.

Thus as nlab and nunlab increase,

ˆβanalytical Wang p γ0 E[XX ] 1 E[X] + γ1 E[XX ] 1 E[X ˆf(Z)],

which is not equal to β in general. In the extreme setting (described in Section 2.2), in which ˆf(z) = E[Y |Z = z], we have that

γ1 = Cov(Y, ˆf(Z))

Var( ˆf(Z)) = Cov(Y, E[Y |Z])

Var(E[Y |Z]) = Var(E[Y |Z])

Var(E[Y |Z]) = 1

Motwani and Witten

and γ0 = E[Y ] γ1 E[ ˆf(Z)] = E[Y ] E[E[Y |Z]] = 0.

Thus using the same argument as for the naive estimator, this analytical correction is consistent for β in this extreme setting.

Appendix C. Inferential consequences of wrong distribution

In this section, we examine how violation of the distributional assumption impacts type 1 error control and coverage in the simulations in Section 3. In Figure 5, we see that the methods proposed by Wang et al. (2020) do not control the type 1 error rate for arbitrary ˆf( ); the situation gets worse as the sample size increases. However, the method proposed by Angelopoulos et al. (2023) does control the type 1 error rate. Wang et al. (2020) controls the type 1 error rate if the machine learning model is the true regression function ˆf(z) = E[Y |Z = z]; of course, such a perfect machine learning model is unattainable in practice. In Figure 6, we see that the methods proposed by Wang et al. (2020) do not attain the nominal coverage for an arbitrary ˆf( ), whereas the proposal of Angelopoulos et al. (2023) does attain the nominal coverage. The naive method and the proposals of Wang et al. (2020) have appropriate coverage when ˆf(z) = E[Y |Z = z]; again, this is unrealistic in practice.

Revisiting inference after prediction

nlab = 100 nunlab = 1000 nlab = 200 nunlab = 2000 nlab = 400 nunlab = 4000 nlab = 800 nunlab = 8000 nlab = 1600 nunlab = 16000

f 1 f 2 f 3 f (z) = E(Y|Z = z)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Uniform(0, 1) Quantiles

Empirical p value Quantiles

Naive Wang et al., analytical

Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap'

Angelopoulos et al. Classical, using labeled data

Figure 5: For labeled and unlabeled datasets generated under H0 : β 1 = 0, quantile-quantile plots of the p-values across replicates of the modiﬁed simulation study are displayed for each of the four ˆf( ) s considered. The methods and simulation setup are described in Section 3. Each panel corresponds to diﬀerent sample sizes of the labeled and unlabeled datasets used for inference. The naive method and the bootstrap and analytical corrections considered by Wang et al. (2020) become increasingly anticonservative as the sample sizes increases, unless the machine learning model is perfect, i.e. ˆf(z) = E[Y |Z = z].

Motwani and Witten

3 f (z) = E(Y|Z = z)

4000 8000 12000 16000 4000 8000 12000 16000 4000 8000 12000 16000 4000 8000 12000 16000

Empirical Coverage

Naive Wang et al., analytical

Wang et al., 'parametric bootstrap' Wang et al., 'nonparametric bootstrap'

Angelopoulos et al. Classical, using labeled data

Figure 6: For labeled and unlabeled datasets generated with β 1 = 1, empirical coverage of 95% conﬁdence intervals for each method across each simulation replicate, for each of the four ˆf( ) s considered, as the labeled and unlabeled sample sizes increase, with nlab = 0.1nunlab. The methods and simulation setup are described in Section 3.