# predictionpowered_evalues__6b38ef87.pdf

Prediction-Powered E-Values

Daniel Csillag 1 Claudio Jos e Struchiner 1 Guilherme Tegoni Goedert 1

Quality statistical inference requires a sufficient amount of data, which can be missing or hard to obtain. To this end, prediction-powered inference has risen as a promising methodology, but existing approaches are largely limited to Z-estimation problems such as inference of means and quantiles. In this paper, we apply ideas of predictionpowered inference to e-values. By doing so, we inherit all the usual benefits of e-values such as anytime-validity, post-hoc validity and versatile sequential inference as well as greatly expand the set of inferences achievable in a predictionpowered manner. In particular, we show that every inference procedure that can be framed in terms of e-values has a prediction-powered counterpart, given by our method. We showcase the effectiveness of our framework across a wide range of inference tasks, from simple hypothesis testing and confidence intervals to more involved procedures for change-point detection and causal discovery, which were out of reach of previous techniques. Our approach is modular and easily integrable into existing algorithms, making it a compelling choice for practical applications.

1. Introduction

Statistical inference is ubiquitous in many critical areas of application, such as medicine and economics. Central to their use is the availability of moderate amounts of data to empower our inferences. However, such data can be expensive to obtain, which complicates matters.

A common strategy is to simply collect a smaller amount of data, in order to minimize costs. Unfortunately, this generally leads to more uncertain inferences. Alternatively, there are methods that leverage auxiliary cheap-to-obtain data

*Equal contribution 1School of Applied Mathematics, Getulio Vargas Foundation, Rio de Janeiro, Brazil. Correspondence to: Daniel Csillag <daniel.csillag@fgv.br>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

to compensate for the missing expensive data. Classical works in this direction include single imputation and multiple imputation methods (Little & Rubin, 2019), but they generally lack any strong guarantee of correctness. More recently, (Angelopoulos et al., 2023a) proposed predictionpowered inference, which allow for versatile procedures that benefit from strong correctness guarantees, notably including unbiasedness and type-I error control under very light assumptions.

At its heart, the idea of prediction-powered inference is simple: we leverage a predictive model (which can be arbitrarily complex, e.g., large neural networks) to predict the expensive data from the cheap data. We can then use our whole dataset to perform our inference by imputing missing expensive data with predictions from our model, while leveraging the available expensive data to quantify our model s inaccuracies, debiasing our inference.

Prediction-powered inference has already inspired a large amount of literature, both methodology-wise (e.g., (Zrnic & Candes, 2024; Angelopoulos et al., 2023b; Zrnic & Cand es, 2023; Gu & Xia, 2024)), as well as in applications such as language model evaluations (Chatzi et al., 2024; Boyeau et al., 2024), genome-wide association studies (Miao et al., 2024) and more. However, throughout, the inference tasks considered are fairly limited; previous works are essentially restricted to problems that can be framed in terms of Zestimation,1 which includes many common tasks such as inference of means, quantiles and regression coefficients, but not much more. In this paper, we significantly expand this frontier by applying prediction-powered inference to e-values.

E-values are a recent enticing alternative to p-values. Formally, an e-value for a null hypothesis H0 is a nonnegative real random variable E such that, if H0 holds, then E[E] 1; by Markov s inequality, it is then unlikely that the e-value E is high under the null, and thus a high evalue ( 1) provides evidence against the null hypothesis. Though simple, this is a very powerful notion: evalues allow for powerful procedures under very lax assumptions (e.g., not even i.i.d., nonparametric and nonasymp-

1A Z-estimation problem is one in which we seek to infer a parameter θ Θ such that EZ[ψ(Z; θ )] = 0, for some known function ψ.

Prediction-Powered E-Values

totic) (Howard et al., 2018), naturally handle sequential and anytime-valid inference (Ramdas et al., 2022), naturally fit into multiple testing and post-selective inference (Wang & Ramdas, 2020; Xu et al., 2022) and allows for significance levels to be chosen a posteriori (Koning, 2023; Gr unwald, 2022) properties that are notoriously challenging to obtain with the more standard p-values, if not outright impossible, especially in conjunction. Furthermore, e-values are rather universal: any e-value can be converted to a p-value by simply taking its reciprocal, and any p-value can be converted to an e-value by a process termed calibration (Vovk & Wang, 2019), albeit at a slight loss of power.

By working atop e-values, our procedure gains a great amount of versatility. We show that any inference procedure that operates in terms of e-values has a predictionpowered counterpart, given by our method. Moreover, our procedure naturally inherits all of the usual virtues of e-values, in particular including anytime-validity and posthoc validity. In fact, the sequential nature of our procedure further empowers prediction-powered inference methods, allowing us to arbitrarily improve our predictive model and data collection policy over the course of the inference, whereas previous methods require us to fix it a priori, or learn it from a separate data split.

We illustrate our procedure in four case studies. First, we use it for a simple problem of estimating prevalence of diabetes on a population from readily available survey data. Secondly, we apply our method for a problem of anytimevalid testing of the hypothesis that a deployed model s risk does not exceed a certain safety level, for the purpose of continuous risk monitoring. We then turn to more involved inference tasks. On the same context of continuous risk monitoring, we apply our method for detection of changepoints, in which we seek to identify points in time where some aspect of the time series has changed. Finally, we consider how our method enables powerful procedures for causal discovery under missing (costly) data.

Our contributions

1. We present a new method for prediction-powered inference based on e-values. Besides being applicable to a much more general setting than the ones previously considered in the literature, it inherits all the usual benefits of e-values, including sequential inference that is valid under arbitrary optional stopping and post-hoc validity. Moreover, it allows for the underlying predictive model to be updated over the course of the inference, yielding much better data efficiency compared to prior work (which require the model to be fit on a separate data split);

2. We show how the base method can be extended from simple hypothesis testing with e-values to more in-

volved procedures, first considering confidence intervals/sequences and then general algorithms based on e-values. In particular, we show that simply substituting the base e-values by our prediction-powered e-values yields valid prediction-powered procedures that are statistically powerful, leading to a modular and widely applicable technique.

3. We showcase our method in four case studies ranging from simple mean estimation and hypothesis testing to change-point detection and causal discovery. This highlights the wide applicability of our approach, and we consistently note its much improved performance compared to baselines in spite of substantial (often 100x-200x) reductions in data acquisition costs.

Related work Our method, much like most of the prediction-powered inference literature, is fundamentally connected to the literature on semiparametric inference. In particular, our prediction-powered e-values can be seen as a use of the AIPW estimator (Robins et al., 1994), but atop the e-values rather than the data. Specifically connecting to e-values, prior work by (Xu et al., 2024) has explored using an AIPW estimator atop e-values for the mean in order to construct risk-controlling prediction sets (RCPSs); their variance-reduced method turns out to be a special case of our approach.

2. A General Method

We will first present how we can transform a standard evalue into a prediction-powered one in the context of hypothesis testing. This mechanism can then be leveraged to transform more complex procedures powered by e-values into prediction-powered ones; we first thoroughly instantiate this for confidence sequences, and then more generally in the context of general e-value-powered algorithms. Throughout, we consider an active data collection setting.

2.1. Hypothesis testing

Our goal is to test some null hypothesis H0, and for this purpose have a stream of data (Xi, Yi) i=1. The X correspond to cheap data that we will always have access to, while the Y correspond to data that is expensive to obtain, and as such we have little access to but, ultimately, the hypothesis we want to test is over the distribution of the Y .s

Data acquisition costs aside, a sound approach to perform such a hypothesis test is to leverage an e-value En i.e., a nonnegative random variable that is a function of the first n data points, such that under the null H0 it holds that E[En] 1. In particular, we consider e-values of the form

i=1 ei(Yi), (1)

Prediction-Powered E-Values

where (ei) i=1 is a predictable sequence of the components of the e-value, i.e., each ei can be arbitrarily dependent on the samples before time i (but nothing else). We will further require that the e-value s components be predictably bounded: for all i, ei( ) [ai, bi] for some predictable sequences (ai) i=1 and (bi) i=1, and with ai > 0 for all i.

Most e-values in the literature are already of this form (e.g., (Waudby-Smith & Ramdas, 2020; Podkopaev & Ramdas, 2023a;b; Waudby-Smith et al., 2022; Bar et al., 2024)), or can factored into it. The boundedness assumption can be enforced by simple rescaling and clipping, albeit at a slight loss of power.

Should we have access to perfect models µ i : X R, i.e., such that µ i (Xi) = Yi almost surely, then we could instead only use the predictions atop the cheaper data, µ i (Xi), to construct the e-value by its components:

Eimputed n :=

i=1 ei(µ i (Xi)).

However, in the much more realistic scenario that the model is not perfect, Eimputed n will not be a valid e-value.

We can, however, debias Eimputed n as per predictionpowered inference (Angelopoulos et al., 2023a) and active statistical inference (Zrnic & Candes, 2024). First, endow the data stream with additional random variables ξi Bern(πi(Xi)) denoting whether we should collect (and thus have access) to the more expensive data Yi, where π1, π2, . . . : X [1 ai/bi, 1] is a predictable (i.e., possibly arbitrarily dependent on data prior to i, but independent of all from i onwards) sequence of functions that produce the probability of data collection.

With this augmented data stream (Xi, Yi, πi, ξi) i=1, we can form a new prediction-powered sequence of e-values, with form similar to that of the active prediction-powered estimators of (Zrnic & Candes, 2024):

eppi i := ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) ξi πi(Xi),

i=1 eppi i , (ξi Bern(πi(Xi))) .

This construction is motivated by the fact that, conditional on all data prior to the time point i, the prediction-powered e-value components eppi i match the non-prediction-powered ones ei in expectation:

ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) ξi πi(Xi)

= Ei[ei(µi(Xi))] + Ei

ei(Yi) ei(µi(Xi)) ξi πi(Xi)

= Ei[ei(µi(Xi))]

ei(Yi) ei(µi(Xi)) ξi πi(Xi) | ξi = 1 Pi[ξi = 1]

ei(Yi) ei(µi(Xi)) ξi πi(Xi) | ξi = 0 Pi[ξi = 0]

= Ei[ei(µi(Xi))] + Ei[ei(Yi) ei(µi(Xi))] = Ei[ei(Yi)] .

Furthermore, the boundedness of the e-values components and on the π ensure that the quantity is always nonnegative. Using these facts along with a backward induction argument, one can prove:

Theorem 2.1. Eppi n is a valid e-value for the null H0. Additionally:

(i) If (E0, E1, . . .) form a test supermartingale i.e., a nonnegative supermartingale with E[E0] 1 under the null H0 then so is (Eppi 0 , Eppi 1 , . . .);

(ii) More generally, if (E0, E1, . . .) form an e-process i.e., a nonnegative stochastic process such that for all stopping times τ, the null H0 implies that E[Eτ] 1 then so is (Eppi 0 , Eppi 1 , . . .) for all finite stopping times.

Besides having valid e-values which assures us of type I error control one should check whether they are efficient/powerful. We can check that, under mild assumptions, our e-process has good power in terms of the expected growth rate (Kelly, 1956) as long as the models µi match the true data Yi sufficiently well:

Theorem 2.2. Suppose that the ei( ) are each Li-Lipschitz, and that πi(Xi) 1 ai/bi + ϵi for some ϵi > 0, for all i. Then there exists some constant c > 0 independent of n such that

n log Eppi n

i=1 E[ µi(Xi) Yi ].

More general and precise statements are also possible, but less compact; see Theorems A.6 in the appendix.

The sequential nature of the prediction-powered e-values which holds regardless of whether the original e-values were of sequential nature allows for an extremely versatile procedure. For instance, in contrast to most existing prediction-powered inference procedures, we are able to update both our underlying prediction model and our data collection rule over the course of our inference process, with no restrictions other than not using future information and having to satisfy the boundedness assumptions.

The resulting algorithm for hypothesis testing is remarkably simple to implement, given its generality. The pseudocode can be found in Algorithm 1.

Prediction-Powered E-Values

Algorithm 1 Prediction-Powered E-Values

Input: base e-value components (e1( ), e2( ), . . .) Output: prediction-powered e-values (Eppi 0 , Eppi 1 , . . .) Eppi 0 1 Initialize µ : X Y and π : X [1 a1/b1, 1] for each i = 1, 2, . . . do

Get cheap data Xi Sample ξi Bern(π(Xi)) if ξi = 1 then

Collect expensive data Yi Eppi i Eppi i 1 ei(Yi) (1 π(Xi))ei(µ(Xi))

Eppi i Eppi i 1 ei(µ(Xi)) end if Optionally update π and µ end for

2.2. From hypothesis testing to confidence intervals

With prediction-powered e-values in hand, we can easily produce prediction-powered confidence intervals/sequences by considering a family of e-values indexed by the parameter in question.

Suppose we want to produce a confidence interval/sequence for a parameter θ Θ of the data generating process, and consider the family of nulls H(θ) 0 : θ = θ, indexed by θ. For each such null, we can construct a corresponding prediction-powered e-value Eppi (θ) n and then consider the set Cppi (α) n := n θ Θ : Eppi (θ) < 1/α o .

By the standard duality between hypothesis tests and confidence sets, it then holds that:

Proposition 2.3. Cppi (α) n is a valid confidence interval i.e., P[θ Cppi (α) n ] 1 α. Moreover:

(i) If the underlying e-values form a nonnegative supermartingale, then the prediction-powered intervals are anytime-valid (also known as confidence sequences): P[ n N, θ Cppi (α) n ] 1 α;

(ii) More generally, if the underlying e-values form eprocesses, then the prediction-powered intervals are valid at arbitrary stopping times: P[θ Cppi (α) τ ] 1 α for any stopping time τ.

Again, we are also interested in how efficient these confidence sequences are. Just like before, as long as our predictive models are good, we get more concentrated intervals, as measured by the area under the log-p-landscape:

Proposition 2.4. Under the assumptions of Theorem 2.2, let ν be a measure over the parameter space Θ. Then there

exists some c for which

Eppi (θ) n dν(θ) E Z 1

E(θ) n dν(θ)

i=1 E[ µi(Xi) Yi ].

These results may be mapped to the actual measure of the confidence interval, but this is nontrivial; see the Appendix.

2.3. General e-value-powered algorithms

Beyond simple hypothesis testing and confidence sequences, e-values can also be used as components of more elaborate inference procedures, for example in causal discovery (e.g. (Peters et al., 2017)), change point detection (Shin et al., 2022; Shekhar & Ramdas, 2023a;b) and test-time adaptation (Bar et al., 2024). Our prediction-powered e-values can also be seamlessly integrated into such procedures.

Consider that we have a family of e-values E(γ) n for respective nulls H(γ) 0 , indexed by γ Γ, and our overall algorithm is of the form A((E(γ) n )γ Γ). Moreover, our algorithm comes endowed with some validity property, and is such that this validity depends only on the inputted e-values being valid:

Assumption 2.5. If E(γ) n is a valid e-value for the null H(γ) 0 for every γ Γ, then A((E(γ) n )γ Γ) is valid.

It is then easy to show that by simply replacing the input evalues by their prediction-powered counterparts, the validity property is maintained:

Proposition 2.6. Under Assumption 2.5, it holds that A((Eppi (γ) n )γ Γ) is also valid. If the underlying e-values are e-processes, then it further holds that A((Eppi (γ) τ )γ Γ) is valid for any finite stopping time τ.

It is still also of interest to consider some notion of power or efficiency of the resulting prediction-powered procedure. However, such an analysis needs to consider more of the particular algorithm in principle, and so should be done on a case-by-case basis. Similarly, the appropriate notion of anytime-validity (which would be implied by the underlying e-values forming test supermartingales) depends on the particular definition of validity for the algorithm in question and so should be considered in a case-by-case basis. Nevertheless, the case of an e-process still holds generally.

2.4. The Asymptotic Case

Though typically considered in the context of nonasymptotic statistics, e-values also have asymptotic analogues (Waudby Smith et al., 2021; Ramdas & Wang, 2024). We focus

Prediction-Powered E-Values

on the main text only on nonasymptotic e-values, but our ideas directly map to the asymptotic setting just as well; see Appendix B.1.

3. Experiments and Case Studies

In this section we present four case studies where we use our method, highlighting the modifications made to the base methods in the process of prediction-empowerment.

3.1. Estimation of a Mean: Prevalence of Diabetes from Survey Data

In this first case study we seek to estimate the prevalence of diabetes on a cohort, upon which we work atop the dataset of (CDC, 2015). This estimation is key to the scaling of resources in health systems, as this medical condition can be very common and very costly to treat in many populations.

Actually assessing the presence of diabetes can be somewhat costly, requiring thorough analysis of individual medical records. On the other hand, we have readily available data in the form of short survey responses, consisting of simple questions such as do you have high blood pressure? , do you have high cholesterol? , have you smoked at least 100 cigarettes in your entire life? , and so on (see Appendix C for the full list). Considering that these questions capture health indicators that are fairly predictive of diabetes, it is appealing to leverage them in a prediction-powered manner.

More formally, we have a data stream (Xi, Yi) i=1 where the Xi correspond to the responses to our survey questions, and the Yi correspond to a binary indicator of whether the individual is diabetic. For the sake of evaluation, our dataset includes all Yi, but in a real-world setting it would be expected that they would be largely missing; we will simulate this missingness. Our goal is to infer the mean

prevalence of diabetes = E[1[Yi = diabetic]].

This is the mean of a random variable bounded in [0, 1], and so we can use the e-value-based method for inference of bounded means of (Waudby-Smith & Ramdas, 2020). Our confidence interval/sequence is thus given by the set

C(α) n = n θ [0, 1] : E(θ) n < 1/α o ,

1 + λi 1[Yi = diabetic] θ ,

where (λi) i=1 is a predictable sequence of bets bounded in ( 1 1 θ, 1

θ). In particular, each E(θ) n is a test supermartingale and thus a sequence of e-values for a corresponding null H(θ) 0 : prevalence of diabetes = θ (Waudby-Smith & Ramdas, 2020).

Figure 1. Prediction-powered confidence sequences. The plot shows the p-landscape (i.e., parameter on the x-axis, reciprocal of the e-value on the y-axis) for the confidence sequence generated by our method (green), along with those for inference using only labelled samples (purple) and by using an imputation approach. The 95% confidence intervals for each p-landscape (i.e., region where the p-landscape is above 0.05) is shaded. Our method provides the tightest valid nonasymptotic intervals, comparable to the active asymptotic method of (Zrnic & Candes, 2024); using only the labelled samples or vanilla PPI (Angelopoulos et al., 2023a) yields weaker inferences, and using imputation fails to cover the true mean.

These e-values are already in our required form of Equation (1), but additional care needs to be taken with regards to the bounds of the e-values components. As-is, the components are bounded just in [0, 1+max{θ/(1 θ), (1 θ)/θ}]. This means that we would require the data collection probabilities πi(Xi) to be bounded in [1, 1] i.e., we would always need to collect data; this is clearly insufficient for our purposes.

Fortunately, we have a direct way of controlling these bounds by the means of the bets (λi). If, instead of requiring them to be bounded in ( 1 1 θ, 1

θ), we require them to be bounded in ( c 1 θ, c

θ) for some 0 < c 1, then we have that the components are bounded in [1 c, 1 + c max{θ/(1 θ), (1 θ)/θ}], which now leads to nontrivial bounds on the πi(Xi). In particular, for any desired lower bound πinf for πi(Xi), we can now solve for some c for which

1 1 c 1 + c max{θ/(1 θ), (1 θ)/θ} πinf, (2)

satisfying our requirements; we use πinf = 1%.

We then have the following methods for doing inference with a fixed labelling budget πinf:

Only labelled samples: collect πinf n labelled samples, and use the standard, non-prediction-powered e-values of (Waudby-Smith & Ramdas, 2020) to estimate the mean. For the bets λi, we use the a GRAPA method proposed by (Waudby-Smith & Ramdas, 2020), bounded to ( 1 1 θ, 1

Prediction-Powered E-Values

Prediction-powered (ours): use our predictionpowered e-values method atop the e-value with bets truncated as per Equation (2). The predictive model is updated over the course of the inference, whenever we get a new data label. For the collection probabilities πi, we always yield πinf, the lowest possible value, in an effort to minimize data collection costs. The predictive model is updated over the course of the procedure by retraining on the augmented dataset

Active prediction-powered (ours): same as the previous prediction-powered method, but with a different choice of collection probabilities πi. This time, rather than opting for constant, always as low as possible, probabilities, we follow an approximately optimal choice which takes into consideration the Xi, as delineated in Appendix B.2. This gives an active inference / active learning flavor to our method.

Imputation: we simply learn a predictive model to predict the missing Yi from the available Xi, and impute the missing Yi with it without any care to use some prediction-powered inference method. This will often yield invalid inferences, but is very common in practice and thus a relevant baseline.

Vanilla prediction-powered: for the sake of comparison to prior work, we also consider the method of (Angelopoulos et al., 2023a). This method requires the prediction model to be fixed a priori, so we first split the collected labels in a training set to train it, and use the remaining labels for their prediction-powered inference method. For confidence intervals, we use CLT-based ones as proposed by the authors.

Vanilla active prediction-powered: we also compare to the method of (Zrnic & Candes, 2024), which does prediction-powered inference in an active setting, with a similar AIPW approach. We use CLT-based CIs for their estimator, as they propose, and update the underlying predictive model as we do for our method.

Figure 1 shows the result of our experiment. Our predictionpowered methods provides valid confidence intervals that are tighter and more concentrated around the true mean in comparison to only using labelled samples, while the imputation approach is strongly concentrated away from the true mean, and would lead to invalid conclusions. In comparison to the method of (Angelopoulos et al., 2023a), our method

Figure 2. Prediction-powered anytime-valid hypothesis testing. The plot shows the e-values over time for testing two null hypotheses one on the bottom, which should be rejected, and one on top, which should not be rejected. Our prediction-powered e-values provide the strongest valid signal for rejection (E 20 for a significance level of 95%, marked by the dashed lines), as the imputation approach rejects before the null is actually violated; for non-rejection (E < 20), all the methods appear valid, but ours still attains the highest e-value.

provides tighter intervals in spite of its nonasymptotic nature, likely due to its ability to train the predictive model without a data split; indeed, when compared to the basedline of (Zrnic & Candes, 2024), which also updates the models, we see a similar performance obtained by our method, but now with the stronger guarantees of e-values.

3.2. Testing the Online Risk: Online Monitoring of a Deployed Model for Forest Cover Prediction

For our second experiment, we consider the task of monitoring the risk of a predictive model for forest cover types online. Forest cover prediction is of wide use in remote sensing tasks and particularly for tracking of deforestation and land use, which is, in turn, very useful for climate research. Moreover, online risk monitoring is ubiquitous and applicable to any setting where a predictive model is involved.

Again we have a data stream (Xi, Yi) i=1, where Xi indicate input variables to our predictive model in this case, corresponding to data from satellite images and Yi are the labels denoting the corresponding cover type (which is a categorical variable). Naturally, Yi is generally missing after all, if it weren t, then we would have no need to predict it. In our experiment, we work on the dataset of (Blackard, 1998). For the sake of evaluation, we have access to all Yi, but will simulate the missingness. The notion of risk in which we are interested is given by the 0-1 loss: Riski(f) = E[1[f(Xi) = Yi]]. We have already trained

Prediction-Powered E-Values

Figure 3. Prediction-powered change-point detection via e-values. The plot shows the exponential moving average of a time series (in blue), with the few collected labels denoted by the scattered Xs. Our prediction-powered methods detect the change-point accurately, while the base method that only considers the labelled data points does not detect any change-point.

the predictive model f independent of our data stream (in the case of our experiment, in a separate training split) and have similarly evaluated it on a separate validation set, also independent of our data stream, upon which we obtained a validation 0-1 loss of Val Risk. For continuous risk monitoring, we want to test the null hypothesis that

H0 : Riski(f) Val Risk + ϵtol, for all i = 1, 2, . . . ,

for some tolerance level ϵtol, for example equal to 0.05. In particular, we would like for this hypothesis test to be anytime-valid, so that at any point we can reach safe conclusions from it.

Inspired by the work of (Podkopaev & Ramdas, 2021), we consider the following e-value:

1 + λi 1[f(Xi) = Yi] (Val Risk + ϵtol) ,

(3) where (λi) i=1 is a predictable sequence of bets bounded in 0, 1/(Val Risk + ϵtol) . This forms a test supermartingale for the null H0.

Much like in the example of inference of the prevalence of diabetes in Section 3.1, the e-values are already of the desired form, but additional care must be taken with regard to the limits of the components. As-is, they are bounded in [0, 1 + max{1/(Val Risk + ϵtol) 1, 0}], meaning that our collection probabilities would have to be within [1, 1]. Similarly to what we did in Section 3.1, we tweak the bounds for the bets λi to make them bounded within 0, c/(Val Risk + ϵtol) for some 0 < c 1, leading to components bounded in [1 c, 1 + c max{1/(Val Risk + ϵtol) 1, 0}]. We can then solve for the c that satisfies

1 1 c 1 + c max{1/(Val Risk + ϵtol) 1, 0} πinf, (4)

for a desired labelling budget πinf, which we take to be equal 0.5%.

The methods we consider for our experiment are akin to those of Section 3.1:

Only labelled samples at every data point i, we sample ξi Bern(πinf). If ξi = 1, then we collect that data point and update the non-prediction-powered e-value in Equation (3). Since the data collection is sampled independently of all else, this is a valid e-value, and forms a test supermartingale; moreover, only about πinf n samples will be collected. However, only data points where ξi = 1 are used for inference.

Prediction-powered (ours): we compute the prediction-powered e-value atop the base-evalue in Equation (3) tweaked to satisfy the boundedness conditions as per Equation (4). We then have two predictive models: one which is the predictive model whose risk we want to monitor µ and another which is used for prediction-powered inference, which receives Xi and predicts the 0-1 loss for that point, 1[µ(Xi) = Yi]. The first model µ is held static over the course of the inference, while the one for prediction-powered inference is updated whenever we collect a new label. Collection probabilities πi(Xi) are held constant at πinf, leading to label collection matching the baseline of only using labelled samples.

Active prediction-powered (ours): the same as our non-active prediction-powered method, but label collection probabilities are given by the approximately optimal choice presented in Appendix B.2.

Imputation: as a final baseline, we consider simply imputing the 0-1 loss at points where we have not

Prediction-Powered E-Values

collected the true label, with no regard to predictionpowreed inference. This is invalid in general, but commonly used in practice.

Note that standard prediction-powered inference methods (e.g., from (Angelopoulos et al., 2023a)) are no longer directly applicable due to the requirement of anytime-validity, as well as the fact that our hypothesis test does not come from a two-sided test for a mean (which would then be an instance of simple Z-estimation).

To fully assess the hypothesis test, we consider two settings here. In the first setting, there is no change in distribution: the data stream for the inference follows the exact same distribution as training and validation, and thus the null hypothesis should hold. For the second setting, we increasingly poison the labels over the course of time to simulate distribution drift.

The results can be seen in Figure 2. Without data poisoning, none of the methods reject the hypothesis, which is appropriate; though it is interesting to note that our predictionpowered methods were the ones with the highest e-values, managing to stay at around 1. Under data poisoning, both of our prediction-powered methods detect the distribution drift much quicker than the method that only uses labelled samples, despite both having access to the same labelled samples. The active prediction-powered method seems to reject the hypothesis a tiny bit earlier and yields larger evalues (i.e., with more evidence towards rejection), at the cost of just a tiny bit more data. The imputation method seemingly detects the shift even earlier, but does so before the null hypothesis is actually falsified; thus, it produces a false alarm with extremely high confidence.

3.3. Change-Point Detection: Detecting Changes in the Quality of a Deployed Model

Still in the context of testing the cover prediction model of Section 3.2, we now consider not just detecting when the risk goes below a certain level, but detecting any change. E-values have seen good use in the change-point detection literature (Shin et al., 2022; Shekhar & Ramdas, 2023a;b); we opt here for the method proposed in (Shekhar & Ramdas, 2023a), where change-point detection is reduced to a simple algorithm atop confidence sequences initialized at each time step. For the underlying confidence sequences, we use the same ones as in Section 3.1. Compared to Section 3.2, the only change we make to the data is the introduction of a crisp change-point for better visualization.

Figure 3 displays a high-frequency exponentially moving average of our data (to give a notion of the underlying data stream) that uses data at all points, regardless of whether they are accessible to the analyst; scattered throughout are the few data points that were labelled and that the analyst

Figure 4. Prediction-powered causal discovery with e-values. We compare our prediction-powered causal discovery method with one that uses only labelled data. The lighter nodes correspond to the costly variables, while the darker nodes correspond to cheaper readily-available ones. The standard base method does not detect any edges in the causal graph (denoted by the dashed edges), while ours detects as many edges as the best possible method, which uses all the data points regardless of data acquisition costs.

does have access to. Our prediction-powered method detects the change-point accurately while retaining the strong guarantees of (Shekhar & Ramdas, 2023a), whereas the non-prediction-powered baseline that uses only available labelled data fails to detect any change-point.

3.4. Causal Discovery: Constraint-Based Structure Learning with Costly Covariates

Causal inference is of essence to any area where one plans interventions, but the usual methods require knowledge of a DAG describing (a simplified version of) the data generating process. Causal discovery (a.k.a. causal structure learning) methods seek to learn this from data. Some particularly common methods for causal learning include the PC (Burr, 2003) and FCI (Spirtes et al., 1995) algorithms; all of these belong to the class of so called constraint-based structure learning, where the DAG is inferred by the means of many hypothesis tests for conditional independencies. In spite of potential multiple comparison concerns, these algorithms are generally said to be valid as long as the underlying hypothesis tests are valid (i.e., control type-I error).

In this section we consider the problem of causal discovery with the PC algorithm (Burr, 2003) with some costly covariates, which will be generally missing. As is usual in the causal discovery literature, we evaluate on synthetic

Prediction-Powered E-Values

data generated with a randomly generated DAG, in order to have access to the true DAG. Our DAG features 6 variables, of which 3 are considered costly. Overall, our cheap data Xi consists of the 3 always-available variables, whereas the costly data Yi consists of the 6 full variables. For constraintbased causal learning, we need to be able to test hypotheses of the form H(A,B,C) 0 : A B | C,

where A, B and C consist of subsets of our 6 variables, possibly empty.

There do exist sequential e-value tests for conditional independence (Shaer et al., 2022; Gr unwald et al., 2022), but they work under the Model-X framework, which requires knowledge of conditionals that are typically inaccessible in the context of causal discovery. We thus opt instead for Fisher s z-transformation of partial correlation test, which is commonly used in causal discovery implementations (e.g., (Markus Kalisch et al., 2012; Zheng et al., 2024)). But it is based on p-values, is not of sequential nature, is asymptotic, and works atop rather heavy normality assumptions.

We first need to adapt it to our required form, following Equation (1). To do so, we first rearrange our data stream (Xi, Yi) i=1 to arrive in batches of B samples, (Xbatch j , Y batch j ) j=1; these batches will be the unit of data for our prediction-powered procedure. We can then compute the test s p-value for each batch, and calibrate this p-value into an e-value by the means of the following PTo E calibrator (Vovk & Wang, 2019):

PTo E(p) = 1 p + p log p

p( log p)2 .

To ensure that our e-values components are appropriately bounded, we first clip the p-values (prior to calibration) to lie within (10 7, 1] (so that they are bounded at all; this clipping preserves the validity of the p-values), and then rescale the calibrated e-values by the means of a rescaling function rescaleη(e) := η (e 1) + 1,

with η chosen so as to satisfy a labelling budget of πinf = 10% (as in the previous sections). Because the p-values are only valid asymptotically, the batch size B cannot be too small; we use B = 100.

The results can be seen in Figure 4. When using only labelled data according to our data collection budget, the causal discovery method identifies no edges at all. By using our prediction-powered e-values, we detect over half of the edges, matching the best possible scenario (i.e., what would happen if we had access to the whole dataset). In terms of the average structural Hamming distance over various sampled graphs, using only labelled data we obtain an average 12.5; our method halves this to 6.7, and the best possible scenario would obtain that of 6.4.

Acknowledgements

This research is funded by Canada s International Development Research Centre (IDRC) (Grant No. 109981) and UK International Development. CJS acknowledges financial support from CNPq and FAPERJ.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning and Statistics. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. Prediction-powered inference. Science, 382:669 674, 2023a. URL https: //api.semanticscholar.org/Corpus ID: 256105365.

Angelopoulos, A. N., Duchi, J. C., and Zrnic, T. Ppi++: Efficient prediction-powered inference. Ar Xiv, abs/2311.01453, 2023b. URL https: //api.semanticscholar.org/Corpus ID: 264935590.

Bar, Y., Shaer, S., and Romano, Y. Protected testtime adaptation via online entropy matching: A betting approach. Ar Xiv, abs/2408.07511, 2024. URL https://api.semanticscholar.org/ Corpus ID:271865850.

Blackard, J. Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N.

Boyeau, P., Angelopoulos, A. N., Yosef, N., Malik, J., and Jordan, M. I. Autoeval done right: Using synthetic data for model evaluation. Ar Xiv, abs/2403.07008, 2024. URL https://api.semanticscholar. org/Corpus ID:268363495.

Burr, T. L. Causation, prediction, and search. Technometrics, 45:272 273, 2003. URL https://api. semanticscholar.org/Corpus ID:10562706.

CDC. Cdc 2014 brfss survey data and documentation, 2015. URL https://www.cdc.gov/brfss/ annual_data/annual_2014.html. Last accessed 30 January 2025.

Chatzi, I., Straitouri, E., Thejaswi, S., and Rodriguez, M. G. Prediction-powered ranking of large language models. Ar Xiv, abs/2402.17826, 2024. URL https://api.semanticscholar. org/Corpus ID:268041436.

Prediction-Powered E-Values

Gr unwald, P. Beyond neyman-pearson: E-values enable hypothesis testing with a data-driven alpha. Proceedings of the National Academy of Sciences of the United States of America, 121 39:e2302098121, 2022. URL https://api.semanticscholar. org/Corpus ID:248496494.

Gr unwald, P., Henzi, A., and Lardy, T. Anytime-valid tests of conditional independence under model-x. Journal of the American Statistical Association, 119:1554 1565, 2022. URL https://api.semanticscholar. org/Corpus ID:252531771.

Gu, Y. and Xia, D. Local prediction-powered inference. Ar Xiv, abs/2409.18321, 2024. URL https: //api.semanticscholar.org/Corpus ID: 272968866.

Howard, S. R., Ramdas, A., Mc Auliffe, J. D., and Sekhon, J. S. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 2018. URL https://api.semanticscholar. org/Corpus ID:219767477.

Kelly, J. L. A new interpretation of information rate. IRE Trans. Inf. Theory, 2:185 189, 1956. URL https://api.semanticscholar. org/Corpus ID:16143351.

Koning, N. W. Post-hoc α hypothesis testing and the post-hoc p-value. 2023. URL https: //api.semanticscholar.org/Corpus ID: 266191165.

Little, R. J. A. and Rubin, D. B. Statistical analysis with missing data, third edition. Wiley Series in Probability and Statistics, 2019. URL https://api. semanticscholar.org/Corpus ID:60779615.

Markus Kalisch, Martin M achler, Diego Colombo, Marloes H. Maathuis, and Peter B uhlmann. Causal inference using graphical models with the R package pcalg. Journal of Statistical Software, 47(11):1 26, 2012. doi: 10.18637/ jss.v047.i11.

Miao, J., Wu, Y., Sun, Z., Miao, X., Lu, T., Zhao, J., and Lu, Q. Valid inference for machine learning-assisted genome-wide association studies. Nature genetics, 2024. URL https://api.semanticscholar. org/Corpus ID:272989144.

Peters, J., Janzing, D., and Sch olkopf, B. Elements of causal inference: Foundations and learning algorithms. 2017. URL https://api.semanticscholar. org/Corpus ID:86533208.

Podkopaev, A. and Ramdas, A. Tracking the risk of a deployed model and detecting harmful distribution shifts. Ar Xiv, abs/2110.06177, 2021. URL https://api.semanticscholar. org/Corpus ID:238634210.

Podkopaev, A. and Ramdas, A. Sequential predictive twosample and independence testing. Ar Xiv, abs/2305.00143, 2023a. URL https://api.semanticscholar. org/Corpus ID:258426601.

Podkopaev, A. and Ramdas, A. Sequential predictive twosample and independence testing. Ar Xiv, abs/2305.00143, 2023b. URL https://api.semanticscholar. org/Corpus ID:258426601.

Ramdas, A. Proof of ville s inequality, 2018.

Ramdas, A. and Wang, R. Hypothesis testing with e-values. 2024. URL https://api.semanticscholar. org/Corpus ID:273707651.

Ramdas, A., Gr unwald, P. D., Vovk, V., and Shafer, G. Game-theoretic statistics and safe anytime-valid inference. Ar Xiv, abs/2210.01948, 2022. URL https: //api.semanticscholar.org/Corpus ID: 252715629.

Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89:846 866, 1994. URL https://api.semanticscholar. org/Corpus ID:120769390.

Shaer, S., Maman, G., and Romano, Y. Model-x sequential testing for conditional independence via testing by betting. In International Conference on Artificial Intelligence and Statistics, 2022. URL https: //api.semanticscholar.org/Corpus ID: 252683086.

Shekhar, S. and Ramdas, A. Reducing sequential change detection to sequential estimation. Ar Xiv, abs/2309.09111, 2023a. URL https://api.semanticscholar. org/Corpus ID:262043770.

Shekhar, S. and Ramdas, A. Sequential change detection via backward confidence sequences. Ar Xiv, abs/2302.02544, 2023b. URL https://api.semanticscholar. org/Corpus ID:256615301.

Shin, J., Ramdas, A., and Rinaldo, A. E-detectors: A nonparametric framework for sequential change detection. The New England Journal of Statistics in Data Science, 2022. URL https://api.semanticscholar. org/Corpus ID:258426776.

Prediction-Powered E-Values

Spirtes, P., Meek, C., and Richardson, T. S. Causal inference in the presence of latent variables and selection bias. In Conference on Uncertainty in Artificial Intelligence, 1995. URL https://api.semanticscholar. org/Corpus ID:11987717.

Ville, J.-L. Etude critique de la notion de collectif. 1939. URL https://api.semanticscholar. org/Corpus ID:123425777.

Vovk, V. and Wang, R. E-values: Calibration, combination, and applications. Political Methods: Quantitative Methods e Journal, 2019. URL https: //api.semanticscholar.org/Corpus ID: 221834569.

Wang, R. and Ramdas, A. False discovery rate control with e-values. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84:822 852, 2020. URL https://api.semanticscholar. org/Corpus ID:221516157.

Waudby-Smith, I. and Ramdas, A. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2020. URL https: //api.semanticscholar.org/Corpus ID: 240070804.

Waudby-Smith, I., Arbour, D. T., Sinha, R., Kennedy, E. H., and Ramdas, A. Time-uniform central limit theory and asymptotic confidence sequences. The Annals of Statistics, 2021. URL https: //api.semanticscholar.org/Corpus ID: 257901246.

Waudby-Smith, I., Wu, L., Ramdas, A., Karampatziakis, N., and Mineiro, P. Anytime-valid off-policy inference for contextual bandits. Ar Xiv, abs/2210.10768, 2022. URL https://api.semanticscholar. org/Corpus ID:252992535.

Xu, Z., Wang, R., and Ramdas, A. Post-selection inference for e-value based confidence intervals. Electronic Journal of Statistics, 2022. URL https: //api.semanticscholar.org/Corpus ID: 247619119.

Xu, Z., Karampatziakis, N., and Mineiro, P. Active, anytime-valid risk controlling prediction sets. Ar Xiv, abs/2406.10490, 2024. URL https: //api.semanticscholar.org/Corpus ID: 270559841.

Zheng, Y., Huang, B., Chen, W., Ramsey, J., Gong, M., Cai, R., Shimizu, S., Spirtes, P., and Zhang, K. Causal-learn: Causal discovery in python. Journal of Machine Learning Research, 25(60):1 8, 2024.

Zrnic, T. and Cand es, E. J. Cross-prediction-powered inference. Proceedings of the National Academy of Sciences of the United States of America, 121, 2023. URL https://api.semanticscholar. org/Corpus ID:263134612.

Zrnic, T. and Candes, E. J. Active statistical inference. Ar Xiv, abs/2403.03208, 2024. URL https: //api.semanticscholar.org/Corpus ID: 268248530.

Prediction-Powered E-Values

Throughout, we denote by Fi the i-th element of the underlying data filtration.

Theorem A.1 (Theorem 2.1 in the main text). Eppi n is a valid e-value for the null H0. Additionally:

(i) If (E0, E1, . . .) form a test supermartingale i.e., a nonnegative supermartingale with E[E0] 1 under the null H0 then so is (Eppi 0 , Eppi 1 , . . .);

(ii) More generally, if (E0, E1, . . .) form an e-process i.e., a nonnegative stochastic process such that for all stopping times τ, the null H0 implies that E[Eτ] 1 then so is (Eppi 0 , Eppi 1 , . . .) for all finite stopping times.

Proof. First, note that Eppi n is always nonnegative for all n N: by induction, it holds for n = 0 (where Eppi n = Eppi 0 = 1), and, for the inductive step,

Eppi n+1 = Eppi n en+1(µn+1(Xn+1)) + en+1(Yn+1) en+1(µn+1(Xn+1)) ξn+1 πn+1(Xn+1)

en+1(µn+1(Xn+1)) + en+1(Yn+1) en+1(µn+1(Xn+1)) ξn+1 πn+1(Xn+1) 0;

If ξn+1 = 0, then the left-hand-side equals en+1(µn+1(Xn+1)) an+1 > 0. Otherwise, it equals

en+1(Yn+1) (1 πn+1(Xn+1))en+1(µn+1(Xn+1))

πn+1(Xn+1) an+1 (1 πn+1(Xn+1))bn+1

πn+1(Xn+1) 0

an+1 (1 πn+1(Xn+1))bn+1 0 an+1 (1 πn+1(Xn+1))bn+1 1 an+1/bn+1 πn+1(Xn+1),

which holds by construction.

So all that remains is to show that its properties under the null hold. Hence, from here on out, we assume that the null H0 is true.

We will first show that, for any n N, E[Eppi n ] 1. To do so, we will first prove the following lemma by backward induction:

Lemma A.2. Let n N and A denote an event. Then, for any 1 k n, it holds that E[Qn i=k eppi i | A, Fk 1] = E[Qn i=k ei(Yi) | A, Fk 1]

Proof. The base case is when k = n. Then

i=k eppi i | A, Fk 1

= E eppi n | A, Fn 1 = E en(µn(Xn)) + ξn πn(Xn) (en(Yn) en(µn(Xn))) | A, Fn 1

= E [en(µn(Xn)) | A, Fn 1] + E ξn πn(Xn) (en(Yn) en(µn(Xn))) | ξn = 1, A, Fn 1

P[ξn = 1 | A, Fn 1]

+ E ξn πn(Xn) (en(Yn) en(µn(Xn))) | ξn = 0, A, Fn 1

P[ξn = 0 | A, Fn 1]

= E [en(µn(Xn)) | A, Fn 1] + E 1 πn(Xn) (en(Yn) en(µn(Xn))) | A, Fn 1

= E [en(µn(Xn)) | A, Fn 1] + E [en(Yn) en(µn(Xn)) | A, Fn 1]

= E [en(Yn) | A, Fn 1] = E

i=k ei(Yi) | A, Fk 1

Prediction-Powered E-Values

For the induction step, given that the hypothesis holds for k + 1 n, we want to show that it holds for k. It follows, using the law of total expectation:

i=k eppi i | A, Fk 1

i=k+1 eppi i | A, Fk 1

i=k+1 eppi i | A, Fk

i=k+1 eppi i | A, Fk

i=k+1 ei(Yi) | A, Fk

" ek(µk(Xk)) + ξk πk(Xk) (ek(Yk) ek(µk(Xk))) E

i=k+1 ei(Yi) | A, Fk

ek(µk(Xk)) E

i=k+1 ei(Yi) | A, Fk

" ξk πk(Xk) (ek(Yk) ek(µk(Xk))) E

i=k+1 ei(Yi) | A, Fk

| ξk = 1, A, Fk 1

P[ξk = 1 | A, Fk 1]

" ξk πk(Xk) (ek(Yk) ek(µk(Xk))) E

i=k+1 ei(Yi) | A, Fk

| ξk = 0, A, Fk 1

P[ξk = 0 | A, Fk 1]

ek(µk(Xk)) E

i=k+1 ei(Yi) | A, Fk

" 1 πk(Xk) (ek(Yk) ek(µk(Xk))) E

i=k+1 ei(Yi) | A, Fk

ek(µk(Xk)) E

i=k+1 ei(Yi) | A, Fk

(ek(Yk) ek(µk(Xk))) E

i=k+1 ei(Yi) | A, Fk

i=k+1 ei(Yi) | A, Fk

i=k+1 ei(Yi) | A, Fk

i=k ei(Yi) | A, Fk

i=k ei(Yi) | A, Fk 1

as we desired.

By picking k = 1 and A to be a trivial event in Lemma A.2, we conclude that E[Eppi n ] = E[Qn i=1 eppi i | F0] = E[Qn i=1 ei(Yi) | F0] = E[En] 1, and so Eppi n is a valid e-value.

Now let us show that, if the underlying e-values form a test supermartingale, then so is the prediction-powered process. By definition Eppi 0 = E0 = 1, and so all we need to do is to show that E[Eppi n+1 | Fn] Eppi n . It follows:

Prediction-Powered E-Values

E[Eppi n+1 | Fn] = E[eppi n+1 Eppi n | Fn] = E[eppi n+1 | Fn] Eppi n

= E en+1(µn+1(Xn+1)) + ξn+1 πn+1(Xn+1) (en+1(Yn+1) en+1(µn+1(Xn+1))) | Fn

= E [en+1(µn+1(Xn+1)) | Fn] Eppi n

+ E ξn+1 πn+1(Xn+1) (en+1(Yn+1) en+1(µn+1(Xn+1))) | ξn+1 = 1, Fn

P[ξn+1 = 1 | Fn] Eppi n

+ E ξn+1 πn+1(Xn+1) (en+1(Yn+1) en+1(µn+1(Xn+1))) | ξn+1 = 0, Fn

P[ξn+1 = 0 | Fn] Eppi n

= E [en+1(µn+1(Xn+1)) | Fn] Eppi n

+ E 1 πn+1(Xn+1) (en+1(Yn+1) en+1(µn+1(Xn+1))) | ξn+1 = 1, Fn

πn+1(Xn+1) Eppi n

= E [en+1(µn+1(Xn+1)) | Fn] Eppi n + E [en+1(Yn+1) en+1(µn+1(Xn+1)) | ξn+1 = 1, Fn] Eppi n = E [en+1(Yn+1) | Fn] Eppi n

= E [en+1(Yn+1) | Fn] En Eppi n En

En Eppi n En = Eppi n .

Finally, we assume that the underlying e-values form an e-process for finite stopping times. We want to show that, for any finite stopping time τ, E[Eppi τ ] 1. Well,

E[Eppi τ ] = E[E[Eppi τ | τ]];

When τ = n for each n N, by Lemma A.2 with k = n and A = {τ = n}, it holds that E[Eppi n | τ = n] = E[Qn i=1 eppi i | τ = n, F0] = E[Qn i=1 ei(Yi) | τ = n, F0] = E[En | τ = n].

Thus E[Eppi τ ] = E[E[Eppi τ | τ]] = E[E[Eτ | τ]] = E[Eτ] 1,

and so Eppi is an e-process.

To prove the main theorem about power of our prediction-powered e-values, we will use the following change-of-measure lemma based on the Wasserstein distance:

Lemma A.3. For any distributions P and Q over some space Z and any L-Lipschitz function ϕ : Z R,

|EP [ϕ] EQ[ϕ]| L W(P Q),

where W(P Q) is the Wasserstein distance between P and Q.

Proof. The proof follows immediately from the representation of the Wasserstein distance as an IPM. The Wasserstein distance, written as an IPM, is W(P Q) = sup f Lip=1 |EP [f] EQ[f]|.

If ϕ is L-Lipschitz, then ϕ/L is 1-Lipschitz, and so

|EP [ϕ] EQ[ϕ]| = |LEP [ϕ/L] LEQ[ϕ/L]| = L|EP [ϕ/L] EQ[ϕ/L]| L sup f Lip=1 |EP [f] EQ[f]| = L W(P Q).

Prediction-Powered E-Values

We will also use a simple upper bound on the Wasserstein distance, showing that it is upper bounded by the MAE.

Lemma A.4. For any distributions P and Q over some normed space (Z, ),

W(P, Q) EZP P,ZQ Q[ ZP ZQ ].

Proof. By the dual representation of the Wasserstein distance,

W(P Q) = sup f Lip=1 |EZP P [f(ZP )] EZQ Q[f(ZQ)]| = sup f Lip=1 |EZP P,ZQ Q[f(ZP ) f(ZQ)]|

sup f Lip=1 EZP P,ZQ Q[|f(ZP ) f(ZQ)|] sup f Lip=1 EZP P,ZQ Q[ ZP ZQ ]

= EZP P,ZQ Q[ ZP ZQ ].

Theorem A.5 (Theorem 2.2 in the main text). Suppose that the ei( ) are each Li-Lipschitz, and that πi(Xi) 1 ai/bi +ϵi for some ϵi > 0, for all i. Then there exists some constant c > 0 independent of n such that

n log Eppi n

i=1 E[ µi(Xi) Yi ],

Proof. First, note that

n log Eppi n

i=1 E h log eppi i i

i=1 E log ei(µi(Xi)) + ξi πi(Xi)[ei(Yi) ei(µi(Xi))] .

i=1 E E log ei(µi(Xi)) + ξi πi(Xi)[ei(Yi) ei(µi(Xi))] | Yi, ξi, πi(Xi), Fi 1

The inner expectation in the last line is random only over µi(Xi). Moreover, thanks to our assumptions, the value we are taking the expectation over is Lipschitz as a function of µi(Xi): because of the lower bound on the πi(Xi) with positive margins ϵi, the value within the log is bounded away from zero, and so the log becomes Lipschitz with some constant u > 0. log ei( ) + ξi πi(Xi)[ei(Yi) ei( )] Lip u ei( ) + ξi πi(Xi)[ei(Yi) ei( )] Lip ;

If ξi = 0, then this equals u ei( ) Lip = u Li. Otherwise, this equals

u ei( ) + ξi πi(Xi)[ei(Yi) ei( )] Lip = u ei(Yi) (1 πi(Xi))ei( )

= u 1 πi(Xi) ei(Yi) (1 πi(Xi))ei( ) Lip

= u 1 πi(Xi) (1 πi(Xi))ei( ) Lip

= u(1 πi(Xi))

πi(Xi) ei( ) Lip = u Li (1 πi(Xi))

In either case, this Lipschitz constant is upper bounded by c := u Li max n (1 πi(Xi))

πi(Xi) , 1 o (which does not depend on n).

Prediction-Powered E-Values

Hence, by Lemma A.3,

i=1 E E log ei(µi(Xi)) + ξi πi(Xi)[ei(Yi) ei(µi(Xi))] | Yi, ξi, πi(Xi), Fi 1

i=1 E E log ei(Yi) + ξi πi(Xi)[ei(Yi) ei(Yi)] | Yi, ξi, πi(Xi), Fi 1

c W(µi(Xi) Yi)

i=1 E [E [log ei(Yi) | Yi, ξi, πi(Xi), Fi 1] c W(µi(Xi) Yi)]

i=1 E [log ei(Yi)] 1

i=1 E [c W(µi(Xi) Yi)]

i=1 E [W(µi(Xi) Yi)] .

Apply Lemma A.4 and conclude.

The following is a more precise statement about the growth rate of our prediction-powered e-values, albeit less directly interpretable:

Theorem A.6. It holds that

n log Eppi n

i=1 (1 πi(Xi)) log ei(µi(Xi)) +E 1

i=1 πi(Xi) log ei(µi(Xi)) + ei(Yi) ei(µi(Xi))

n log Eppi n

i=1 E h log eppi i i

i=1 E log ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) ξi πi(Xi)

i=1 E E log ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) ξi πi(Xi)

i=1 E E log ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) ξi πi(Xi)

| ξi = 1, Fi 1

P[ξi = 1 | Fi 1]

+ E log ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) ξi πi(Xi)

| ξi = 0, Fi 1

P[ξi = 0 | Fi 1]

i=1 E E log ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) 1 πi(Xi)

+ E [log ei(µi(Xi)) | Fi 1] (1 πi(Xi))

i=1 E πi(Xi) log ei(µi(Xi)) + ei(Yi) ei(µi(Xi)) 1 πi(Xi)

+ (1 πi(Xi)) log ei(µi(Xi))

i=1 (1 πi(Xi)) log ei(µi(Xi)) + E 1

i=1 πi(Xi) log ei(µi(Xi)) + ei(Yi) ei(µi(Xi))

To prove the next result we will make use of Ville s inequality:

Prediction-Powered E-Values

Theorem A.7 (Ville s inequality (Ville, 1939; Ramdas, 2018)). For any nonnegative supermartingale (Lt) and any x > 1, define the (possibly infinite) stopping time N := inf t 1 : Lt x and denote the expected overshoot when Lt surpasses x as

P[ t : Lt x] E[L0]

Proposition A.8 (Proposition 2.3 in the main text). Cppi (α) n is a valid confidence interval i.e., P[θ Cppi (α) n ] 1 α. Moreover:

(i) If the underlying e-values form a nonnegative supermartingale, then the prediction-powered intervals are anytime-valid (also known as confidence sequences): P[ n N, θ Cppi (α) n ] 1 α;

(ii) More generally, if the underlying e-values form e-processes, then the prediction-powered intervals are valid at arbitrary stopping times: P[θ Cppi (α) τ ] 1 α for any stopping time τ.

Proof. By construction, P[θ Cppi (α) n ] = P[Eppi (θ ) n 1/α] = 1 P[Eppi (θ ) n > 1/α]. By Markov, considering that the null Hθ 0 holds and using Theorem 2.1,

1 P[Eppi (θ ) n > 1/α] 1 E[Eppi (θ )]

1/α 1 1 1/α = 1 α.

If the underlying e-values form a test supermartingale, then by Theorem 2.1 so do the prediction-powered e-values; then, using Ville s inequality,

P[ n N, θ Cppi (α) n ] = P[ n N, Eppi (θ ) n 1/α] = P[sup n Eppi (θ ) n 1/α]

= 1 P[sup n Eppi (θ ) n > 1/α] 1 E[Eppi (θ ) 0 ] 1/α = 1 1 1/α = 1 α.

Finally, if the underlying e-values form an e-process, thenby Theorem 2.1 so do the prediction-powered e-values (for finite stopping times), and so, by Markov,

P[θ Cppi (α) τ ] = P[Eppi (θ ) τ 1/α] = 1 P[Eppi (θ ) τ > 1/α]

1 E[Eppi (θ ) τ ] 1/α 1 1 1/α = 1 α.

Proposition A.9 (Proposition 2.4 in the main text). Under the assumptions of Theorem 2.2, let ν be a measure over the parameter space Θ. Then there exists some c for which

Eppi (θ) n dν(θ) E Z 1

E(θ) n dν(θ) + ν(Θ)c

i=1 E[ µi(Xi) Yi ].

Proof. By Fubini,

n log 1/Eppi (θ) n dν(θ) = Z E 1

n log 1/Eppi (θ) n

Prediction-Powered E-Values

And now we apply Theorem 2.2: Z E 1

n log 1/Eppi (θ) n

dν(θ) = Z E 1

n log Eppi (θ) n

n log E(θ) n

i=1 E[W(µi(Xi) Yi)]

n log 1/E(θ) n

i=1 E[W(µi(Xi) Yi)]

n log 1/E(θ) n

dν(θ) + ν(Θ)c

i=1 E[W(µi(Xi) Yi)]

n log 1/E(θ) n dν(θ) + ν(Θ)c

i=1 E[W(µi(Xi) Yi)]

n log 1/E(θ) n dν(θ) + ν(Θ)c

i=1 E[ µi(Xi) Yi ],

where the last step holds by Lemma A.4.

Considering that the object of interest is a confidence interval, it is desirable to further bound the measure of the interval. We were unable to prove any sufficiently general result that was (i) nonvacuous, and (ii) decayed reasonably fast as n increased, and imagine that heavy assumptions are necessary; this may be best done on a case-by-case basis. Nevertheless, here is one possible somewhat straightforward result. Proposition A.10. Under the same conditions of Proposition 2.4, suppose that the prediction-powered e-values are bounded from above by M ppi (i.e., for all θ Θ, Eppi (θ) n < M ppi almost surely), and similarly for the non-prediction powered e-values by M (i.e., for all θ Θ, E(θ) n < M almost surely). Then: Then

E[ν(Cppi)] E[ R log 1/Eppi (θ)dν(θ)] + ν(Θ)M ppi

log α + log M ppi , E[ν(C)] E[ R log 1/E(θ)dν(θ)] + ν(Θ)M

log α + log M .

Proof. Consider the measure ν(A) = ν(A)/ν(Θ); it is a probability measure. Then:

ν(Cppi) = Pθ ν[Eppi (θ) < 1/α] = Pθ ν[1/Eppi (θ) > α] = Pθ ν[log 1/Eppi (θ) > log α];

We want to apply Markov. To do that, we need the left-hand side to be nonnegative; to do so, we add log M ppi to both sides, which yields

Pθ ν[log 1/Eppi (θ) > log α]; = Pθ ν[log 1/Eppi (θ) + log M ppi > log α + log M ppi]

Eθ ν[log 1/Eppi (θ) + log M ppi]

log α + log M ppi

R log 1/Eppi (θ)d ν(θ) + log M ppi

log α + log M ppi .

[ν(Θ)] 1 R log 1/Eppi (θ)dν(θ) + log M ppi

log α + log M ppi .

So, multiplying everything by ν(Θ), we get that

ν(Cppi) ν(Θ) = ν(Cppi) ν(Θ) [ν(Θ)] 1 R log 1/Eppi (θ)dν(θ) + log M ppi

log α + log M ppi =

R log 1/Eppi (θ)dν(θ) + ν(Θ) log M ppi

log α + log M ppi .

Finally, taking the expectation on both sides, we get that

E[ ν(Cppi)] E

"R log 1/Eppi (θ)dν(θ) + ν(Θ) log M ppi

log α + log M ppi

= E[ R log 1/Eppi (θ)dν(θ)] + ν(Θ) log M ppi

log α + log M ppi ,

as we desired.

The same can be done for the non-prediction-powered e-values, replacing Eppi with E and M ppi with M.

Prediction-Powered E-Values

Most terms in the inequality depend on n, so it s a bit hard to intuit. But, if the dependence on the n in the expectation of the log is good enough, then this should be nonvacuous, at least.

Proposition A.11 (Proposition 2.6 in the main text). Under Assumption 2.5, it holds that A((Eppi (γ) n )γ Γ) is also valid. If the underlying e-values are e-processes, then it further holds that A((Eppi (γ) τ )γ Γ) is valid for any finite stopping time τ.

Proof. To prove that A((Eppi (γ) n )γ Γ) is valid, by Assumption 2.5, it suffices to show that Eppi (γ) n is valid for every γ Γ; and by Theorem 2.1, this is indeed the case.

Now suppose that the underlying e-values (E(γ) n )γ Γ form e-processes; then so do the prediction-powered e-values (Eppi (γ) n )γ Γ for finite stopping times, by Theorem 2.1. Then, to prove that A((Eppi (γ) τ )γ Γ) is valid for any finite stopping time τ, again by Assumption 2.5 it suffices to show that Eppi (γ) τ is valid, which is indeed the case since they form e-processes for finite stopping times.

B. Additional Results

B.1. The Asymptotic Setting

E-values, though usually defined in non-asymptotic terms, have asymptotic analogues. In particular, a (sequential) asymptotic e-value is defined as a (sequence of) nonnegative random variable(s) En such that, under the null H0, it holds that lim supn E[En] 1 (Ramdas & Wang, 2024). We briefly show here that the core points of the theory we build in the main text can be directly applied here. Most results whose analogues we do not prove still hold, and are just omitted for conciseness.

Proposition B.1. If En is an asymptotic e-value, then so is its prediction-powered analogue Eppi n .

Proof. We want to prove that Eppi n is an asymptotic e-value. It follows, by Theorem 2.1:

lim sup n E[Eppi n ] = lim sup n E[En] 1.

Proposition B.2. If E(θ) n is an asymptotic e-value for each θinΘ, then Cppi (α) n := {θ Θ : Eppi (θ) n < 1/α} is an asymptotic confidence interval, i.e., lim supn P[θ Cppi (α) n ] α.

Proof. It holds that lim sup n P[θ Cppi (α) n ] = lim sup n P[Eppi (θ ) n 1/α];

lim sup n P[Eppi (θ ) n 1/α] lim sup n E[Eppi (θ ) n ] 1/α = α lim sup n E[Eppi (θ ) n ] α.

The results related to power (e.g., Theorem 2.2) apply to asymptotic e-values without any modification necessary.

B.2. An approximately optimal choice for πi

Our prediction-powered e-values have, at their core, the customizeable choice of data collection probabilities πi(Xi). While selecting a constant πi(Xi) = πinf, where πinf is the lowest possible value possible (so as to minimize data collection costs) is a reasonable approach, it ignores the versatility that the probability can take into account the cheap data Xi, which could significantly improve statistical power when used correctly. In an effort to seek a better strategy, we try to identify an approximately optimal choice of πi.

The optimality is in the sense that, at point i in time, the data collection probability function πi( ) should be chosen so as to maximize the expected log of the e-value, as per (Kelly, 1956); this is also similar, e.g., to the GRAPA and a GRAPA strategies of (Waudby-Smith & Ramdas, 2020). However, the π also have additional constraints:

Prediction-Powered E-Values

(i) Its image must be bounded: πi : X [1 ai/bi, 1]. I.e., for all x X, 1 ai/bi πi(x) 1.

(ii) It must respect some particular maximal data collection budget: E[πi(Xi)] Budget.

So we seek to solve the following constrained functional optimization problem:

π i = argmax π L2 E[log Eppi n | Fi 1] = argmin π L2 E[ log eppi n | Fi 1]

1 ai/bi πi(x) 1 for (almost) all x X

E[π(Xi) | Fi 1] Budget,

where we assume that the domain of π is bounded (so that there are functions that satisfy the first domain, since π is always positive).

Our approximate solution to this is as follows: the functional gradient of our (unconstrained) loss is given by

π 7 E h ei(Yi) (1 π(Xi))ei(µi(Xi))

ei(µi(Xi)) π(Xi) log 1 | Xi, Fi 1

where h(t) = 1/t log 1/t = 1/t + log t. The h function is a bit inconvenient for solving this problem in closed form, so, inspired by (Waudby-Smith & Ramdas, 2020), we do a Taylor approximation around some point a (which turns out to later combine with the parameter to control the budget constraint); this leads to the following approximate functional gradient:

π 7 αa + βa/πinf E ei(Yi) ei(µi(Xi)) | Xi, Fi 1

where αa = log a + 2/a 2 and βa = (a 1)/a2.

The uncontsrained solution is then given by

π (Xi) E ei(Yi) ei(µi(Xi)) | Xi, Fi 1

1 /(αa/βa + 1),

and KKT conditions give that:

If the unconstrained optimum above satisfies the boundedness constraint, then that is the optimal choice;

If αa + βa(E h ei(Yi) ei(µi(Xi)) | Xi, Fi 1 i /πinf (1 πinf)/πinf) 0, then π (Xi) = πinf;

Otherwise, π (Xi) = 1.

C. Datasets

C.1. For Section 3.1

We use the dataset of (CDC, 2015). It is a tabular dataset, where each row corresponds to an individual; the targets Yi in the original dataset denote whether the individual was (i) diabetic, (ii) pre-diabetic, or (iii) neither. For the purposes of our experiment, we only look for whether they were diabetic or not. The covariates are effectfully responses to the following simple survey questions:

do you have high blood pressure?

do you have high cholesterol?

how long has it been since the last time you have checked your cholesterol levels?

what is your body mass index (BMI)?

Prediction-Powered E-Values

have you smoked at least 100 cigarettes in your entire life?

has you ever been told you had a stroke?

have you been diagnosed with coronary heart disease (CHD) or myocardial infarction (MI)?

how much physical activity have you done in the past 30 days (excluding job)?

how often do you consume fruit?

how often do you consume vegetables?

how often do you consume alcohol?

do you have health care coverage, including health insurance, prepaid plans such as HMO, etc.?

Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?

Would you say that in general your health is: [excellent / very good / good / fair / poor]

Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?

Now thinking about your physical health, which includes physical illness and injury, for how many days during the pat 30 days was your physical health not good?

Do you have serious difficulty walking or climbing stairs?

What is your age?

What is your highest level of education?

What is your level of income?

C.2. For Sections 3.2 and 3.3

We use the dataset of (Blackard, 1998). Upon this dataset, in a training split, we train a simple random forest classification model. We also separate a validation split to compute the validation loss in Section 3.2. At evaluation time:

For the non-poisoned data stream in Section 3.2, where the null should not be rejected, we just use the data remaining after the training and validation splits.

For the poisoned data stream in Section 3.2, we switch the label with a probability of

for time t [0, 1].

For the data stream in Section 3.3, we switch the label with a probability of

1[t 0.23] clamp[0,1]

5 + 0.35 2!

for time t [0, 1]. The indicator causes a visible change in the time series, good for visualization. The remaining bit is done differently from in the previous section so that the change in the distribution is not too drastic.

C.3. For Section 3.4

We generate a random DAG with 6 nodes using the Erd os-Renyi procedure, and mark the last three of these nodes as costly . Relations between the nodes are given by linear functions, whose weights and biases are sampled randomly, with additional independence gaussian noise with a standard deviation of 0.4.

Prediction-Powered E-Values

The source code to reproduce the experiments in the paper, as well as additional experiments (e.g. varying seed, varying sampling budget, underpowered settings [i.e., with worse predictive models]), is available at https://github.com/ dccsillag/experiments-prediction-powered-evalues.