# predictionpowered_adaptive_shrinkage_estimation__ab10dd61.pdf

Prediction-Powered Adaptive Shrinkage Estimation

Sida Li 1 Nikolaos Ignatiadis 1 2

Prediction-Powered Inference (PPI) is a powerful framework for enhancing statistical estimates by combining limited gold-standard data with machine learning (ML) predictions. While prior work has demonstrated PPI s benefits for individual statistical problems, modern applications require answering numerous parallel statistical questions. We introduce Prediction-Powered Adaptive Shrinkage (PAS), a method that bridges PPI with empirical Bayes shrinkage to improve estimation of multiple means. PAS debiases noisy ML predictions within each problem and then borrows strength across problems by using those same predictions as a reference point for shrinkage. The amount of shrinkage is determined by minimizing an unbiased estimate of risk, and we prove that this tuning strategy is asymptotically optimal. Experiments on both synthetic and real-world datasets show that PAS adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications.

1. Introduction

A major obstacle in answering modern scientific questions is the scarcity of gold-standard data (Miao et al., 2024b). While advancements in data collection, such as large-scale astronomical surveys (York et al., 2000) and web crawling (Penedo et al., 2024), have led to an abundance of covariates (or features), scientific conclusions often rely on outcomes (or labels), which are often expensive and labor-intensive to obtain. The rapid development of machine learning (ML) algorithms has offered a path forward, with ML predictions increasingly used to supplement goldstandard outcomes and increase the statistical efficiency of subsequent analyses (Liang et al., 2007; Wang et al., 2020).

1Data Science Institute, The University of Chicago 2Department of Statistics, The University of Chicago. Correspondence to: Sida Li <listar2000@uchicago.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Prediction-Powered Inference (PPI) (Angelopoulos et al., 2023) addresses the scarcity issue by providing a framework for valid statistical analysis using predictions from black-box ML models. By combining ML-predicted and gold-standard outcomes, PPI and its variants (Angelopoulos et al., 2024; Zrnic & Cand es, 2024; Zrnic, 2025) use the abundance of predictions to reduce variance while relying on the accuracy of labeled1 data to control bias.

In this work, we adapt PPI to the estimation of multiple outcome means in compound estimation settings. Many applications of PPI naturally involve parallel statistical problems that can be solved simultaneously. For instance, several PPI methods (Angelopoulos et al., 2024; Fisch et al., 2024) have shown improvements in estimating the fraction of spiral galaxies using predictions on images from the Galaxy Zoo 2 dataset (Willett et al., 2013). While these methods focus on estimating a single overall fraction, a richer analysis emerges from partitioning galaxies based on metadata (such as celestial coordinates or pre-defined bins) and estimating the fraction of galaxies within each partition. This compound estimation approach enables more granular scientific inquiries that account for heterogeneity across galaxy clusters and spatial locations (Nair & Abraham, 2010).

We demonstrate, both theoretically and empirically, the benefits of solving multiple mean estimation problems simultaneously. Our approach builds on the empirical Bayes (EB) principle of sharing information across problems (Robbins, 1956; Efron, 2010) as exemplified by James-Stein shrinkage (James & Stein, 1961; Xie et al., 2012). The connection between modern and classical statistical ideas allows us to perform within problem PPI estimation in the first place, followed by a shrinkage step reusing the ML predictions in an adaptive way, which becomes possible through borrowing information across problems. Our contributions are:

1. We propose Prediction-Powered Adaptive Shrinkage (PAS) for compound mean estimation. PAS inherits the flexibility of PPI in working with any black-box predictive model and makes minimal distributional assumptions about the data. Its two-stage estimation process makes efficient use of the ML predictions as both a variancereduction device and a shrinkage target.

1Throughout the paper, we use the terms labeled and goldstandard interchangeably.

Prediction-Powered Adaptive Shrinkage Estimation

2. We develop a Correlation-Aware Unbiased Risk Estimate (CURE) for tuning the PAS estimator, establish asymptotic optimality of this tuning strategy, and derive an interpretation in terms of a Bayes oracle risk upper bound.

3. We conduct extensive experiments on both synthetic and real-world datasets. Our experiments demonstrate PAS s applicability to large-scale problems with deep learning models, showing improved estimation accuracy compared to other classical and modern baselines.

2. Preliminaries and Notation

2.1. Prediction-Powered Inference (PPI)

The PPI framework considers a setting where we have access to a small number of labeled data points (Xi, Yi)n i=1 (X Y)n and a large number of unlabeled covariates ( Xi)N i=1 (X)N, where X and Y represent the covariate and outcome space, respectively. The data points are drawn iid from a joint distribution PXY .2 We are also given a black-box predictive model f : X Y that is independent of the datasets (e.g., pre-trained on similar but unseen data). For mean estimation with Y R, the goal is to leverage the predicted outcomes f(Xi) to improve the estimation of θ := E[Yi]. Some simple estimators take the form of the following aggregated (summary) statistics

i=1 Yi, Y := 1

i=1 f(Xi), Zf := 1

i=1 f( Xi).

Above, Y is the classical estimator,3 Zf, Zf are the prediction means on the labeled and unlabeled data, and Y (grayed out) is unobserved. The vanilla PPI estimator is defined as,

ˆθPPI := Y |{z} Baseline

| {z } Variance Reduction

= Zf |{z} Baseline

| {z } Debiasing

Both definitions represent ˆθPPI in the form of a baseline estimator plus a correction term . In the first representation, the baseline estimator is the unbiased classical estimator Y , while the correction term has expectation 0 and attempts to reduce the variance of Y . In the second representation, the baseline estimator is the prediction mean on unlabeled data Zf (which in general may be biased for θ), while the correction term removes the bias of Zf by estimating the bias of the ML model f on the labeled dataset.

2To be concrete: (Xi, Yi) iid PXY and ( Xi, Yi) iid PXY independently, but Yi is unobserved. 3From now on, we will use the term classical estimator to refer to the sample average of the labeled outcomes.

Writing ˆθPPI = 1

N PN i=1 f( Xi)+ 1

n Pn i=1(Yi f(Xi)), we find that E[ˆθPPI] = E[Yi] = θ and

Var[ˆθPPI] = 1

N Var[f( Xi)] + 1

n Var[Yi f(Xi)], (3)

that is, ˆθPPI is unbiased for θ and its variance becomes smaller when the model predicts the true outcomes well. The mean squared error (MSE) of ˆθPPI is equal to Var[ˆθPPI]. Although we motivated ˆθPPI in (2) as implementing a correction step on two possible baseline estimators ( Y and Zf), ˆθPPI may have MSE for estimating θ that is arbitrarily worse than either of these baselines.

Comparison to classical estimator Y . The classical estimator Y which only uses labeled data is unbiased for θ and has variance (and MSE) equal to 1

Power-Tuned PPI (PPI++). To overcome the above limitation, Angelopoulos et al. (2024) introduce a power-tuning parameter λ and define

ˆθPPI λ := Y + λ Zf Zf , (4)

which recovers the classical estimator when λ = 0 and the vanilla PPI estimator when λ = 1. For all values of λ, ˆθPPI λ is unbiased, so if we select the λ that minimizes Var[ˆθPPI λ ], we can improve our estimator over both the classical estimator and vanilla PPI. Such an estimator is defined as the Power-Tuned (PT) PPI4 estimator ˆθPT := ˆθPPI λ , where we pick λ that minimizes the variance (and thus the MSE) of ˆθPPI λ . We will revisit PT as one of the building blocks of our proposed PAS estimator in Section 4.

Comparison to Zf. Consider the ideal scenario for PPI with N = (that is, the unlabeled dataset is much larger than the labeled dataset) so that Zf E[f( Xi)]. Even then, the MSE of ˆθPPI in (3) is always lower bounded5 by 1 n E[Var[Yi | Xi]] and the lower bound is attained by the perfect ML predictor f( ) E[Yi | Xi = ]. In words, if Yi is not perfectly predictable from Xi, then PPI applied to a labeled dataset of fixed size n must have non-negligible MSE. By contrast, for N = , the prediction mean of unlabeled data Zf has zero variance and MSE equal to the squared bias (E[f(Xi)] θi)2. Thus if the predictor satisfies a calibration-type property that E[f(Xi)] E[Yi] (which is implied by, but much weaker than the requirement f(Xi) Yi), then the MSE of Zf could be nearly 0. By contrast, PPI (and PPI++) can only partially capitalize on such a predictor f( ).

While PPI and PPI++ are constrained by their reliance on unbiased estimators, we show that the compound estima-

4We use the term PPI++ for the broader framework, while PT refers to the specific estimator. 5The same lower bound also applies to power-tuned PPI ˆθPT.

Prediction-Powered Adaptive Shrinkage Estimation

tion setting (Section 2.2) enables a different approach. By carefully navigating the bias-variance tradeoff through information sharing across parallel estimation problems, we can provably match the performance of both Y and Zf.

2.2. The Compound Mean Estimation Setting

In this section, we introduce the problem setting that PAS is designed to address estimating the mean of m > 1 parallel problems with a single black-box predictive model f.6 For the j-th problem, where j [m] := {1, . . . , m}, we observe a labeled dataset (Xij, Yij)nj i=1 with nj N samples and an unlabeled dataset ( Xij)Nj i=1 with Nj N samples. We start with modeling heterogeneity across problems.

Assumption 2.1 (Prior). There exist problem-specific unobserved latent variables ηj with

ηj iid Pη, j [m], and η := (η1, ..., ηm) , (5)

where Pη is an unknown probability measure. The latent variable ηj fully specifies the distribution of the j-th labeled and unlabeled dataset. We use the notation Eηj[ ] (resp. Eη[ ]) to denote the expectation conditional on ηj (resp. η), while EPη[ ] denotes an expectation also integrating out Pη.

We do not place any restriction over the unknown prior Pη. Assumption 2.1 posits exchangeability across problems, which enables information sharing, without restricting heterogeneity (Ignatiadis et al., 2023). In our setting, we are specifically interested in the means

θj := Eηj[Yij] , j [m], and θ := (θ1, . . . , θm) . (6)

Our next assumption specifies that we only model the first two moments of the joint distribution between the outcomes and the predictions. The upshots of such modeling are that the exact form of the observation distribution is neither assumed nor required in our arguments, and that our approach will be directly applicable to settings where the covariate space X is high-dimensional or structured.

Assumption 2.2 (Sampling). For each problem j [m], we assume that the joint distribution of (f(Xij), Yij) has finite second moments conditional on ηj and for i [nj]

, τ 2 j ρjτjσj ρjτjσj σ2 j

where Fj, µj, θj, ρj, σ2 j , τ 2 j are functions of ηj. Conditional on ηj, the unlabeled predictions f( Xi j), i [Nj], are also

6Our proposal also accommodates using separate predictors {fj}m j=1 for each problem. To streamline exposition, we focus on the practical scenario where a single (large) model (e.g., a large language or vision model) can handle multiple tasks simultaneously (Radford et al., 2019; He et al., 2022).

1.0 0.5 0.0 0.5 1.0 1.5 Xij

0.27 0.08 -0.12

0.09 0.22 0.37

nj = 10 labeled points of m = 10 problems

1.0 0.5 0.0 0.5 1.0 1.5 Xij

0.27 0.17 0.20

Nj = 20 unlabeled points projected onto y=|x|

Figure 1. We instantiate the model described in Example 2.3 with m = 10 problems, each has nj = 10 labeled and Nj = 20 unlabeled data (we use different colors for all 10 problems). (Top) Labeled data (Xij, Yij) nj j=1 with the classical estimator Yj shown for each problem. (Bottom) We apply a flawed predictor f(x) = |x| to the unlabeled covariates and visualize (Xij, f(Xij)) Nj j=1 as well as the prediction mean Zf j .

iid, independent of the labeled dataset and identically distributed with f(Xij). In the notation of (7), Fj represents an unspecified distribution satisfying the moment constraints in (6) and

Eηj[f(Xij)] = µj, Varηj[f(Xij)] = τ 2 j ,

Corrηj[f(Xij), Yij] = ρj, Varηj[Yij] = σ2 j .

We further denote γj := Covηj[f(Xij), Yij] = ρjτjσj.

Similar to Eq. (1), we define the aggregated statistics Yj, Zf j , Zf j for each j [m]. To facilitate exposition, following prior work,7 we treat the second moments τ 2 j , σ2 j , γj as known until the end of Section 5. In Section 6, we extend our methods to allow for unknown τ 2 j , σ2 j , and γj.

We next introduce a synthetic model that will serve both as a running example and as part of our numerical study.

Example 2.3 (Synthetic model). For each problem j, let ηj U[ 1, 1]. We think of ηj as both indexing the problems and generating heterogeneity across problems. The j-th dataset is generated via (with constants set to c = 0.05, ψ = 0.1),

Xij iid N(ηj, ψ2), Yij|Xij ind N(2ηj Xij η2 j , c). (8)

In Figure 1, we visualize realizations from this model with m = 10 problems, nj = 10 labeled observations, and Nj = 20 unlabeled observations for each problem. We apply a flawed predictor f(x) = |x|. The classical estimator Yj and the prediction mean Zf j deviate from each

7For EB, examples include Xie et al. (2012); Soloff et al. (2024); for PPI, see recent works like Fisch et al. (2024).

Prediction-Powered Adaptive Shrinkage Estimation

other. Nevertheless, Zf j contains information that can help us improve upon Yj as an estimator of θj by learning from within problem (PPI, PPI++, this work) and across problem (this work) structure. We emphasize that, as specified in (7), our approach only requires modeling the first and second moments of the joint distribution of (f(Xij), Yij). For instance, in this synthetic model, θj = η2 j and σ2 j = 4η2 j ψ2+c, while µj, τ 2 j and γj also admit closed-form expressions in terms of ηj when the predictor takes the form f(x) = |x| or f(x) = x2 (see Appendix E.1).

To conclude this section, we define the compound risk (Robbins, 1951; Jiang & Zhang, 2009) for any estimator ˆθ = (ˆθ1, . . . , ˆθm) as the expected squared error loss averaged over problems,

Rm(ˆθ, θ) := Eη h ℓm(ˆθ, θ) i , (9)

with ℓm(ˆθ, θ) := 1

j=1 (ˆθj θj)2. (10)

The Bayes risk, which we also refer to simply as mean squared error (MSE), further integrates over randomness in the unknown prior Pη in (5),

BPη m (ˆθ) := EPη h Rm(ˆθ, θ) i . (11)

3. Statistical Guiding Principles & Prior Work

In this section, we illustrate both the statistical guiding principles of our approach and some connections to prior work8 through the following stylized Gaussian model:

Sampling: ˆθcl = θ + (ξ + ε), ξ N(0, σ2 ξ), ε N(0, σ2 ε).

Prior: θ N(0, σ2 θ), ϕ N(0, σ2 ϕ), Corr[θ, ϕ] = ρ.

In our stylized model, we assume that (θ, ϕ, ε, ξ) are jointly normal and that all their pairwise correlations are zero with the exception of Corr[θ, ϕ] = ρ = 0. We write σ2 θ|ϕ := Var[θ | ϕ] = (1 ρ2)σ2 θ < σ2 θ.

We think of ˆθcl as the baseline classical statistical estimator of a quantity θ that we seek to estimate with small MSE. In our stylized Gaussian model, ˆθcl is unbiased for θ and has noise contribution ξ + ε, so that E[(ˆθcl θ)2] = Varθ[ˆθcl] = σ2 ξ + σ2 ε. We describe three high-level strategies used to improve the MSE of ˆθcl. These strategies are not tied in any way to the stylized model; nevertheless, the stylized model enables us to give precise expressions for the risk reductions possible, see Table 1.

Variance reduction (VR). An important statistical idea is to improve ˆθcl via obtaining further information to intercept some of its noise, say ξ, and replacing ˆθcl by ˆθcl ξ

8We provide further connections in Appendix A.1.

Table 1. Estimator comparison in the stylized model of Section 3.

Estimator MSE VR P CP

ˆθcl σ2 ξ + σ2 ε ˆθcl ξ σ2 ε

E[θ | ˆθcl] (σ2 ξ+σ2 ε)σ2 θ (σ2 ξ+σ2ε)+σ2 θ

E[θ | ˆθcl, ϕ] (σ2 ξ+σ2 ε)σ2 θ|ϕ (σ2 ξ+σ2ε)+σ2 θ|ϕ

E[θ | ˆθcl ξ] σ2 εσ2 θ σ2ε+σ2 θ

E[θ | ˆθcl ξ, ϕ]

σ2 εσ2 θ|ϕ σ2ε+σ2 θ|ϕ

VR: Variance Reduction, P: Prior Information, CP: Contextual Prior Information.

which has MSE σ2 ε and remains unbiased for θ. This idea lies at the heart of approaches such as control variates in simulation (Lavenberg & Welch, 1981; Hickernell et al., 2005), variance reduction in randomized controlled experiments via covariate adjustment (Lin, 2013) and by utilizing pre-experiment data (Deng et al., 2013, CUPED), as well as model-assisted estimation in survey sampling (Cochran, 1977; Breidt & Opsomer, 2017). It is also the idea powering PPI and related methods: the unlabeled dataset and the predictive model are used to intercept some of the noise in the classical statistical estimator ˆθcl Y ; compare to Eq. (2) with ξ Zf Zf. We refer to Ji et al. (2025) and Gronsbell et al. (2025) for informative discussions of how PPI relates to traditional ideas in semi-parametric inference as in e.g., Robins et al. (1994).

Prior information (P) via empirical Bayes (EB). In the Bayesian approach we seek to improve upon ˆθcl by using the prior information that θ N(0, σ2 θ). The Bayes estimator,

E θ | ˆθcl = σ2 θ σ2 ξ + σ2ε + σ2 θ ˆθcl,

reduces variance by shrinking ˆθcl toward 0 (at the cost of introducing some bias). When σ2 θ is small, the MSE of E[θ | ˆθcl] can be substantially smaller than that of ˆθcl.

Now suppose that the variance of the prior, σ2 θ, is unknown but we observe data from multiple related problems generated from the same model and indexed by j [m], say, θj iid N(0, σ2 θ) and ˆθcl j ind N(θj, σ2 ξ + σ2 ε). Then an EB analysis can mimic the MSE of the oracle Bayesian that has full knowledge of the prior. To wit, we can estimate σ2 θ as

ˆσ2 θ = 1 m 2

j=1 (ˆθcl j )2 (σ2 ξ + σ2 ε),

and then consider a plug-in approximation of the Bayes rule, ˆθJS j = ˆE[θj | ˆθcl j ] := {ˆσ2 θ/(σ2 ξ + σ2 ε + ˆσ2 θ)}ˆθcl. The resulting estimator is the celebrated James-Stein estimator (James

Prediction-Powered Adaptive Shrinkage Estimation

& Stein, 1961; Efron & Morris, 1973), whose risk is very close to the Bayes risk under the hierarchical model for large m (Efron, 2010, equation (1.25)). The James-Stein estimator also always dominates the classical estimator under a frequentist evaluation of compound risk in (9) under the assumption that ˆθcl j ind N(θj, σ2 ξ + σ2 ε) and m 3:

Rm(ˆθJS, θ) < Rm(ˆθcl, θ) for all θ Rm.

Contextual prior information (CP) via EB. Instead of using the same prior for each problem, we may try to sharpen the prior and increase its relevance (Efron, 2011) by using further contextual information ϕ. In the stylized example, as seen in Table 1, such an approach reduces the variance of the prior from σ2 θ to σ2 θ|ϕ < σ2 θ with corresponding MSE reduction of the Bayes estimator E[θ | ˆθcl, ϕ]. With multiple related problems, such a strategy can be instantiated via EB shrinkage toward an informative but biased predictor (Fay III & Herriot, 1979; Green & Strawderman, 1991; Mukhopadhyay & Maiti, 2004; Kou & Yang, 2017;

Rosenman et al., 2023). The strategy of this form that is closest to our proposal is the covariate-powered EB approach of Ignatiadis & Wager (2019), recently applied to large language model evaluation by Fogliato et al. (2024). Therein (following the notation of Section 2.2), the analyst has access to classical estimators Yj, j [m], and problemspecific covariates Wj and seeks to shrink Yj toward ML models that predict Yj from Wj. By contrast, in our setting we have observation-level covariates Xij and the ML model operates on these covariates. In principle one could simultaneously use both types of covariates: problem-specific and observation-specific.

Combine variance reduction (VR) and prior information (P). One can shrink the variance reduced estimator ˆθcl ξ toward 0 via E[θ | ˆθcl ξ] = {σ2 θ/(σ2 ε + σ2 θ)}(ˆθcl ξ). In the context of PPI, variance reduction and prior information (with a more heavy-tailed prior) are used by Cortinovis & Caron (2025) within the Frequentist-Assisted by Bayes (FAB) framework of Yu & Hoff (2018). Cortinovis & Caron (2025) only consider a single problem and do not pursue an empirical Bayes approach.

Combine P, CP, and VR together. Finally, in our stylized example, we can get the smallest MSE (last row of Table 1) by using both variance reduction, shrinkage, and a contextual prior. In that case, the Bayes estimator E[θ | ˆθcl ξ, ϕ] takes the form,

σ2 θ|ϕ σ2ε + σ2 θ|ϕ (ˆθcl ξ) + σ2 ε σ2ε + σ2 θ|ϕ E[θ | ϕ] . (12)

EB ideas can be used to mimic the estimator above and provide the starting point for the proposal we describe next.

4. Prediction-Powered Adaptive Shrinkage

On a high level, PAS aims to provide a lightweight approach that outperforms both baselines in (2) and PPI/PPI++ in terms of MSE when estimating multiple means. PAS also aims at minimal modeling requirements and assumptions.

The stylized example from Section 3 serves as a guiding analogy. We seek to benefit from ML predictions in two ways: first by variance reduction (acting akin to ξ in the stylized example), and second by increasing prior relevance (acting as a proxy for ϕ). We implement both steps to adapt to the unknown data-generating process in an assumptionlean way using within-problem information for the first step (Section 4.1) and across-problem information for the second step (Section 4.2), drawing on ideas from the EB literature.

4.1. The Within Problem Power-Tuning Stage

Extending the notation from (4) to each problem j provides us with a class of unbiased estimators ˆθPPI j,λ := Yj + λ( Zf j Zf j ), λ R. Calculating the variance gives

Varηj h ˆθPPI j,λ i = σ2 j nj +

=: δj(λ) z }| { nj + Nj

nj Nj λ2τ 2 j 2

Note that the classical estimator has risk σ2 j /nj and gets outperformed whenever δj(λ) < 0. We can further analytically solve for the optimal λ, which yields

λ j := arg min λ δj(λ) = Nj nj + Nj

τ 2 j , (13)

and the Power-Tuned (PT) estimator ˆθPT j := ˆθPPI j,λ j with

σ2 j := Varηj h ˆθPT j i = σ2 j nj Nj nj(nj + Nj) γ2 j τ 2 j . (14)

The formulation of the above PT estimators is well understood in the single problem setting (Angelopoulos et al., 2024; Miao et al., 2024a). In PAS, we execute this stage separately for each problem, as the optimal power-tuning parameter is problem-dependent and varies case by case.

4.2. The Across Problem Adaptive Shrinkage Stage

The PT estimator derived in Section 4.1 already possesses many appealing properties: it is unbiased and has lower variance than both the classical estimator and vanilla PPI. However, as our setting involves working with many parallel problems together, there is the opportunity of further MSE reduction by introducing bias in a targeted way.9 Concretely, based on the PT estimator obtained in

9See Appendix A.2.1 for some explanations about why we can improve MSE by borrowing information across problems.

Prediction-Powered Adaptive Shrinkage Estimation

Section 4.1, we consider a class of shrinkage estimators ˆθ PAS ω := (ˆθPAS 1,ω , . . . , ˆθPAS m,ω) , where for any ω 0,

ˆθPAS j,ω := ωj ˆθPT j + (1 ωj) Zf j ,

with ωj ωj(ω) := ω ω + σ2 j . (15)

The motivation is to formally match the form of the Bayes estimator with variance reduction and contextual prior information in (12) with the following (approximate) analogies:10

ˆθcl ξ ˆθPT j , E[θ | ϕ] Zf j ,

σ2 ε σ2 j , σ2 θ|ϕ ω . (16)

The highlighted ω is a global shrinkage parameter that acts as follows:

(i) Fixing ω, any problem whose PT estimator has higher variance possesses smaller ωj and shrinks more toward Zf j ; a smaller variance increases ωj and makes the final estimator closer to ˆθPT j .

(ii) Fixing all the problems, increasing ω has an overall effect of recovering ˆθPT j for all j (full recovery when ω ), and setting ω = 0 recovers Zf j .

Points (i) and (ii) establish the conceptual importance of ω. If we could choose ω in an optimal way, that is,

ω arg min ω 0

n Rm ˆθ PAS ω , θ o ,

then the resulting estimator ˆθ PAS ω would satisfy all our desiderata. While this construction is not feasible since the compound risk function in (9) depends on the unknown η, θ, we can make progress by pursuing a classical statistical idea: we can develop an unbiased estimate of the compound risk (Mallows, 1973; Stein, 1981; Efron, 2004) and then use it as a surrogate for tuning ω.

To this end, we define the Correlation-aware Unbiased Risk Estimate (CURE),

CURE ˆθ PAS ω := 1

h (2ωj 1) σ2 j + 2(1 ωj) γj

+ (1 ωj)2 ˆθPT j,ωj Zf j 2i .

Both the formula and our nomenclature ( correlationaware ) highlight the fact that we must account for the potentially non-zero covariance between shrinkage source ˆθPT j and target Zf j , which can be explicitly written down as

γj := Covηj[ˆθPT j , Zf j ] = λ j Varηj[ Zf j ] = γj nj + Nj . (17)

10We comment more on these analogies in Appendix A.2.2.

Figure 2. A flowchart illustration of the PAS method. See Algorithm 1 for a pseudo-code implementation.

Algorithm 1 Prediction-Powered Adaptive Shrinkage

Require: (Xij, Yij) nj i=1, ( Xij) Nj i=1, γj, τj, σj for j [m], predictive model f 1: for j = 1 to m do 2: Step 1: Apply predictor (Eq. (1)) 3: Yj, Zf j , Zf j = get means((Xij, Yij) nj i=1, ( Xij) Nj i=1, f) 4: Step 2: Power tuning (Eq. (13)) 5: λ j = get pt param(γj, τj, nj, Nj) 6: ˆθPT j = Yj + λ j( Zf j Zf j ) 7: σ2 j = get pt var(ˆθPT j ) (Eq. (14)) 8: end for 9: Step 3: Adaptive shrinkage (Eq. (18)) 10: ˆω = get shrink param((ˆθPT j )m j=1, ( Zf j )m j=1, ( σ2 j )m j=1) 11: for j = 1 to m do 12: ˆωj = ˆω/(ˆω + σ2 j ) 13: ˆθPAS j = ˆωj ˆθPT j + (1 ˆωj) Zf j 14: end for 15: return {ˆθPAS j }m j=1

Theorem 4.1. Under Assumption 2.2, CURE is an unbiased estimator of the compound risk defined in (9), that is, for all ω 0 and all η,

Eη h CURE ˆθ PAS ω i = Rm ˆθ PAS ω , θ .

See Appendices B and F.1 for the proof and motivation. With Theorem 4.1 in hand, we now have a systematic strategy of picking ω by minimizing CURE, following the paradigm of tuning parameter selection via minimization of an unbiased risk estimate (as advocated by, e.g. Li (1985); Donoho & Johnstone (1995); Xie et al. (2012); Cand es et al. (2013); Ignatiadis & Wager (2019); Ghosh et al. (2025)):11

ˆω arg min ω 0 CURE ˆθ PAS ω . (18)

Even though ˆω does not admit a closed-form expression, the one-dimensional minimization can be efficiently carried out numerically (e.g., grid search). The final PAS estimator is:

11The connection to EB is the following. Xie et al. (2012) and Tibshirani & Rosset (2019) explain that James-Stein-type estimators may be derived by tuning σ2 θ (in Section 3) via minimization of Stein s (1981) unbiased risk estimate (SURE).

Prediction-Powered Adaptive Shrinkage Estimation

ˆθPAS j := ˆθPAS j,ˆω = ˆω ˆω + σ2 j ˆθPT j + σ2 j ˆω + σ2 j Zf j .

Figure 2 visualizes the full method for constructing the PAS estimator from applying the predictor and obtaining aggregated statistics to going through the two stages described in Section 4.1 and this section. A pseudo-code implementation is also presented in Algorithm 1.

To illustrate the flexibility and adaptivity of PAS, we briefly revisit the synthetic model in Example 2.3, whose special structure allows us to visualize how the power-tuned and adaptive shrinkage parameters vary across problems and different predictors. In Figure 3, we consider m = 200 problems and two predictors: a good predictor f1(x) = x2 and a flawed predictor f2(x) = |x|. The model setup in (8) is such that the magnitude of Covηj[Xj, Yj] relative to Varηj[Yj] is much larger for problems with ηj closer to the origin. Therefore, for both predictors, we see a dip in λ j near the middle (top panel), which shows that PAS adapts to the level of difficulty of each problem when deciding how much power-tuning to apply. On the other hand (bottom panel), the overall shrinkage effect is much stronger (smaller ˆωj for all j) with f1 than with f2, which demonstrates PAS s ability to adapt to the predictor s quality across problems while still allowing each problem to have its own shrinkage level. Numerical results are postponed to Section 7.

5. Asymptotic Optimality

In (18), we proposed selecting ˆω by optimizing an unbiased surrogate of true risk. In this section, we justify this procedure theoretically. Our first result establishes that CURE approximates the true loss (whose expectation is the compound risk in (9)) uniformly in ω as we consider more and more problems.

Proposition 5.1. Suppose the datasets are generated according to Assumptions 2.1 and 2.2 and further assume that EPη[Y 4 ij] < , EPη[f(Xij)4] < . Then,

CURE ˆθ PAS ω ℓm ˆθ PAS ω , θ = o(1),

where o(1) denotes a term that converges to 0 as m .

A principal consequence of Proposition 5.1 is that PAS with the data-driven choice of ˆω in (18) has asymptotically smaller Bayes MSE (defined in (11)) than any of the estimators in (15), i.e., it has smaller MSE than both baselines in (2) as well as the PPI and PT estimators.

Theorem 5.2. Under the assumptions of Proposition 5.1,

BPη m ˆθ PAS ˆω inf ω 0

n BPη m ˆθ PAS ω o + o(1) as m ,

Power-Tuning Parameters

f2(x) = |x|

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Problems indexed by j = E[Xij]

Adaptive Shrinkage Parameters

f2(x) = |x|

Figure 3. The power-tuned and adaptive shrinkage parameters, λ j and ˆωj across m = 200 problems in Example 2.3. On the x-axis, we identify the problem by its ηj so the trend is more visible.

so BPη m ˆθ PAS ˆω min BPη m Z f , BPη m ˆθ PT + o(1).

Our next proposition connects Theorem 5.2 with the lowest possible MSE in the stylized Gaussian example of Section 3 (last row of Table 1).

Proposition 5.3. In addition to the assumptions of Proposition 5.1, further assume that Nj = and that there exist n N, σ2 > 0 such that nj = n and σ2 j = σ2 for all j almost surely. Let β2 := EPη[( Zf j θj)2] (which does not depend on j as we are integrating over Pη). Then,

BPη m ˆθ PAS ˆω σ2β2

σ2 + β2 + o(1) as m .

To interpret the result, it is instructive to compare the asymptotic upper bound on the MSE of PAS with the MSE in the last line of Table 1, i.e., with (σ2 εσ2 θ|ϕ)/(σ2 ε + σ2 θ|ϕ). Observe that σ2 plays the role of σ2 ε (as already anticipated in (16)) which is smaller than the variance of the classical estimator (due to power tuning). Meanwhile, β2 plays the role of σ2 θ|ϕ. If the baseline Zf j (that is, the mean of the ML predictions on the unlabeled datasets) is doing a good job of predicting θj, then β2 will be small, and so PAS may have MSE substantially smaller than that of PT. On the other hand, even if β2 is large (that is, even if the ML model is very biased), PAS asymptotically still has MSE less than or equal to σ2, the MSE of PT. We emphasize that the role of Proposition 5.3 is to provide intuition at the expense of strong assumptions.12 By contrast, the statement of Theorem 5.2 does not restrict heterogeneity (e.g., heteroscedasticity) across problems and allows for varying, finite unlabeled and labeled sample sizes.

12We further elaborate on this proposition in Appendix A.2.3.

Prediction-Powered Adaptive Shrinkage Estimation

6. PAS with Unknown Second Moments

So far we have assumed that the second moments σ2 j , τ 2 j , γj in (7) are known. In practice, e.g., in our numerical experiments with real-world datasets, we use sample-based estimates ˆσ2 j , ˆτ 2 j and ˆγj instead (Appendix C.1). This approach works well empirically, even for small nj, but is not covered by the theory above. To address this shortcoming, we develop variants Uni PT (Appendix C.2) and Uni PAS (Appendix C.3) that provide asymptotic guarantees (as m ) without requiring known or consistently estimable second moments (nj, Nj remain bounded). Briefly, Uni PT targets an optimal single power-tuning parameter λ across all problems, and does so consistently as m . Uni PAS then builds upon Uni PT, applying adaptive shrinkage similar to PAS, but with mechanisms to handle the unknown moments. These extensions are justified by theoretical results akin to those for PAS (cf. Theorem 5.2). In our experiments below, Uni PAS is competitive with PAS.

7. Experiments

We apply the proposed PAS estimator and conduct extensive experiments in both the synthetic model proposed in Example 2.3 and two real-world datasets.

All Estimators. We compare the PAS estimator against both classical and modern baseline estimators:

(i.) the classical estimator;

(ii.) the prediction mean on unlabeled data;

(iii.) the vanilla PPI estimator (Angelopoulos et al., 2023);

(iv.) the PT estimator (Angelopoulos et al., 2024, PPI++);

(v.) the shrink-classical estimator that directly shrinks the classical estimator toward the prediction mean;

(vi.) the shrink-average estimator that shrinks ˆθPT j toward the PT group mean across all problems, i.e., toward θPT := 1

m Pm j=1 ˆθPT j .

(vii.) the Uni PT and Uni PAS estimators introduced in Section 6 and detailed in Appendix C.

We include the estimators in (vii.) only for the real-world experiments, as they are specifically designed for settings where the second moments are unknown. See Appendix D for detailed formulations and implementations for (v.) & (vi.) (the latter of which is inspired by the SURE-grand mean estimator of Xie et al. (2012)). The comparisons with (iv.) and (v.) also directly serve as ablation studies of the two stages in constructing the PAS estimator.

Metrics. We report the mean squared error (MSE) ( 1 standard error) of each estimator ˆθ by averaging

Estimator MSE f1 ( 10 3) MSE f2 ( 10 3)

Classical 3.142 0.033 3.142 0.033 Prediction Avg 0.273 0.004 34.335 0.147 PPI 2.689 0.027 2.756 0.027 PT 2.642 0.027 2.659 0.026 Shrink Classical 0.273 0.003 3.817 0.042 Shrink Avg 2.486 0.026 2.575 0.026 PAS (ours) 0.272 0.003 2.496 0.025

Table 2. MSE ( standard error) of different estimators under the synthetic model with predictors f1(x) = x2 and f2(x) = |x|.

1 m Pm j=1(ˆθj θj)2 across K = 200 Monte Carlo replicates. In the synthetic model, we sample ηj (and thus θj) from the known prior Pη. For the real-world datasets, since Pη is unknown, we follow the standard evaluation strategy in the PPI literature: we start with a large labeled dataset and use it to compute a pseudo-ground truth for each mean θj. Then in each Monte Carlo replicate, we randomly split the data points of each problem into labeled/unlabeled partitions (where we choose a 20/80 split ratio).13

For real-world datasets, we introduce a second metric to assess whether improvements in MSE are driven by a few difficult problems rather than consistent performance gains: the percentage of problems improved relative to the classical estimator (abbreviated as Improved % ). This metric is defined as 1

m Pm j=1 1[(ˆθj θj)2 < (ˆθClassical j θj)2] (100%). Larger values of this metric are preferable.

7.1. Synthetic Model

This is the synthetic model from Example 2.3, where we choose m = 200, nj = 20, and Nj = 80 for all j. Since we have already visualized the model and the parameters of the PAS estimator in previous sections, we simply report the numerical results (for the good predictor f1 and the flawed predictor f2) in Section 7.1.

For both predictors, we see that PAS outperforms all the baselines. With a good predictor f1, both the prediction mean and the shrinkage estimator closely track PAS; in contrast, the PPI and PT estimators fail to fully leverage the accurate predictions, as their design enforces unbiasedness. The situation reverses for the less reliable predictor f2: the prediction mean and the shrinkage estimator have high MSE, while estimators with built-in debiasing mechanisms demonstrate greater resilience. PAS adapts effectively across these extremes, making it a handy choice for a wide range of problems and predictors.

7.2. Real-World Datasets

We next evaluate PAS on two large-scale real-world datasets, highlighting its ability to leverage state-of-the-art deep learn-

13In Appendix E, we vary this ratio from 1% to 40%, and provide more details on the benchmarking procedure for MSE.

Prediction-Powered Adaptive Shrinkage Estimation

Table 3. Results aggregated over K = 200 replicates on three real-world datasets: Amazon review ratings with BERT-base and BERT-tuned predictors, and spiral galaxy fractions with Res Net50 predictor. The bottom two rows are the Uni PT and Uni PAS estimators detailed in Appendix C as variants of PT and PAS. Metrics are reported with 1 standard error.

Amazon (base f) Amazon (tuned f) Galaxy

Estimator MSE ( 10 3) % Improved MSE ( 10 3) % Improved MSE ( 10 3) % Improved

Classical 24.305 0.189 baseline 24.305 0.189 baseline 2.073 0.028 baseline Prediction Avg 41.332 0.050 30.7 0.2 3.945 0.011 75.4 0.2 7.195 0.008 17.0 0.2 PPI 11.063 0.085 62.4 0.2 7.565 0.066 70.4 0.2 1.149 0.017 59.4 0.3 PT 10.633 0.089 70.3 0.2 6.289 0.050 76.0 0.2 1.026 0.015 67.7 0.3 Shrink Classical 15.995 0.121 56.4 0.3 3.828 0.039 78.9 0.2 1.522 0.016 48.8 0.4 Shrink Avg 9.276 0.078 70.4 0.2 6.280 0.058 77.1 0.2 0.976 0.014 68.9 0.3 PAS (ours) 8.517 0.071 71.4 0.2 3.287 0.024 80.8 0.2 0.893 0.011 67.3 0.4

Uni PT (ours) 10.272 0.084 70.0 0.2 6.489 0.053 76.2 0.2 1.017 0.015 67.7 0.3 Uni PAS (ours) 8.879 0.073 69.5 0.2 3.356 0.031 77.6 0.2 0.909 0.011 66.7 0.3

ing models in different settings.14

Amazon Review Ratings (SNAP, 2014). Many commercial and scientific studies involve collecting a large corpus of text and estimating an average score (rating, polarity, etc.) from it. A practitioner would often combine limited expert annotations with massive automatic evaluations from ML models (Baly et al., 2020; Egami et al., 2023; Fan et al., 2024; Mozer & Miratrix, 2025). To emulate this setup, we consider mean rating estimation problems using the Amazon Fine Food Review dataset from Kaggle, where we artificially hide the labels in a random subset of the full data to serve as the unlabeled partition. Concretely, we estimate the average rating for the top m = 200 products with the most reviews (from 200 to 900). For the i-th review of the j-th product, the covariate Xij consists of the review s title and text concatenated, while the outcome Yij is the star rating in {1, . . . , 5}. We employ two black-box predictors: (1) BERT-base, a language model without fine-tuning (Devlin et al., 2019) and (2) BERT-tuned which is the same model but fine-tuned on a held-out set of reviews from other products. Neither of them are trained on the reviews of the 200 products we evaluate.

Spiral Galaxy Fractions (Willett et al., 2013). The Galaxy Zoo 2 project contains the classification results of galaxy images from the the Sloan Digital Sky Survey (York et al., 2000, SDSS). We are interested in estimating the fraction of galaxies that are classified as spiral, i.e., have at least one spiral arm. The covariates Xij in this applications are images (we provide some examples in Figure 5 of the appendix). Existing PPI papers have focused on estimating the overall fraction; we demonstrate how this dataset s metadata structure enables compound estimation of spiral fractions across distinct galaxy subgroups. We first use a pre-defined partition of the galaxies into 100 subgroups that

14We include only the essential setup below and defer additional data and model details (e.g., hyper-parameters, preprocessing) to Appendix E.

is based on the metadata attribute WVT BIN. Second, we estimate the fraction of spiral galaxies in all of the galaxy subgroups simultaneously. The predictor is a Res Net50 network trained on a held-out set with around 50k images.

For both datasets, we randomly split the data for each problem (a food product or galaxy subgroup) into a labeled and unlabeled partition with a 20/80 ratio. We repeat the random splitting K = 200 times and report metrics averaged over all splits. Here we summarize the results in Table 3:

Amazon Review: similar to the trend in the synthetic model, the more accurate BERT-tuned model enables stronger shrinkage for PAS while the biased BERT-base predictions necessitate less shrinkage. Our PAS estimator adapts to both predictors and outperforms other baselines. PAS has the lowest MSE and highest % Improved .

Galaxy Zoo 2: the predictions from Res Net50 are suboptimal, so the variance-reduction from power tuning dominates any benefit from shrinkage. PAS achieves the lowest MSE among all estimators, and improves individual estimates at a level on par with the shrink-average estimator.

8. Conclusion

This paper introduces PAS, a novel method for compound mean estimation that effectively combines PPI and EB principles. We motivate the problem through the lens of variance reduction and contextual prior information then demonstrate how PAS achieves both goals, in theory and in practice. Our paper differs from many other PPI-related works in its focus on estimation, so a natural next step is to develop average coverage controlling intervals for the means centered around PAS. To this end, it may be fruitful to build on the robust empirical Bayes confidence intervals of Armstrong et al. (2022). Modern scientific inquiries increasingly demand the simultaneous analysis of multiple related problems. The framework developed in this paper represents a promising direction for such settings.

Prediction-Powered Adaptive Shrinkage Estimation

Acknowledgments

We thank Dan Kluger and Qingqi Zhang for thoughtful feedback on an earlier version of our manuscript. We also thank Claire Donnat, Yichen Ji, Lihua Lei, Aaron Schein, and Tijana Zrnic for helpful discussions. We are grateful to Bennett Hunter and the compute cluster at the Data Science Institute for supporting our computational resources. We thank all four reviewers for their constructive feedback. N.I. gratefully acknowledges support from NSF (DMS 2443410).

Impact Statement

By developing a method to improve the efficiency and accuracy of mean estimation using machine learning predictions, this research has the potential to enhance data analysis across various domains where labeled data is limited but predictive models are available.

We acknowledge that advancements in machine learning can have broader societal consequences. However, the ethical considerations directly arising from this methodological contribution are those typically associated with the general advancement of statistical methods and machine learning. We do not foresee specific negative ethical impacts unique to this work that require further detailed discussion.

Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. Prediction-powered inference. Science, 382 (6671):669 674, 2023.

Angelopoulos, A. N., Duchi, J. C., and Zrnic, T. PPI++: Efficient Prediction-Powered Inference. ar Xiv preprint, ar Xiv:2311.01453, 2024.

Armstrong, T. B., Koles ar, M., and Plagborg-Møller, M. Robust empirical Bayes confidence intervals. Econometrica, 90(6):2567 2602, 2022.

Baly, R., Da San Martino, G., Glass, J., and Nakov, P. We can detect your bias: Predicting the political ideology of news articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4982 4991. Association for Computational Linguistics, 2020.

Breidt, F. J. and Opsomer, J. D. Model-assisted survey estimation with modern prediction techniques. Statistical Science, 32(2), 2017.

Cand es, E. J., Sing-Long, C. A., and Trzasko, J. D. Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Transactions on Signal Processing, 61 (19):4643 4657, 2013.

Cochran, W. G. Sampling Techniques. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York, 3d ed edition, 1977.

Cortinovis, S. and Caron, F. FAB-PPI: Frequentist, Assisted by Bayes, Prediction-Powered Inference. ar Xiv preprint, ar Xiv:2502.02363, 2025.

Deng, A., Xu, Y., Kohavi, R., and Walker, T. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 123 132, Rome Italy, 2013. ACM.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pp. 4171 4186, 2019.

Donoho, D. L. and Johnstone, I. M. Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90(432):1200 1224, 1995.

Durrett, R. Probability: Theory and Examples, volume 49 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 5th edition, 2019.

Efron, B. The estimation of prediction error: Covariance penalties and cross-validation. Journal of the American Statistical Association, 99(467):619 632, 2004.

Efron, B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics Monographs. Cambridge University Press, Cambridge, 2010.

Efron, B. Tweedie s formula and selection bias. Journal of the American Statistical Association, 106(496):1602 1614, 2011.

Efron, B. and Morris, C. Stein s estimation rule and its competitors an empirical Bayes approach. Journal of the American Statistical Association, 68(341):117 130, 1973.

Egami, N., Hinck, M., Stewart, B., and Wei, H. Using imperfect surrogates for downstream inference: Designbased supervised learning for social science applications of large language models. In Advances in Neural Information Processing Systems, volume 36, pp. 68589 68601, 2023.

Prediction-Powered Adaptive Shrinkage Estimation

Fan, S., Visokay, A., Hoffman, K., Salerno, S., Liu, L., Leek, J. T., and Mc Cormick, T. H. From narratives to numbers: Valid inference using language model predictions from verbal autopsy narratives. ar Xiv preprint, ar Xiv:2404.02438, 2024.

Fay III, R. E. and Herriot, R. A. Estimates of income for small places: An application of James-Stein procedures to census data. Journal of the American Statistical Association, 74(366a):269 277, 1979.

Fisch, A., Maynez, J., Hofer, R., Dhingra, B., Globerson, A., and Cohen, W. Stratified prediction-powered inference for effective hybrid evaluation of language models. In Advances in Neural Information Processing Systems, volume 37, 2024.

Fogliato, R., Patil, P., Akpinar, N.-J., and Monfort, M. Precise model benchmarking with only a few observations. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 9563 9575. Association for Computational Linguistics, 2024.

Ghosh, S., Ignatiadis, N., Koehler, F., and Lee, A. Stein s unbiased risk estimate and Hyv arinen s score matching. ar Xiv preprint, ar Xiv:2502.20123, 2025.

Green, E. J. and Strawderman, W. E. A James-Stein type estimator for combining unbiased and possibly biased estimators. Journal of the American Statistical Association, 86(416):1001 1006, 1991.

Gronsbell, J., Gao, J., Shi, Y., Mc Caw, Z. R., and Cheng, D. Another look at inference after prediction. ar Xiv preprint, ar Xiv:2411.19908, 2025.

Hart, R. E., Bamford, S. P., Willett, K. W., Masters, K. L., Cardamone, C., Lintott, C. J., Mackay, R. J., Nichol, R. C., Rosslowe, C. K., Simmons, B. D., et al. Galaxy zoo: comparing the demographics of spiral arm number and a new method for correcting redshift bias. Monthly Notices of the Royal Astronomical Society, 461(4):3663 3682, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

He, K., Chen, X., Xie, S., Li, Y., Doll ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000 16009, 2022.

Hickernell, F. J., Lemieux, C., and Owen, A. B. Control variates for Quasi-Monte Carlo. Statistical Science, 20 (1):1 31, 2005.

Ignatiadis, N. and Wager, S. Covariate-powered empirical Bayes estimation. In Wallach, H., Larochelle, H., Beygelzimer, A., D Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Ignatiadis, N., Saha, S., Sun, D. L., and Muralidharan, O. Empirical Bayes mean estimation with nonparametric errors via order statistic regression on replicated data. Journal of the American Statistical Association, 118(542): 987 999, 2023.

James, W. and Stein, C. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pp. 361 379, 1961.

Ji, W., Lei, L., and Zrnic, T. Predictions as surrogates: Revisiting surrogate outcomes in the age of AI. ar Xiv preprint, ar Xiv:2501.09731, 2025.

Jiang, W. and Zhang, C.-H. General maximum likelihood empirical Bayes estimation of normal means. The Annals of Statistics, 37(4):1647 1684, 2009.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Kou, S. C. and Yang, J. J. Optimal shrinkage estimation in heteroscedastic hierarchical linear models. In Ahmed, S. E. (ed.), Big and Complex Data Analysis, pp. 249 284. Springer International Publishing, Cham, 2017.

Lavenberg, S. S. and Welch, P. D. A perspective on the use of control variables to increase the efficiency of Monte Carlo simulations. Management Science, 27(3):322 335, 1981.

Li, K.-C. From Stein s Unbiased Risk Estimates to the method of Generalized Cross Validation. The Annals of Statistics, 13(4):1352 1377, 1985.

Liang, F., Mukherjee, S., and West, M. The use of unlabeled data in predictive modeling. Statistical Science, 22(2): 189 205, 2007.

Lin, J. Y.-Y., Liao, S.-M., Huang, H.-J., Kuo, W.-T., and Ou, O. H.-M. Galaxy morphological classification with efficient vision transformer. In Machine Learning and the Physical Sciences Workshop at the 35th Conference on Neural Information Processing Systems, 2021.

Lin, W. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman s critique. The Annals of Applied Statistics, 7(1):295 318, 2013.

Mallows, C. L. Some comments on CP . Technometrics, 15 (4):661, 1973.

Prediction-Powered Adaptive Shrinkage Estimation

Miao, J., Miao, X., Wu, Y., Zhao, J., and Lu, Q. Assumptionlean and data-adaptive post-prediction inference. ar Xiv preprint, ar Xiv:2311.14220, 2024a.

Miao, J., Wu, Y., Sun, Z., Miao, X., Lu, T., Zhao, J., and Lu, Q. Valid inference for machine learning-assisted genome-wide association studies. Nature Genetics, pp. 1 9, 2024b.

Mozer, R. and Miratrix, L. More power to you: Using machine learning to augment human coding for more efficient inference in text-based randomized trials. The Annals of Applied Statistics, 19(1):440 464, 2025.

Mukhopadhyay, P. and Maiti, T. Two stage non-parametric approach for small area estimation. Proceedings of ASA Section on Survey Research Methods, hal, pp. 4058 4065, 2004.

Nair, P. B. and Abraham, R. G. On the fraction of barred spiral galaxies. The Astrophysical Journal Letters, 714 (2):L260, 2010.

Penedo, G., Kydl ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Robbins, H. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, volume 2, pp. 131 149. University of California Press, 1951.

Robbins, H. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp. 157 163. The Regents of the University of California, 1956.

Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846 866, 1994.

Rosenman, E. T., Basse, G., Owen, A. B., and Baiocchi, M. Combining observational and experimental datasets using shrinkage estimators. Biometrics, pp. biom.13827, 2023.

SNAP. Amazon Fine Food Reviews (Stanford network analysis project). Kaggle Dataset, 2014. URL

https://www.kaggle.com/datasets/snap/ amazon-fine-food-reviews/data. Accessed: 2024-01-01.

Soloff, J. A., Guntuboyina, A., and Sen, B. Multivariate, heteroscedastic empirical Bayes via nonparametric maximum likelihood. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. qkae040, 2024.

Stein, C. M. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, pp. 1135 1151, 1981.

Tibshirani, R. J. and Rosset, S. Excess optimism: How biased is the apparent error of an estimator tuned by SURE? Journal of the American Statistical Association, 114(526):697 712, 2019.

Town, N. bert-base-multilingual-uncased-sentiment, 2023. URL https://huggingface.co/nlptown/ bert-base-multilingual-uncasedsentiment.

Wang, S., Mc Cormick, T. H., and Leek, J. T. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266 30275, 2020.

Willett, K. W., Lintott, C. J., Bamford, S. P., Masters, K. L., Simmons, B. D., Casteels, K. R., Edmondson, E. M., Fortson, L. F., Kaviraj, S., Keel, W. C., et al. Galaxy Zoo 2: detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society, 435(4):2835 2860, 2013.

Wolf, T. Huggingface s transformers: State-of-theart natural language processing. ar Xiv preprint, ar Xiv:1910.03771, 2019.

Xie, X., Kou, S., and Brown, L. D. SURE estimates for a heteroscedastic hierarchical model. Journal of the American Statistical Association, 107(500):1465 1479, 2012.

York, D. G., Adelman, J., Anderson Jr, J. E., Anderson, S. F., Annis, J., Bahcall, N. A., Bakken, J., Barkhouser, R., Bastian, S., Berman, E., et al. The sloan digital sky survey: Technical summary. The Astronomical Journal, 120(3):1579, 2000.

Yu, C. and Hoff, P. D. Adaptive multigroup confidence intervals with constant coverage. Biometrika, 105(2): 319 335, 2018.

Zrnic, T. A note on the prediction-powered bootstrap. ar Xiv preprint, ar Xiv:2405.18379, 2025.

Zrnic, T. and Cand es, E. J. Cross-prediction-powered inference. Proceedings of the National Academy of Sciences, 121(15):e2322083121, 2024.

Prediction-Powered Adaptive Shrinkage Estimation

Appendix: Table of Contents

A. Further Connections and Intuitions Behind PAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 13

B. The Correlation-Aware Unbiased Risk Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Page 16

C. Details on PAS with Unknown Second Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Page 17

D. Baseline Shrinkage Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 21

E. Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 23

F. Proofs of Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 27

A. Further Connections and Intuitions Behind PAS

A.1. Further Connections to Related Works

In this part, we elaborate on two important connections to existing work.

Stratified PPI (Strat PPI; Fisch et al. (2024)). Fisch et al. (2024) also consider a setting with stratification of observations into subproblems. However, their overall aim is still to estimate a single parameter, rather than multiple parameters, as we do. Thus, in solving individual problems, for instance, they still adhere to the requirement of unbiasedness. As an example, let us revisit the Galaxy Zoo 2 application. Therein, we mentioned the following distinction between our work and previous methods in the PPI literature:

Estimate the fraction θ of spiral galaxies across the whole universe (Angelopoulos et al., 2023).

(Our work) Estimate the fraction θj of spiral galaxies within the j-th cluster of galaxies for j [m].

Now suppose that galaxies in the whole universe can be partitioned into the m clusters above. Then the goal of Strat PPI is:

Estimate the fraction θ of spiral galaxies across the whole universe by proceeding with an intermediate step that involves estimating θ1, . . . , θm.

We continue to make this connection explicit in our mean estimation setting. Suppose (dropping subscripts for convenience) that we start with covariate-outcome pairs (X, Y ) X Y distributed as

(X, Y ) iid P.

Moreover suppose that there exist pairwise disjoint strata A1, . . . , Am whose union is the full covariate space X, that is, X = Sm j=1Aj. Then define,

wj := P[X Aj], θj := P[Y | X Aj], j [m],

and observe the key equality for θ := E[Y ],

Strat PPI assumes that the probabilities wj are known exactly and then samples nj labeled as well as Nj unlabeled observations from the conditional distribution P[ | X Aj].15 These samples are then used alongside PPI++ to estimate θj

15Part of the contribution of Strat PPI includes a strategy for allocating resources and choosing nj and Nj for different problems with a view toward minimizing the variance of the final estimator of θ.

Prediction-Powered Adaptive Shrinkage Estimation

by ˆθPT j , almost verbatim to the approach we described in Section 4.1. Finally, Strat PPI estimates θ via:

j=1 wj ˆθPT j .

To summarize, the settings of Strat PPI and our paper are similar, but the goals and methods differ. Strat PPI seeks an unbiased estimate of a single parameter θ, while we seek to estimate a parameter vector (θ1, . . . , θm) as well as possible in mean squared error, while allowing for the possibility of bias. Moreover, the asymptotic regimes in these two papers are different: Strat PPI keeps m fixed and takes nj, Nj , while we assume that the number of problems grows (m ) while nj, Nj are allowed to remain bounded.

PPI++ (Angelopoulos et al., 2024) with multivariate mean estimand. Angelopoulos et al. (2024) (and other papers in the PPI literature) consider statistical problems going beyond the estimation of a univariate mean. For instance, the results of Angelopoulos et al. (2024) also accommodate multivariate mean estimands. Here we explain how one can frame our compound mean estimation setting into the estimation of a single but multivariate mean estimand, which then fits in the more generalized formulation for power tuning considered in PPI++.

If we assume nj = n and Nj = N for all j, we can collate the responses and predictions from all problems into (f(Xi.), Yi.)n i=1 and (f( Xi.))N i=1, where

Yi. = (Yi1, ..., Yim) , f(Xi.) = (f(Xi1), ..., f(Xim)) , f( Xi.) = (f( Xi1), ..., f( Xim)) .

By independence across problems, we have the covariance structures Var[f( Xi.)] = diag((τ 2 j )m j=1) and Cov[f(Xi.), Yi.] = diag((γj)m j=1) where diag( ) constructs a diagonal matrix with its arguments on the diagonal. Angelopoulos et al. (2024) then consider a class of estimators indexed by a single weighting parameter λ that is applied to the entire vector:

ˆθPPI λ := 1

i=1 Yi. + λ

i=1 f( Xi.) 1

The optimal λ is chosen by minimizing the trace of an asymptotic covariance Σλ. For the mean estimation problem, Σλ

simplifies to the covariance of ˆθPPI λ and can be exactly calculated in finite-sample setting (without relying on n, N ). Minimizing Σλ over λ thus admits a simple closed-from solution:

λ := n 1Tr(Cov[f(Xi.), Yi.])

n N Tr(Var[f( Xi.)]) = n 1 Pm j=1 γj n+N

n N Pm j=1 τ 2 j (19)

Going back to our original formulation with multiple problems, what we are doing here is to perform power tuning across problems to minimize the sum of MSEs which equals the trace of the covariance matrix Tr(Var[f( Xi.)]) in the PPI++ formulation over a single (univariate) λ. Finally, if nj and Nj vary across j, the above PPI++ formulation fails to hold, but the idea to power tune all problems together carries over to our multiple problem setting. The univariate optimal tuning parameter takes the same form as Equation (19):

Pm j=1 n 1 j γj Pm j=1 nj+Nj

nj Nj τ 2 j .

In fact, this choice of λ leads to a new estimator, denoted as Uni PT, that will be further explored in Appendix C.

A.2. Further Intuitions Behind the Design of PAS

A.2.1. WHY DOES SHARING INFORMATION ACROSS PROBLEMS HELP?

Here we provide an explanation of the fundamental statistical difference between estimating a single mean versus estimating a lot of means (m ) with an eye toward providing intuition of how we can share information across problems. We emphasize that this is only meant for intuition; the empirical Bayes literature, starting with James & Stein (1961), provides a more nuanced picture.

Prediction-Powered Adaptive Shrinkage Estimation

Imagine a simplified setting with only a single problem (m = 1, as in the standard PPI setting). We further assume that Nj = , so that Zf 1 Eη1[f( X1)] and we also assume that σ2 1 is known. Moreover, suppose we want to choose between the PT estimator ˆθPT 1 and the prediction mean Zf 1 by comparing their MSEs in a data-driven way. We know that the PT estimator is unbiased and its MSE is σ2 1, but we cannot estimate E[(θ1 Zf 1 )2] accurately since we only have a single θ1 (a single problem). At best, we can use (ˆθPT 1 Zf 1 )2 σ2 1 as an unbiased estimate of this quantity since Eη1[(ˆθPT 1 Zf 1 )2] σ2 1 = Eη1[(θ1 Zf 1 )2].

Now suppose we have multiple problems, then we can more precisely learn how good the prediction mean Zf j is for estimating θj by repeating the above estimation procedure for each problem and averaging the results. This average turns out to be not only unbiased but also consistent. To wit, as m , by the law of large numbers,

(ˆθPT j Zf j )2 σ2 j P EPη[(θj Zf j )2], (20)

where we emphasize that the mean squared on the right-hand side also integrates with respect to the meta-distribution Pη that determines the distribution of the θj (cf. Assumption 2.1). This illustrates a mechanism for sharing information across problems: in a first step, we can consistently estimate the average squared difference between the true parameters θj and the prediction means Zf j (i.e., how good the ML predictor is on average) by aggregating information from many parallel

problems. Then, based on the first step, we can choose whether to use Z f or ˆθ PT according to their MSE.

Our actual implementation is more involved but embodies the same fundamental principle: we construct a consistent estimate for the MSE of a family of shrinkage estimators. While other factors such as heterogeneity in sample sizes (nj, Nj) introduce further complexities, the core idea of sharing information here carries over to the overall design and justification of PAS.

A.2.2. ANALOGY BETWEEN Zf j AND E[θ | ψ]

In Equation (16), we informally draw an analogy between the prediction mean Zf j in the compound PPI problem and the posterior mean E[θ | ψ] in the stylized Gaussian model. Our goal here is to provide a heuristic motivation for the one-dimensional parameterized family of weights ωj in Equation (15). We emphasize that the ultimate success of this parameterization choice is judged by the empirical results.

Our heuristic motivation building on the stylized example of Section 3 is as follows. We seek to estimate θ by taking a convex combination ˆθcl ξ and a shrinkage target s,

w(ˆθcl ξ) + (1 w)s,

for w [0, 1]. Our main point is that for several choices of shrinkage target s, the best weight w can be written in the form

w = ω ω + σ2ε , (21)

for some ω 0.16

The Bayes estimator with contextual prior in Equation (12) uses s = E[θ | ϕ] and weight w as in (21) with

ω = E[(θ E[θ | ϕ])2] = E[Var[θ | ϕ]] = σ2 θ|ϕ.

The Bayes estimator without contextual prior uses s = E[θ] = 0 and weight w as in (21) with

ω = E[(θ E[θ])2] = Var(θ) = σ2 θ.

16At this stage, there is 1-1 mapping between ω 0 and w [0, 1]. However, for PAS, σ2 ε corresponds to the variance of the j-th power tuning estimator and will be different from problem to problem. Then it will be convenient to parameterize all the problems by the same parameter ω 0.

Prediction-Powered Adaptive Shrinkage Estimation

Suppose now that we ask for the best convex combination (not necessarily a Bayes predictor) between ˆθcl ξ and s = h(ϕ) where h( ) is some fixed function. Then, the weight w minimizing MSE can be shown to take the form in (21) with ω = E[(θ h(ϕ))2] = E[Var[θ | ϕ]] + E[(h(ϕ) E[θ | ϕ])2].

The above expression is interesting as it forces us to inflate ω , i.e., to shrink less toward h(ϕ) in a way that depends on how close h(ϕ) is to E[θ | ϕ]. See Ignatiadis & Wager (2019, last paragraph of Section 3) and Appendix A.2.3 below for further discussion of this point.

The takeaway is that for lots of possible predictors, the optimal weights have the same parameterized form up to the single parameter ω that varies according to the quality of the predictor. This motivates our one-dimensional family of weights. Once this family has been motivated, we learn ω in Equation (15) in a way that does not depend on the above analogy at all by minimizing CURE.

A.2.3. ELABORATION ON PROPOSITION 5.3

Suppose (as in Proposition 5.3) that nj is the same across all problems, Nj = , and that second moments ρj, τj, σj are identical across all problems. Then we could ask: what is the best convex combination between ˆθPT j and Zf j in the following sense:

ω j arg min ωj 0 EPη {θj (ωj ˆθPT j + (1 ωj) Zf j )}2 .

By direct calculation (note that the right-hand side is a convex quadratic in ωj) we find that:

ω j = EPη[(θj Zf j )2]

EPη[(θj Zf j )2] + σ2 .

This implies the following intuitive result: the larger the MSE EPη[(θj Zf j )2], the less weight we should assign to Zf j . If we evaluate the MSE at this optimal ω j , we recover precisely the upper bound of Proposition 5.3.

B. The Correlation-Aware Unbiased Risk Estimate

Theorem B.1. Let X, Y be two random variables satisfying Eθ[X] = θ, Varθ[X] = σ2, Covθ[X, Y ] = γ, and the second moment of Y exists.17 Consider estimating θ with the shrinkage estimator ˆθc = c X + (1 c)Y with c [0, 1]. Assuming that σ2 and γ are known, the following estimator

CURE(ˆθc) := (2c 1)σ2 + 2(1 c)γ + {(1 c)(X Y )}2, (22)

defined as the Correlation-aware Unbiased Risk Estimate, is an unbiased estimator for the risk of ˆθc under quadratic loss. That is, letting R(ˆθc, θ) := Eθ[(ˆθc θ)2], it holds that:

Eθ h CURE(ˆθc) i = R(ˆθc, θ).

Proof. First, expand the risk:

R(ˆθc, θ) = Eθ[(ˆθc θ)2] = Eθ[(c X + (1 c)Y θ)2]

= Varθ[c X + (1 c)Y ] + (Eθ[c X + (1 c)Y ] θ)2

= c2σ2 + (1 c)2Varθ[Y ] + 2c(1 c)γ + [(1 c)(Eθ[Y ] θ)]2.

17We redefine certain variables for generality of this result beyond the setting in Assumption 2.2. In this theorem, θ plays the role of η in the main text, i.e. all the other parameters are deterministic given θ.

Prediction-Powered Adaptive Shrinkage Estimation

Then, taking the expectation of CURE(ˆθc):

Eθ h CURE(ˆθc) i = (2c 1)σ2 + 2(1 c)γ | {z } I

+Eθ {(1 c)(X Y )}2 , (23)

where the last term is

Eθ {(1 c)(X Y )}2 = (1 c)2 (Eθ[X Y ])2 + Varθ[X Y ]

= (1 c)2 (Eθ[Y ] θ)2 + σ2 + Varθ[Y ] 2γ

= [(1 c)(Eθ[Y ] θ)]2 + (1 c)2(σ2 + Varθ[Y ] 2γ) | {z } II

With a little algebra, we observe

I + II = (2c 1)σ2 + 2(1 c)γ + (1 c)2(σ2 + Varθ[Y ] 2γ)

= c2σ2 + (1 c)2Varθ[Y ] + 2c(1 c)γ.

Thus, a term-by-term matching confirms Eθ[CURE(ˆθc)] = R(ˆθc, θ).

Remark B.2 (Connection to SURE). Stein s Unbiased Risk Estimate (SURE) was proposed in Charles Stein s seminal work (1981) to study the quadratic risk in Gaussian sequence models. As a simple special case of SURE, let Z N(θ, σ2) and let h : R R be an absolutely continuous function and Eθ[|h (Z)|] < , then SURE is defined as

SURE(h) := (h(Z) Z)2 + 2σ2h (Z) σ2,

with the property that Eθ[SURE(h)] = R(h(Z), θ) = Eθ[(h(Z) θ)2]. A proof of this argument relies on Stein s lemma, an identity specific to Gaussian random variables (Stein, 1981). Now consider the specific linear shrinkage estimator hc(Z) := c Z + (1 c)Y , with c [0, 1] and Y R being fixed (that is, Y is a constant, or Y is independent of Z and we condition on Y ). Then SURE takes the following form:

SURE(hc) = (hc(Z) Z)2 + 2σ2h c(Z) σ2

= (c Z + (1 c)Y Z)2 + 2cσ2 σ2

= [(1 c)(Y Z)]2 + (2c 1)σ2

( ) = CURE(hc(Z)),

where in ( ) we used the fact that in this case (with Y fixed or Y independent of Z), it holds that γ = 0 so that the definition of CURE in (22) simplifies. This explains how CURE defined in B.1 is connected to SURE.

We make one last remark: The derivation of SURE itself requires Gaussianity. However, for linear shrinkage rules as hc(Z), SURE only depends on the first two moments of the distribution of Z and thus is an unbiased estimator of quadratic risk under substantial generality as long as Eθ[Z] = θ and Varθ[Z] = σ2. This remark has been made by previous authors, e.g., Kou & Yang (2017); Ignatiadis & Wager (2019) and is important for the assumption-lean validity of PAS.

C. Details on PAS with Unknown Second Moments

In this appendix we explain how to apply PAS when second moments are unknown. In Appendix C.1 we describe samplebased estimators of the second moments. We also develop Uni PT (Appendix C.2) and Uni PAS (Appendix C.3), two new estimators that extend our framework to scenarios where second moments are unknown and must be estimated from data. We derive these methods and their theoretical guarantees within the PAS asymptotic regime, where the number of problems m while individual sample sizes nj, Nj remain bounded. Specifically, Uni PT introduces a global power-tuning strategy, and Uni PAS builds upon it to perform adaptive shrinkage, both without requiring knowledge of true second-moment parameters.

Notations. Throughout this section and the proof details in Appendix F, we use P for convergence in probability and Lp for Lp convergence of random variables. We use the Little-o notation o(1) for any term that vanishes to zero as m . Similarly, for a sequence of random variables Xm, we use Xm = o P (1) if Xm converges to zero in probability, and Xm = OP (1) if Xm is bounded in probability. All stochastic order relations are understood to hold as m .

Prediction-Powered Adaptive Shrinkage Estimation

C.1. Sample-based Estimators for σ2 j , τ 2 j , γj

We first write down the expressions of the unbiased sample-based estimators for σ2 j , τ 2 j and γj, assuming that nj 2.

ˆσ2 j := 1 nj 1

i=1 (Yij Yj)2, ˆτ 2 j := 1 nj + Nj 1

i=1 (f(Xij) ZN+n j )2 +

i=1 (f( Xij) ZN+n j )2 ,

ˆγj := 1 nj 1

i=1 (Yij Yj)(f(Xij) ZN+n j ), where ZN+n j := 1 nj + Nj

i=1 f(Xij) +

i=1 f( Xij) .

These sample-based estimators serve a dual role. Firstly, they are utilized in the practical implementation of the PT and PAS estimators for our numerical experiments on real-world datasets, where true second moments are unavailable. Secondly, they form the basis for defining the Uni PT and Uni PAS estimators, which we introduce subsequently.

C.2. Uni PT: Power-tuning Across Problems with Estimated Second Moments

Under this new setup, the first step is to derive a new variant of the PT estimator without the known second moments. It turns out that performing power-tuning across problems using the same λ for all problems (which we show in Appendix A.1 to be closely related to the multivariate estimation problem in PPI++) leads to a promising alternative.

Definition C.1 (Univariate Power Tuning (Uni PT)). We consider a family of estimators for each problem j [m]:

ˆθj,λ = Yj + λ( Zf j Zf j ),

where all problems are controlled by a single, global power-tuning parameter λ R. Our goal is to find the λ that minimizes the sum of variances across all problems, Pm j=1 Varηj[ˆθj,λ]. The variance for each problem j is:

Varηj[ˆθj,λ] = σ2 j nj + λ2 1

where σ2 j = Varηj[Yij], τ 2 j = Varηj[f(Xij)], and γj = Covηj[Yij, f(Xij)]. Minimizing Pm j=1 Varηj[ˆθj,λ] with respect to λ yields the theoretically optimal global parameter for the given set of m problems:

Pm j=1 n 1 j γj Pm j=1 1 nj + 1 Nj

Pm j=1 n 1 j γj Pm j=1 nj+Nj

nj Nj τ 2 j .

In practice, since γj and τ 2 j are now assumed unknown, we replace them with their sample-based unbiased estimators ˆγj and ˆτ 2 j (computed in Appendix C.1) to obtain the estimated global power-tuning parameter:

Pm j=1 n 1 j ˆγj Pm j=1 nj+Nj

nj Nj ˆτ 2 j .

We clip ˆλ to the interval [0, 1]. Let this be ˆλclip = clip(ˆλ, [0, 1]). Similarly, we denote the clipped version of λ m as λ clip,m := clip(λ m, [0, 1]). The Uni PT estimator for problem j is then:

ˆθUPT j = Yj + ˆλclip( Zf j Zf j ).

The Uni PT estimator, by using a single data-driven ˆλ, offers a practical way to perform power tuning when per-problem moments are unknown. This approach is justified by the following theoretical results, which hold as m .

Proposition C.2 (Asymptotic Consistency of Clipped Global Tuning Parameter). Assume that:

1. Sample sizes nj, Nj are bounded (2 nmin nj nmax < , 1 Nmin Nj Nmax < ).

2. EPη[Varηj[ˆγj]] < and EPη[Varηj[ˆτ 2 j ]] < .

Prediction-Powered Adaptive Shrinkage Estimation

3. EPη[γ2 j ] < , EPη[(τ 2 j )2] < , EPη[(σ2 j )2] < .18

4. (Denominator bounded away from 0) there exists some ε > 0 such that

lim m P 1 m

nj Nj τ 2 j

Then, the clipped estimated global tuning parameter ˆλclip converges in L2 to the clipped theoretical optimal global tuning parameter λ clip,m:

ˆλclip λ clip,m L2 0 as m .

The proof is deferred to Appendix F.5.

This consistency ensures that ˆλclip effectively targets the best single power-tuning parameter for the collection of m problems. Building on this, we can state a result regarding the asymptotic variance of the Uni PT estimator.

Theorem C.3 (Asymptotic Variance Optimality of Uni PT). Under the assumptions of Proposition C.2, the sum of variances of the Uni PT estimators, Pm j=1 Varηj[ˆθUPT j ], asymptotically achieves the minimum possible sum of variances within the

class of estimators C = {(ˆθj,λ)m j=1 | ˆθj,λ = Yj + λ( Zf j Z f j ), λ [0, 1]}, in the sense that:

j=1 Varηj[ˆθUPT j ] min λ [0,1] 1 m

j=1 Varηj[ˆθj,λ ] P 0 as m ,

where the left-hand side is still a random variable with randomness from drawing ηj iid Pη.

The proof is deferred to Appendix F.6.

C.3. Uni PAS: CURE and Adaptive Shrinkage with Estimated Second Moments

Once we obtain the Uni PT estimator, which is asymptotically optimal within a class of unbiased estimators, our next goal is to imitate the steps in Section 4.2 to apply shrinkage across problems. To do so, we must first revisit the formulation of CURE and see how it depends on the now unknown second-moment parameters. By definition:

CURE ˆθ PAS ω := 1

h (2 ωj 1) σ2 j + 2(1 ωj ) γj + (1 ωj )2 ˆθPT j,ωj Zf j 2i , ωj = ω ω+ σ2 j . (24)

So there are three places where CURE makes use of σ2 j , γj, which then depend on second moments of the data generating process: (1) in the definition of PT estimator, (2) in the definition of CURE itself and (3) in determining the localized shrinkage level ωj. We thus make the following modifications to CURE using the sample-based estimators.

1. We first replace the PT estimator (shrinkage source) to the Uni PT estimator. Additionally, σ2 j is now the variance of ˆθj,λ clip,m (as defined in Definition C.1) and τj is its covariance with Zf j . We can explicitly write them down as

σ2 j := σ2 j nj + Nj + nj

Njnj (λ clip,m)2τ 2 j 2

nj λ clip,mγj , γj := λ clip,m τ 2 j Nj .

2. For σ2 j and γj in the definition of CURE , we replace them directly with the sample-based estimators

σ2 j := ˆσ2 j nj + Nj + nj

Njnj ˆλ2 clipˆτ 2 j 2

nj ˆλclipˆγj , γj := ˆλclip ˆτ 2 j Nj

where ˆσ2 j , ˆτ 2 j , ˆγj are defined in Appendix C.1.

18Note that this condition is implied by the finite fourth-moment assumption in Proposition 5.1.

Prediction-Powered Adaptive Shrinkage Estimation

3. For σ2 j in the definition of ωj , we replace it with the following averaging estimators:

ˇσ2 j := σ2

nj + Nj + nj

Njnj ˆλ2 clip τ 2 2

nj ˆλclip γ, where (25)

j=1 ˆσ2 j , τ 2 := 1

j=1 ˆτ 2 j , γ := 1

are the average sample co-(variances) across all m problems.19

With these modifications, we define a new class of shrinkage estimators based on ˆθUPT j . For any global shrinkage parameter ω 0, the problem-specific shrinkage weight is now defined as:

ˆωj := ˆωj(ω) = ω ω + ˇσ2 j .

The corresponding family of shrinkage estimators for problem j is:

ˆθUPAS j,ω := ˆωj ˆθUPT j + (1 ˆωj) Zf j .

We then define the modified CURE, denoted \ CURE, by taking into account all the changes above.20

\ CURE ˆθ UPAS ω := 1

h (2ˆωj 1) σ2 j + 2(1 ˆωj) γj + (1 ˆωj)2 ˆθUPT j Zf j 2i . (26)

Finally, the Uni PAS estimator is obtained by selecting the ω that minimizes this \ CURE:

Definition C.4 (Univariate Prediction-Powered Adaptive Shrinkage (Uni PAS)). The Uni PAS estimator for problem j is ˆθUPAS j := ˆθUPAS j,ˆω , where

ˆω := arg min ω 0 \ CURE ˆθ UPAS ω .

This Uni PAS estimator is fully data-driven and does not rely on knowledge of the true second-moment parameters. The pseudo-code for the full Uni PAS algorithm is given below.

The fully data-driven construction of Uni PAS is supported by the following theoretical guarantee:

Proposition C.5 (Asymptotic Consistency of \ CURE for Uni PAS). On top of the assumptions in Proposition C.2 and Assumption 2.2, if we further require that infj [m],m N σ2 j,m δ > 0 for some fixed δ, where

σ2 j σ2 j,m := µσ2

nj + Nj + nj

Njnj (λ clip,m)2µτ 2 2

nj λ clip,mµγ (27)

with µσ2 := EPη[σ2 j ], µτ 2 := EPη[τ 2 j ], µγ := EPη[γj]. Then, \ CURE ˆθ UPAS ω is an asymptotically consistent estimator

for the true loss ℓm(ˆθ UPAS ω , θ) of the Uni PAS estimator (which uses weights ˆωj = ω/(ω + ˇσ2 j )), in the sense that:

sup ω 0 | \ CURE ˆθ UPAS ω ℓm(ˆθ UPAS ω , θ)| m 0.

We defer the proof to Appendix F.7.

This consistency result ensures that minimizing \ CURE is asymptotically equivalent to minimizing the true MSE of the Uni PAS estimator. This leads to the following optimality guarantee:

19The rationale is as follows: In defining ωj we pretend that σ2 j , τ 2 j and γj are the same across all problems and so can be estimated consistently. We emphasize that our theoretical results do not require that these second moments be identical (i.e., it is only a working modeling assumption). 20We use the hat notation for \ CURE to make it explicit that \ CURE is not an unbiased estimator of risk in finite samples. However, we will show that it is a consistent estimator of risk asymptotically as m .

Prediction-Powered Adaptive Shrinkage Estimation

Algorithm 2 Uni PAS

Require: (Xij, Yij)nj i=1, ( Xij)Nj i=1 for j [m], predictive model f 1: for j = 1 to m do 2: Step 1: Apply predictor (Eq. (1)) to get aggregated statistics and sample-based estimators for second moments 3: Yj, Zf j , Zf j = get means((Xij, Yij)nj i=1, ( Xij)Nj i=1, f)

4: ˆσ2 j , ˆγj, ˆτ 2 j = get sample variances((Xij, Yij)nj i=1, ( Xij)Nj i=1, f) 5: Step 2: Univariate power tuning (Appendix C.2)

6: ˆλclip = clip Pm j=1 n 1 j ˆγj Pm j=1 Nj +nj

Nj nj ˆτ 2 j , [0, 1]

7: ˆθUPT j = Yj + ˆλclip( Zf j Zf j )

8: Step 3: Construct sample-based estimators of Var[ˆθUPT j ] and Cov[ˆθUPT j , Zf j ]

9: σ2 j = ˆσ2 j nj + Nj+nj

Njnj ˆλ2 clipˆτ 2 j 2 nj ˆλclipˆγj

10: γj = ˆλ ˆτ 2 j Nj 11: end for 12: Step 4: Adaptive shrinkage 13: σ2 = 1

m Pm j=1 ˆσ2 j , τ 2 = 1

m Pm j=1 ˆτ 2 j , γ = 1

m Pm j=1 ˆγj 14: for j = 1 to m do 15: Calculate the averaging variance estimator in Equation (25) 16: ˇσ2 j := σ2

Njnj ˆλ2 clip τ 2 2 nj ˆλclip γ 17: end for 18: Now for each choice of ω 0, the shrinkage coefficient is determined by ω ω+ˇσ2 19: The CURE calculation in Equation (26), which still depends on the sample-based estimators σ2 j , γj 20: ˆω = get shrink param((ˆθUPT j )m j=1, ( Zf j )m j=1, (ˇσ2 j )m j=1, ( σ2 j )m j=1, ( γj)m j=1) 21: for j = 1 to m do 22: ˆθUPAS j = ˆωˆθUPT j + (1 ˆω) Zf j 23: end for 24: return {ˆθUPAS j }m j=1

Theorem C.6 (Asymptotic Bayes Risk Optimality of Uni PAS). Under the assumptions of Theorem C.5, the Uni PAS

estimator ˆθ UPAS = ˆθ UPAS ˆω satisfies:

BPη m ˆθ UPAS inf ω 0

n BPη m ˆθ UPAS ω o + o(1) as m ,

and so BPη m ˆθ UPAS min n BPη m Zf , BPη m ˆθ UPT o + o(1) as m .

The proof of Theorem C.6, which follows directly from Proposition C.5, mirrors the argument in in Appendix F.3 used to derive Theorem 5.2 from Proposition 5.1.

We note that the conclusion here is slightly weaker than that of Theorem 5.2 for PAS. Theorem 5.2 guarantees that PAS asymptotically has risk less or equal to that of PT with optimal per-problem choice of the power tuning parameter. By contrast, Theorem C.6 along with Theorem C.3 shows that Uni PAS always has risk less or equal to that of power tuning that uses the same power tuning parameter for all problems. The main upshot of Theorem C.6 is that Uni PAS does not require knowledge of second moments.

D. Other Baseline Shrinkage Estimators

D.1. Shrink-classical (Shrinkage) Baseline

The shrink-classical estimator applies shrinkage directly to the classical estimator Yj, using the prediction mean Zf j as a shrinkage target without first applying power-tuned PPI. We include this baseline to isolate the benefits of power tuning

Prediction-Powered Adaptive Shrinkage Estimation

from the PAS estimator as an ablation study.

Formulation. The shrink-classical estimator for problem j takes the form:

ˆθShrink j,ω := ωj Yj + (1 ωj) Zf j ,

where ωj := ω/(ω + σ2 j ), σ2 j := Varηj[ Yj] = σ2 j /nj.

Here ω 0 is a global shrinkage parameter analogous to Section 4.2. The key difference from PAS is that we shrink the classical estimator Yj (which is independent of Zf j ) rather than the power-tuned estimator ˆθPT j (which is correlated with Zf j ).

Optimizing ω via CURE. Since Yj and Zf j are independent, Theorem 4.1 simplifies. Let γj = Cov[ Yj, Zf j ] = 0 and σ2 j = σ2 j /nj. CURE simplifies as:

CURE(ˆθShrink j,ω ) = (2ωj 1) σ2 j + h (1 ωj)( Yj Zf j ) i2 .

This follows from Theorem 4.1 by setting γj = 0. The global shrinkage parameter ω is selected by minimizing CURE across all m problems.

ˆθShrink j := ˆωj Yj + (1 ˆωj) Zf j , ˆωj = ˆω/(ˆω + σ2 j ),

where ˆω arg min ω 0

j=1 CURE(ˆθShrink j,ω ). (28)

The optimal ˆω does not admit a closed-form expression, but we can compute it numerically by grid search. Below we provide the pseudo-code for implementing the shrink-classical estimator.

Algorithm 3 Shrink-classical Estimator

Require: {(Xij, Yij) nj i=1}, { Xij} Nj i=1 for j [m], variance parameters {σ2 j }m j=1, predictive model f 1: for j = 1 to m do 2: Yj, Zf j = get means((Xij, Yij) nj i=1, ( Xij) Nj i=1, f) 3: σ2 j σ2 j /nj variance of Yj 4: end for 5: ˆω = get shrink param(( Yj)m j=1, ( Zf j )m j=1, ( σ2 j )m j=1) use Eq. (28) 6: for j = 1 to m do 7: ˆωj = ˆω/ ˆω + σ2 j

8: ˆθShrink j = ˆωj Yj + (1 ˆωj) Zf j 9: end for 10: return {ˆθShrink j }m j=1

D.2. Shrink-average Baseline The shrink-average estimator represents an alternative, perhaps more classical, shrinkage approach that attempts to further improve upon the unbiased PT estimators. While PAS reuses the prediction means on unlabeled data as shrinkage targets, here we consider shrinking the PT estimators across all problems to a shared location, namely their group mean

j=1 ˆθPT j .

Formulation. The shrink-average estimator for problem j takes the form:

ˆθAvg j,ω := ωj ˆθPT j + (1 ωj) θPT,

where ωj := ω/(ω + σ2 j ), σ2 j := Varηj[ˆθPT j ].

Prediction-Powered Adaptive Shrinkage Estimation

Algorithm 4 Shrink-average Estimator

Require: (Xij, Yij) nj i=1, ( Xij) Nj i=1, γj, τj, σj for j [m], predictive model f 1: for j = 1 to m do 2: Step 1: Apply predictor (Eq. (1)) 3: Yj, Zf j , Zf j = get means((Xij, Yij) nj i=1, ( Xij) Nj i=1, f) 4: Step 2: Power tuning (Eq. (13)) 5: λ j = get pt param(γj, τj, nj, Nj) 6: ˆθPT j = Yj + λ j( Zf j Zf j ) 7: σ2 j = get pt var(ˆθPT j ) (Eq. (14)) 8: end for 9: θPT = m 1 Pm j=1 ˆθPT j 10: Step 3: Adaptive shrinkage toward group mean (Eq. (29)) 11: ˆω = get shrink param((ˆθPT j )m j=1, θPT, ( σ2 j )m j=1) 12: for j = 1 to m do 13: ˆωj = ˆω/(ˆω + σ2 j ) 14: ˆθAvg j = ˆωj ˆθPT j + (1 ˆωj) θPT

15: end for 16: return {ˆθAvg j }m j=1

Optimizing ω via SURE. Xie et al. (2012) proposed the following unbiased risk estimate to optimize ω for this estimator. Note that even though the group mean is also correlated with each PT estimator, we still denote the following SURE instead of CURE following the nomenclature in Xie et al. (2012).

ˆω arg min ω 0

j=1 SURE(ˆθShrink j,ω ) (29)

SURE(ˆθShrink j,ω ) := h (1 ωj)(ˆθPT j θPT) i2 + (1 ωj)(ω + (2/m 1) σ2 j ).

We refer to Algorithm 4 for a full pseudo-code implementation of shrink-average estimator.

E. Experiment Details

E.1. Synthetic Model

Motivation. In Example 2.3, we described the following data generation process (copied from Eq. (8))

ηj U[ 1, 1], j = 1, . . . , m,

Xij N(ηj, ψ2), Yij|Xij N(2ηj Xij η2 j , c), i = 1, . . . , nj,

and the same for ( Xij, Yij). ψ and c are two hyperparameters that we chose to be 0.1 and 0.05, respectively. The (marginal) mean and variance of Yij are

θj := Eηj[Yij] = η2 j , σ2 j := Varηj[Yij] = 4η2 j ψ2 + c.

To understand the motivation behind this setup, we can further inspect the covariance between Xij and Yij, which can be verified to be Covηj[Xij, Yij] = 2ηjψ2. Therefore, if we consider the ratio between the absolute covariance and the variance (of Yij) as a characterization of the inherent predictability of a problem, we see that

|Covηj[Xij, Yij] |

Varηj[Yij] = 2|ηj|ψ2

4η2 j ψ2 + c

which has its minimum when ηj = 0 and increases monotonically in |ηj| for |ηj| [0, 1] given our specific choices of ψ and c (see Figure 4). In other words, problems with ηj close to the origin have a lower predictability; whereas when ηj moves away from zero, the problems become easier to solve. This quantitatively reflects the pattern we see in Figure 3, where we display the power-tuning parameters as a function of ηj.

Prediction-Powered Adaptive Shrinkage Estimation

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Ratio between |Cov j[Xij, Yij]| and Var j[Yij]

Figure 4. The ratio between |Covηj[Xij, Yij] | and Varηj[Yij] as a function of ηj. The constants are set to ψ = 0.1 and c = 0.05.

Expressions for θj, µj, σ2 j , τ 2 j , γj when f(x) = |x|. When we work with the synthetic model using the flawed predictor f(x) = |x|, we can match the form of our dataset with the general setting in Assumption 2.2 by identifying closed-form expressions for the model parameters θj, µj, σ2 j , τ 2 j , γj.

θj = η2 j , σ2 j = 4η2 j ψ2 + c,

γj = 2ηjψ2 r

2 π e η2 j /(2ψ2), µj =

2 π ψ exp ηj

τ 2 j = η2 j + ψ2

π exp η2 j 2ψ2

where Φ( ) denotes the standard normal distribution function.

Expressions for θj, µj, σ2 j , τ 2 j , γj when f(x) = x2. Similar closed-form expressions can be derived when we use the other predictor f(x) = x2. Note that θj and σ2 j remain the same.

θj = η2 j , σ2 j = 4η2 j ψ2 + c,

γj = 4η2 j ψ2, µj = η2 j + ψ2, τ 2 j = 2ψ4 + 4η2 j ψ2.

In experiments involving the synthetic model with both predictors, we are able to leverage these closed-form expressions and supplement the ground-truth parameters to our datasets.

Interpretation of MSE. In the synthetic experiments, since we have access to the true prior for ηj (therefore for θj) and resample them for each problem across K trials, the MSE we obtained in Section 7.1 is an unbiased estimate of the Bayes Risk defined in Eq. (11).

E.2. Amazon Review Ratings Dataset

Dataset & Preprocessing. The Amazon Fine Food Reviews dataset, provided by the Stanford Network Analysis Project (SNAP; SNAP (2014)) on Kaggle,21 comes in a clean format. We group reviews by their Product ID. For each review, we concatenate the title and body text to form the covariate, while the response is the reviewer s score/rating (1 to 5 stars). Here s a sample review:

Score: 4 Product: BBQ Pop Chips Title: Delicious! Text: BBQ Pop Chips are a delicious tasting healthier chip than many on the market. They are light and full of flavor. The 3 oz bags are a great size to have. I would recommend them to anyone.

21https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

Prediction-Powered Adaptive Shrinkage Estimation

We focus on the top m = 200 products with the most reviews for the compound mean estimation of average ratings. This approach mitigates extreme heteroscedasticity across estimators for different problems, which could unduly favor shrinkage-based methods when considering unweighted compound risk. There are a total of 74,913 reviews for all 200 products.

Fine-tuning BERT. The Bidirectional Encoder Representations from Transformers (BERT) model is a widely adopted language model for many NLP tasks including text classification (Devlin et al., 2019). However, pretraining BERT from scratch is time-consuming and requires large amounts of data. We therefore use the bert-base-multilingual-uncased-sentiment model22 from Town (2023) as the base model, denoted as BERT-base. BERT-base is pre-trained on general product reviews (not exclusive to Amazon) in six languages. It achieves 67.5% prediction accuracy on a validation set of 100 products ( 46k reviews).

Then, we further fine-tune it on the held-out review data, that is, reviews outside the top 200 products, for 2 full epochs. The fine-tuning is done using Hugging Face s transformers library (Wolf, 2019). After fine-tuning, the BERT-tuned model achieves 78.8% accuracy on the same validation set.

E.3. Spiral Galaxy Fractions (Galaxy Zoo 2)

Dataset & Preprocessing. The Galaxy Zoo 2 (GZ2) project23 contains a large collection of human-annotated classification results for galaxy images from SDSS. However, instead of having a single dataframe, GZ2 has many different tables each for subsets of the SDSS raw data. We begin with a particular subset of 239,696 images with metadata drawn from Hart et al. (2016). Our data cleaning pipeline is inspired by Lin et al. (2021), which removes missing data and relabels the class name of each galaxy image to a more readable format:

Class Names: Round Elliptical, In-between Elliptical, Cigar-shaped Elliptical, Edge-on Spiral, Barred Spiral, Unbarred Spiral, Irregular, Merger

In the downstream estimation problems, we consider a galaxy spiral if it is classified as one of the three classes ending with Spiral , otherwise non-spiral . Below we display a few examples of galaxy images. Each image has dimensions of 424 424 3, where the third dimension represents the three filter channels: g (green), r (red), and i (infrared). The cleaned dataset has 155,951 images in total.

Sample Images from Galaxy Zoo 2

Spiral Non-spiral

Figure 5. Example of spiral & non-spiral galaxy images from Galaxy Zoo 2. 22https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment 23https://data.galaxyzoo.org/

Prediction-Powered Adaptive Shrinkage Estimation

The additional SDSS metadata for GZ224 contains valuable information that directly partitions the galaxies based on certain attributes, e.g., REDSHIFT SIMPLE BIN based on galaxy redshift measurements, and WVT BIN calculated by weighted Voronoi tessellation. These partitions naturally motivate fine-grained compound mean estimation on this dataset.

After partitioning the images based on WVT BIN,25 we consider only the top m = 100 partitions based on the cutoff that each problem should have 150 images (many partitions have very few galaxy images in them), for the same reason as in the Amazon Review dataset. Finally, we have a total of 100k images as covariates (either Xij or Xij) for our problem.

Training the Predictor. We employ the Res Net50 architecture (He et al., 2016), utilizing the pre-trained model from torchvision initially trained on Image Net (Deng et al., 2009). To tailor the model to our task, we fine-tune it on 50k images excluded from the top m problems. The model is trained to classify galaxies into eight categories, later condensed into a binary spiral/non-spiral classification for prediction. We use a batch size of 256 and Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e-3. After 20 epochs, the model achieves 87% training accuracy and 83% test accuracy. Despite these promising results, Table 3 indicates that the predictions still require debiasing for accurate estimation.

E.4. Benchmarking in real-world datasets

In this appendix we describe the steps to obtain the MSEs and their standard errors for real-world datasets shown in Table 3.

Let K be the number of experiment trials, Tj be the total number of data points for problem j, i.e. { Xij, Yij}Tj i=1 represents the raw data we have, and nj, Nj be the desired number of labeled/unlabeled data to simulate, usually calculated through a hyper-parameter splitting ratio (e.g. Nj = r Tj , nj = Tj Nj for r = 0.8 in our case).

1. Following evaluation methodology in existing PPI literature, e.g., (Angelopoulos et al., 2023), we first calculate the mean of all responses for each problem and treat it as the pseudo ground-truth, i.e., θj := 1 Tj P

2. For each trial k [K], we create a random permutation for the raw data, with indices permuted by κ : N N, and obtain the labeled and unlabeled datasets for problem j as

{Xij, Yij}nj i=1 = { Xκ(i)j, Yκ(i)j}nj i=1, { Xij}Nj i=1 = { Xκ(i)j}Tj i=nj+1

3. We proceed with using these datasets to obtain the baseline and PAS estimators. Let ˆθk j be an estimator for the j-th problem at trial k, then our final reported MSE and standard error is calculated as

[ MSEK(ˆθ) := 1

j=1 (ˆθk j θj)2

, SEK(ˆθ) := 1

v u u u t 1 K 1

j=1 (ˆθk j θj)2 [ MSEK(ˆθ)

Note that the standard error only accounts for uncertainty due to the random splits into labeled and unlabeled datasets.

E.5. Additional Experiment with Varying Labeled/Total Data Ratio

In Table 3, we report our experiment results on different tasks and predictors, but fixing the ratio between labeled and total amounts of data nj/(nj + Nj) = 0.2 for all problems. To verify the broader applicability of our method, we repeat our experiments across a much wider range of ratios from 1% to 40% and report the results in Figure 6. For each ratio, we follow exactly the same data-splitting and benchmarking procedures specified in Appendix E.4.

E.6. Computational Resources

All the experiments were conducted on a compute cluster with Intel Xeon Silver 4514Y (16 cores) CPU, Nvidia A100 (80GB) GPU, and 64GB of memory. Fine-tuning the BERT-tuned model took 2 hours, and training the Res Net50

24The column names and their meanings are available at https://data.galaxyzoo.org/data/gz2/gz2sample.txt. 25In the Galaxy Zoo 2 dataset, WVT BIN denotes Voronoi bins constructed based on each galaxy s intrinsic size and absolute magnitude. The motivation and implementation of this binning strategy are detailed in Hart et al. (2016), who justify such partitioning aligned with our compound mean estimation setup by noting that spiral arm morphology exhibits systematic dependencies on stellar mass and related intrinsic properties.

Prediction-Powered Adaptive Shrinkage Estimation

1 5 10 15 20 25 30 35 40 Train-Test Split Ratio (%)

MSE vs Split Ratio on Amazon Review (base f)

Classical Prediction Avg PPI PT Shrinkage PAS Uni PT Uni PAS

1 5 10 15 20 25 30 35 40 Train-Test Split Ratio (%)

MSE vs Split Ratio on Amazon Review (tuned f)

Classical Prediction Avg PPI PT Shrinkage PAS Uni PT Uni PAS

1 5 10 15 20 25 30 35 40 Train-Test Split Ratio (%)

MSE vs Split Ratio on Galaxy Zoo

Classical Prediction Avg PPI PT Shrinkage PAS Uni PT Uni PAS

Figure 6. Average MSEs for the three real-world datasets when the labeled/unlabeled split ratio varies from 1% to 40%. Uni PT and Uni PAS are the two newly added variants of PT and PAS estimators, respectively.

model took 1 hour. All the inferences (predictions) can be done within 10 minutes. The nature of our research problem requires running the prediction only once per dataset, making it fast to benchmark all estimators for K = 200 trials using existing predictions.

E.7. Code Availability

The code for reproducing the experiments is available at https://github.com/listar2000/predictionpowered-adaptive-shrinkage.

F. Proofs of Theoretical Results

F.1. Proof of Theorem. 4.1

For each problem j [m], we are shrinking the PT estimator ˆθPT j obtained from the first stage toward Zf j , the prediction mean on the unlabeled data. Conditioning on ηj, we denote

σ2 j := Varηj h ˆθPT j i = Varηj h ˆθPPI j,λ j

γj := Covηj h ˆθPT j , Zf j i = λ j Varηj h Zf j i ,

where all the first and second moments of ˆθPT j and Zf j exist under the conditions of Assumption 2.2. For each global ω 0, the shrinkage parameter for the j-th problem is defined as ωj := ω/(ω + σ2 j ). Then, following the result in Theorem B.1, CURE for ˆθPAS j,ωj := ωj ˆθPT j + (1 ωj) Zf j ,

CURE ˆθPAS j,ω = (2ωj 1) σ2 j + 2(1 ωj) γj + h (1 ωj)(ˆθPT j Zf j ) i2 ,

Prediction-Powered Adaptive Shrinkage Estimation

is an unbiased estimator of the risk, i.e.,

Eηj h CURE ˆθPAS j,ω i = R(ˆθPAS j,ωj , θj).

Finally, the CURE for the collection of estimators is ˆθ PAS ω := (ˆθPAS 1,ω , . . . , ˆθPAS m,ω)

CURE ˆθ PAS ω := 1

j=1 CURE ˆθPAS j,ω ,

which is an unbiased estimator of the compound risk Rm(ˆθ PAS ω , θ) by linearity of the expectation.

F.2. Formal conditions and proof of Proposition 5.1

We aim to prove that CURE converges uniformly to the true squared-error loss ℓm(ˆθ PAS ω , θ) as m . Specifically, our goal is to establish

CURE(ˆθ PAS ω ) ℓm(ˆθ PAS ω , θ) L1 m 0.

For this proposition, all the expectation and variance terms without subscript are conditioning on η. We keep using the notations θj = E[ˆθPT j ], µj = E[ Zf j ], σ2 j = Var[ˆθPT j ] and γj = Cov[ˆθPT j , Zf j ]. For this proposition, additional assumptions are placed on the data generating process (integrated over Pη). We first show how they translate to moment conditions on the estimators ˆθPT j and Zf j .

Lemma F.1. Under the assumptions of Proposition 5.1, and specifically EPη f(Xij)4 < and EPη Y 4 ij < , it holds that:

sup j 1 EPη h ˆθPT j 4i < , sup j 1 EPη h Zf j 4i < ,

EPη θ4 j < , EPη µ4 j < .

Proof. By Minkowski s inequality, we have

EPη h ˆθPT j 4i = EPη h Yj + λ j( Zf j Zf j ) 4i

EPη Y 4 j 1/4 + λ j EPη h Zf j 4i1/4 + λ j EPη h Zf j 4i1/4 4 ,

so it suffices to bound the fourth moments of Yj, Zj, Zj. We proceed with Yj first. Again, by Minkowski s inequality

EPη Y 4 j 1/4 = EPη

i=1 n 1 j Yij

i=1 EPη h n 1 j Yij 4i1/4

i=1 n 1 j EPη Y 4 ij 1/4

= EPη Y 4 ij 1/4 < .

The given assumption on the data generating process is used in the last line. Although the left-hand side of the inequality may depend on j (via the deterministic nj), the right-hand side does not depend on j (since we are integrating over Pη) and so we may take a supremum over all j on the left-hand side. The arguments for Zj and Zj based on the finiteness of

Prediction-Powered Adaptive Shrinkage Estimation

EPη[f(Xij)4] are analogous. Next, by Jensen s inequality we have

θ4 j = Eηj h ˆθPT j i4 Eηj h ˆθPT j 4i

= EPη θ4 j EPη h ˆθPT j 4i < ,

and similarly we can also obtain EPη[µ4 j] < .

With Lemma F.1, we will prove Proposition 5.1 via the following steps.

Step 1: Decompose the difference. We first decompose both CURE and the loss separately as

CURE(ˆθ PAS ω ) = 1

(2ωj 1) σ2 j + h (1 ωj)(ˆθPT j Zf j ) i2 + 2(1 ωj) γj

(2ωj 1) σ2 j + (1 ωj)2(ˆθPT j µj)2

| {z } I(ω)

2(1 ωj) γj + 2(1 ωj)2(µj Zf j )(ˆθPT j µj) + (1 ωj)2(µj Zf j )2

| {z } II(ω)

ℓm(ˆθ PAS ω , θ) = 1

j=1 (ωj ˆθPT j + (1 ωj) Zf j θj)2

j=1 (ωj ˆθPT j + (1 ωj)µj θj)2

| {z } I (ω)

2(1 ωj)( Zf j µj)(ˆθPT j θj) + 2(1 ωj)2(µj Zf j )(ˆθPT j µj) + (1 ωj)2(µj Zf j )2

| {z } II (ω)

and we are interested in bounding

CURE(ˆθ PAS ω ) ℓm(ˆθ PAS ω , θ) sup ω 0 |I(ω) I (ω)| + sup ω 0 |II(ω) II (ω)| . (30)

Step 2: Bounding the first difference 1(ω) := I(ω) I (ω). The proof in this step is directly adapted from Theorem 5.1 in Xie et al. (2012) and generalizes to non-Gaussian data. With some algebraic manipulation, we can further decompose

(2ωj 1) σ2 j + (1 ωj)2(ˆθPT j µj)2

j=1 (ωj ˆθPT j + (1 ωj)µj θj)2

= CURE(ˆθ 0 ω) ℓm(ˆθ 0 ω, θ) 2

j=1 µj(1 ωj)(ˆθPT j θj)

where CURE(ˆθ 0 ω) = 1

(2ωj 1) σ2 j + (1 ωj)2 ˆθPT j 2 , ℓm(ˆθ 0 ω, θ) = 1

j=1 (ωj ˆθPT j θj)2,

Prediction-Powered Adaptive Shrinkage Estimation

corresponds to CURE and the loss of the shrink-toward-zero estimator ˆθ0 j,ω := ωj ˆθPT j . We thus have

sup ω 0 | 1(ω)| sup ω 0

CURE(ˆθ 0 ω) ℓm(ˆθ 0 ω, θ) + 2

j=1 µj(1 ωj)(ˆθPT j θj) . (31)

Now, rearrangements of terms gives that

CURE(ˆθ 0 ω) ℓm(ˆθ 0 ω, θ) = sup ω 0

ˆθPT j 2 σ2 j θ2 j 2ωj ˆθPT j 2 ˆθPT j θj σ2 j

ˆθPT j 2 σ2 j θ2 j

j=1 2ωj ˆθPT j 2 ˆθPT j θj σ2 j | {z } ( )

For the first term ( ),

ˆθPT j 2 σ2 j θ2 j

j=1 EPη h Varηj h ˆθPT j 2ii 1

m sup j Var Pη h ˆθPT j 2i .

Thus by Jensen s inequality and iterated expectation:

ˆθPT j 2 σ2 j θ2 j

m sup j Var Pη h ˆθPT j 2i 1/2 . (32)

For the second term ( ), we start by arguing conditionally on η, which implies in particular that we may treat all the σ2 j as fixed. It is thus without loss of generality to assume that σ2 1 ... σ2 m (by first sorting problems according to the value of σ2 j ). Then, since ωj is monotonic function of σ2 j for any fixed ω 0, we have 1 ω1 ... ωm 0. The following inequality follows:

j=1 2ωj ˆθPT j 2 ˆθPT j θj σ2 j max 1 c1 ... cm 0

j=1 cj ˆθPT j 2 ˆθPT j θj σ2 j . (33)

The following lemma would help us for handling the RHS of (33) (the same structural form of it will appear repeatedly in subsequent parts of the proof).

Lemma F.2. Let A1, . . . , An be real numbers. Then

max 1 c1 cn 0

= max 1 k n

Proof. Define Sk = Pk i=1 Ai for k = 1, . . . , n, and let c1, . . . , cn be real numbers satisfying 1 c1 cn 0. Set cn+1 = 0. Then we can rewrite

i=1 ci Ai =

k=1 (ck ck+1) k X

k=1 (ck ck+1) Sk.

Since ck ck+1, each αk := ck ck+1 is nonnegative, and

k=1 αk = c1 cn+1 1.

k=1 αk |Sk| max 1 k n |Sk| n X

k=1 αk max 1 k n |Sk|.

Prediction-Powered Adaptive Shrinkage Estimation

max 1 c1 cn 0

To see that this upper bound can be attained, consider for each k the choice

c1 = c2 = = ck = 1, ck+1 = ck+2 = = cn = 0.

Since 1 c1 cn 0, we have that

Taking the maximum over all such k {1, . . . , n} matches max1 k n |Sk|. Thus,

max 1 c1 cn 0

= max 1 k n

as claimed.

With Lemma F.2 in hand, we have

max 1 c1 ... cm 0

j=1 cj ˆθPT j 2 ˆθPT j θj σ2 j = max 1 k m

ˆθPT j 2 ˆθPT j θj σ2 j .

Let Mk := Pk j=1( ˆθPT j 2 ˆθPT j θj σ2 j ), it is easy to see that {Mk}m k=1 forms a martingale conditional on η. Therefore, by a standard L2 maximal inequality (ref. Theorem 4.4.6 in Durrett (2019)), we have

max 1 k m M 2 k

4Eη M 2 m = 4

j=1 Varηj h ˆθPT j 2 ˆθPT j θj i , (34)

which then implies

j=1 2ωj ˆθPT j 2 ˆθPT j θj σ2 j

max 1 k m M 2 k

j=1 EPη h Varηj h ˆθPT j 2 ˆθPT j θj ii

m sup j Var Pη h ˆθPT j 2 ˆθPT j θj i

j=1 2ωj ˆθPT j 2 ˆθPT j θj σ2 j

m sup j Var Pη h ˆθPT j 2 ˆθPT j θj i 1/2 . (35)

Next, we bound the last expression in (31): 2 m supω 0 Pm j=1(1 ωj)µj(ˆθPT j θj) . Note that (1 ωj) is also monotonic

in σ2 j , and if we define M k := Pk j=1 µj(ˆθPT j θj), then {M k}m k=1 forms another martingale conditioning on η. Therefore, following the same argument as (33) (34) gives

j=1 (1 ωj)µj(ˆθPT j θj) 2

max 1 k m M k 2

m2 EPη h M m 2i = 16

m sup j EPη h Varηj h ˆθPT j i µ2 j i

j=1 (1 ωj)µj(ˆθPT j θj)

m sup j EPη h Varηj h ˆθPT j i µ2 j i 1/2 . (36)

Prediction-Powered Adaptive Shrinkage Estimation

The upper bounds derived in (32), (35) and (36) establish control on

sup ω 0 | 1(ω)| 4 m

sup j Var Pη h ˆθPT j 2i1/2 + sup j Var Pη h ˆθPT j 2 ˆθPT j θj i + sup j EPη h Varηj h ˆθPT j i µ2 j i ,

since each term on the right-hand side can be controlled by the fourth-moment conditions established in Lemma F.1, we know that 1(ω) converges uniformly to zero.

Step 3: Bounding the second difference 2(ω) := II(ω) II (ω). We next cancel out identical terms in the second difference in (30) and get

j=1 (1 ωj) γj ( Zf j µj)(ˆθPT j θj) . (37)

By the same proof logic that has been applied twice above, we now have a function (1 ωj) monotonic in σ2 j , and a

martingale Qk := Pk j=1 h γj ( Zf j µj)(ˆθPT j θj) i for k = 1, . . . , m (recall that γj = Covηj[ˆθPT j , Zf j ]). The steps from (33) (34) follows, and we have

j=1 (1 ωj) γj ( Zf j µj)(ˆθPT j θj)

m sup j EPη h Varηj h ( Zf j µj)(ˆθPT j θj) ii ,

j=1 (1 ωj) γj ( Zf j µj)(ˆθPT j θj)

m sup j EPη h Varηj h ( Zf j µj)(ˆθPT j θj) ii 1/2 .

Again, the (fourth-)moment conditions from Lemma F.1 suffice to ensure that

sup j EPη h Varηj h ( Zf j µj)(ˆθPT j θj) ii < ,

and to establish control of

sup ω 0 | 2(ω)| .

Step 4: Concluding the argument. Finally, based on Steps 1 3, we have that

CURE(ˆθ PAS ω ) ℓm(ˆθ PAS ω , θ) EPη

sup ω 0 | 1(ω)| + EPη

sup ω 0 | 2(ω)| ,

and both terms on the right hand side converge to zero by our preceding bounds and the moment assumptions in the statement of the theorem.

F.3. Proof of Theorem 5.2

We apply a standard argument used to prove consistency of M-estimators.

Let ω be the oracle choice of ω 0 that minimizes the Bayes risk BPη m (ˆθ PAS ω ).26 Notice that by definition of ˆω as the minimizer of CURE, CURE(ˆθ PAS ˆω ) CURE(ˆθ PAS ω ).

Then: ℓm(ˆθ PAS ˆω , θ) ℓm(ˆθ PAS ω , θ) 2 sup ω 0

CURE(ˆθ PAS ω ) ℓm(ˆθ PAS ω , θ) .

26To streamline the proof, we assume that the infimum is attained by a value ω . If the infimum is not attained, the proof still goes through using approximate minimizers.

Prediction-Powered Adaptive Shrinkage Estimation

Taking expectations,

BPη m (ˆθ PAS ˆω ) BPη m (ˆθ PAS ω ) 2 EPη

CURE(ˆθ PAS ω ) ℓm(ˆθ PAS ω , θ) .

Noting that the right hand side converges to 0 as m , and recalling the definition of ω , we prove the desired result

BPη m (ˆθ PAS ˆω ) inf ω 0 BPη m (ˆθ PAS ω ) + o(1).

F.4. Proof of Proposition 5.3

We start with the result in Theorem 5.2

BPη m (ˆθ PAS ˆω , θ) inf ω 0 BPη m (ˆθ PAS ω , θ) + o(1).

Now, since we are integrating over η Pη for all problems

BPη m (ˆθ PAS ω , θ) = 1

j=1 EPη h ˆθPAS j, ω θj 2i

= EPη h ˆθPAS j, ω θj 2i ,

by definition ˆθPAS j, ω = ωj ˆθPT j + (1 ωj) Zf j , where ωj = ω/(ω + σ2). Therefore

EPη h ˆθPAS j, ω θj 2i = EPη h ωj ˆθPT j θj + (1 ωj) Zf j θj 2i

(ω + σ2)2 EPη h ˆθPT j θj 2i + σ4

(ω + σ2)2 EPη h Zf j θj 2i

+ 2 σ2ω (ω + σ2)2 EPη h ˆθPT j θj Zf j θj i .

By our assumption, second moment terms like σ2 and γ are now fixed, so we have (by iterated expectation)

EPη h ˆθPT j θj 2i = σ2.

Noting that γj = 0 since Nj = , we have

EPη h ˆθPAS j, ω θj 2i = ω2 σ2

(ω + σ2)2 + σ4

(ω + σ2)2 EPη h Zf j θj 2i .

Plugging in ω = EPη h Zf j θj 2i gives

(ω + σ2)2 + σ4

(ω + σ2)2 EPη h Zf j θj 2i = σ2EPη ( Zf j θj)2

σ2 + EPη ( Zf j θj)2 .

We finally have

BPη m (ˆθ PAS) σ2EPη ( Zf j θj)2

σ2 + EPη ( Zf j θj)2 + o(1).

Prediction-Powered Adaptive Shrinkage Estimation

F.5. Proof of Lemma C.2

We aim to prove that EPη h (ˆλclip λ clip,m)2i 0 as m . Let λ m be the unclipped theoretical optimal global

parameter for m problems, and ˆλ be its unclipped sample-based estimator:

Pm j=1 n 1 j γj Pm j=1 nj+Nj

nj Nj τ 2 j and ˆλ =

Pm j=1 n 1 j ˆγj Pm j=1 nj+Nj

nj Nj ˆτ 2 j .

The clipped versions are λ clip,m = clip(λ m, [0, 1]) and ˆλclip = clip(ˆλ, [0, 1]).

Step 1: Convergence in probability of the numerator and the denominator. Let:

j=1 n 1 j γj, D m = 1

nj Nj τ 2 j , (38)

j=1 n 1 j ˆγj, ˆDm = 1

nj Nj ˆτ 2 j . (39)

So λ m = N m/D m and ˆλ = ˆNm/ ˆDm. We denote Nm = ˆNm N , = m 1 Pm j=1 Uj where Uj = n 1 j (ˆγj γj). Note that EPη[Uj] = 0, and by assumption 2 and 3 we know that Var Pη[Uj] = EPη[n 2 j Varηj[ˆγj]] is bounded (say < VU < ), we then have EPη[ Nm] = 0, Var Pη[ Nm] 0 as m .

So Nm L2 0, which implies Nm P 0. Thus, ˆNm N m P 0. An identical proof will also give us ˆDm D m P 0 for the denominator terms.

Step 2: Convergence in probability of unclipped ratio. By assumption 5, we have D m (and ˆDm as well since ˆDm = D m + o P (1)) bounded away from zero in probability. Thus, by standard argument using the Continuous Mapping Theorem, we have ˆλ λ m = N m + o P (1) D m + o P (1) N m D m

for the unclipped ratio.

Step 2: L2 convergence for the clipped ratio. The clipping function g(x) = clip(x, [0, 1]) is again continuous. Re-applying the Continuous Mapping Theorem gives

ˆλclip λ clip,m = clip(ˆλ, [0, 1]) clip(λ m, [0, 1]) P 0.

Finally, the clipping also makes sure that |ˆλclip λ clip,m| 1. This then leads to the stronger convergence result

ˆλclip λ clip,m L2 0 and completes the proof.

F.6. Proof of Theorem C.3

Denote Vj(λ) = Varηj[ˆθj,λ] = σ2 j nj 2λ

nj γj + λ2 nj+Nj

nj Nj τ 2 j . Let Sm(λ) = Pm j=1 Vj(λ). The average sum of variances is Vm(λ) = 1

m Sm(λ). We can express Vm(λ) as a quadratic function of λ: Vm(λ) = Cm 2Bmλ + Amλ2, where

nj Nj τ 2 j (= D m from Equation (38))

γj nj (= N m from Equation (38))

Prediction-Powered Adaptive Shrinkage Estimation

The minimizer of Vm(λ) over λ R is λu = Bm/Am (assuming Am > 0, which holds if not all τ 2 j = 0). Since Vm(λ) is a quadratic function in λ with a positive leading coefficient Am (i.e., an upward-opening parabola), its minimum over the closed interval [0, 1] is achieved at λ clip,m = clip(λu, [0, 1]) = clip(Bm/Am, [0, 1]).

We want to show that Vm(ˆλclip) Vm(λ clip,m) P 0. The derivative of Vm(λ) is V m(λ) = 2Amλ 2Bm. By the Mean Value Theorem, for some ξm between ˆλclip and λ clip,m:

Vm(ˆλclip) Vm(λ clip,m) = V m(ξm)(ˆλclip λ clip,m)

= (2Amξm 2Bm)(ˆλclip λ clip,m)

Since ˆλclip and λ clip,m are both in [0, 1], ξm is also in [0, 1]. The terms Am = D m and Bm = N m are averages of m independent terms with uniformly bounded variances (under Assumption 4 of Lemma C.2). Thus, by Chebyshev s inequality, Am = OP (1) and Bm = OP (1) (i.e., they are bounded in probability). Since ξm [0, 1], the term (2Amξm 2Bm) is also OP (1). Let λm = ˆλclip λ clip,m. From Lemma C.2, we have λm P 0. Therefore,

Vm(ˆλclip) Vm(λ clip,m) = (2Amξm 2Bm) λm P 0.

The L1 convergence then follows from the fact that This establishes the asymptotic variance optimality of the Uni PT estimator.

F.7. Proof of Proposition C.5

Organization of the proof. We provide a high-level sketch of the proof in Figure 7. Our main goal is to establish that \ CURE ℓm L1 0 uniformly in ω as m . To achieve this, we introduce two intermediate estimators CURE and CURE that serve as bridges between \ CURE and ℓm. The proof proceeds by showing that each consecutive pair of estimators is asymptotically close, which allows us to conclude the overall result via repeated applications of the triangle inequality.

\ CURE Eq. (26)

ℓm Eq. (9) Lemma F.7 Lemma F.5 Theorem F.4

Figure 7. A visual sketch of the proof. Each node represents a variant of the original CURE (i.e. a risk estimator conditioned on ω); each arrow then represents an asymptotic (m ) closeness result between the estimators.

We first define two intermediate forms of CURE, denoted as CURE and CURE respectively, between the original CURE (Eq. 24) that requires full knowledge about second moments and the sample estimate-based \ CURE defined in Equation (26):

CURE (ˆθ UPAS ω ) = 1

h (2ω j 1) σ2 j + 2(1 ω j ) γj + (1 ω j )2 ˆθUPT j Zf j 2i , ω j := ω ω + σ2 j , (40)

with σ2 j being the variance target defined in Equation (27).27 In other words, we can treat σ2 j as a fixed (but unknown) plug-in value so ω j is non-random for all ω > 0. Next we define CURE by replacing the σ2 j and γj terms in CURE which in turn depend on the estimated Uni PT parameter ˆλclip with variants that depend on λ clip,m instead:

CURE (ˆθ UPAS ω ) = 1

h (2ω j 1) σ2 j + 2(1 ω j ) γj + (1 ω j )2 ˆθUPT j Zf j 2i , (41)

where σ2 j = ˆσ2 j nj + Nj + nj

Njnj (λ clip,m)2ˆτ 2 j 2

nj λ clip,mˆγj , γj = λ clip,m ˆτ 2 j Nj .

CURE possesses the following desirable properties:

27We omit the dependency on m by using the shorthand σ2 j σ2 j,m introduced in (27).

Prediction-Powered Adaptive Shrinkage Estimation

Lemma F.3. CURE (ˆθ UPAS ω ) is an unbiased estimator of Rm(ˆθ UPAS ω ) conditioning on ηj.

Proof. This is a direct result of Theorem B.1 and σ2 j , γj being unbiased estimators by construction.

Theorem F.4 (Asymptotic Consistency of CURE ). Under the assumption of Proposition C.5,

sup ω 0 |CURE (ˆθ UPAS ω ) lm(ˆθ UPAS ω , θ)| m 0.

Proof. For simplicity, hereafter we drop the notations dependencies on ˆθ UPAS ω . By triangle inequality,

sup ω 0 |CURE lm| EPη

sup ω 0 |CURE lm| + EPη

sup ω 0 |CURE CURE| , (42)

and we remark here that the CURE referred in Equation (42) differs slightly from the CURE for PAS in Theorem 4.1 as we replace σ2 j with σ2 j in the definition of its ωj (same as ω j ).28 But since σ2 j is a fixed plug-in value, we can still use Theorem 4.1 to bound the first term on the RHS of (42), so the only task here is to bound the second term as well.

Let Aj = σ2 j σ2 j and Bj = γj γj. By assumption, we have Eηj[Aj] = 0, Eηj[Bj] = 0, Eηj[A2 j] Vσ < and Eηj[B2 j ] Vγ < . Let cj(ω) = 2ωj 1 and dj(ω) = 2(1 ωj), we have

CURE CURE = 1

j=1 (CURE j CUREj) = 1

j=1 (cj(ω)Aj + dj(ω)Bj).

Now we can apply triangle inequality again. Since both cj(ω) and dj(ω) are monotone functions of ω (and supported on [0, 1]), we can use the same strategies (constructing martingale and applying maximal inequality) as in the proof of Proposition 5.1 Let M k = Pk j=1 Aj, we have

j=1 cj(ω)Aj

k=1 (ck(ω) ck+1(ω))M k + cm(ω)M m

(1 + |cm(ω)|) max 1 k m |M k| 2 max 1 k m |M k| (43)

Now taking expectation over Pη and use the maximal inequality gives

" max 1 k m |M k| 2#

j=1 EPη[A2 j] 4m Vσ

Finally, by Jensen s inequality, we have

j=1 cj(ω)Aj

j=1 cj(ω)Aj

m max 1 k m |M k| = 2

max 1 k m |M k| 2

m Vσ = 4 Vσ m 0

Similar to Equation (43), we can show that

j=1 dj(ω)Bj

2 max 1 k m |N k|

28In other parts of this CURE, however, we keep using σ2 j := Var h ˆθUPT λ clip,m

i and γj := Cov h ˆθUPT j,λ clip,m, Zf j i , which are also the limits

of σ2 j and γj as m .

Prediction-Powered Adaptive Shrinkage Estimation

where N k = Pk j=1 Bj. Again by applying maximal inequality,

" max 1 k m |N k| 2#

j=1 dj(ω)Bj

m max 1 k m |N k| = 2

as m . Combining everything together gives

sup ω 0 |CURE CURE| EPη

O(1/ m) + O(1/ m) 0

CURE retains the nice asymptotic properties of CURE, but it consists of σ2 j and γj that we cannot evaluate from data (due to the unknown λ clip,m). Our next lemma derives the asymptotic closeness between CURE and CURE , where the

definition of the latter quantity is one step closer to the fully data-driven \ CURE.

Lemma F.5 (Closeness between CURE and CURE ). Under the assumption of Proposition C.5, we have

sup ω 0 |CURE (ˆθ UPAS ω ) CURE (ˆθ UPAS ω )| m 0

Proof. By construction, CURE and CURE only differ in their use of σ2 j versus σ2 j (similarly for γj v.s. γj). We can thus decompose their difference as

CURE CURE = sup ω>0

h (2ω j 1)( σ2 j σ2 j ) + 2(1 ω j )( γj γj) i .

j=1 sup ω>0

(2ω j 1)( σ2 j σ2 j ) + 1

j=1 sup ω>0

2(1 ω j )( γj γj)

σ2 j σ2 j + 2

j=1 | γj γj| (44)

since supω |2ω j 1| 1 and supω |2(1 ω j )| 2. Since we have ˆλclip λ clip,m (and ˆλ2 clip (λ clip,m)2)29 in L2, and by our fourth-moment assumptions we have EPη[ˆσ2 j ], EPη[ˆτ 2 j ], and EPη[ˆγj] all finite, by construction we also have

sup j EPη ( σ2 j σ2 j )2 m 0 and sup j EPη ( γj γj)2 m 0.

These uniform L2 convergence conditions are sufficient to make sure the expectation of Equation (44) converges to 0, and we thus prove our lemma.

At this point, we note that the only intractable piece in CURE is the unknown variance target σ2 j , which is used for

constructing the weights ω j . \ CURE then operationalize CURE by replacing σ2 j with the sample-based ˇσ2 j . Our next lemma

shows that ˇσ2 j σ2 j uniformly in L2 as m , which then leads to our final closeness result between \ CURE and CURE .

29We defer the proof of this convergence result to Lemma F.6.

Prediction-Powered Adaptive Shrinkage Estimation

Lemma F.6 (Uniform L2 convergence of ˇσ2 j ). Under the assumption of Proposition C.5, we have

sup j EPη (ˇσ2 j σ2 j )2 m 0

Proof. We can decompose the difference as

ˇσ2 j σ2 j = 1

nj ( σ2 µσ2) + Nj + nj

Njnj ˆλ2 clip( τ 2 µτ 2) 2

nj ˆλclip( γ µγ) | {z } ( )

Njnj µτ 2(ˆλ2 clip (λ clip,m)2) 2

nj µγ(ˆλclip λ clip,m) | {z } ( )

Bounding ( ): Using the inequality (a + b + c)2 3(a2 + b2 + c2),

EPη[( )2] 3 c2 1EPη[( σ2 µσ2)2] + c2 2EPη[( τ 2 µτ 2)2] + c2 3EPη[( γ µγ)2] (45)

where c1, c2, c3 are some bounded terms involving nj, Nj, ˆλclip (note that ˆλclip is clipped and bounded). Now, by our assumptions (esp. the finiteness of fourth-moments) and the Weak Law of Large Numbers, as m we have

σ2 P EPη[σ2 j ] =: µσ2, τ 2 P EPη[τ 2 j ] =: µτ 2, γ P EPη[γj] =: µγ,

then by our assumption (again by finite fourth moments in data generating process),

EPη[( σ2 µσ2)2] = Var Pη[ σ2] = 1

m Var Pη[ˆσ2 j ] = O(1/m),

similarly, we know that the other two terms in the RHS of Equation (45) are also O(1/m). Moreover, since all the problems are iid, these rates remain valid even if we take the supremum over j [m]. We thus have

sup j EPη[( )2] = O(1/m) = o(1).

Bounding ( ): Similarly, we use the inequality (a + b)2 2a2 + 2b2 to get

EPη[( )2] 2 k2 1EPη h (ˆλ2 clip (λ clip,m)2 2i + k2 2EPη h (ˆλclip λ clip,m)2i (46)

where k1, k2 are some bounded terms involving nj, Nj, µτ 2, µγ. By Proposition C.2, we already know that ˆλclip λ clip,m in L2. Further, since both ˆλclip and λ clip,m are bounded within [0, 1], thus

(ˆλ2 clip (λ clip,m)2 2 = (ˆλclip λ clip,m)2(ˆλclip + λ clip,m)2

2(ˆλclip λ clip,m)2

so we know that ˆλ2 clip (λ clip,m)2 in L2 as well. We thus have supj EPη[( )2] = o(1).

Putting ( ) and ( ) together. Since (ˇσ2 j σ2 j )2 2( )2 + 2( )2, the above results show us that

sup j EPη (ˇσ2 j σ2 j )2 2 sup j EPη[( )2] + 2 sup j EPη[( )2] = o(1)

and we obtain the uniform convergence in L2.

With Lemma F.6, we can derive an asymptotic closeness result between CURE and the new \ CURE that we can actually calculate from data.

Prediction-Powered Adaptive Shrinkage Estimation

Lemma F.7 (Closeness between \ CURE and CURE ). Under the assumption of Proposition C.5, we have

sup ω 0 | \ CURE(ˆθ UPAS ω ) CURE (ˆθ UPAS ω )| m 0

Proof. Denote Dj(ω) = \ CUREj CURE j as the difference on the j-th problem. We can further decompose

Dj(ω) = 2 ωj σ2 j | {z } Qj(ω)

2 ωj γj | {z } Rj(ω)

ωj(2 ˆωj ω j )(ˆθj Zf j )2 | {z } Sj(ω)

with ωj := ˆωj ω j = ω( σ2 j ˇσ2 j ) (ω + ˇσ2 j )(ω + σ2 j ).

Going term by term, for Qj(ω), we have that for all ω > 0

2 ω( σ2 j ˇσ2 j ) (ω + ˇσ2 j )(ω + σ2 j ) σ2 j

δ | σ2 j ˇσ2 j || σ2 j |,

since ω ω+ˇσ2 j 1 and ω + σ2 j δ 0 (by our assumption in Proposition C.5). Therefore,

j=1 sup ω 0 |Qj(ω)|

2 δ | σ2 j ˇσ2 j || σ2 j |

EPη ( σ2 j ˇσ2 j )2 EPη[( σ2 j )2], (47)

where the last inequality follows from Cauchy-Schwarz. Handling Rj(ω) similarly, we have

EPη ( σ2 j ˇσ2 j )2 EPη[ γ2 j ]. (48)

Finally, we have

|Sj(ω)| = ω| σ2 j ˇσ2 j | (ω + ˇσ2 j )(ω + σ2 j )|2 ˆωj ω j |(ˆθj Zf j )2.

Using ω ω+ˇσ2 j 1, 1 ω+ σ2 j 1

δ , and |2 ˆωj ω j | 2,

sup ω |Sj(ω)| 2

δ | σ2 j ˇσ2 j |(ˆθj Zf j )2,

which gives

EPη ( σ2 j ˇσ2 j )2 EPη (ˆθj Zf j )4 . (49)

Now, by the uniform L2 convergence of σ2 j given by Lemma F.6, as well as the fourth-moment assumptions that guarantee EPη[( σ2 j )2] < , EPη[ γ2 j ] < and EPη (ˆθj Zf j )4 < , we immediately see that (47, 48, 49) are all o(1) terms. Finally

sup ω 0 | \ CURE(ˆθ UPAS ω ) CURE (ˆθ UPAS ω )| EPη

Prediction-Powered Adaptive Shrinkage Estimation

Finally, combining the results from Theorem F.4, Lemma F.5 and Lemma F.7 via triangle inequality, we obtain the final result through

sup ω 0 | \ CURE(ˆθ UPAS ω ) lm(ˆθ UPAS ω , θ)| EPη

sup ω 0 |CURE (ˆθ UPAS ω ) lm(ˆθ UPAS ω , θ)|

sup ω 0 |CURE (ˆθ UPAS ω ) CURE (ˆθ UPAS ω )|

sup ω 0 | \ CURE(ˆθ UPAS ω ) CURE (ˆθ UPAS ω )|