# active_anytimevalid_risk_controlling_prediction_sets__62740d88.pdf

Active, anytime-valid risk controlling prediction sets

Ziyu Xu Department of Statistics and Data Science Carnegie Mellon University xzy@cmu.edu

Nikos Karampatziakis Microsoft nikosk@microsoft.com

Paul Mineiro Microsoft pmineiro@microsoft.com

Rigorously establishing the safety of black-box machine learning models concerning critical risk measures is important for providing guarantees about model behavior. Recently, Bates et. al. (JACM 24) introduced the notion of a risk controlling prediction set (RCPS) for producing prediction sets that are statistically guaranteed low risk from machine learning models. Our method extends this notion to the sequential setting, where we provide guarantees even when the data is collected adaptively, and ensures that the risk guarantee is anytime-valid, i.e., simultaneously holds at all time steps. Further, we propose a framework for constructing RCPSes for active labeling, i.e., allowing one to use a labeling policy that chooses whether to query the true label for each received data point and ensures that the expected proportion of data points whose labels are queried are below a predetermined label budget. We also describe how to use predictors (i.e., the machine learning model for which we provide risk control guarantees) to further improve the utility of our RCPSes by estimating the expected risk conditioned on the covariates. We characterize the optimal choices of label policy and predictor under a fixed label budget and show a regret result that relates the estimation error of the optimal labeling policy and predictor to the wealth process that underlies our RCPSes. Lastly, we present practical ways of formulating label policies and empirically show that our label policies use fewer labels to reach higher utility than naive baseline labeling strategies on both simulations and real data.

1 Introduction

One of the core problems of modern deep learning systems is the lack of rigorous statistical guarantees one can ensure about the performance of a model in practice. In particular, we are interested in ensuring the safety of a deep learning system so that it does not incur undue risk while optimizing for an objective of interest. This type of guarantee arises in many applications. For example, a deep learning based medical imaging segmentation system that detects lesions [36, 10, 29] should guarantee that it does not miss most of the lesion tissue while remaining precise and minimizing the total amount of tissue that is highlighted. Hence, it is crucial to provide a statistical guarantee about the safety of any machine learning system to be deployed. Bates et al. [4] introduced the notion of a risk controlling prediction set as a method to derive such guarantees on top of the outputs of a wide range of black-box models. They consider the setting where all the calibration data for verifying statistical safety guarantees is available before deployment and the model only needs to be

Part of work done while interning at Microsoft.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

calibrated once, i.e., the batch setting. However, this is often unrealistic in a production setup when we have no data concerning the performance of a model on the distribution of interest before we deploy it, and we wish to update our calibration each time a new data point (or group of data points) arrives. Consequently, it is natural to calibrate the machine learning model in an online fashion while receiving new data sequentially. Unfortunately, methods for obtaining statistical guarantees in the batch setting of Bates et al. [4] do not ensure risk control guarantees in the sequential regime. Further, when one uses data from production, the raw data is unlabeled, and one must expend resources (either paying experts or utilizing a more powerful model) to create gold labels for these data points. Hence, we generalize the sequential setup to an active setting in which we see the covariate X (e.g., an image, a natural language query from the user, etc.) and choose whether to query the true label Y . Concretely, consider the following scenarios where an active and sequential method is relevant.

Reduce query cost in medical imaging. A medical imaging system that outputs scores for for each pixel of image that determines whether there is a lesion or not would want to utilize labels given by medical experts for unlabeled images from new patients. Since the cost of asking experts to label these images is quite high, one would want to query experts efficiently, and only on data that would be most helpful for reducing the number of highlighted pixels. Domain adaptation for behavior prediction. Often, the post-deployment distribution is different from that which has been seen before. For example, during a navigation task for a robot, we may want to predict the actions of other agents and avoid colliding into them when travelling between two points [20]. Since agents may behave differently in every environment, it makes sense to collect the behavior data in the test environment and update the behavior prediction in an online fashion to get accurate predictions calibrated for that specific environment. Safe outputs for large language models (LLMs). One of the goals with large language models is to ensure their responses are not harmful in some fashion (e.g., factually wrong, toxic, etc.). One can view this as producing a prediction set for the binary label set of Y {harmful, not harmful}. Many pipelines for modern LLMs include some form of a safety classifier, which scores the risk level of an output, and determines whether it should be output to the user or not [21, 15], or a default backup response should be used instead. One would want to label production data acquired from user interaction with the LLM and used to calibrate cutoff for the scores that are considered low enough for the response to be allowed through.

Example: image classification Let us assume we wish to classify an image X X, and we have access to a probabilistic classifier s : X Y where Y is the probability simplex over distributions over all possible classes, Y. Let sy(x) denote the probability of class y in the distribution s(x). Based on the probabilities from s(X), we can define C(X, β) to have the labels with the largest probabilities that sum to β [0, 1] in the following fashion:

γ(X, β) := max

γ [0, 1] : X

y Y 1 {sy(X) γ} sy(X) β

C(X, β) := {y Y : sy(X) γ(X, β)}

Now, we can define the miscoverage error of our label set C(X, β) as follows:

r(X, Y, β) := 1 {Y C(X, β)} . (1)

Now, assume that (X, Y ) P , i.e., the images and class are jointly drawn from a fixed distribution. We want to find a choice of β such that ρ(β) := E[r(X, Y, β)] is guaranteed to be at most θ [0, 1], i.e., the expected miscoverage over the population of images and labels is at most θ.

In the above image classification example, we do not simply wish to find any β that ensures ρ(β) θ setting β = 1 would trivially ensure this guarantee for any θ [0, 1]. We also want to minimize the size of our uncertainty set C(X, β). To present this formulation in more general terms, we are interested in solving the following problem for a fixed level of risk control θ [0, 1]:

max β g(β) subject to ρ(β) θ. (2)

where g is the utility of our choice. We make the following natural assumption about r, ρ, and g. Assumption 1. g and ρ are monotonically decreasing w.r.t. β and we assume ρ(1) = 0. In addition, ρ is right-continuous.

Our image classification example has an expected risk and utility that satisfy the respective monotonicity assumptions, and such risk measures arise in many applications such as natural language question answering [26], image segmentation [1], and behavior control for robotics [20, 16]. Assumption 1 implies that maximizing g(β) is equivalent to minimizing β, as g is decreasing in β, and the right-continuity of ρ allows us to define the notion of an optimal calibration parameter that is the solution to (2):

β := min {β [0, 1] : ρ(β) θ}.

Our goal in this paper is to derive a sequence of upper bounds on β that quickly approach the true β but provide anytime-valid risk control in the sense that they are always greater the smallest safe parameter β and induce risk under θ, i.e., β β implies that ρ(β) ρ(β ) θ. Since we are guaranteed by Assumption 1 that ρ(1) = 0 θ, we always have a safe option of β = 1 to start with as our upper bound.

Our contributions. The primary contributions of this paper are as follows.

1. Extensions of RCPS to anytime-valid and active settings. We extend the notion of RCPS in two ways: (1) to enable anytime-valid RCPS which allows one to refine the set as one receives more samples in a stream while maintaining risk control throughout the entire stream, and (2) to define an RCPS that is valid under active learning, i.e., enable us to decide whether to label each example based on the covariates. We also define a way for incorporating risk predictions from the machine learning model to decrease variance further and reduce the number of labels needed to estimate β . We formulate this betting framework in Section 2. 2. Deriving powerful labeling policies and predictors. We show in Section 3 that our active, anytimevalid RCPS methods are practically powerful and converge to β in a label efficient manner by also deriving formulations for the optimal labeling policy and predictors under the standard log-optimality criterion that is used for evaluating anytime-valid methods [13, 34, 19]. We derive explicit regret bounds w.r.t. a lower bound on the growth rate on the wealth processes that underlie our RCPS methods. These bounds characterize how the deviation of any labeling policy and predictor from the log-optimal policy and predictor affect the growth rate of our wealth processes (and hence the earliest time at which a candidate β is removed from consideration as β ). In Section 4 we also show that machine learning model based estimators of the optimal policy and predictors are label efficient in practice through experiments.

Related work. Most relevant to this paper is the recent work from Zrnic and Cand es [37] that provides a rigorous framework for statistical inference with active labeling policies, and leverages machine learning predictions through prediction-powered inference [2]. However, their focus is on M-estimation and deriving asymptotic, martingale central-limit theorem based results for a parameter. On the other hand, we provide finite sample anytime-valid results that are also valid at adaptive stopping times that directly utilize e-process [28] construction of sequential tests. Further, our goal is to provide a time-uniform statistical guarantee in the RCPS framework rather than directly estimating a parameter with adaptively collected data. We discuss additional related work in depth in Section 5.

2 Anytime-valid risk control through betting

We use (Xt)t I to denote a sequence that is indexed by t with index set I. If the index set or indexing variable is apparent in context, we drop it for brevity. In our setup, we assume that our data points arrive in a stream (X1, Y1), (X2, Y2), . . . that proceeds indefinitely. Let (Ft) be the canonical filtration on the data, i.e., Ft := σ({(Xi, Yi)}i [t]) is the sigma-algebra over the first t points. Recall that we assumed (Xt, Yt) P are i.i.d. draws for each t N := {1, 2, 3, . . . }, and want to control the risk ρ(β) = E(X,Y ) P [r(X, Y, β)] where the expectation is take only over (X, Y ). We illustrate an overview of our methodology (which we describe in the sequel) in Figure 1.

We desire to output a sequence of calibration parameters, (bβt), such that every βt is safe , i.e., ensures that the resulting risk of the output is provably controlled under a fixed level. Definition 1. A sequence of calibration parameters (bβt) is said to have (θ, α)-anytime-valid risk control if it possesses the following property:

P(ρ(bβt) θ for all t N) 1 α. (3)

Features (e.g., user prompt

+ LLM response)

Gold label (e.g., toxicity of response)

Risk estimate

rt(Xt, β) β [0,1]

Incurred risk

r(Xt, Yt, β) β [0,1]

Flip coin w/

label prob.

Update threshold βt 1 βt

Deploy with risk control βt

label ( ) Lt = 1

don t query

label ( ) Lt = 0

label oracle (e.g., human labeler, powerful LLM)

Treat risk as 0

Ensures for all

with prob. ρ( βi) = 𝔼[r(X, Y, βi)] θ i 1,2, 1 α

optional (decreases risk variance)

Figure 1: Diagram of the active labeling setup for ensuring anytime-valid risk control.

We name this as anytime-valid since the risk control condition (i.e., ρ(bβt) θ) is guaranteed to hold simultaneously at all t N. Hence, this allows for the user to process a continuous stream of data and control the probability that a bβt is chosen at any time t that is unsafe , i.e., ρ(bβt) > θ. We build on recent work that develops a framework for hypothesis testing and parameter estimation with sequential data collection based on martingales and gambling with virtual wealth known as testing by betting [30]. In this framework, the goal is to design an e-process, (Et), w.r.t. a null hypothesis H0, which satisfies the following properties when true: Definition 2. An e-process, (Et)t N0, w.r.t. a hypothesis H0, is a nonnegative process for which there exists another nonnegative process, (Mt)t N0 s.t. the following is true when H0 is true: (1) E[M0] 1, (2) Mt Et for all t N almost surely and (3) E[Mt | Ft 1] Mt 1 for all t N,, i.e., (Mt) is a supermartingale.

E-processes will be the main tool we use to construct (bβt). We leverage the probabilistic bound on e-processes provided by Ville s inequality to prove our anytime-valid risk control guarantee. Fact 1 (Ville s inequality [33]). For any e-process, (Et), with initial expectation bounded by 1, i.e., E[E0] 1, we have that

P exists t N : Mt α 1 α for each α [0, 1].

In this paper, all our e-processes will also be nonnegative supermartingales, so we denote them as (Mt). Now, we will specify the null hypotheses in our risk control setting. For each β [0, 1], we test the null hypothesis Hβ 0 : ρ(β) θ for a fixed risk control level θ [0, 1]. Note that we include equality to θ in the null hypothesis since we do not wish to reject Hβ 0 . Let {(Mt(β))}β [0,1] be a family of e-processes where (Mt(β)) is an e-process for Hβ 0 . Then, we can derive bβt for each t N as follows:

bβt := min {β [0, 1] : Mt(β ) 1/α for all β > β}. (4)

Theorem 1. The sequence of estimates (bβt) in (4) satisfies the anytime-valid risk control guarantee (3), i.e., P(ρ(bβt) θ for all t N) 1 α.

Proof. First, we note that

{ t N : ρ(bβt) > θ} { t N : bβt < β } { t N : Mt(β ) 1/α}.

Since Hβ 0 is always true by definition of β , we get that (Mt(β )) is an e-process by Proposition 1. Thus, by applying Ville s inequality, we get that:

P( t N : ρ(bβt) > θ) P ( t N : Mt(β ) 1/α) α.

Now, we will present a concrete example of an e-process. Denote Rt(β) := r(Xt, Yt, β). We test Hβ 0 using the betting e-process from Waudby-Smith and Ramdas [34]:

i=1 (1 + λi (θ Ri(β))) , (5)

where (λt) is predictable w.r.t. (Ft), i.e., λt can be determined by Ft 1 for each t N, and λt [0, (1 θ) 1].

Proposition 1. (Mt(β)) in (5) is an e-process for all β where Hβ 0 is true, i.e., where ρ(β) θ.

Proof. We note that (Mt(β)) is nonnegative by the support of λt being limited, i.e.,

1 + λt(θ r(Xt, Yt, β)) 1 + λt(θ 1) 0.

Now, we will also show that (Mt(β)) is a supermartingale when Hβ 0 is true.

E[Mt(β) | Ft 1] = E[1 + λt(θ Rt(β)) | Ft 1] Mt 1(β) = (1 + λt(θ E[Rt(β) | Ft 1])) Mt 1(β) Mt 1(β).

The second equality is because λt is measurable w.r.t. Ft 1 and the last inequality is by Rt(β) being independent of Ft 1 and E[Rt(β)] θ being true under Hβ 0 . Thus, we have our desired result.

Now, we have a concrete way to derive (bβt) that ensures the risk ρ(bβt), is controlled at every time step t N. However, this requires one to label every example that arrives, i.e., it requires access to entire stream of labels (Yt). We will now derive a more label efficient way for constructing (bβt). Remark 1. Ramdas et al. [27] show that e-processes of the form in (5) characterize the set of admissible e-processes (and hence anytime-valid sequential tests) for testing the mean of bounded random variables. Hence, it is an optimal choice of e-process for our setting, and have been shown to perform better both theoretically and empirically than other sequential tests for bounded random variables (e.g., Hoeffding and empirical-Bernstein based tests [34]).

2.1 Active sampling for risk control

Now, we describe active learning for risk control, where an algorithm sees Xt decides whether a label, for the current point, Yt, should be queried or not. At each step t N, the algorithm produces a label policy qt : X [qmin t , 1] based on the observed data (i.e., Ft 1). It then queries the label, Yt, with probability qt(Xt) that is lower bounded by a constant, qmin t . Let Lt be the indicator random variable for whether the tth label is queried, i.e., Lt Bern(qt(Xt)).

To produce a label efficient method, one would hope to label the most impactful data points that result in the largest growth of Mt(β) for choices of β (0, bβt), i.e. that are still in consideration for the next bβt+1. For the labeling policies we consider in this paper, we let qmin t [0, 1] be a lower bound on the labeling probability, i.e., qt(Xt) qmin t almost surely. Thus, we can derive the following e-process for any sequence of labeling policies (qt).

θ Li qi(Xi) Ri(β) . (6)

Proposition 2. Let (qt) be a sequence of labeling policies, and (λt) be a sequence of betting parameters, and let both sequences be predictable w.r.t. (Ft) (i.e., qt and λt are measurable w.r.t. Ft 1 for each t N). Then, (Mt(β)) in (6) is an e-process for all β where Hβ 0 is true.

Proof. The proof of this is similar to that of inverse propensity-weighted e-processes derived in Waudby-Smith et al. [35]. We know that (Mt(β)) is nonnegative by the support of λt being limited:

θ Lt qt(Xt) Rt(β) 1 + λt θ (qmin t ) 1 0,

where the first inequality is by Rt(β) 1 and qt(Xt) qmin t , and the second inequality is by λt ((qmin t ) 1 θ) 1. Now, we will also show that (Mt(β)) is a supermartingale. We first show the following upper bound:

E Lt qt(Xt) Rt(β) | Ft 1

= E E[Lt | Xt, Ft 1]

qt(Xt) Rt(β) | Ft 1

= E[Rt(β) | Ft 1] θ. (7)

The first equality is by further conditioning on Xt, and the second equality is since Lt is defined to be a Bernoulli random variable with parameter qt(Xt) that is independent of all other randomness when conditioned on Xt and Ft 1. The last inequality is by Hβ 0 being true. Now, we have that

E[Mt(β) | Ft 1] = 1 + λt

θ E Lt qt(Xt)Rt(β) | Ft 1

Mt 1(β) Mt 1(β),

where the inequality is by (7). Hence, we have shown that (Mt(β)) is a nonnegative supermartingale, and hence also an e-process, under Hβ 0 .

Theorem 2. (bβt) defined w.r.t. (6) satisfies the anytime-valid risk control guarantee (3). This is a result of Proposition 2 and Ville s inequality, similar to the proof of Theorem 1. Theorem 2 essentially shows that we can still design e-processes by allowing for a probabilistic label policy.

2.2 Variance reduction through prediction

Often, we also have an estimate of the risk we incur, e.g., in the example given for classification, we have an estimated probability distribution over possible outcomes. As a result, we also have an empirical estimate of E[r(X, Y, β) | X = x] for each β [0, 1] that we can use to reduce the variance of our estimate. This is similar to the usage of control variates for improving Monte Carlo estimation [3, V.2], and of predictors in the recently formulated prediction-powered inference framework [2]. Let brt : X [0, 1] [0, 1] be an estimator of the risk incurred by parameter β conditional on x X for each time step t N. (brt) is predictable w.r.t. (Ft). Where does br come from? We note that often machine learning models have some estimate ˆP(X) of the conditional distribution of Y | X (e.g, class probabilities, conditional diffusion models, LLMs, etc.). Thus, for any realized covariate x, we can derive use EY ˆ P (x)[r(X, Y, β) | X = x] from the machine learning model as our choice of br(x, β). This expectation can either be calculated analytically (as we do in or classification examples in our experiments) or derived using Monte Carlo approximation (for generative models such as LLMs, one can sample from the conditional distribution). In essence, we can obtain a predictor from the very model we are calibrating. We may also update our predictor using new (Xt, Yt) pairs we receive for calibrating (bβt).

Now, we define our e-process that utilizes our predictor as follows:

θ bri(Xi, β) Li qi(Xi) Ri(β) where Rt(β) := Rt(β) brt(Xt, β),

and we restrict λt [0, ((qmin t ) 1 θ) 1] (or in other words, qmin t λt/(1 + λtθ)). Note that this e-process recovers the active e-process defined in (6) if we set brt( , β) = 0 for all t N.

Proposition 3. (Mt(β)) as defined in (8) is an e-process for Hβ 0 .

Proof. Since the restriction on (λt) ensures Mt(β) is nonnegative, to show that (Mt(β)) is an e-process, it is sufficient to show: E[λt(θ brt(Xt, β) Lt qt(Xt) 1 Rt(β)+) | Ft 1]

= λt(θ E[brt(Xt, β) + Lt qt(Xt) 1 Rt(β) | Ft 1])

= λt(θ E[brt(Xt, β) + E[Lt qt(Xt) 1 | Xt, Ft 1] Rt(β) | Ft 1])

= λt(θ E[brt(Xt, β) + Rt(β) | Ft 1]) = λt(θ E[Rt(β) | Ft 1]) 0.

The 3rd equality is by definition of Lt, the last equality by the definition of Rt, and the last inequality is due to Hβ 0 being true.

The role of brt(Xt, β) is to accurately predict Rt(β). Bad predictions can increase the variance of Rr(β) and lead to slower growth of Mt(β), but do not compromise the risk control guarantee. On the other hand, accurate predictions, which come from pretrained models, decrease variance and improve the growth of Mt(β). We characterize the optimal predictor (Proposition 4) and relate the accuracy of a predictor to its effect on the e-process (Theorem 3) in the next section.

3 Optimal labeling policies

Since the goal of having an active labeling policy is to label fewer data points, one reasonable way of doing this is to maximize the growth rate of our e-process (Mt(β)) defined in (8). Define the following function, for some β [0, 1], of a labeling policy q, predictor br, and betting parameter λ where we let L | X Bern(q(X)) and (X, Y ) P :

Gβ(q, br, λ) := log 1 + λ θ br(X, β) L q(X) R(β) where R(β) := r(X, Y, β) br(X, β).

Define the growth rate at the tth step of (Mt(β)) as Gβ t := E[Gβ t (qt, brt, λt)], where we let Gβ t be identical to Gt but with X and Y replaced with Xt and Yt, respectively. It is a standard notion of power or sample efficiency for e-processes. Typically, our goal when designing an e-process based test is to maximize such a metric, i.e., we want our e-process to be log-optimal [13, 34, 19]. Log-optimality is also called the Kelly criterion in finance [18] and it is known that maximizing the growth rate of a process is equivalent to minimizing the expected time for the process to exceed a threshold, i.e., for our sequential test to reject a value of β, in the limit as the threshold approaches infinity [5]. Thus, in an asymptotic sense, maximizing the growth rate is equivalent to minimizing the expected time for rejection. Our goal is to maximize the growth rate while having a constraint on the number of labels we can produce.

Let B [0, 1] be the constraint on our labeling budget, i.e., we label, in expectation, a B fraction of all data points that we receive. To achieve both of these goals, we wish to choose qt, brt, and λt that are the solutions to the following optimization problem:

max q,br,λ EL q(X)[Gβ(q, br, λ)] s.t. E[q(X)] B.

Since solving the above optimization problem is analytically difficult, one can instead maximize a lower bound on the expected growth [32, 25, 24]:

b Gβ(q, br, λ) := λ θ br(X, β) L q(X) R(β) λ2 θ br(X, β) L q(X) R(β) 2

Gβ(q, br, λ), (9)

which holds when λ [0, (2(qmin) 1 2θ) 1], where qmin := infx X q(x). We can further simplify We can use the lower bound in (9) to formulate the following optimization problem.

max q,br,λ E h b Gβ(q, br, λ) i s.t. E[q(X)] B (10)

Let (q , r , λ ) be the tuple that is the solution to (10). We can analytically show what r is. Proposition 4. The optimal predictor r in the solution to (10) is r (x, β) = E[r(X, Y, β) | X = x] for each x X.

We defer the proof to Appendix A.1. The optimal choice of q has the following formulation. Proposition 5. If we fix br and λ, the solution to the optimization problem in (10) is given by q β where q β(x) p

E[ R(β)2 | X = x] for each x X if such a q β exists.

We defer the proof to Appendix A.2. Let σβ(x) := p

V[r(X, Y, β) | X = x] be the conditional standard deviation of r(X, Y, β). Now, we can argue that the solution to the optimization problem on the growth rate lower bound in (10) has the following characterization.

Corollary 1. The optimal choice of q β and λ that solves (10) is

q β(x) := σβ(x) E[σβ(X)] B, λ := 1

2 θ ρ(β) (θ ρ(β))2 + E[σβ(X)]2 B 1 + V[r(X, Y, β)]

if q β(x) [0, 1] for all x X, and λ (2(infx X q β(x)) 1 2θ) 1. The resulting growth rate has the following lower bound:

E[Gβ t (q , r , λ )] E[ b Gβ t (q , r , λ )] = 1

4 (θ ρ(β))2

(θ ρ(β))2 + E[σβ(X)]2 B 1 + V[r(X, Y, β)]

We can show this is true as a consequence of Proposition 4, Proposition 5, and solving the quadratic equation that arises for the growth rate to derive the optimal choice of λ . Further, we note that we can define regret of a sequence (λt) compared to λ on b Gβ t as follows.

Definition 3. The b Gβ-regret at the tth step of a sequence of betting parameters (λt) for a risk upper bound θ [0, 1], and a sequence of labeling policies (qt) and predictors (brt) where qt(x) ε > 0 for all x X and t N almost surely is defined as follows:

Regt := max λ [0,(2ε 1 2θ) 1]

i=1 E[ b Gβ t (qt, brt, λ) | Ft 1] E[ b Gβ t (qt, brt, λt) | Ft 1].

Since b Gβ t (q, br, λ) is exp-concave in λ, existing online learning algorithms such as Online Newton Step (ONS) [8] can get o(T) regret guarantees, which means that the growth rate of (λt) averaged over time will approach (or exceed) the optimal growth rate under λ . For simplicity of analysis, we make the following assumption about the labeling probability of the optimal policy, q β. Assumption 2. Let ε > 0 be a positive constant. Assume that q β(x) ε for each x X.

The lower bound in the above assumption is an analog of the propensity score lower bound on optimal policies that is needed for proving valid inference in adaptive experimentation [17, 7]. Further, we do not need this assumption to hold on every β, since we are not necessarily interested in log-optimality w.r.t. fringe β that are quite far away from β in practice having this assumption hold for values of β near β suffices to develop an estimator bβ that shrinks toward β quickly. Now, we describe how much the growth rates deviates based on on how well q and r are estimated.

Theorem 3. Let (λt) be a sequence with b Gβ-regret (Regt) and (qt) and (brt) are sequences of labeling policies and predictors that are all predictable w.r.t. (Ft). For a positive constant ε > 0, let qt(x) ε > 0 for each t N and x X almost surely. Under Assumption 2 for the same ε, the following bound holds:

i=1 E[ b Gβ t (q , r , λ ) b Gβ t (qt, brt, λt)] Regt + t P

i=1 O(E[|q(Xt) q β(Xt)|] + E[(brt(Xt) r (Xt, β))2]).

We defer the proof to Appendix A.3. The proof idea follows a similar idea that of the regret bound in Kato et al. [17] for deriving an estimator that is close to the optimal estimator for the average treatment effect in an adaptive experimentation setup. Theorem 3 relates the estimation error of q β(x) and r (x, β) to how quickly β will be deemed safe . Hence, if we have good estimates of those quantities, then we can produce an estimates (bβt) that are small and close to β while remaining safe. We will now describe some practical methods for calculating qt and brt, and demonstrate their empirical performance in some experiments.

4 Experiments

We use Py Torch to model our (qt) and (brt), and we consider the following formulations.2

1. Baseline labeling policies. We have baseline labeling policies of labeling all data that arrives, and a policy that just randomly samples B proportion of samples to label these are denoted respectively as all and oblivious .

2Code at github.com/neilzxu/active-rcps

2. Pretrain: We derive an estimate of r from a pretrained machine learning model, brpretr, to be our choice of predictor for all time steps. We also derive an estimate of σ(x, β), bσpretr(x, β), from the pretrained model. We also learn a sequence of normalizing constants (Ct) s.t. the budget is satisfied. Our labeling policy in this case is qt(x) = bσ(x, bβt 1)/Ct, where we want to optimize our policy for the previous best bound on β , bβt 1. We denote this method as pretrain .

3. Estimating q β and r : We learn sequences of models (bσplugin t ) and (brplugin t ) using the labeled data points. We preprocess the outputs from bσpretr and brpretr to use as the input features to these models, respectively. Each of these sequences of models are then updated at every step. We also similarly learn a sequence of normalization constants (Ct) for deriving the final labeling policy (qt).

We provide more details on how are methods are formulated in Appendix B. We run all our experiments on a 48-core CPU on the Azure platform, after using a GPU to precompute the predictions made by neural network models. We set θ = 0.1, α = 0.05, and B = 0.3 for all our experiments.

4.1 Numerical simulations

We have a simple data generating process of sampling Pt Uniform[0, 1] and let Yt | Xt Bern(Xt). This simulates the setting we have with our real data where we have an accurate pretrained classifier that have a probability estimate of Yt of being 0 or 1. We let our covariates Xt = Pt. As a result, our risk function is the false positive rate r FPR(X, Y, β) := 1 {X β, Y = 0}. We run 100 trials where each trial runs until 2500 labels are queried. We compare our methods based on their label efficiency, i.e., how close is bβt to β = 1

2α after a set number of queried labels. In Figure 2, we plot the average bβt reached after a given number of labels queried across trials. The shaded areas denotes pointwise 95% confidence intervals on the uncertainty of the average estimate. We can see that the pretrain and learned methods outperform both the all and oblivious strategies uniformly numbers across labels queried. In Figure 2a, we show the average rate of safety violations, i.e., the average proportion of trials that bβt was unsafe and ρ(bβt) > θ at any time step. We can see that all methods control the desired safety violation rate at the predetermined level α.

(a) Average rate of safety violations bβt.

(b) Average final value of bβt (lower is better).

(c) Average bβt vs. labels queried (lower is better).

Figure 2: Experimental results for different methods for our numerical simulation setup. We can see that pretrain and learned perform better by getting lower average bβt uniformly across number of labels queried the dotted line in Figures 2b and 2c is β = 0.5578. Each method also has low safety violation rate, i.e., is below the dotted line of α = 0.05 in Figure 2a.

4.2 Imagenet

We also evaluate our methods on the Imagenet dataset [9], and we used the pretrained neural network classifiers from Bates et al. [4] to provide estimates of the class probabilities.

Since Imagenet is a classification task with label support on Y = [1000], our goal is to ensure that the miscoverage rate of the true class is controlled. We follow the same setup as descibed in the introduction, i.e., with our risk measure r specified according to (1). For Imagenet, we reshuffle our dataset for each trial, and run each method till we have queried 3000 labels.

(a) Average rate of safety violations bβt.

(b) Average final value of bβt (lower is better).

(c) Average bβt vs. labels queried (lower is better).

Figure 3: Experimental results for different methods on Imagenet. Again, we see that pretrain and learned are the best performing, and they have very similar performance and hence overlap in Figure 3c. Here, β = 0.8349, and is delineated by the dotted line in Figures 3b and 3c. Again, each method also has low safety violation rate, i.e., is below the dotted line of α = 0.05 in Figure 3a.

In Figure 3, we plot the average bβt across trials. Once again, we can see that the pretrain and learned methods outperform both the all and oblivious strategies here as well. On Imagenet the average safety violation rate is also controlled as well under the predetermined level of α = 0.05.

5 Additional related work

Casgrain et al. [6] provide anytime-valid sequential tests for identifiable functions, which result in similar hypotheses being tested as this paper albeit with equality instead of equality. They, in addition to other recent work [25, 24, 31], using regret bounds for betting-based e-processes to show either derivations for the growth rate of a betting strategy w.r.t. to the optimal growth rate. However, none of these settings incorporate the ability to perform adaptive sampling or inverse propensity weights. Prior work in anytime-valid inference have included inverse propensity weights have been for off policy evaluation [35], adaptive experimentation [7], or estimating the weighted mean of a finite population [32]. However, none of these works explicitly characterize deviation in the sampling policy away from the optimal sampling policy ultimately affects the growth rate as we do in Theorem 3.

Our analysis of power and regret for our algorithm is quite similar to methods in adaptive experimentation for average treatment effect estimation [14, 17] that attempt to derive a no regret treatment policy and outcome regressor that produces an estimator with a variance that approaches the variance of the optimal estimator. Unlike the adaptive experimentation setting, however, we have an additional label budget constraint on our formulation that results in a different optimal policy.

6 Conclusion, limitations, and future work

We have shown that we can extend the RCPS formulation to be anytime-valid, and retain validity and increase label efficiency in an active learning setting. We use the theory of betting and e-processes to develop this framework and show it is verifiably safe, and we verified this with our experimental results. We have primarily considered the i.i.d. setting here for anytime-valid calibration, and one key area in which one can extend this line of work is to account for distribution shift during test time. The empirical Bernstein supermartingales in Waudby-Smith et al. [35] can likely be used to extend our framework control risk in an average sense, but stronger guarantees could be made about the provided risk control if more realistic assumptions are made about the nature of the distribution (e.g., covariate shift, label shift, etc). It may also be possible extend a notion of adaptive conformal inference (ACI) [11, 12] to anytime-valid risk control. Another limitation of this work is the bounded label policy assumption (i.e., Assumption 2) and existence assumption in Proposition 5. We believe that more careful analysis can get rid of these assumptions in future work.

[1] A. N. Angelopoulos, A. P. Kohli, S. Bates, M. I. Jordan, J. Malik, T. Alshaabi, S. Upadhyayula, and Y. Romano. Image-to-Image Regression with Distribution-Free Uncertainty Quantification and Applications in Imaging. In International Conference on Machine Learning, 2022.

[2] A. N. Angelopoulos, S. Bates, C. Fannjiang, M. I. Jordan, and T. Zrnic. Prediction-powered inference. Science, 382(6671):669 674, 2023.

[3] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis. Stochastic Modelling and Applied Probability. Springer, New York, 1 edition, 2007.

[4] S. Bates, A. Angelopoulos, L. Lei, J. Malik, and M. Jordan. Distribution-free, Risk-controlling Prediction Sets. Journal of the ACM, 68(6):43:1 43:34, 2024.

[5] L. Breiman. Optimal Gambling Systems for Favorable Games. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 65 79, 1961.

[6] P. Casgrain, M. Larsson, and J. Ziegel. Sequential testing for elicitable functionals via supermartingales. Bernoulli, 30(2):1347 1374, 2024.

[7] T. Cook, A. Mishler, and A. Ramdas. Semiparametric Efficient Inference in Adaptive Experiments. In Conference on Causal Learning and Reasoning, 2024.

[8] A. Cutkosky and F. Orabona. Black-Box Reductions for Parameter-free Online Learning in Banach Spaces. In Conference On Learning Theory, 2018.

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.

[10] A. M. Flores, F. Demsas, N. J. Leeper, and E. G. Ross. Leveraging Machine Learning and Artificial Intelligence to Improve Peripheral Artery Disease Detection, Treatment, and Outcomes. Circulation Research, 128(12):1833 1850, 2021.

[11] I. Gibbs and E. Candes. Adaptive Conformal Inference Under Distribution Shift. In Neural Information Processing Systems, 2021.

[12] I. Gibbs and E. Cand es. Conformal Inference for Online Prediction with Arbitrary Distribution Shifts. Journal of Machine Learning Research, 25(162):1 36, 2024.

[13] P. Gr unwald, R. de Heide, and W. Koolen. Safe Testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2024.

[14] J. Hahn, K. Hirano, and D. Karlan. Adaptive Experimental Design Using the Propensity Score. Journal of Business & Economic Statistics, 29(1):96 108, 2011.

[15] L. Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.

[16] H. Huang, S. Sharma, A. Loquercio, A. Angelopoulos, K. Goldberg, and J. Malik. Conformal Policy Learning for Sensorimotor Control Under Distribution Shifts. In IEEE International Conference on Robotics and Automation, 2024.

[17] M. Kato, T. Ishihara, J. Honda, and Y. Narita. Efficient Adaptive Experimental Design for Average Treatment Effect Estimation. ar Xiv:2002.05308, 2021.

[18] J. L. Kelly. A New Interpretation of Information Rate. The Bell System Technical Journal, page 10, 1956.

[19] M. Larsson, A. Ramdas, and J. Ruf. The numeraire e-variable and reverse information projection. ar Xiv:2402.18810, 2024.

[20] J. Lekeufack, A. N. Angelopoulos, A. Bajcsy, M. I. Jordan, and J. Malik. Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions. In IEEE International Conference on Robotics and Automation, 2024.

[21] T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng. A Holistic Approach to Undesired Content Detection in the Real World. AAAI Conference on Artificial Intelligence, 2023.

[22] F. Orabona and T. Tommasi. Training Deep Networks without Learning Rates Through Coin Betting. In Neural Information Processing Systems, 2017.

[23] A. B. Owen. Monte Carlo theory, methods and examples. https://artowen.su.domains/ mc/, 2013.

[24] A. Podkopaev and A. Ramdas. Sequential Predictive Two-Sample and Independence Testing. In Neural Information Processing Systems, 2023.

[25] A. Podkopaev, P. Bl obaum, S. Kasiviswanathan, and A. Ramdas. Sequential Kernelized Independence Testing. In International Conference on Machine Learning, 2023.

[26] V. Quach, A. Fisch, T. Schuster, A. Yala, J. H. Sohn, T. S. Jaakkola, and R. Barzilay. Conformal Language Modeling. In International Conference on Learning Representations, 2024.

[27] A. Ramdas, J. Ruf, M. Larsson, and W. Koolen. Admissible anytime-valid sequential inference must rely on nonnegative martingales. ar Xiv:2009.03167, 2022.

[28] A. Ramdas, P. Gr unwald, V. Vovk, and G. Shafer. Game-Theoretic Statistics and Safe Anytime Valid Inference. Statistical Science, 38(4):576 601, 2023.

[29] L. Saba, S. S. Sanagala, S. K. Gupta, V. K. Koppula, A. M. Johri, N. N. Khanna, S. Mavrogeni, J. R. Laird, G. Pareek, M. Miner, P. P. Sfikakis, A. Protogerou, D. P. Misra, V. Agarwal, A. M. Sharma, V. Viswanathan, V. S. Rathore, M. Turk, R. Kolluri, K. Viskovic, E. Cuadrado-Godia, G. D. Kitas, N. Sharma, A. Nicolaides, and J. S. Suri. Multimodality carotid plaque tissue characterization and classification in the artificial intelligence paradigm: A narrative review for stroke application. Annals of Translational Medicine, 9(14):1206, 2021.

[30] G. Shafer. Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(2):407 431, 2021.

[31] S. Shekhar and A. Ramdas. Nonparametric Two-Sample Testing by Betting. IEEE Transactions on Information Theory, 70(2):1178 1203, 2024.

[32] S. Shekhar, Z. Xu, Z. Lipton, P. Liang, and A. Ramdas. Risk-limiting financial audits via weighted sampling without replacement. In Conference on Uncertainty in Artificial Intelligence, 2023.

[33] J. Ville. Etude Critique de la Notion de Collectif. Ph D thesis, University of Paris, Paris, 1939.

[34] I. Waudby-Smith and A. Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B (Statistical Methodology), 2023.

[35] I. Waudby-Smith, L. Wu, A. Ramdas, N. Karampatziakis, and P. Mineiro. Anytime-valid off-policy inference for contextual bandits. ACM / IMS Journal of Data Science, 2024.

[36] J. Wu, J. Xin, X. Yang, J. Sun, D. Xu, N. Zheng, and C. Yuan. Deep morphology aided diagnosis network for segmentation of carotid artery vessel wall and diagnosis of carotid atherosclerosis on black-blood vessel wall MRI. Medical Physics, 46(12):5544 5561, 2019.

[37] T. Zrnic and E. J. Cand es. Active Statistical Inference. In International Conference on Machine Learning, 2024.

A Omitted proofs

Proofs that have been deferred from the main body of the paper are contained here.

A.1 Proof of Proposition 4

We can rewrite the objective in the following way:

E[ b G(q, br, λ)] = E

λ θ br(X, β) L q(X) R(β) λ2 θ br(X, β) L q(X) R(β) 2#

= λ (θ ρ(β)) λ2 (θ ρ(β))2 + V br(X, β) + L q(X) R(β) (11)

Maximizing (11) is the same as minimizing the following equivalent expressions:

V br(X, β) + L q(X) R(β)

= E V br(X, β) + L q(X) R(β) | X + V E br(X, β) + L q(X) R(β) | X

= E V L q(X) R(β) | X + V[R(β)]

= E E R(β)2

q(X) | X E[ R(β) | X]2 + V[R(β)]

Z E R(β)2 | X = x

q(x) E[ R(β) | X = x]2 !

+ V[R(β)], (12)

where we derive the 1st equality from the law of total variance, and the 2nd equality from the fact that br(X, β) is fixed given X.

Since the only term that br affects is the integral term in (12), we can choose each br(x, β) for each x X to minimize the following:

E R(β)2 | X = x

q(x) E[ R(β) | X = x]2

=E[(r(X, Y, β) br(x, β))2 | X = x]

q(x) (E[r(X, Y, β) | X = x] br(x, β))2

=E[r(X, Y, β)2 | X = x]

q(x) E[r(X, Y, β) | X = x]2

q(x) 1 (2E[r(X, Y, β) | X = x]br(x, β) br(x, β)2) (13)

If we remove the constants (i.e., terms unaffected by br(x, β)), and note that q(x) 1 1 > 0, we get that minimizing (13) is equivalent to minimizing 2E[r(X, Y, β) | X = x]br(x, β) br(x, β)2.

This is equivalent to minimizng the squared error, i.e., (E[r(X, Y, β) | X = x] br(x, β))2, which means that r (x, β) = E[r(X, Y, β) | X = x], which gets us our desired result.

A.2 Proof of Proposition 5

Since we have shown that maximizing (10) is equivalent to minimizing (12), we can isolate the terms that change wiht q and see that we are looking for the solution to the following optimization problem:

q(x)E[ R(β)2 | X = x] dx

s.t. Z p(x)q(x) dx B.

We can define φ(x) := p(x)q(x) rewrite this as

φ(x) E[ R(β)2 | Xi = x] dx

s.t. Z φ(x) dx B.

Assume we can define a valid q β where q β(x) [0, 1] for each x X that satisfies the following conditions:

φ(x) p(x) q

E R(β)2 | X = x

E R(β)2 | X = x .

Explicitly, we define q β as follows:

E R(β)2 | X = x

E R(β)2 | X = x i B.

One can show its optimality by considering some other labeling policy q where E[q (X)] = B (we use a similar proof technique from importance sampling Owen [23, 9.1]). Now, let φ (x) := p(x)q β(x) and φ (x) = p(x)q (x) Z p(x)2

φ (x)E[ R(β)2 | Xi = x] dx

E R(β)2 | X = x Z p(x)

E[ R(β)2 | Xi = x] dx

E R(β)2 | X = x 2 B 1

= Z p(x) φ (x) φ (x) q

E R(β)2 | X = x dx 2 B 1

= B Z p(x) φ (x) φ (x)

E R(β)2 | X = x dx 2

φ (x)E R(β)2 | X = x dx,

where the last line is by Cauchy-Schwarz, since φ (x)/B is a valid p.d.f. Hence, we have shown our desired result.

A.3 Proof of Theorem 3

By definition of Regt, we know that

i=1 E[ b Gβ(qi, bri, λ )] E[ b Gβ(qt, bri, λi)] Regt

by taking an expectation over Fi 1 for each term in the summation Hence, what remains to be shown is the following:

i=1 E[ b Gβ(q , r , λ )] E[ b Gβ(qt, brt, λ )]

i=1 O(E[|qt(Xt) q β(Xt)|]) + O(E[(brt(Xt, β) r (Xt, β))2]) (14)

Let Rt(β) := r(X, Y, β) brt(X, β) and R (β) := r(X, Y, β) r (X, β). We first note the following identity using (12):

E[ b Gβ(q , r , λ )] E[ b Gβ(qt, brt, λ )]

V brt(X, β) L qt(X) Rt(β) | Ft 1

r (X, β) L q β(X) R (β)

Now, we make the following derivations for the difference between the variance (V) terms:

V brt(X, β) L qt(X) Rt(β) | Ft 1

r (X, β) L q β(X) R (β)

= Z E[ Rt(β)2 | X = x]

qt(x) E[ R (β)2 | X = x]

q β(x) E[ Rt(β) | X = x]2 !

= Z (q β(x) qt(x))V[r(X, Y, β) | X = x] + q β(x)(1 qt(x))(brt(x) r (x, β))2

qt(x)q β(x)

Z (q β(x) qt(x)) + (brt(x) r (x, β))2 p(x)

O(E[ q β(Xt) qt(Xt) | Ft 1] + E[(brt(Xt) r (Xt, β))2 | Ft 1]). (15)

The 1st equality is by substituting in the identity from (12). The 1st inequality is a result of V[r(X, Y, β) | X = x] 1

4, since r(X, Y, β) [0, 1], and q β(x), qt(x) [ε, 1] almost surely. The 2nd inequality is by upper bounding q β(x) qt(x) by its absolute value.

Now, if we plug (15) into (14), take the expectation over Ft 1, and take the summation over t, we get our desired result.

B Experiment details

In this section, we discuss additional details about how we implement our methods described in Section 4.

B.1 Formulation of the labeling policy

For the pretrain policy, we use an estimate of the conditional mean and variance derived from s.

brpretr(x, β) := X

y C(x,β) sy(x),

bσpretr(x, β) := p

brpretr(x, β) (1 brpretr(x, β)).

These estimates may not be accurate, but might still represent a reasonable partitioning of the feature space where σ(x, β) and r (x, β) are similar. Hence, for bσplugin t and brplugin t , we model them as linear regresssion models where inputs are a binning of brpretr(x, β) and bσpretr(x, β), respectively. We then learn the regression model parameters on training data.

B.2 Optimization to maintain the budget constraint

For any predictor bσ, we optimize the Lagrangian corresponding to (10), which is defined as follows for a fixed λt.

= E[ b G(qt, brt, λt)] νt(E[qt(Xt)] B)

θ Lt qt(X)r(X, Y, β) ψ(λt) θ Lt qt(X)r(X, Y, β) 2#

νt(E[qt(X)] B)

= λt(θ ρ(β)) λ2 t E

" θ brt(X, β) Lt qt(X) R(β) 2#

νt(E[qt(X)] B)

Since we know the optimal form of brt, we optimize it separately by taking an optimization step with the loss of squared error, (r(X, Y, β) brt(X, β))2, for each labeled example for a grid of β values.

To derive the solution, we simplify playing the minimax game with the above Lagrangian to the following objective:

max qt min νt E 1 qt(X) (r(X, Y, β) brt(X, β))2 νt(E[qt(X)] B)

In the case of both pretrain and learned methods, we parameterize our qt in the following fashion:

qt(x) = bσt(x, bβt 1)

where our normalization constant Ct = exp(ct) for some value ct R to ensure it is nonnegative.

bσt is updated separately. For pretrain , it is fixed from the beginning, and for learned , we take an optimization step to minimize the squared loss against the squared residual, i.e., we update to minimize ((r(X, Y, β) brt(X, β))2 bσt(X, β)2)2.

Hence, the only thing that remains to optimize ct, which now simply means we need to solve the following problem (where we treat bβt 1 as fixed):

max ct min νt ct + νt

1 exp(ct)E[bσt(X, bβt 1)] B

The actual game payoff we play is the stochastic approximation of the Lagrangian in the following form:

L(νt, ct) = ct νt

bσt(Xt, bβt 1)

At each step, we take an optimization step on ct towards maximizing the above loss, and determine νt by playing either best response or a windowed best response that takes an average of best responses over recent rounds. We use the COCOB optimizer [22] for all of learning and optimization which requires no hyperparameter or learning rate selection.

Neur IPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and precede the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

You should answer [Yes] , [No] , or [NA] . [NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available. Please provide a short (1 2 sentence) justification right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While [Yes] is generally preferable to [No] , it is perfectly acceptable to answer [No] provided a proper justification is given (e.g., error bars are not reported because it would be too computationally expensive or we were unable to find the license for the dataset we used ). In general, answering [No] or [NA] is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

Delete this instruction block, but keep the section heading Neur IPS paper checklist , Keep the checklist subsection headings, questions/answers and guidelines below. Do not modify the questions and only use the provided macros for your answers.

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We provide exactly what we describe in the abstract/introduction, i.e., a method for constructing active, anytime-valid risk controlling prediction sets along with theoretical guarantees and experiments demonstrating its efficacy. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes]

Justification: We elaborate on the limitations of the paper in Section 6

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate Limitations section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We provide proofs and assumptions for each result (clearly delineated) in the paper.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We provide the code for reproducing the experiments in a supplement.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide all code for reproducing the experiments in the paper in a supplement (see above).

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide experimental overview in Section 4, and we provide additional details in both the code and the appendix (Appendix B). Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We show error bars (from 95% normal CIs) for all our experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer Yes if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We specify our compute resources in Section 4. Guidelines:

The answer NA means that the paper does not include experiments.

The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: I have reviewed the ethics guidelines and this paper conforms with it. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work is statistical methodology/theory for enabling active labeling for risk control it may allow users of machine learning models more cost-effectively calibrate their models to control measures of harmful risk when using them in practice, but there are no direct societal impacts as far as we can discern. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our paper does not release new data/models.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: Yes, we properly cite the papers for which we use models, code, and data from (e.g, Imagenet, RCPS, etc.).

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: We provide our experimental code in the supplement and provide details about it in Appendix B.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We don t perform human research for this paper. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We don t perform human research in this paper. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.