# conformal_risk_control__4d014238.pdf Published as a conference paper at ICLR 2024 CONFORMAL RISK CONTROL Anastasios N. Angelopoulos1, Stephen Bates2, Adam Fisch2, Lihua Lei3, Tal Schuster4 1UC Berkeley 2MIT 3Stanford 4Google Research We extend conformal prediction to control the expected value of any monotone loss function. The algorithm generalizes split conformal prediction together with its coverage guarantee. Like conformal prediction, the conformal risk control procedure is tight up to an O(1/n) factor. We also introduce extensions of the idea to distribution shift, quantile risk control, multiple and adversarial risk control, and expectations of U-statistics. Worked examples from computer vision and natural language processing demonstrate the usage of our algorithm to bound the false negative rate, graph distance, and token-level F1-score. 1 INTRODUCTION We seek to endow some pre-trained machine learning model with guarantees on its performance as to ensure its safe deployment. Suppose we have a base model f that is a function mapping inputs x X to values in some other space, such as a probability distribution over classes. Our job is to design a procedure that post-processes the output of f to give it a statistical safety property. Split conformal prediction (Vovk et al., 2005; Papadopoulos et al., 2002), which we will refer to simply as conformal prediction , has been useful in areas such as computer vision (Angelopoulos et al., 2021b) and natural language processing (Fisch et al., 2021) to provide such a guarantee. By measuring the model s performance on a calibration dataset (Xi, Yi) n i=1 of feature-response pairs, conformal prediction post-processes the model to construct prediction sets that bound the miscoverage, P Yn+1 / C(Xn+1) α, (1) where (Xn+1, Yn+1) is a new test point, α is a user-specified error rate (e.g., 10%), and C is a function of the model and calibration data that outputs a prediction set. Note that C is formed using the first n data points, and the probability in (1) is over the randomness in all n + 1 data points (i.e., the draw of both the calibration points 1, . . . , n and the test point n + 1). In this work, we extend conformal prediction to prediction tasks where the natural notion of error is not simply miscoverage. In particular, our main result is that a generalization of conformal prediction provides guarantees of the form E h ℓ Cλ(Xn+1), Yn+1 i α, (2) for any bounded loss function ℓthat shrinks as Cλ(Xn+1) grows, and λ is an input parameter that controls the growth of Cλ(Xn+1). We call this conformal risk control. Note that (2) recovers the conformal miscoverage guarantee in (1) when using the miscoverage loss, ℓ Cλ(Xn+1), Yn+1) = 1 {Yn+1 / Cλ(Xn+1)}. However, our algorithm also extends conformal prediction to situations where other loss functions, such as the false negative rate (FNR), are more appropriate. As an example, consider multilabel classification, where the Yi {1, ..., K} are sets comprising a subset of K classes. Creating sets that contain all the classes may be too conservative if K is massive; instead, given a trained multilabel classifier f : X [0, 1]K, we want to output sets that include a large fraction of the true classes in Yi. To that end, we post-process the model s raw outputs into the set of classes with sufficiently high scores, Cλ(x) = {k : f(X)k 1 λ}, where the main parameter of the algorithm λ [0, 1] is a threshold. Note that as the threshold λ grows, we include more classes in Cλ(x) i.e., it becomes more conservative. In this case, conformal risk control finds a threshold value ˆλ that controls the fraction of missed classes, i.e., the expected value Published as a conference paper at ICLR 2024 of ℓ Cˆλ(Xn+1), Yn+1 = 1 |Yn+1 Cλ(Xn+1)|/|Yn+1|. Setting α = 0.1 would ensure that our algorithm produces sets Cˆλ(Xn+1) containing 90% of the true classes in Yn+1 on average. 1.1 ALGORITHM AND PREVIEW OF MAIN RESULTS Formally, we will consider post-processing the predictions of the model f to create a function Cλ( ). The function has a parameter λ that encodes its conservativeness: larger λ values yield more conservative outputs (e.g., larger prediction sets). To measure the quality of Cλ, we consider a loss function ℓ(Cλ(x), y) ( , B] for some B < .1 We require this loss to be non-increasing in λ. Our goal is to choose ˆλ based on the observed data {(Xi, Yi)}n i=1 so that risk control as in (2) holds. We now rewrite this same task in a more notationally convenient and abstract form. Consider an exchangeable collection of non-increasing, bounded, random functions Li : Λ ( , B], i = 1, . . . , n + 1, where Λ is the space of all inputs (e.g., thresholds ) to the function Li(λ). Throughout the paper, we assume λmax := sup Λ Λ, so that Li(λmax) is well-defined and satisfies Li(λmax) α for any α by design (i.e., α is achievable). We seek to use the first n functions to choose a value of the parameter, ˆλ, so that the risk on the unseen function is controlled: E h Ln+1 ˆλ i α. (3) We are primarily motivated by the case where Li(λ) = ℓ(Cλ(Xi), Yi), in which case the guarantee in (3) coincides with risk control as in (2). Now we describe the algorithm. Let b Rn(λ) = (L1(λ) + . . . + Ln(λ))/n. Given any desired risk level upper bound α ( , B), define ˆλ = inf λ : n n + 1 b Rn(λ) + B n + 1 α = inf λ : b Rn(λ) α B α Since b Rn(λ) is monotone, we can efficiently search for ˆλ using binary search to arbitrary precision. When the set is empty, we define ˆλ = λmax. Our proposed conformal risk control algorithm is to deploy ˆλ on the forthcoming test point. Our main result is that this algorithm satisfies (3). Intuitively, we can see that this algorithm reduces to searching for a value of λ that results in a slightly conservative empirical risk that gets less conservative when the difference between the worst-case risk (B) and the desired risk (α) is smaller, or the calibration set size (n) is larger. Moreover, when the Li are i.i.d. from a continuous distribution, we can show that the algorithm satisfies a tight lower bound saying it is not too conservative, E h Ln+1 ˆλ i α 2B n + 1. We show the reduction from conformal risk control to conformal prediction in Appendix A. Furthermore, if the risk is non-monotone, then this algorithm does not control the risk; we discuss this in Section 2.3. Finally, we provide both practical examples using real-world data and several theoretical extensions of our procedure in Sections 3 and 4, respectively. 1.2 RELATED WORK Conformal prediction was developed by Vladimir Vovk and collaborators beginning in the late 1990s (Vovk et al., 1999; 2005), and has recently become a popular uncertainty estimation tool in the machine learning community, due to its favorable model-agnostic, distribution-free, finite-sample guarantees. See Angelopoulos & Bates (2021) for a modern introduction to the area or Shafer & Vovk (2008) for a more classical alternative. As previously discussed, in this paper we primarily build on split conformal prediction (Papadopoulos et al., 2002); statistical properties of this algorithm including the coverage upper bound were studied in Lei et al. (2018). Recently there have been many extensions of the conformal algorithm, mainly targeting deviations from exchangeability (Tibshirani et al., 2019; Gibbs & Candes, 2021; Barber et al., 2022; Fannjiang et al., 2022) and 1Note that any unbounded loss can be transformed to a bounded loss. For example, any unbounded loss ℓ(λ) on the positive reals can be transformed by taking the inverse tangent, i.e., ℓ (λ) = arctan (ℓ(λ)). As long as the transformation is monotone, a loss of α on the original loss corresponds exactly to a loss of α on the new loss; controlling this transformed risk may be enough in practice. Published as a conference paper at ICLR 2024 improved conditional coverage (Barber et al., 2020; Romano et al., 2019; Guan, 2020; Romano et al., 2020; Angelopoulos et al., 2021b). Most relevant to us is recent work on risk control in high probability (Vovk, 2012; Bates et al., 2021; Angelopoulos et al., 2021a) and its applications (Park et al., 2020; Fisch et al., 2022; Schuster et al., 2021; 2022; Sankaranarayanan et al., 2022; Angelopoulos et al., 2022a;b, inter alia). However, while Bates et al. (2021) and Angelopoulos et al. (2021a) operate in similar mathematical settings and we reuse much of their notation, the algorithm presented herein differs greatly. It is far more sample-efficient, simpler, and provides a guarantee in expectation versus a guarantee in probability. Our algorithm is entirely different no existing algorithm gives guarantees in expectation for risk control and its validity proof is mathematically unrelated to these previous works. See Appendix B for a detailed discussion and comparison of these algorithms, including experiments. To further elaborate on the difference between our work and the broader existing literature, first consider conformal prediction. The purpose of conformal prediction is to provide coverage guarantees of the form in (1). The guarantee available through conformal risk control, (3), strictly subsumes that of conformal prediction; it is generally impossible to recast risk control as coverage control. As a second question, one might ask whether (3) can be achieved through standard statistical machinery, such as uniform concentration inequalities. Though it is possible to integrate a uniform concentration inequality to get a bound in expectation, this strategy tends to be excessively loose both in theory and in practice (see, e.g., the bound of Anthony & Shawe-Taylor (1993)). The technique herein avoids these complications; it is simpler than concentration-based approaches, practical to implement, and tight up to a factor of 1/n, which is comparatively faster than concentration would allow. In this section, we establish the core theoretical properties of conformal risk control. All proofs, unless otherwise specified, are deferred to Appendix E. 2.1 RISK CONTROL We first show that the proposed algorithm leads to risk control when the loss is monotone. Theorem 1. Consider a sequence of exchangeable random loss functions, {Li(λ)}n+1 i=1 , where Li : Λ R which are non-increasing in λ, right-continuous, and for λmax = sup Λ Λ, satisfy Li(λmax) α, sup λ Li(λ) B < almost surely. (5) Then E[Ln+1(ˆλ)] α. Proof. Let b Rn+1(λ) = (L1(λ) + . . . + Ln+1(λ))/(n + 1) and ˆλ = inf n λ Λ : b Rn+1(λ) α o . Since infλ Li(λ) = Li(λmax) α, ˆλ is well-defined almost surely. Since Ln+1(λ) B, we know b Rn+1(λ) = n n+1 b Rn(λ) + Ln+1(λ) n+1 n n+1 b Rn(λ) + B n+1. Thus, n n + 1 b Rn(λ) + B n + 1 α = b Rn+1(λ) α. This implies ˆλ ˆλ when the LHS holds for some λ Λ. When the LHS is above α for all λ Λ, by definition, ˆλ = λmax ˆλ . Thus, ˆλ ˆλ almost surely. Since Li(λ) is non-increasing in λ, E h Ln+1 ˆλ i E h Ln+1 ˆλ i . (6) Let E be the multiset of loss functions {L1, . . . , Ln+1}. Then ˆλ is a function of E, or, equivalently, ˆλ is a constant conditional on E. Additionally, Ln+1(λ)|E Uniform({L1, ..., Ln+1}) by exchangeability. These facts combined with the right-continuity of Li imply E h Ln+1(ˆλ ) | E i = 1 n + 1 i=1 Li(ˆλ ) α. The proof is completed by the law of total expectation and (6). Published as a conference paper at ICLR 2024 2.2 A TIGHT RISK LOWER BOUND Next we show that the conformal risk control procedure is tight up to a factor 2B/(n+1) that cannot be improved in general. The proof will rely on a form of continuity that generalizes the assumption of continuous non-conformity scores used for the standard conformal proof. Define the jump function, which quantifies the size of the discontinuity in a right-continuous input function l at point λ, a J(l, λ) = limϵ 0+ l(λ ϵ) l(λ). If the probability that Li has a discontinuity at any pre-specified λ is P(J(Li, λ) > 0) = 0, then the conformal risk control procedure is not too conservative. Theorem 2. In the setting of Theorem 1, further assume that the Li are i.i.d., Li 0, and for any λ, P (J(Li, λ) > 0) = 0. Then E h Ln+1 ˆλ i α 2B n + 1. This bound is tight for general monotone loss functions, as we show next. Proposition 1. In the setting of Theorem 2, for any ϵ > 0, there exists a loss function and α (0, 1) such that E h Ln+1 ˆλ i α 2B ϵ Since we can take ϵ arbitrarily close to zero, the factor 2B/(n + 1) in Theorem 2 is required. Conformal prediction both the algorithm and the guarantee is exactly equivalent to conformal risk control when the loss function is an indicator, including the tighter lower bound (see Appendix A). 2.3 CONTROLLING GENERAL LOSS FUNCTIONS We next show that the conformal risk control algorithm does not control the risk if the Li are not assumed to be monotone. In particular, (3) does not hold. We show this by example. Proposition 2. For any ϵ, there exists a non-monotone loss function such that E h Ln+1 ˆλ i B ϵ. Notice that for any desired level α, the expectation in (3) can be arbitrarily close to B. Since the function values here are in [0, B], this means that even for bounded random variables, risk control can be violated by an arbitrary amount unless further assumptions are placed on the Li. However, the algorithms developed may still be appropriate for near-monotone loss functions. Simply monotonizing all loss functions Li and running conformal risk control will guarantee (3), but this strategy will only be powerful (i.e., not conservative) if the loss is near-monotone. For concreteness, we describe this procedure below as a corollary of Theorem 1. Corollary 1. Allow Li(λ) to be any (possibly non-monotone) function of λ satisfying 5. Take Li(λ) = sup λ λ Li(λ ), Rn(λ) = 1 i=1 Li(λ) and λ = inf λ : n n + 1 Rn(λ) + B n + 1 α . Then, E h Ln+1 λ i α. If the loss function is already monotone, then λ reduces to ˆλ. We propose a further algorithm for picking λ in Appendix C that provides an asymptotic risk-control guarantee for non-monotone loss functions. However, this algorithm again is only powerful when the risk E[Ln+1(λ)] is nearmonotone and reduces to the standard conformal risk control algorithm when the loss is monotone. To demonstrate the flexibility and empirical effectiveness of the proposed algorithm, we apply it to four tasks across computer vision and natural language processing. All four loss functions are nonbinary, monotone losses bounded by 1. They are commonly used within their respective application domains. Our results validate that the procedure bounds the risk as desired and gives useful outputs Published as a conference paper at ICLR 2024 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 risk 0 5 10 15 20 25 set size as a fraction of polyp size Figure 1: FNR control in tumor segmentation. The top figure shows examples of our procedure with correct pixels in white, false positives in blue, and false negatives in red. The bottom plots report FNR and set size over 1000 independent random data splits. The dashed gray line marks α. fork cake dining table person tennis racket chair suitcase person traffic light umbrella handbag 0.09 0.10 0.11 0.12 risk 0 2 4 6 8 10 12 size Figure 2: FNR control on MS COCO. The top figure shows examples of our procedure with correct classes in black, false positives in blue, and false negatives in red. The bottom plots report FNR and set size over 1000 independent random data splits. The dashed gray line marks α. to the end-user. We note that the choices of Cλ used herein are only for the purposes of illustration; any nested family of sets will work. For each example use case, for a representative α (details provided for each task) we provide both qualitative results and quantitative histograms of the risk and set sizes over 1000 random data splits that demonstrate valid risk control (i.e., with mean α). 3.1 FNR CONTROL IN TUMOR SEGMENTATION In tumor segmentation, our input is a d d image and our label is a set of pixels Yi ({(1, 1), (1, 2), ..., (d, d)}), with the power set. We use an image segmentation model f : X [0, 1]d d outputting a probability for each pixel and measure loss as the fraction of false negatives, LFNR i (λ) = 1 |Yi Cλ(Xi)| |Yi| , where Cλ(Xi) = {y : f(Xi)y 1 λ} . (7) The expected value of LFNR i is the FNR. Since LFNR i is monotone, so is the FNR. Thus, we use the technique in Section 2.1 to pick ˆλ by (4) that controls the FNR on a new point, which guarantees: E h LFNR n+1 (ˆλ) i α. (8) For evaluating the proposed procedure we pool data from several online open-source gut polyp segmentation datasets: Kvasir, Hyper-Kvasir, CVC-Colon DB, CVC-Clinic DB, and ETIS-Larib. We choose a Pra Net (Fan et al., 2020) as our base model f and used n = 1000, and evaluated risk control with the 781 remaining validation data points. We report results with α = 0.1 in Figure 1. The mean and standard deviation of the risk over 1000 trials are 0.0987 and 0.0114, respectively. Published as a conference paper at ICLR 2024 soup bowl dish hot pot soup bowl dish soup bowl Loafer shoe sandal Loafer shoe Loafer poncho cloak cloak poncho cloak poncho candle lamp jack-o'-lantern candle lamp candle meat loaf dish burrito meat loaf dish meat loaf table lamp lamp candle table lamp lamp table lamp spotlight lamp candle spotlight lamp spotlight soup bowl dish consomme soup bowl dish soup bowl 0.047 0.048 0.049 0.050 0.051 0.052 0.053 risk 0 1 2 3 4 5 6 7 8 height Figure 3: Control of graph distance on hierarchical Image Net. The top figure shows examples of our procedure with correct classes in black, false positives in blue, and false negatives in red. The bottom plots report our minimum hierarchical distance loss and set size over 1000 independent random data splits. The dashed gray line marks α. 3.2 FNR CONTROL IN MULTILABEL CLASSIFICATION In the multilabel classification setting, our input Xi is an image and our label is a set of classes Yi {1, . . . , K} for some number of classes K. Using a multiclass classification model f : X [0, 1]K, we form prediction sets and calculate the number of false positives exactly as in (7). By Theorem 1, picking ˆλ as in (4) again yields the FNR-control guarantee in (8). We evaluate on the Microsoft Common Objects in Context (MS COCO) dataset (Lin et al., 2014), a large-scale 80-class multiclass classification task commonly used in computer vision. We choose a TRes Net (Ridnik et al., 2020) as our model f and used n = 4000, and evaluated risk control with 1000 validation data points. We report results with α = 0.1 in Figure 2. The mean and standard deviation of the risk over 1000 trials are 0.0996 and 0.0052, respectively. The results indicate that the risk is almost exactly controlled, the spread is not too wide, and the set sizes are reasonable, not overly inflated. 3.3 CONTROL OF GRAPH DISTANCE IN HIERARCHICAL IMAGE CLASSIFICATION In the K-class hierarchical classification setting, our input Xi is an image and our label is a leaf node Yi {1, ..., K} on a tree with nodes V and edges E. Using a single-class classification model f : X K, we calibrate a loss in graph distance between the interior node we select and the closest ancestor of the true class. For any x X, let ˆy(x) = arg maxk f(x)k be the class with the highest estimated probability. Further, let d : V V Z be the function that returns the length of the shortest path between two nodes, let A : V 2V be the function that returns the ancestors of its argument, and let P : V 2V be the function that returns the set of leaf nodes that are descendants of its argument. We also let g(v, x) = P k P(v)f(x)k be the sum of scores of leaves descended from v. Further, define a hierarchical distance d H(v, u) = inf a A(v){d(a, u)}. For a set of nodes Cλ 2V, we then define the set-valued loss LGraph i (λ) = inf s Cλ(Xi){d H(y, s)}/D, where Cλ(x) = \ {a A(ˆy(x)) : g(a,x) λ} P(a). This loss returns zero if y is a child of any element in Cλ, and otherwise returns the minimum distance between any element of Cλ and any ancestor of y, scaled by the depth D. Thus, it is a monotone loss function and can be controlled by choosing ˆλ as in (4) to achieve the guarantee E h LGraph n+1 (ˆλ) i α. Published as a conference paper at ICLR 2024 when were cigarette ads banned from tv uk? who told the story of the prodigal son? {1 august 1965, 1965, 11 january 2006, } {robert wilkins, jesus, david, keith green, } Answers = {1 august 1965, } Answers = {Jesus Christ} 0.26 0.28 0.30 0.32 0.34 risk 20 22 24 26 28 30 set size Figure 4: F1-score control on Natural Questions. The top figure shows examples of our procedure with fully correct answers in green, partially correct answers in blue, and false positives in gray. Note that answers that are technically correct may still be down-graded if they do not match the reference. We treat this as part of the randomness in the task. The bottom plots report the F1 risk and average set size over 1000 independent random data splits. The dashed gray line marks α. We use the Image Net dataset (Deng et al., 2009), which comes with an existing label hierarchy, Word Net, of maximum depth D = 14. We choose a Res Net152 (He et al., 2016) for f and n = 30000, and evaluate risk with the remaining 20000. We report results with α = 0.05 in Figure 3. The mean and standard deviation of the risk over 1000 trials are 0.0499 and 0.0011, respectively. The results indicate that the risk is almost exactly controlled, and that the adaptively chosen resolution of the prediction appropriately encodes the model uncertainty (it is almost always a leaf node). 3.4 F1-SCORE CONTROL IN OPEN-DOMAIN QUESTION ANSWERING In the open-domain question answering setting, our input Xi is a question and our label Yi is a set of (possibly non-unique) correct answers. For example, the input Xn+1 = Where was Barack Obama Born? could have the answer set Yn+1 = { Hawaii , Honolulu, Hawaii , Kapo olani Medical Center } Formally, here we treat all questions and answers as being composed of sequences (up to size m) of tokens in a vocabulary V i.e., assuming k valid answers, we have Xi Z and Yi Zk, where Z := Vm. Using an open-domain question answering model that individually scores candidate output answers f : Z Z R, we calibrate the best token-based F1-score of the prediction set, taken over all pairs of predictions and answers: LF1 i (λ) = 1 max F1(a, c): c Cλ(Xi), a Yi , where Cλ(Xi) = {y Vm : f(Xi, y) λ} . We define the F1-score following popular QA evaluation metrics (Rajpurkar et al., 2016), where we treat predictions and ground truth answers as bags of tokens and compute the geometric average of their precision and recall (while ignoring punctuation and articles { a , an , the }). Since LF1 i , as defined in this way, is monotone and upper bounded by 1, it can be controlled by choosing ˆλ as in Section 2.1 to achieve the following guarantee: E h LF1 n+1(ˆλ) i α. We use the Natural Questions (NQ) dataset (Kwiatkowski et al., 2019), a popular open-domain question answering baseline, to evaluate our method. We use the splits distributed as part of the Dense Passage Retrieval (DPR) package (Karpukhin et al., 2020). Our base model is the DPR Retriever Reader model (Karpukhin et al., 2020), which retrieves passages from Wikipedia that might contain the answer to the given query, and then uses a reader model to extract text sub-spans from the retrieved passages that serve as candidate answers. Instead of enumerating all possible answers to a given question, we retrieve the top several hundred candidate answers, extracted from the top 100 passages. We use n = 2500 calibration points, and evaluate risk control with the remaining 1110. We use α = 0.3 (chosen empirically as the lowest F1 score which reliably results in approximately correct answers by manual validation) in Figure 4. The mean and standard deviation of the risk over 1000 trials are 0.2996 and 0.0150, respectively. The results indicate that the risk is almost exactly controlled, and that the sets are reasonably sized, scaling appropriately with question difficulty. Published as a conference paper at ICLR 2024 4 EXTENSIONS We now discuss several extensions of conformal risk control to different settings and risks. 4.1 RISK CONTROL UNDER DISTRIBUTIONAL SHIFT Under a distribution shift, the goal in (3) can be redefined as E(X1,Y1),...,(Xn,Yn) Ptrain, (Xn+1,Yn+1) Ptest h Ln+1 ˆλ i α. (9) Assuming that Ptest is absolutely continuous with respect to Ptrain and defining w(x, y) = d Ptest(x,y) d Ptrain(x,y), the weighted objective (9) can be rewritten as E(X1,Y1),...,(Xn+1,Yn+1) Ptrain h w(Xn+1, Yn+1)Ln+1 ˆλ i α. (10) When w is known and bounded, we can apply our procedure on the loss function Ln+1(λ) = w(Xn+1, Yn+1)Ln+1(λ), which is non-decreasing, bounded, and right-continuous in λ whenever Ln+1 is. Thus, Theorem 1 guarantees that the resulting ˆλ satisfies (10). For example, in the covariate shift setting, w(Xn+1, Yn+1) = w(Xn+1) = d Ptest(Xn+1) d Ptrain(Xn+1). In this case, we can achieve risk control even when w is unbounded. In particular, assuming Li [0, B], for any potential value x of the covariate, we define ˆλ(x) = inf λ : Pn i=1 w(Xi)Li(λ) + w(x)B Pn i=1 w(Xi) + w(x) α . Proposition 3. In the setting of Theorem 1, with ˆλ as above, E(X1,Y1),...,(Xn,Yn) Ptrain,(Xn+1,Yn+1) Ptest[Ln+1(ˆλ(Xn+1))] α. This is an exact generalization of the procedure of Tibshirani et al. (2019) beyond indicator losses. As proposed therein, when unlabeled data in the test domain is available, w can be estimated by the probabilistic classification algorithm; this gives good practical results (see also our experiment in Appendix D). For arbitrary distribution shifts, we give a total variation bound analogous to that of Barber et al. (2022) for independent data (see their Section 4.1), though the proof is different. Here we will use the notation Zi = (Xi, Yi), and ˆλ(Z1, . . . , Zn) to refer to that chosen in (4). Proposition 4. Let Z = (Z1, . . . , Zn+1) be a sequence of random variables. Then, under the conditions in Theorem 1, E h Ln+1(ˆλ) i α + B Pn i=1 TV(Zi, Zn+1). If further the assumptions of Theorem 2 hold, E h Ln+1(ˆλ) i α B 2 n+1 + Pn i=1 TV(Zi, Zn+1) . 4.2 QUANTILE RISK CONTROL Snell et al. (2022) generalizes Bates et al. (2021) to control the quantile of a monotone loss function conditional on (Xi, Yi)n i=1 with probability 1 δ over the calibration dataset for any user-specified tolerance parameter δ. In some applications, it may be sufficient to control the unconditional quantile of the loss function, which alleviates the burden of the user to choose the tolerance parameter δ. For any random variable X, let Quantileβ(X) = inf{x : P(X x) β}. Analogous to (3), we want to find ˆλ based on (Xi, Yi)n i=1 such that Quantileβ Ln+1(ˆλβ) α. (11) By definition, Quantileβ Ln+1(ˆλβ) α E h 1 n Ln+1(ˆλβ) > α oi 1 β. As a consequence, quantile risk control is equivalent to expected risk control (3) with loss function Li(λ) = 1 {Li(λ) > α}. Let ˆλβ = inf n λ Λ : 1 n+1 Pn i=1 1 {Li(λ) > α} + 1 n+1 1 β o . Proposition 5. In the setting of Theorem 1, with ˆλ as above, (11) is achieved. It is unclear whether the wider class of quantile-based risks considered by Snell et al. (2022) (e.g. the CVa R) can be controlled unconditionally. Published as a conference paper at ICLR 2024 4.3 CONTROLLING MULTIPLE RISKS Let Li(λ; γ) be a family of loss functions indexed by γ Γ for some domain Γ that may have infinitely many elements. A researcher may want to control E[Li(λ; γ)] at level α(γ). Equivalently, we need to find an ˆλ based on (Xi, Yi)n i=1 such that supγ Γ E h Li(ˆλ;γ) α(γ) i 1. (12) Though the above worst-case risk is not an expectation, it can still be controlled. Towards this end, we define ˆλ = supγ Γ ˆλγ, where ˆλγ = inf{λ : 1 n+1 Pn i=1 Li(λ; γ) + B n+1 α(γ)}. Proposition 6. In the setting of Theorem 1, with ˆλ as above, (12) is satisfied. 4.4 ADVERSARIAL RISKS We next show how to control risks defined by adversarial perturbations. We adopt the same notation as Section 4.3. Bates et al. (2021) (Section 6.3) discusses the adversarial risk where Γ parametrizes a class of perturbations of Xn+1, e.g., Li(λ; γ) = L(Xi + γ, Yi) and Γ = {γ : γ ϵ}. A researcher may want to find an ˆλ based on (Xi, Yi)n i=1 such that E[supγ Γ Li(λ; γ)] α. (13) This can be recast as a conformal risk control problem by taking Li(λ) = supγ Γ Li(λ; γ). Then, the following choice of λ leads to risk control: ˆλ = inf{λ : 1 n+1 Pn i=1 Li(λ) + B n+1 α}. Proposition 7. In the setting of Theorem 1, with ˆλ as above, (13) is satisfied. 4.5 U-RISK CONTROL For ranking and metric learning, Bates et al. (2021) considered loss functions that depend on two test points. In general, for any k > 1 and subset S {1, . . . , n + k} with |S| = k, let LS(λ) be a loss function. Our goal is to find ˆλk based on (Xi, Yi)n i=1 such that E h L{n+1,...,n+k}(ˆλk) i α. (14) We call the LHS a U-risk since, for any fixed ˆλk, it is the expectation of an order-k U-statistic. As a natural extension, we can define ˆλk = inf n λ : k!n! (n+k)! P S {1,...,n},|S|=k LS(λ) + B 1 (n!)2 (n+k)!(n k)! α o . (15) Again, we define ˆλk = λmax when the RHS is empty. Then we can prove the following result. Proposition 8. Assume that LS(λ) is non-increasing in λ, right-continuous, and LS(λmax) α, supλ LS(λ) B < almost surely. Then (14) is achieved with ˆλ as above. 5 CONCLUSION This generalization of conformal prediction broadens its scope to new applications, as shown in Section 3. Still, two primary limitations of our technique remain: firstly, the requirement of a monotone loss is difficult to lift. Secondly, extensions to non-exchangeable data require knowledge about the form of the shift. This issue affects most statistical methods, including standard conformal prediction, and ours is no different in this regard. Finally, the mathematical tools developed in Sections 2 and 4 may be of independent technical interest, as they provide a new, and more general, language for studying conformal prediction, along with new results about its validity. REPRODUCIBILITY STATEMENT Code to reproduce our examples is available at https://github.com/aangelopoulos/ conformal-risk. Published as a conference paper at ICLR 2024 Anastasios N Angelopoulos, Stephen Bates, Emmanuel J Cand es, Michael I Jordan, and Lihua Lei. Learn then Test: Calibrating predictive algorithms to achieve risk control. ar Xiv preprint ar Xiv:2110.01052, 2021a. Anastasios N Angelopoulos, Amit Pal Kohli, Stephen Bates, Michael Jordan, Jitendra Malik, Thayer Alshaabi, Srigokul Upadhyayula, and Yaniv Romano. Image-to-image regression with distribution-free uncertainty quantification and applications in imaging. In International Conference on Machine Learning, pp. 717 730. PMLR, 2022a. Anastasios N Angelopoulos, Karl Krauth, Stephen Bates, Yixin Wang, and Michael I Jordan. Recommendation systems with distribution-free reliability guarantees. ar Xiv preprint ar Xiv:2207.01609, 2022b. Anastasios Nikolas Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification, 2021. URL https://arxiv.org/abs/ 2107.07511. Anastasios Nikolas Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I Jordan. Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations (ICLR), 2021b. URL https://openreview.net/forum?id=e Ndi U_ Db M9. Martin Anthony and John Shawe-Taylor. A result of vapnik with applications. Discrete Applied Mathematics, 47(3):207 217, 1993. Rina Barber, Emmanuel Cand es, Aaditya Ramdas, and Ryan Tibshirani. The limits of distributionfree conditional predictive inference. Information and Inference, 10, 08 2020. doi: 10.1093/ imaiai/iaaa017. Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. ar Xiv preprint ar Xiv:2202.13415, 2022. Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I. Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM, 68(6), September 2021. ISSN 0004-5411. doi: 10.1145/3478535. URL https://doi.org/10.1145/3478535. Vidmantas Bentkus. On Hoeffding s inequalities. The Annals of Probability, 32(2):1650 1673, 2004. doi: 10.1214/009117904000000360. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009. doi: 10.1109/CVPR.2009.5206848. Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverse attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 263 273, 2020. doi: 10.1007/978-3-030-59725-2 26. Clara Fannjiang, Stephen Bates, Anastasios N Angelopoulos, Jennifer Listgarten, and Michael I Jordan. Conformal prediction for the design problem. ar Xiv preprint ar Xiv:2202.03613, 2022. Adam Fisch, Tal Schuster, Tommi S. Jaakkola, and Regina Barzilay. Efficient conformal prediction via cascaded inference with expanded admission. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=tn So6VRLm T. Adam Fisch, Tal Schuster, Tommi Jaakkola, and Regina Barzilay. Conformal prediction sets with limited false positives. ar Xiv preprint ar Xiv:2202.07650, 2022. Alexandre Froda. Sur la distribution des propri et es de voisinage des fonctions de variables r eelles. Bulletin math ematique de la Soci et e Roumaine des Sciences, 32(2):105 202, 1931. Published as a conference paper at ICLR 2024 Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660 1672, 2021. Leying Guan. Conformal prediction with localization. ar Xiv:1908.08558, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. doi: 10.1109/CVPR.2016.90. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, November 2020. Association for Computational Linguistics. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452 466, 2019. Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distributionfree predictive inference for regression. Journal of the American Statistical Association, 113 (523):1094 1111, 2018. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European conference on computer vision, pp. 740 755. Springer, 2014. doi: 10.1007/978-3-319-10602-1 48. Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Machine Learning: European Conference on Machine Learning, pp. 345 356, 2002. doi: https://doi.org/10.1007/3-540-36755-1 29. Sangdon Park, Osbert Bastani, Nikolai Matni, and Insup Lee. PAC confidence sets for deep neural networks via calibrated prediction. In International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=BJx VI04Yv B. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQu AD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383 2392, November 2016. Tal Ridnik, Hussam Lawen, Asaf Noy, and Itamar Friedman. TRes Net: High performance GPUdedicated architecture. ar Xiv:2003.13630, 2020. Yaniv Romano, Evan Patterson, and Emmanuel Cand es. Conformalized quantile regression. In Advances in Neural Information Processing Systems, volume 32, pp. 3543 3553. 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 5103c3584b063c431bd1268e9b5e76fb-Paper.pdf. Yaniv Romano, Matteo Sesia, and Emmanuel Cand es. Classification with valid and adaptive coverage. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3581 3591. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 244edd7e85dc81602b7615cd705545f5-Paper.pdf. Swami Sankaranarayanan, Anastasios N Angelopoulos, Stephen Bates, Yaniv Romano, and Phillip Isola. Semantic uncertainty intervals for disentangled latent spaces. ar Xiv preprint ar Xiv:2207.10074, 2022. Tal Schuster, Adam Fisch, Tommi Jaakkola, and Regina Barzilay. Consistent accelerated inference via confident adaptive transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. Published as a conference paper at ICLR 2024 Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. ar Xiv preprint ar Xiv:2207.07061, 2022. Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar):371 421, 2008. Jake C Snell, Thomas P Zollo, Zhun Deng, Toniann Pitassi, and Richard Zemel. Quantile risk control: A flexible framework for bounding the probability of high-loss predictions. ar Xiv preprint ar Xiv:2212.13629, 2022. Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems, volume 32, pp. 2530 2540. 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 8fb21ee7a2207526da55a679f0332de2-Paper.pdf. Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. Vladimir Vovk. Conditional validity of inductive conformal predictors. In Proceedings of the Asian Conference on Machine Learning, volume 25, pp. 475 490, 2012. URL http:// proceedings.mlr.press/v25/vovk12.html. Vladimir Vovk, Alexander Gammerman, and Craig Saunders. Machine-learning applications of algorithmic randomness. In International Conference on Machine Learning, pp. 444 453, 1999. Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, NY, USA, 2005. Ian Waudby-Smith and Aaditya Ramdas. Variance-adaptive confidence sequences by betting. ar Xiv preprinttt ar Xiv:2010.09686, 2020. A CONFORMAL PREDICTION REDUCES TO RISK CONTROL Conformal prediction can be thought of as controlling the expectation of an indicator loss function. Recall that the risk upper bound (2) specializes to the conformal coverage guarantee in (1) when the loss function is the indicator of a miscoverage event. The conformal risk control procedure specializes to conformal prediction under this loss function as well. However, the risk lower bound in Theorem 2 has a slightly worse constant than the usual conformal guarantee. We now describe these correspondences. First, we show the equivalence of the algorithms. In conformal prediction, we have conformal scores s(Xi, Yi) for some score function s : X Y R. Based on this score function, we create prediction sets for the test point Xn+1 as Cˆλ(Xn+1) = y : s(Xn+1, y) ˆλ , where ˆλ is the conformal quantile, a parameter that is set based on the calibration data. In particular, conformal prediction chooses ˆλ to be the (n + 1)(1 α) /n sample quantile of {s(Xi, Yi)}n i=1. To formulate this in the language of risk control, we consider a miscoverage loss LCvg i (λ) = 1 n Yi / b Cλ(Xi) o = 1 {s(Xi, Yi) > λ}. Direct calculation of ˆλ from (4) then shows the equivalence of the proposed procedure to conformal prediction: λ : 1 n + 1 i=1 1 {s(Xi, Yi) > λ} + 1 n + 1 α i=1 1 {s(Xi, Yi) λ} (n + 1)(1 α) | {z } conformal prediction algorithm Next, we discuss how the risk lower bound relates to its conformal prediction equivalent. In the setting of conformal prediction, Lei et al. (2018) proves that P(Yn+1 / Cˆλ(Xn+1)) α 1/(n + 1) when the conformal score function follows a continuous distribution. Theorem 2 recovers this guarantee with a slightly worse constant: P(Yn+1 / Cˆλ(Xn+1)) α 2/(n+1). First, note that our Published as a conference paper at ICLR 2024 assumption in Theorem 2 about the distribution of discontinuities specializes to the continuity of the score function when the miscoverage loss is used: P J LCvg i , λ > 0 = 0 P(s(Xi, Yi) = λ) = 0. However, the bound for the conformal case is better than the bound for the general case in Theorem 2 by a factor of two, which cannot be improved according to Proposition 1. This difference is an interesting oddity of the binary loss, but is not practically important. B COMPARISON WITH RCPS/LTT The setting of conformal risk control resembles the setting of previous works in high-probability risk control Bates et al. (2021); Angelopoulos et al. (2021a) (RCPS/LTT); however, the algorithms, guarantees, and proof strategies are entirely different. To summarize, the main differences are as follows: 1. Conformal risk control is substantially more sample-efficient than RCPS/LTT. On the scale of the risk, RCPS/LTT are far more conservative, converging to α at a 1 n rate, while conformal risk control converges at a 1 2. RCPS/LTT are high probability bounds of the form P(E[ℓ(Yn+1, Cˆλ(Xn+1)) | {Xi, Yi}n i=1] α) 1 δ Conformal risk control gives an expectation bound of the form E[ℓ(Yn+1, Cˆλ(Xn+1))] α. The former bound provides high-probability control over the sampling of the calibration data, which can be attractive if one needs to be below α with high probability, as opposed to centered at α as in the latter. However, the former bound is more difficult to interpret for nonstatisticians and requires the selection of two parameters α and δ which can introduce complexity. 3. RCPS/LTT are substantially more complicated procedures than conformal risk control especially LTT, which requires careful discretization and bespoke multiple testing correction. This careful construction is required because the LTT procedure allows for the highprobability control of non-monotone risks, while conformal risk control only applies to monotone ones. 4. The theory of conformal risk control is a basic advancement to the core theory of conformal prediction and exchangeability-based arguments, extending them to a broader mathematical setting. Meanwhile, LTT/RCPS use well-established concentration bounds, which are mathematically unrelated to conformal prediction. To underscore these points, we perform a careful evaluation against these procedures on the polyp segmentation dataset in Figure 5. We compare against two versions of LTT/RCPS: Baseline 1 has δ = 0.5, Baseline 2 integrates the tail bound of LTT/RCPS to achieve a bound in expectation. The former strategy is not statistically valid for the goal of expectation control, and the latter strategy is essentially only possible using fixed-width bounds such as Hoeffding s inequality, and is described in the below Appendix B.1. We use the same risk level α = 0.1 for all procedures and make plots with n = {25, 50, 100, 200, 300}. The takeaways are as follows: The statistical efficiency gains of conformal risk control are massive. On the scale of the risk gap, |E[R(ˆλ)] α|, conformal risk control is 50% closer to the desired risk than Baseline 1, and orders of magnitude closer than Baseline 2. This is expected given the theoretical convergence rate is quadratically faster than both approaches. The computational efficiency of conformal risk control is substantially better than all baselines. For example, it is 14x faster than Baseline 1 and 19% faster than Baseline 2. It is also substantially easier to implement than Baseline 1, only requiring 5 lines of code. Baseline 2 is equally easy to implement, although far less flexible. B.1 RECOVERING EXPECTATION BOUNDS FROM RCPS We introduce a technique for deriving bounds in expectation using RCPS Bates et al. (2021)/LTT Angelopoulos et al. (2021a). This technique was not described in the original papers; Published as a conference paper at ICLR 2024 50 100 150 200 250 300 n Integrate Tail Bound (Hoeffding) Conformal Risk Control RCPS ( = 0.5) 50 100 150 200 250 300 n Integrate Tail Bound (Hoeffding) Conformal Risk Control RCPS ( = 0.5) Figure 5: Comparison of RCPS/LTT with conformal risk control on the polyp dataset. we only derive it here for the purpose of the above comparison (i.e., to show that this approach should essentially never be used for monotone losses). Consider a pointwise upper-bound on the empirical risk, as constructed in RCPS Bates et al. (2021): P(R(λ) ˆRn(λ) > ϵ) < δ(ϵ, λ), for all ϵ [0, B]. The next proposition says that if the bound has equal width everywhere, like in the case of Hoeffding s inequality, then it can be integrated to control the risk. Unfortunately, this strategy is not possible for general concentration inequalities, such as that of Bentkus (2004) or Waudby-Smith & Ramdas (2020), as it would require knowing δ(ϵ, λ ) (for an oracle λ defined in the proof below). Proposition B.1. Let Li(λ) be right-continuous, monotone, i.i.d. functions in [0, B] satisfying P(J(Li, λ) > 0) = 0. Furthermore, define δ+(ϵ) = supλ δ(ϵ, λ) and ˆλ = inf n λ : b R(λ) α o , where α satisfies α α R B α 0 δ+(ϵ)dϵ. Then, Proof. Let λ = inf{λ : R(λ) α + ϵ}. By monotonicity and the definition of ˆλ, ˆλ λ = ˆRn(λ ) α . But also, by the bounded jump assumption, R(λ ) = α + ϵ. Therefore, on the event ˆλ λ , we have that R(λ ) ˆRn(λ ) α α + ϵ = ϵ. (16) Since the event in (16) happens with probability at most δ+(ϵ), we have that P(ˆλ λ ) δ+(ϵ), so P(R(ˆλ) α + ϵ) δ+(ϵ). Finally, we have that E[R(ˆλ)] = Z B 0 P(R(ˆλ) > r)dr α P(R(ˆλ) > α + ϵ)dϵ This can be applied to Hoeffding s inequality, giving the aforementioned Baseline 2. Published as a conference paper at ICLR 2024 Corollary 2. In the setting of Proposition B.1, let α = α πB erf( 2n . Then, R(ˆλ) α. Proof. Applying Hoeffding s inequality we have δ+(ϵ) = e 2nϵ2, and then B2 dϵ α Z 1 B2 dϵ = α πB erf( as required. C MONOTONIZING NON-MONOTONE RISKS We next show that the proposed algorithm leads to asymptotic risk control for non-monotone risk functions when applied to a monotonized version of the empirical risk. We set the monotonized empirical risk to be b R n(λ) = sup t λ b Rn(t), then define ˆλ n = inf n λ : b R n(λ) α o . Theorem C.1. Let the Li(λ) be right-continuous, i.i.d., bounded (both above and below) functions satisfying (5). Then, lim n E h Ln+1 ˆλ n i α. Theorem C.1 implies that an analogous procedure to 4 also controls the risk asymptotically. In particular, taking λ = inf λ : b R n(λ) + B n + 1 α also results in asymptotic risk control (to see this, plug λ into Theorem C.1 and see that the risk level is bounded above by α B n+1). Note that in the case of a monotone loss function, λ = ˆλ. However, the counterexample in Proposition 2 does not apply to λ , and it is currently unknown whether this procedure does or does not provide finite-sample risk control. D EXPERIMENTS ON COVARIATE SHIFT To validate the covariate shift extension from Section 4, we perform an experiment on a synthetic regression under covariate shift. The model is as follows: Xi N(0, 1) Yi = 2Xi + N(0, 0.5|Xi|) Xtest N(0, 2) Ytest = 2Xtest + N(0, 0.5|Xtest|). The model used for prediction was a standard linear regression pre-trained on 1000 data points independent and identically distributed with the calibration data. We estimated the likelihood ratio using a logistic regression model by solving a classification problem to classify between all the available training/calibration covariates and batch of 1000 unlabeled test datapoints from the shifted distribution. The probits from the logistic regression were then transformed to likelihood ratios via the function h(x) = x/1 x. We took n = 100 and α = 0.1, controlling the clipped projective distance of y onto the set C: ℓ(y, C) = min(proj C(y), B), where B = 2. The set construction is the standard Cλ(x) = h ˆY (x) λ i . Figure 6 shows results. Published as a conference paper at ICLR 2024 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 risk method standard covariate shift (estimated) Figure 6: Risk control results on a synthetic covariate shift dataset. Running standard conformal risk control does not control the risk, while running the covariate shift algorithm from Section 4 does even though the likelihood ratios are estimated by a logistic regression model. We also computed the risk with the true likelihood ratios, and it overlaps entirely with the histogram for estimated likelihood ratios, so we do not show it here. The proof of Theorem 2 uses the following lemma on the approximate continuity of the empirical risk. Lemma 1 (Jump Lemma). In the setting of Theorem 2, any jumps in the empirical risk are bounded, i.e., sup λ J b Rn, λ a.s. B Proof of Jump Lemma, Lemma 1. By boundedness, the maximum contribution of any single point to the jump is B λ : J b Rn, λ > B n = λ : J(Li, λ) > 0 and J(Lj, λ) > 0 for some i = j. Call Di = {λ : J(Li, λ) > 0} the sets of discontinuities in Li. Since Li is bounded monotone, Di has countably many points. The union bound then implies that P λ : J( b Rn, λ) > B i =j P(Di Dj = ) Rewriting each term of the right-hand side using tower property and law of total probability gives P (Di Dj = ) = E P Di Dj = Dj λ Dj P λ Di Dj λ Dj P (λ Di) Where the second inequality is because the union of the events λ Dj is the entire sample space, but they are not disjoint, and the third equality is due to the independence between Di and Dj. Rewriting in terms of the jump function and applying the assumption P (J(Li, λ) > 0) = 0, λ Dj P (λ Di) λ Dj P (J(Li, λ) > 0) Published as a conference paper at ICLR 2024 Chaining the above inequalities yields P λ : J( b Rn, λ) > B P λ : J( b Rn, λ) > B Proof of Theorem 2. If Li(λmax) α 2B/(n+1), then E[Ln+1(ˆλ)] α 2B/(n+1). Throughout the rest of the proof, we assume that Li(λmax) < α 2B/(n + 1). Define the quantity ˆλ = inf λ : b Rn+1(λ) + B n + 1 α . Since Li(λmax) < α 2B/(n + 1) < α B/(n + 1), ˆλ exists almost surely. Deterministically, n n+1 b Rn(λ) b Rn+1(λ), which yields ˆλ ˆλ . Again since Li(λ) is non-increasing in λ, E h Ln+1 ˆλ i E h Ln+1 ˆλ i By exchangeability and the fact that ˆλ is a symmetric function of L1, . . . , Ln+1, E h Ln+1 ˆλ i = E h b Rn+1 ˆλ i For the remainder of the proof we focus on lower-bounding b Rn+1 ˆλ . We begin with the following identity: α = b Rn+1 ˆλ + B n + 1 b Rn+1 ˆλ + B n + 1 α . Rearranging the identity, b Rn+1 ˆλ = α B n + 1 + b Rn+1 ˆλ + B n + 1 α . Using the Jump Lemma to bound b Rn+1 ˆλ + B n+1 α below by B b Rn+1 ˆλ α 2B n + 1. Finally, chaining together the above inequalities, E Ln+1(ˆλ) E b Rn+1(ˆλ ) α 2B n + 1. Proof of Proposition 1. Without loss of generality, assume B = 1. Fix any ϵ > 0. Consider the following loss functions, which satisfy the conditions in Theorem 2: Li(λ) i.i.d. 1 λ [0, Zi) k k+1 λ [Zi, Wi) 0 else , where k N, the Zi i.i.d. Uniform(0, 0.5), the Wi i.i.d. Uniform(0.5, 1) for i {1, ..., n + 1} and α = k+1 ϵ n+1 . Then, by the definition of ˆλ, we know b Rn ˆλ k ϵ If n > k + 1, b R(λ) k k+1 > k n whenever λ 1 2. Thus, we must have ˆλ > 1 2. Since k is an integer and by (17), we know that {i {1, ..., n} : Li ˆλ > 0} (k + 1)(k ϵ )/k k. This immediately implies that ˆλ W(n k+1), Published as a conference paper at ICLR 2024 where W(j) denotes the j-th order statistic. Notice that for all λ > 1 R(λ) = E [Li(λ)] = k k + 1P Wi > λ = k k + 1 2(1 λ), so R ˆλ k k+1 2(1 W(n k+1)). Let U(k) be the k-th smallest order statistic of n i.i.d. uniform random variables on (0, 1). Then, by symmetry and rescaling, 2(1 W(n k+1)) d= U(k), R ˆλ k k + 1U(k), where denotes the stochastic dominance. It is well-known that U(k) Beta(k, n + 1 k) and hence E[R ˆλ ] k k + 1 k n + 1. α E h R ˆλ i k + 1 ϵ (n + 1)(k + 1) = 1 n + 1 (2 ϵ )k + 1 ϵ For any given ϵ > 0, let ϵ = ϵ/2 and k = 2 (2 ϵ )k + 1 ϵ implying that α E h R ˆλ i 2 ϵ Proof of Proposition 2. Without loss of generality, we assume B = 1. Assume ˆλ takes values in [0, 1] and α (1/(n + 1), 1). Let p (0, 1), N be any positive integer, and Li(λ) be i.i.d. rightcontinuous piecewise constant (random) functions with Li(N/N) = 0, (Li(0/N), Li(1/N), . . . , Li((N 1)/N)) i.i.d. Ber(p). By definition, ˆλ is independent of Ln+1. Thus, for any j = 0, 1, . . . , N 1, n Ln+1(ˆλ) | ˆλ = j/N o Ber(p), n Ln+1(ˆλ) | ˆλ = 1 o δ0. Then, E h Ln+1 ˆλ i = p P(ˆλ = 1) Note that ˆλ = 1 min j {0,...,N 1} 1 n + 1 i=1 Li(j/N) α 1 n + 1. Since α > 1/(n + 1), P(ˆλ = 1) = 1 P(ˆλ = 1) = 1 P for all j, we have 1 n + 1 i=1 Li(j/N) > α 1 n + 1 pk(1 p)(n k) = 1 1 Bino CDF n, p, (n + 1)α 1 N As a result, E h Ln+1 ˆλ i = p 1 1 Bino CDF n, p, (n + 1)α 1 N ! Published as a conference paper at ICLR 2024 Now let N be sufficiently large such that 1 1 Bino CDF n, p, (n + 1)α 1 N ! Then E h Ln+1 ˆλ i > p2 For any α > 0, we can take p close enough to 1 to render the claim false. Proof of Theorem C.1. Define the monotonized population risk as R (λ) = sup t λ E h Ln+1(t) i Note that the independence of Ln+1 and ˆλ n implies that for all n, E h Ln+1 ˆλ n i E h R ˆλ n i . Since R is bounded, monotone, and one-dimensional, a generalization of the Glivenko-Cantelli Theorem given in Lemma 2 gives that uniformly over λ, lim n sup λ | b Rn(λ) R(λ)| a.s. 0. As a result, lim n sup λ | b R n(λ) R (λ)| a.s. 0, which implies that lim n | b R n(ˆλ ) R (ˆλ )| a.s. 0. By definition, b R (ˆλ ) α almost surely and thus this directly implies lim sup n R ˆλ n α a.s.. Finally, since for all n, R ˆλ n B, by Fatou s lemma, lim n E h Ln+1 ˆλ n i lim sup n E h R ˆλ n i E h lim sup n R ˆλ n i α. Proposition 3. Let λ : Pn+1 i=1 w(Xi)Li(λ) Pn+1 i=1 w(Xi) α Since infλ Li(λ) α, ˆλ exists almost surely. Using the same argument as in the proof of Theorem 1, we can show that ˆλ ˆλ(Xn+1). Since Ln+1(λ) is non-increasing in λ, E[Ln+1(ˆλ(Xn+1))] E[Ln+1(ˆλ )]. Let E be the multiset of loss functions {(X1, Y1), . . . , (Xn+1, Yn+1)}. Then ˆλ is a function of E, or, equivalently, ˆλ is a constant conditional on E. Lemma 3 of Tibshirani et al. (2019) implies that (Xn+1, Yn+1) | E w(Xi) Pn+1 j=1 w(Xj) δ(Xj,Yj) = Ln+1 | E w(Xi) Pn+1 j=1 w(Xj) δLi where δz denotes the Dirac measure at z. Together with the right-continuity of Li, the above result implies E h Ln+1(ˆλ ) | E i = Pn+1 i=1 w(Xi)Li(ˆλ ) Pn+1 i=1 w(Xi) α. The proof is then completed by the law of total expectation. Published as a conference paper at ICLR 2024 Lemma 2 (Glivenko-Cantelli for Monotone Functions). For i.i.d., right-continuous, monotone loss functions L1, . . . , Ln, we have that lim n sup λ | b Rn(λ) R(λ)| a.s. = 0. Lemma 2. We generalize the proof of the Glivenko-Cantelli Theorem (see, e.g., Van der Vaart (2000)). Let the notation R(λ ) denote the limit of R from the left-hand side approaching λ. By the strong law of large numbers, b Rn(λ) a.s. R(λ) and b Rn(λ ) a.s. R(λ ) for all λ. Fix ϵ > 0. It is well-known that a monotone function cannot have more than a countable number of discontinuities (Froda, 1931). Thus, there exists a finite partition Λ(ϵ) = { λ1 . . . λk }, such that R(λi ) R(λi 1) < ϵ. (There can only be a finite number of jumps greater than ϵ in size, and these can be included in Λ(ϵ).) The uniform convergence on the finite set Λ(ϵ) is immediate by the law of large numbers; thus, we have that limn supλ Λ(ϵ) | b Rn(λ) R(λ)| a.s. = 0. For any λ Λ(ϵ), suppose λ (λj 1, λj). Then ˆRn(λ) R(λ) ˆRn(λj) R(λj 1) < ˆRn(λj) R(λj)+ϵ. Similarly, ˆRn(λ) R(λ) ˆRn(λj 1) R(λ j ) > ˆRn(λj 1) R(λj 1) ϵ. Combining two pieces, we conclude that supλ | ˆRn(λ) R(λ)| supλ Λ(ϵ) | b Rn(λ) R(λ)| + ϵ. As a result, limn supλ | b Rn(λ) R(λ)| a.s. ϵ. Taking ϵ 0 completes the proof. Proposition 4. Define the vector Z = (Z 1, . . . , Z n, Zn+1), where Z i i.i.d. L(Zn+1) for all i [n]. Let i=1 TV(Zi, Z i). By sublinearity, TV(Z, Z ) ϵ. (18) It is a standard fact that (18) implies sup f F1 |E[f(Z)] E[f(Z )]| ϵ, where F1 = {f : Z 7 [0, 1]}. Let ℓ: Z Λ [0, B] be a bounded loss function. Furthermore, let g(z) = ℓ(zn+1; ˆλ(z1, . . . , zn)). Since g(Z) [0, B], |E[g(Z)] E[g(Z )]| Bϵ. Furthermore, since Z 1, . . . , Z n+1 are exchangeable, we can apply Theorems 1 and 2 to E[g(Z )], recovering α 2B n + 1 E[g(Z )] α. A final step of triangle inequality implies the result: α 2B n + 1 Bϵ E[g(Z)] α + Bϵ. Proposition 5. It is left to prove that Li(λ) satisfies the conditions of Theorem 1. It is clear that Li(λ) 1 and Li(λ) is non-increasing in λ when Li(λ) is. Since Li(λ) is non-increasing and right-continuous, for any sequence λm λ, Li(λm) Li(λ) = 1 {Li(λm) > α} 1 {Li(λ) > α} . Thus, Li(λ) is right-continuous. Finally, Li(λmax) α implies Li(λmax) = 0 1 β. Proposition 6. Examining the form of ˆλ, for each γ Γ, we have E h L(ˆλ, γ) i E h L(ˆλγ, γ) i α(γ). Thus, dividing both sides by α(γ) and taking the supremum, we get that supγ Γ E h L(ˆλ,γ) α(γ) i 1, and the worst-case risk is controlled. Published as a conference paper at ICLR 2024 Proposition 7. Because Li(λ, γ) is bounded and monotone in λ for all choices of γ, it is also true that Li(λ) is bounded and monotone. Furthermore, the pointwise supremum of right-continuous functions is also right-continuous. Therefore, the Li satisfy the assumptions of Theorem 1. Proposition 8. Let λ : k!n! (n + k)! S {1,...,n+k},|S|=k LS(λ) α Since LS(λmax) α, ˆλ k exists almost surely. Since LS(λ) B, we have k!n! (n + k)! S {1,...,n+k},|S|=k LS(λ) k!n! (n + k)! S {1,...,n},|S|=k LS(λ) + B X S {n+1,...,n+k} = ,|S|=k 1 = k!n! (n + k)! S {1,...,n},|S|=k LS(λ) + B 1 k!n! (n + k)! S {1,...,n},|S|=k 1 = k!n! (n + k)! S {1,...,n},|S|=k LS(λ) + B 1 (n!)2 (n + k)!(n k)! Since LS(λ) is non-increasing in λ, we conclude that ˆλ k ˆλk if the right-hand side of (15) is not empty; otherwise, by definition, ˆλ k λmax = ˆλk. Thus, ˆλ k ˆλk almost surely. Let E be the multiset of loss functions {LS : S {1, . . . , n + k}, |S| = k}. Using the same argument in the end of the proof of Theorem 1 and the right-continuity of LS, we can show that E h L{n+1,...,n+k}(ˆλ k) | E i = k!n! (n + k)! S {1,...,n+k},|S|=k LS(λ) α. The proof is then completed by the law of iterated expectation.