# batch_multivalid_conformal_prediction__533725f9.pdf Under review as a conference paper at ICLR 2023 BATCH MULTIVALID CONFORMAL PREDICTION Anonymous authors Paper under double-blind review We develop fast distribution-free conformal prediction algorithms for obtaining multivalid coverage on exchangeable data in the batch setting. Multivalid coverage guarantees are stronger than marginal coverage guarantees in two ways: (1) They hold even conditional on group membership that is, the target coverage level 1 α holds conditionally on membership in each of an arbitrary (potentially intersecting) group in a finite collection G of regions in the feature space. (2) They hold even conditional on the value of the threshold used to produce the prediction set on a given example. In fact multivalid coverage guarantees hold even when conditioning on group membership and threshold value simultaneously. We give two algorithms: both take as input an arbitrary non-conformity score and an arbitrary collection of possibly intersecting groups G, and then can equip arbitrary black-box predictors with prediction sets. Our first algorithm Batch GCP is a direct extension of quantile regression, needs to solve only a single convex minimization problem, and produces an estimator which has group-conditional guarantees for each group in G. Our second algorithm Batch MVP is iterative, and gives the full guarantees of multivalid conformal prediction: prediction sets that are valid conditionally both on group membership and non-conformity threshold. We evaluate the performance of both of our algorithms in an extensive set of experiments. 1 INTRODUCTION Consider an arbitrary distribution D over a labeled data domain Z = X Y. A model is any function h : X Y for making point predictions. The traditional goal of conformal prediction in the batch setting is to take a small calibration dataset consisting of labeled examples sampled from D and use it to endow an arbitrary model h : X Y with prediction sets Th(x) Y that have the property that these prediction sets cover the true label with probability 1 α marginally for some target miscoverage rate α: Pr(x,y) D[y Th(x)] = 1 α. This is a marginal coverage guarantee because the probability is taken over the randomness of both x and y, without conditioning on anything. In the batch setting (unlike in the sequential setting), labels are not available when the prediction sets are deployed. Our goal in this paper is to give simple, practical algorithms in the batch setting that can give stronger than marginal guarantees the kinds of multivalid guarantees introduced by Gupta et al. (2022); Bastani et al. (2022) in the sequential prediction setting. Following the literature on conformal prediction (Shafer and Vovk, 2008), our prediction sets are parameterized by an arbitrary non-conformity score sh : Z R defined as a function of the model h. Informally, smaller values of sh(x, y) should mean that the label y conforms more closely to the prediction h(x) made by the model. For example, in a regression setting in which Y = R, the simplest non-conformity score is sh(x, y) = |h(x) y|. By now there is a large literature giving more sophisticated non-conformity scores for both regression and classification problems see Angelopoulos and Bates (2021) for an excellent recent survey. A non-conformity score function sh(x, y) induces a distribution over non-conformity scores, and if τ is the 1 α quantile of this distribution (i.e. Pr(x,y) D[sh(x, y) τ] = 1 α), then defining prediction sets as T τ h (x) = {y : sh(x, y) τ] gives 1 α marginal coverage. Split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018) simply finds a threshold τ that is an empirical 1 α quantile on the calibration set, and then uses this to deploy the prediction sets T τ h (x) defined above. Our goal is to give stronger coverage guarantees, and to do so, rather than learning a single threshold τ from the calibration set, Under review as a conference paper at ICLR 2023 we will learn a function f : X R mapping unlabeled examples to thresholds. Such a mapping f induces prediction sets defined as follows: T f h (x) = {y : sh(x, y) f(x)}. Our goal is to find functions f : X R that give valid conditional coverage guarantees of two sorts. Let G be an arbitrary collection of groups: each group g G is some subset of the feature domain g 2X about which we make no assumption and we write g(x) = 1 to denote that x is a member of group g. An example x might be a member of multiple groups in G. We want to learn a function f that induces group conditional coverage guarantees i.e. such that for every g G: Pr(x,y) D[y T f h (x)|g(x) = 1] = 1 α. Here we can think of the groups as representing e.g. demographic groups (broken down by race, age, gender, etc) in settings in which we are concerned about fairness, or representing any other categories that we think might be relevant to the domain at hand. Since our functions f now map different examples x to different thresholds f(x), we also want our guarantees to hold conditional on the chosen threshold which we call a threshold calibrated guarantee. This avoids algorithms that achieve their target coverage rates by overcovering for some thresholds and undercovering with others for example, by randomizing between full and empty prediction sets. That is, we have: Pr(x,y) D[y T f h (x)|g(x) = 1, f(x) = τ] = 1 α simultaneously for every g G and every τ R. If f is such that its corresponding prediction sets T f h (x) satisfy both group and threshold conditional guarantees simultaneously, then we say that it promises full multivalid coverage. 1.1 OUR RESULTS We design, analyze, and empirically evaluate two algorithms: one for giving group conditional guarantees for an arbitrary collection of groups G, and the other for giving full multivalid coverage guarantees for an arbitrary collection of groups G. We give PAC-style guarantees (Park et al., 2019), which means that with high probability over the draw of the calibration set, our deployed prediction sets have their desired coverage properties on the underlying distribution. Thus our algorithms also offer training-conditional coverage in the sense of Bian and Barber (2022). We prove our generalization theorems under the assumption that the data is drawn i.i.d. from some distribution, but note that De Finetti s theorem (Ressel, 1985) implies that our analysis carries over to data drawn from any exchangeable distribution (see Remark C.1). Group Conditional Coverage: Batch GCP We first give an exceedingly simple algorithm Batch GCP (Algorithm 1) to find a model f that produces prediction sets T f h that have group conditional (but not threshold calibrated) coverage guarantees. We consider the class of functions F = {fλ : λ R|G|}: each fλ F is parameterized by a vector λ R|G|, and takes value: fλ(x) = f0(x) + P g G:g(x)=1 λg. Here f0 is some arbitrary initial model. Our algorithm simply finds the parameters λ that minimize the pinball loss of fλ(x). This is a |G|-dimensional convex optimization problem and so can be solved efficiently using off the shelf convex optimization methods. We prove that the resulting function fλ(x) guarantees group conditional coverage. This can be viewed as an extension of conformalized quantile regression (Romano et al., 2019) which is also based on minimizing pinball loss. It can also be viewed as an algorithm promising quantile multiaccuracy , by analogy to (mean) multiaccuracy introduced in H ebert-Johnson et al. (2018); Kim et al. (2019), and is related to similar algorithms for guaranteeing multiaccuracy (Gopalan et al., 2022b). Here pinball loss takes the role that squared loss does in (mean) multiaccuracy. Multivalid Coverage: Batch MVP We next give a simple iterative algorithm Batch MVP (Algorithm 2) to find a model f that produces prediction sets T f h that satisfy both group and threshold conditional guarantees simultaneously i.e. full multivalid guarantees. It iteratively finds groups g G and thresholds τ such that the current model fails to have the target coverage guarantees conditional on g(x) = 1 and f(x) = τ, and then patches the model so that it does. We show that each patch improves the pinball loss of the model substantially, which implies fast convergence. This can be viewed as an algorithm for promising quantile multicalibration and is an extension of related algorithms for guaranteeing mean multicalibration (H ebert-Johnson et al., 2018), which offer similar guarantees for mean (rather than quantile) prediction. Once again, pinball loss takes the role that squared loss takes in the analysis of mean multicalibration. Under review as a conference paper at ICLR 2023 Empirical Evaluation We implement both algorithms, and evaluate them on synthetic prediction tasks, and on 10 real datasets derived from US Census data from the 10 largest US States using the Folktables package of Ding et al. (2021). In our synthetic experiments, we measure group conditional coverage with respect to synthetically defined groups that are constructed to be correlated with label noise. On Census datasets, we aim to ensure coverage on population groups defined by reported race and gender categories. We compare our algorithms to two other split conformal prediction methods: a naive baseline which simply ignores group membership, and uses a single threshold, and the method of Foygel Barber et al. (2020), which calibrates a threshold τg for each group g G, and then on a new example x, predicts the most conservative threshold among all groups x belongs to. We find that both of our methods obtain significantly better group-wise coverage and threshold calibration than the baselines we compare to. Furthermore, both methods are very fast in practice, only taking a few seconds to calibrate on datasets containing tens of thousands of points. We have discussed the most closely related work; but see Appendix A for an extended discussion of additional related work. 2 PRELIMINARIES We study prediction tasks over a domain Z = X Y. X denotes the feature domain and Y the label domain. We write G 2X to denote a collection of subsets of X which we represent as indicator functions g : X {0, 1}. The label domain might e.g. be real valued (Y = R) the regression setting, or consist of some finite unordered set the multiclass classification setting. Suppose there is a fixed distribution D Z. Given such a distribution, we will write DX to denote the marginal distribution over features: DX X induced by D. We will write DY(x) Y to denote the conditional distribution over labels induced by D when we condition on a particular feature vector x. We sometimes overload the notation and write D(x) = DY(x). Our uncertainty quantification is based on a bounded non-conformity score function s : X Y R. Non-conformity score functions are generally defined with respect to some model h which is why in the introduction we wrote sh but our development will be entirely agnostic to the specifics of the non-conformity score function, and so we will just write s for simplicity. Without loss of generality, we assume that the scoring function takes values in the unit interval: s(x, y) [0, 1] for any x X and y Y. Given a distribution D over Z = X Y and a non-conformity score function s, we write S to denote the induced joint distribution over feature vectors x and corresponding non-conformity scores s(x, y). Analogously to our D(x) notation, we write S(x) to denote the non-conformity score distribution conditional on a particular feature vector x. Given a subset of the feature space B X, we write S|B to denote the conditional distribution on non-conformity scores conditional on x B. Finally, we assume that all non-conformity score distributions S(x) are continuous, which simplifies our treatment of quantiles note that if this is not the case already, it can be enforced by perturbing non-conformity scores with arbitrarily small amounts of noise from a continuous distribution. Definition 2.1. For any q [0, 1], we say that τ is a q-quantile of a (continuous) nonconformity score distribution S if Pr(x,s) S[s τ] = q. Our convergence results will be parameterized by the Lipschitz parameter of the CDF of the underlying nonconformity score distribution. Informally speaking, a distribution with a Lipschitz CDF cannot concentrate too much of its mass on an interval of tiny width. Similar assumptions are commonly needed in related work, and can be guaranteed by perturbing non-conformity scores with noise see e.g. the discussion in (Gupta et al., 2022; Bastani et al., 2022). Definition 2.2. A conditional nonconformity score distribution S(x) is ρ-Lipschitz if we have Prs S(x)[s τ ] Prs S(x)[s τ] ρ(τ τ) for all 0 τ τ 1. A nonconformity score distribution S is ρ-Lipschitz if for each x X, S(x) is ρ-Lipschitz. If we could find a model f : X [0, 1] that on each input x outputs a value f(x) that is a q-quantile of the nonconformity score distribution S(x), this would guarantee true conditional coverage at rate q: for every x, Pry[y T f(x)|x] = q, where T f(x) = {y : s(x, y) f(x)}. As this is generally impossible, our aim will be to train models f that allow us to make similar claims not conditional on every x, but conditional on membership of x in some group g and on the value of f(x). To facilitate learning models f with guarantees conditional on their output values, we will learn models f whose range R(f) = {f(x) : x X} has finite cardinality m = |R(f)| < . Under review as a conference paper at ICLR 2023 3 ALGORITHMS 3.1 ALGORITHMIC PRELIMINARIES In this section we establish lemmas that will be key to the analysis of both of the algorithms we give. Both of our algorithms will rely on analyzing pinball loss as a potential function. Definition 3.1. The pinball loss at target quantile q for threshold τ and nonconformity score s is Lq(τ, s) = (s τ)q 1[s > τ] + (τ s)(1 q) 1[s τ]. We write PBS q (f) = E (x,s) S[Lq(f(x), s)] for a model f : X [0, 1] and nonconformity score distribution S. When quantile q and/or distribution S is clear, we write PB, PBq, and/or PBS. It is well known that pinball loss is minimized by the function that predicts for each x the target quantile q of a conditional score distribution given x, but we will need a more refined set of statements. First we define a model s marginal quantile consistency error: Definition 3.2. A q-quantile predictor f : X R has marginal quantile consistency error α if Pr(x,s) S[s f(x)] q = α. If α = 0, we say that f satisfies marginal q-quantile consistency. Informally, we will need the pinball loss to have both a progress and an anti-progress property: If a model f is far from satisfying marginal quantile consistency, then linearly shifting it so that it becomes marginal quantile consistent should reduce pinball loss substantially. Conversely, if it is marginal quantile consistent, then perturbing it slightly should not increase pinball loss substantially. The next lemma establishes these, under the assumption that the underlying distribution is Lipschitz. Lemma 3.1. Fix any nonconformity score distribution S that is ρ-Lipschitz. Fix any model f : X R that has marginal quantile consistency error α with respect to target quantile q, and let f (x) = f(x) + with chosen such that f is marginal quantile consistent at quantile q. Then 2ρ PBS q (f ) PBS q (f) α2 We now define what will be a basic building block of our iterative algorithm for obtaining multivalid coverage, and of the analysis of our algorithm for obtaining group conditional coverage. It is a simple patch operation on a model, that shifts the model s predictions by a constant term only on those examples x that lie in some subset B of the feature space: Definition 3.3 (Patch Operation). Given a model f, a subset B X, and a value R define the patched model f = Patch(f, B, ) to be such that f (x) = f(x) + if x B, and f (x) = f(x) otherwise. We next show that if we have a model f, and we can identify a large region B on which it is far from satisfying marginal quantile consistency, then patching the model so that it satisfies marginal quantile consistency on S|B substantially improves its pinball loss. Lemma 3.2. Given some predictor f : X R, suppose we have a set of points B X with Pr (x,s) S[x B] Pr (x,s) S[s f(x)|x B] q 2 α and = argmin R Pr (x,s) S[s f(x) + |x B] q . Then, if S|B is continuous and ρ-Lipschitz, f = Patch(f, B, ) has PBS q (f ) PBS q (f) α 3.2 BATCHGCP: OBTAINING GROUP CONDITIONAL GUARANTEES We now give an extremely simple algorithm Batch GCP (Batch Group-Conditional Predictor) that obtains group conditional (but not threshold calibrated) prediction sets. Batch GCP only needs to solve a single closed-form convex optimization problem. Specifically, Algorithm 1 takes as input an arbitrary threshold model f and collection of groups G, and then simply minimizes pinball loss over all linear combinations of f and the group indicator functions g G. This is a quantile-analogue of a similar algorithm that obtains group conditional mean consistency (Gopalan et al., 2022b). Under review as a conference paper at ICLR 2023 Algorithm 1: Batch GCP(f, G, q, D) Let λ be a solution to the optimization problem: argminλ E (x,y) D h Lq ˆf(x; λ), s(x, y) i where ˆf(x; λ) f(x) + X g G λg g(x) Output ˆf(x; λ ) Theorem 3.1. For any input model f, groups G, and q [0, 1], Algorithm 1 returns ˆf(x; λ ) with h y T ˆ f(x)|g(x) = 1 i = q for every g G, where T ˆ f(x) = {y : s(x, y) ˆf(x; λ )}. Furthermore, PBS q ( ˆf( ; λ )) PBS q (f). The analysis is simple: if the optimal model ˆf(x, λ ) did not satisfy marginal quantile consistency on some g G, then by Lemma 3.2, a patch operation on the set B(g) = {x : g(x) = 1} would further reduce its pinball loss. By definition, this patch would just shift the model parameter λ g and yield a model ˆf(x, λ ) for λ = λ , falsifying the optimality of ˆf(x, λ ) among such models. 3.3 BATCHMVP: OBTAINING FULL MULTIVALID COVERAGE In this section, we give a simple iterative algorithm Batch MVP (Batch Multivalid Predictor) that trains a threshold model f that provides full multivalid coverage i.e. that produces prediction sets T f(x) that are both group conditionally valid and threshold calibrated. To do this, we start by defining quantile multicalibration, a quantile prediction analogue of (mean) multicalibration defined in H ebert-Johnson et al. (2018); Gopalan et al. (2022a) . Definition 3.4. The quantile calibration error of q-quantile predictor f : X [0, 1] on group g is: Q(f, g) = X v R(f) Pr (x,s) S[f(x) = v|g(x) = 1] q Pr (x,s) S[s f(x)|f(x) = v, g(x) = 1] 2 . We say that f is α-approximately q-quantile multicalibrated with respect to group collection G if Q(f, g) α Pr(x,s) S[g(x) = 1] for every g G. Conditional on membership in each group g, the quantile multicalibration error gives a bound on the expected coverage error when conditioning both on membership in g and on a predicted threshold of f(x) = v, in expectation over v. In particular, it implies (and is stronger than) the following simple worst-case (over g and v) bound on coverage error conditional on both g(x) = 1 and f(x) = v: Claim 3.1. If f is α-approximately quantile multicalibrated with respect to G and q, then Pr (x,y) D y T f(x)|g(x) = 1, f(x) = v q α p Pr(x,s) S[g(x) = 1, f(x) = v] for g G, v R(f). Algorithm 2: Batch MVP(f, α, q, G, ρ, S, m) Initialize t = 0, and define f0 as f0(x) = minv [ 1 m ] |v f(x)| for x X. while ft is not α-approximately q-quantile multicalibrated with respect to G do Let Bt = {x : ft(x) = vt, gt(x) = 1}, where: (vt, gt) argmax (v,g) [ 1 m ] G Pr (x,s) S[ft(x) = v, g(x) = 1] q Pr (x,s) S[s ft(x)|ft(x) = v, g(x) = 1]) 2 Let: t = argmin [ 1 Pr (x,s) S[s ft(x) + |x Bt] q Update ft+1 Patch (ft, Bt, t) and t t + 1. Output ft. Under review as a conference paper at ICLR 2023 Algorithm 2 is simple: Given an initial threshold model f, collection of groups G, and target quantile q, it repeatedly checks whether its current model ft satisfies α-approximate quantile multicalibration. If not, it finds a group gt and a threshold vt such that ft predicts a substantially incorrect quantile conditional on both membership of x in gt and on ft(x) = vt such a pair is guaranteed to exist if ft is not approximately quantile multicalibrated. It then fixes this inconsistency and produces a new model ft+1 with a patch operation up to a target discretization parameter m which ensures that the range of ft+1 does not grow too large. Under the assumption that the non-conformity score distribution is Lipschitz, each patch operation substantially reduces the pinball loss of the current predictor, which ensures fast convergence to quantile multicalibration. Theorem 3.2. Suppose S is ρ-Lipschitz and continuous, and m = 8ρ2 α . After T 32ρ3 α2 many rounds, Algorithm 2 Batch MVP(f, α, q, G, ρ, S, m) returns f T such that 1. PBS q (f T ) PBS q (f0) T α2 32ρ3 . 2. f T is α-approximately quantile multicalibrated with respect to G and q. In particular, via Claim 3.1, we have for every g G and v R(f), Pr (x,y) D y T f T (x)|g(x) = 1, f T (x) = v q α p Pr(x,s) S[g(x) = 1, f T (x) = v] 4 OUT OF SAMPLE GENERALIZATION We have presented Batch GCP (Algorithm 1) and Batch MVP (2) as if they have direct access to the true data distribution D. In practice, rather than being able to directly access the distribution, we only have a finite calibration set D = (xi, yi)n i=1 of n data points sampled iid. from D. In this section, we show that if we run our algorithms on the empirical distribution over the sample D = (xi, yi)n i=1, then their guarantees hold not only for the empirical distribution over D but also with high probability for the true distribution D. We include the generalization guarantees for Batch GCP in Appendix C.3 and focus on generalization guarantees for Batch MVP. At a high level, our argument proceeds as follows (although there are a number of technical complications). For any fixed model f, by standard concentration arguments its inand out-of-sample quantile calibration error will be close with high probability for a large enough sample size. Our goal is to apply concentration bounds uniformly to every model f T that might be output by our algorithm, and then union bound over all of them to get a high probability bound for whichever model happens to be output. Towards this, we show how to bound the total number of distinct models that can be output as a function of T, the total number of iterations of the algorithm. This is possible because at each round t, the algorithm performs a patch operation parameterized by a group gt G (where |G| < ), a value vt [1/m], and an update value t [1/m] and thus only a finite number of models ft+1 can result from patching the current model ft. The difficulty is that our convergence analysis in Theorem 3.2 gives a convergence guarantee in terms of the smoothness parameter ρ of the underlying distribution, which will not be preserved on the empirical distribution over the sample D drawn from D. Hence, to upper bound the number of rounds our algorithm can run for, we need to interleave our generalization theorem with our convergence theorem, arguing that at each step taken with respect to the empirical distribution over D we make progress that can be bounded by the smoothness parameter ρ of the underlying distribution D. We first prove a high probability generalization bound for our algorithm as a function of the number of steps T it converges in. This bound holds uniformly for all T, and so can be applied as a function of the actual (empirical) convergence time T. We let D = (xi, yi)n i=1 denote our sample, S = (xi, s(xi, yi))n i=1 our nonconformity score sample, and SS denote the empirical distribution over S. When S is clear from the context, we just write S. Theorem 4.1. Suppose S is ρ-Lipschitz and S Sn. Suppose Batch MVP(f, α, q, G, ρ, SS, m) (Algorithm 2) runs for T rounds and outputs model f T . Then f T is α -approximately q-quantile multicalibrated with respect to G on S with probability 1 δ, where v u u t3ρ2 ln( 4π2T 2 3δ ) + T ln( ρ4|G| 2αn + 12ρ2( 4π2T 2 3δ ) + T ln( ρ4|G| Under review as a conference paper at ICLR 2023 Figure 1: Performance comparisons across different conformal prediction methods. Groupwise coverage is on the left, and mean prediction-set size by group is on the right (averaged over 50 runs). Error bars show standard deviation. We then prove a worst-case upper bound on the convergence time T, which establishes a worst-case PAC style generalization bound in combination with Theorem 4.1. We remark that although the theorem contains a large constant, in our experiments, our algorithm halts after a small number of iterations T (See Sections 5 and D), and we can directly apply Theorem 4.1 with this empirical value for T. Theorem 4.2. Suppose S is ρ-Lipschitz and continuous, m = ρ2 2α, and our calibration set S Sn consists of n iid. samples drawn from S, where n is sufficiently large: n 92928 ln 128ρ3 α2 ln ρ4|G| α4 . Then Batch MVP(f, α, q, G, ρ, SS, m) (Al- gorithm 2) halts after T 8ρ3 α2 rounds with prob. 1 δ. Formal generalization arguments are in Appendix C. Theorems 4.1, 4.2 are proved in Appendix C.4. 5 EXPERIMENTS 5.1 A SYNTHETIC REGRESSION TASK We first consider a linear regression problem on synthetic data drawn from a feature domain of 10 binary features and 90 continuous features, with each binary feature drawn uniformly at random, and each continuous feature drawn from a normal distribution N(0, σ2 x). The 10 binary features are used to define 20 intersecting groups, each depending on the value of a single binary feature. An input s label is specified by an ordinary least squares model with group-dependent noise as: y = θ, x + N 0, σ2 + P10 i=1 σ2 i xi , where each term σ2 i is associated with one binary feature. We set σ2 i = i for all i [10]. So the more groups an input x is a member of, the more label noise it has, with larger index groups contributing larger amounts of label noise. We generate a dataset {(xi, yi)} of size 40000 using the above-described model, and split it evenly into training and test data. The training data is further split into training data of size 5000 (with which to train a least squares regression model f) and calibration data of size 15000 (with which to calibrate our various conformal prediction methods). Given the trained predictor f, we use non-conformity score s(x, y) = |f(x) y|. Next, we define the set of groups G = {g1, g2, , g20} where for each j [20], gj = {x X | x (j+1)/2 2 j + 1}. We run Algorithm 1 (Batch GCP) and Algorithm 2 (Batch MVP with m = 100 buckets) to achieve group-conditional and full multivalid coverage guarantees respectively, with respect to G with target coverage q = 0.9. We compare the performance of these methods to naive split-conformal prediction (which, without knowledge of G, uses the calibration data to predict a single threshold) and the method of Foygel Barber et al. (2020) which predicts a threshold for each group in G that an input x is part of, and chooses the most conservative one. Under review as a conference paper at ICLR 2023 Figure 2: The left figure plots group-wise calibration error (averaged over 50 runs), weighted by group size. Error bars show standard deviation. The right figure is a scatterplot of the number of points associated with each threshold-group pair (g, τi) against the average coverage conditional on that pair for all g G and all τi in a grid, over all tested conformal prediction methods (consolidating results over all 50 runs). Target coverage is q = 0.9. Figures 1 and 2 present results over 50 runs of this experiment. Notice that both Batch GCP and Batch MVP achieve close to the desired group coverage across all groups with Batch GCP achieving nearly perfect coverage and Batch MVP sometimes deviating from the target coverage rate by as much as 1%. In contrast, the method of Foygel Barber et al. (2020) significantly overcovers on nearly all groups, particularly low-noise groups, and naive split-conformal starts undercovering and overcovering on high-noise and low-noise groups respectively as the expected label-noise increases. Group-wise calibration error is high across all groups but the last using the method of Foygel Barber et al. (2020), and naive split-conformal has low calibration error on groups where inclusion/exclusion reflects less fluctuation in noise, and higher calibration error in groups where there is much higher fluctuation in noise based on inclusion. Both Batch GCP and Batch MVP have lower group calibration errors interestingly, Batch GCP appears to do nearly as well as Batch MVP in this regard despite having no theoretical guarantee on threshold calibration error. A quantile multicalibrated predictor must have low coverage error conditional on groups g and thresholds τi that appear frequently, and may have high coverage error for pairs that appear infrequently behavior that we see in Batch MVP in Figure 2. On the other hand, for both naive split conformal prediction and the method of Foygel Barber et al. (2020), we see high mis-coverage error even for pairs (g, τi) containing a large fraction of the probability mass. We fix the upper-limit of allowed iterations T for Batch MVP to 1000, but typically the algorithm converges and halts in many fewer iterations. Across the 50 runs of this experiment, the average number of iterations T taken to converge was 45.54 14.77. 5.2 AN INCOME PREDICTION TASK ON CENSUS DATA We also compare our methods to naive split conformal prediction and the method of Foygel Barber et al. (2020) on real data from the 2018 Census American Community Survey Public Use Microdata, compiled by the Folktables package (Ding et al., 2021). This dataset records information about individuals including race, gender and income. In this experiment, we generate prediction sets for a person s income while aiming for valid coverage on intersecting groups defined by race and gender. The Folktables package provides datasets for all 50 US states. We run experiments on state-wide datasets: for each state, we split it into 60% training data Dtrain for the income-predictor, 20% calibration data Dcalib to calibrate the conformal predictors, and 20% test data Dtest. After training the income-predictor f on Dtrain, we use the non-conformity score s(x, y) = |f(x) y|. There are 9 provided codes for race1 and 2 codes for sex (1. Male, 2. Female) in the Folktables data. We 11. White alone, 2. Black or African American alone, 3. American Indian alone, 4. Alaska Native alone, 5. American Indian and Alaska Native tribes specified; or American Indian or Alaska Native, not specified and no other races, 6. Asian alone, 7. Native Hawaiian and other Pacific Islander alone, 8. Some Other Race alone, 9. Two or More Races. Under review as a conference paper at ICLR 2023 Figure 3: Performance across different conformal prediction methods on Folktables California data. Groupwise coverage is on the left, and mean prediction-set sizes for each group are on the right (averaged over 50 rounds). Error bars show standard deviation. Figure 4: The left figure plots group-wise calibration error, weighted by group size (averaged over 50 runs). Error bars show standard deviation. The right figure is a scatterplot of the number of points associated with each threshold-group pair (g, τi) against the average coverage conditional on that pair for all g G and all τi in a grid, over all tested conformal prediction methods (consolidating results over all 50 runs). Target coverage is q = 0.9. define groups for five out of nine race codes (disregarding the four with the least amount of data) and both sex codes. We run all four algorithms (naive split-conformal, the method of Foygel Barber et al. (2020), Batch GCP, and Batch MVP with m = 300 buckets) with target coverage q = 0.9. We ran this experiment on the 10 largest states. Figure 3 and Figure 4 present comparisons of the performance of all four algorithms on data taken from California, averaged over 50 different runs. Results on all remaining states are presented in Appendix D.2. Just as in the synthetic experiments, both Batch GCP and Batch MVP achieve excellent coverage across all groups in fact now, we see nearly perfect coverage for both Batch GCP and Batch MVP, with Batch GCP still obtaining slightly better group conditional coverage. In contrast, naive splitconformal prediction undercovers on certain groups and overcovers on others, and the method of Foygel Barber et al. (2020) significantly overcovers on some groups (here, group 4 and 7). The conservative approach also generally yields prediction sets of much larger size. We see also that Batch MVP achieves very low rates of calibration error across all groups, outperforming Batch GCP in this regard. Calibration error is quite irregular across groups for both naive split-conformal prediction and for the method of Foygel Barber et al. (2020), being essentially zero in certain groups and comparatively much larger in others. The average number of iterations T Batch MVP converged in was 10.64 1.12. Reproducibility statement The full Python implementations of Batch GCP and Batch MVP can be found in the supplementary zip. Jupyter notebooks that implement each of our experiments are Under review as a conference paper at ICLR 2023 also included in the supplementary zip. For details on our experiments, such as how we generate the data, what conformal scores we use, how we instantiate Batch MVP and Batch GCP, please see Section 5 and Appendix D, as well as the corresponding Jupyter notebooks. Proofs of all theoretical results are included, either in the main body of the paper or in the designated Appendices. Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. ar Xiv preprint ar Xiv:2107.07511, 2021. Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations, 2020. Osbert Bastani, Varun Gupta, Christopher Jung, Georgy Noarov, Ramya Ramalingam, and Aaron Roth. Practical adversarial multivalid conformal prediction. Advances in Neural Information Processing Systems, 2022. Michael Bian and Rina Foygel Barber. Training-conditional coverage for distribution-free predictive inference. ar Xiv preprint ar Xiv:2205.03647, 2022. Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 2021. Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Learning from outcomes: Evidence-based rankings. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 106 125. IEEE, 2019. Rina Foygel Barber, Emmanuel J Cand es, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 2020. Ira Globus-Harris, Varun Gupta, Christopher Jung, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Multicalibrated regression for downstream fairness. ar Xiv preprint ar Xiv:2209.07312, 2022. Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum f ur Informatik, 2022a. Parikshit Gopalan, Michael P Kim, Mihir A Singhal, and Shengjia Zhao. Low-degree multicalibration. In Conference on Learning Theory, pages 3193 3234. PMLR, 2022b. Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M. Pai, and Aaron Roth. Online Multivalid Learning: Means, Moments, and Prediction Intervals. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), pages 82:1 82:24, 2022. Ursula H ebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939 1948, 2018. Peter Hoff. Bayes-optimal prediction with frequentist coverage control. ar Xiv preprint ar Xiv:2105.14045, 2021. Christopher Jung, Changhwa Lee, Mallesh M Pai, Aaron Roth, and Rakesh Vohra. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory. PMLR, 2021. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564 2572, 2018. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. An empirical study of rich subgroup fairness for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 100 109, 2019. Under review as a conference paper at ICLR 2023 Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247 254, 2019. Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distributionfree predictive inference for regression. Journal of the American Statistical Association, 113 (523):1094 1111, 2018. Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In European Conference on Machine Learning, pages 345 356. Springer, 2002. Sangdon Park, Osbert Bastani, Nikolai Matni, and Insup Lee. Pac confidence sets for deep neural networks via calibrated prediction. In International Conference on Learning Representations, 2019. Paul Ressel. De finetti-type theorems: an analytical approach. The Annals of Probability, 13(3): 898 922, 1985. Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. Advances in neural information processing systems, 32, 2019. Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel Cand es. With malice toward none: Assessing uncertainty via equalized coverage. Harvard Data Science Review, 2020a. Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581 3591, 2020b. Guy N Rothblum and Gal Yona. Multi-group agnostic pac learnability. In International Conference on Machine Learning, pages 9107 9115. PMLR, 2021. Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(Mar):371 421, 2008. A ADDITIONAL RELATED WORK Conformal prediction (see Shafer and Vovk (2008); Angelopoulos and Bates (2021) for excellent surveys) has seen a surge of activity in recent years. One large category of recent work has been the development of sophisticated non-conformity scores that yield compelling empirical coverage in various settings when paired with split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2018). This includes Romano et al. (2019) who give a nonconformity score based on quantile regression, Angelopoulos et al. (2020) and Romano et al. (2020b) who give nonconformity scores designed for classification, and Hoff (2021) who gives a nonconformity score that leads to Bayes optimal coverage when data is drawn from the assumed prior distribution. This line of work is complementary to ours: we give algorithms that can be used as drop-in replacements for split conformal prediction, and can make use of any of these nonconformity scores. Another line of work has focused on giving group conditional guarantees. Romano et al. (2020a) note the need for group-conditional guarantees with respect to demographic groups when fairness is a concern, and propose separately calibrating on each group in settings in which the groups are disjoint. Foygel Barber et al. (2020) consider the case of intersecting groups, and give an algorithm that separately calibrates on each group, and then uses the most conservative group-wise threshold when faced with examples that are members of multiple groups the result is that this method systematically over-covers. The kind of multivalid prediction sets that we study here were first proposed by Jung et al. (2021) in the special case of prediction intervals: but the algorithm given by Jung et al. (2021), based on calibrating to moments of the label distribution and using Chebyshev s inequality, also generally leads to over-coverage. Gupta et al. (2022) gave a theoretical derivation of an algorithm to obtain tight multivalid prediction intervals in the sequential adversarial setting, and Bastani et al. (2022) gave a practical algorithm to obtain tight multivalid coverage in the general case also in the sequential adversarial setting. Although the sequential setting is more difficult in the sense that it makes no distributional assumptions, it also requires that labels be available after Under review as a conference paper at ICLR 2023 predictions are made at test time, in contrast to the batch setting that we study, in which labels are not available. Gupta et al. (2022) give an online-to-batch reduction that requires running the sequential algorithm on a large calibration set, saving the algorithm s internal state at each iteration, and then deploying a randomized predictor that randomizes over the internal state of the algorithm across all rounds of training. This is generally impractical for large datasets; in contrast we give a direct, simple, deterministic predictor in the batch setting. Our algorithms can be viewed as learning quantile multiaccurate predictors and quantile multicalibrated predictors respectively by analogy to multiaccuracy Kim et al. (2019) and multicalibration H ebert-Johnson et al. (2018) which are defined with respect to means instead of quantiles. Their analysis is similar, but with pinball loss playing the role played by squared error in (mean) multicalibration. This requires analyzing the evolution of pinball loss under Lipschitzness assumptions on the underlying distribution, which is a complication that does not arise for means. More generally, our goals for obtaining group conditional guarantees for intersecting groups emerge from the literature on multigroup fairness see e.g. (Kearns et al., 2018; H ebert-Johnson et al., 2018; Kearns et al., 2019; Rothblum and Yona, 2021; Dwork et al., 2019; Globus-Harris et al., 2022) B MISSING DETAILS FROM SECTION 3 B.1 MISSING DETAILS FROM SECTION 3.1 Lemma B.1. Fix any x X and τ [0, 1]. d Lq(f(x) + τ, s) d Lq(f(x) + τ, s) dτ d S(x) = Pr s S(x)[s f(x) + τ] q d Lq(f(x) + τ), s d dτ Lq(f(x) + τ, s)d S(x) d dτ Lq(f(x) + τ, s)d S(x) + Z 1 d dτ Lq(f(x) + τ, s)d S(x) 0 (1 q)d S(x) Z 1 = Pr s S(x)[s τ] q Lemma 3.1. Fix any nonconformity score distribution S that is ρ-Lipschitz. Fix any model f : X R that has marginal quantile consistency error α with respect to target quantile q, and let f (x) = f(x) + with chosen such that f is marginal quantile consistent at quantile q. Then 2ρ PBS q (f ) PBS q (f) α2 PBS q (f ) PBS q (f) X [0,1] Lq(f(x) + , s) Lq(f(x), s)S(dx, ds) d Lq(f(x) + τ, s) dτ dτS(dx, ds) d Lq(f(x) + τ, s) dτ S(dx, ds)dτ Pr s S(x)[s f(x) + τ] q dτ Under review as a conference paper at ICLR 2023 Pr (x,s) S[s f(x) + τ] q dτ 0 Pr (x,s) S[s f(x) + τ]dτ q where the fourth equality follows from Lemma B.1. For convenience, write HS,f(τ) = Pr(x,s) S[s f(x) + τ]. Note that Z 0 HS,f(τ)dτ = Z 0 Pr (x,s) S[s f(x) + τ]dτ 0 Pr (x,s) S[s f (x) + τ]dτ 0 Pr (x,s) S[s f (x) + τ]dτ 0 HS,f (τ)dτ, as instead of sweeping the area under the curve from f(x) to f(x)+ , we can sweep the area under the curve from f (x) to f (x) because f (x) = f(x) + . Lemma B.2. Fix any conformity score distribution S that is ρ-Lipschitz, > 0, and f : X R. Then we have HS,f(0) + (HS,f( ) HS,f(0))2 0 HS,f(τ)dτ HS,f( ) (HS,f( ) HS,f(0))2 Proof. For simplicity write H(τ) = HS,f(τ). Note that H(τ) is a non-negative function that is increasing in τ. First, we find an upper bound the area. The maximum area that can be achieved is when there s a linear rate of increase from y = H(0) to y = H( ) between x = 0 and x = H( ) H(0) ρ as depicted in Figure 5. The area measured via the integral can be calculated by subtracting the area of the triangle from the area of the rectangle from x = 0 to x = and from y = 0 to y = H( ). Z 0 Pr (x,s) S[s f(x) + τ]dτ H( ) (H( ) H(0))2 Now, we find a lower bound of the area. The area under the curve is minimized when there s a linear increase from H(0) to H( ) between x = H( ) H(0) ρ and x = . The area can be calculated as the sum of the area of the rectangle from x = 0 to x = and from y = 0 to y = H(0) and the area of the triangle. Z 0 Pr (x,s) S[s f(x) + τ]dτ H(0) + H( ) H(0) 2ρ (H( ) H(0)) Under review as a conference paper at ICLR 2023 0 |H( ) H(0)| ρ |H( ) H(0)| Figure 5: Upper and lower bounding the local area under H. The max-area CDF curve is in dashed blue, the min-area CDF curve is in solid purple. Case (i) 0: We have HS,f( ) = q and HS,f(0) = q α. PBq(f ) PBq(f) HS,f( ) (HS,f( ) HS,f(0))2 On the other hand, PBq(f ) PBq(f) HS,f(0) + (HS,f( ) HS,f(0))2 = (q α) + α2 Case (ii) < 0: We have we have HS,f( ) = HS,f (0) = q and HS,f(0) = HS,f ( ) = q +α where = , we have PBq(f ) PBq(f) = Z 0 HS,f (τ)dτ q 0 HS,f (τ)dτ q HS,f (0) + (HS,f ( ) HS,f (0))2 Under review as a conference paper at ICLR 2023 On the other hand, we have PBq(f ) PBq(f) = Z 0 HS,f (τ)dτ q 0 HS,f (τ)dτ q HS,f ( ) (HS,f ( ) HS,f (0))2 = (q + α) + α2 Lemma 3.2. Given some predictor f : X R, suppose we have a set of points B X with Pr (x,s) S[x B] Pr (x,s) S[s f(x)|x B] q 2 α and = argmin R Pr (x,s) S[s f(x) + |x B] q . Then, if S|B is continuous and ρ-Lipschitz, f = Patch(f, B, ) has PBS q (f ) PBS q (f) α Proof. Note that f s marginal quantile consistency error with respect to target quantile q as measured on S|B is Pr (x,s) S|B[s f(x)] q r α Pr(x,s) S[x B]. Also, since S|B is continuous, we have q = Pr (x,s) S|B[s f(x) + ]. In other words, f satisfies marginal quantile consistency for q as measured on S|B. Applying Lemma 3.1 to S|B, we have PBS q (f ) PBS q (f) = Pr (x,s) S[x B] (PBS|B q (f ) PBS|B q (f)) Pr (x,s) S[x B] α 2ρ Pr(x,s) S[x B] B.2 MISSING DETAILS FROM SECTION 3.2 Theorem 3.1. For any input model f, groups G, and q [0, 1], Algorithm 1 returns ˆf(x; λ ) with h y T ˆ f(x)|g(x) = 1 i = q for every g G, where T ˆ f(x) = {y : s(x, y) ˆf(x; λ )}. Furthermore, PBS q ( ˆf( ; λ )) PBS q (f). Proof. Suppose ˆf(x; λ ) is not marginally quantile consistent on some g G i.e. Pr(x,s)[g (x) = 1] (Pr(x,s) S[s f(x)|g (x) = 1] q)2 > 0. In other words, there exists some = 0 such that Pr (x,s) S[s f(x) + |g (x) = 1] = q. Under review as a conference paper at ICLR 2023 Suppose we patch ˆf( ; λ ) f = Patch( ˆf( ; λ ), g , ) which can be re-written as f (x) = ˆf(x; λ ) + 1[g (x) = 1] g G λg g(x) + 1[g (x) = 1] g G λ g g(x) where λ g = λ g + and λ g = λ g for all other g = g . Lemma 3.2 yields PBS q ( ˆf( , λ )) PBS q ( ˆf( ; λ )) < 0 which contradicts the fact that λ is an optimal solution to minλ E(x,y) D h Lq ˆf(x; λ), y i . B.3 MISSING DETAILS FROM SECTION 3.3 Claim 3.1. If f is α-approximately quantile multicalibrated with respect to G and q, then Pr (x,y) D y T f(x)|g(x) = 1, f(x) = v q α p Pr(x,s) S[g(x) = 1, f(x) = v] for g G, v R(f). Proof . Consider any g G. Q(f, g) = X v R(f) Pr (x,s) S[f(x) = v|g(x) = 1] q Pr (x,s) S[s f(x)|f(x) = v, g(x) = 1] 2 α Pr(x,s) S[g(x) = 1] In particular, since each term in the sum is non-negative, we must have for every v R(f): Pr (x,s) S[f(x) = v|g(x) = 1] q Pr (x,s) S[s f(x)|f(x) = v, g(x) = 1] 2 α Pr(x,s) S[g(x) = 1] Dividing both sides by Pr(x,s) S[f(x) = v|g(x) = 1] we see that this is equivalent to: q Pr (x,s) S[s f(x)|f(x) = v, g(x) = 1] 2 α Pr(x,s) S[g(x) = 1, f T (x) = v] Taking the square root yields our claim. Theorem 3.2. Suppose S is ρ-Lipschitz and continuous, and m = 8ρ2 α . After T 32ρ3 α2 many rounds, Algorithm 2 Batch MVP(f, α, q, G, ρ, S, m) returns f T such that 1. PBS q (f T ) PBS q (f0) T α2 32ρ3 . 2. f T is α-approximately quantile multicalibrated with respect to G and q. In particular, via Claim 3.1, we have for every g G and v R(f), Pr (x,y) D y T f T (x)|g(x) = 1, f T (x) = v q α p Pr(x,s) S[g(x) = 1, f T (x) = v] Under review as a conference paper at ICLR 2023 (1) Marginal quantile consistency error in each round t: At any round t, if ft is not αapproximately quantile multicalibrated with respect to G and q, we have Pr (x,s) S[x Bt] q Pr (x,s) S[s ft(x)|x Bt]) 2 α m + 1 α as the average of m + 1 elements is greater than α. (2) Decomposing the patch operation into two patch operations: Write t = argmin [0,1] Pr (x,s) S[s ft(x) + |x Bt] q to denote how much we would have patched ft by if we actually optimized over the unit interval. Then, we can divide the patch operation into two where ft+1 = Patch ft, Bt, t ft+1 = Patch ft, Bt, t t . Now, we try to bound the change in the pinball loss separately: PBS q (ft+1) PBS q (ft) = (PBS q (ft+1) PBS q ( ft+1)) | {z } ( ) + (PBS q ( ft+1) PBS q (ft)) | {z } ( ) (2) Bounding (**): Lemma 3.2 yields ( ) = PBS( ft+1) PBS(ft) α (3) Bounding (*): Note that i m i+1 m for some i [0, . . . , m 1]. Because the function Pr(x,s) S|Bt[s ft(x) + ] is an increasing function in , we have Pr (x,s) S|Bt [s ft(x) + ] q < Pr (x,s) S|Bt s ft(x) + i q for any < i m Pr (x,s) S|Bt [s ft(x) + ] q < Pr (x,s) S|Bt s ft(x) + i + 1 q for any > i + 1 In other words, t { i m } so we have | t t| 1 m. Because S|Bt is ρ-Lipschitz and ft+1(x) is q-quantile for S|Bt, we can bound the marginal quantile consistency error of ft+1 against S|Bt as Pr (x,s) S|Bt[s ft+1(x)] q = Pr (x,s) S|Bt[s ft+1(x)] Pr (x,s) S|Bt[s ft(x) t + t] ρ as | t t| 1 Note that ft+1(x) = ft+1(x) for x Bt and ft+1(x) = ft+1(x) + t t for x Bt where | t t| 1 m. Applying Lemma 3.1 with = t t, α ρ m, and (f, f ) = (ft+1, ft+1), we have that PBS q (ft+1) PBS q ( ft+1) = Pr (x,s) S[x Bt] PBS|Bt(ft+1) PBS|Bt( ft+1) Pr (x,s) S[x Bt] ρ Under review as a conference paper at ICLR 2023 (4) Combining (*) and (**): We get PBS q (ft+1) PBS q (ft) α (5) Guarantees: It directly follows that PBS q (f T ) PBS q (f0) T α2 With 0 PBS q (f) 1 for any f : X [0, 1], we note that f T must be α-approximately quantile multicalibrated with respect to G and q after at most 32ρ3 α2 many rounds. C MISSING DETAILS FROM SECTION 4 C.1 ON THE I.I.D. VERSUS EXCHANGEABILITY ASSUMPTION Remark C.1. In this section, we prove PAC-like generalization theorems for our algorithms under the assumption that the data is drawn i.i.d. from some underlying distribution. Exchangeability is a weaker condition than independence, and requires only that the probability of observing any sequence of data is permutation invariant. De Finetti s representation theorem (Ressel, 1985) states that any infinite exchangeable sequence of data can be represented as a mixture of constituent distributions, each of which is i.i.d. Thus, our generalization theorems carry over from the i.i.d. setting to the more general exchangeable data setting: we can simply apply our theorems to each (i.i.d.) mixture component in the De Finetti representation of the exchangeable distribution. If for every mixture component, the model we learn has low quantile multicalibration error with probability 1 δ when the data is drawn from that component, then our models with also have low quantile multicalibration error with probability 1 δ in expectation over the choice of the mixture component. C.2 HELPFUL CONCENTRATION BOUNDS Theorem C.1 (Additive Chernoff Bound). Let {Xi}n i=1 be independent random variables bounded such that for each i [n], Xi [0, 1]. Let Sn = Pn i=1 Xi denote their sum. Then for all ϵ > 0, Pr {Xi}n i=1 [|Sn E[Sn]| ϵ] 2 exp ϵ2 Theorem C.2 (Multiplicative Chernoff Bound). Let {Xi}n i=1 be independent random variables bounded such that for each i [n], Xi [0, 1]. Let Sn = Pn i=1 Xi denote their sum. Then for all η > 0, Pr {Xi}n i=1 [|Sn E[Sn]| η E[Sn]] 2 exp E[Sn]η2 Lemma C.1. Fix any B X. Given S = {(xi, si)}n i=1 Sn, we have 1 n i=1 1[xi B] Pr (x,s)[x B] δ ) Pr S[x B] Proof. This is just a direct application of the Chernoff bound (Theorem C.2) where we set η = q δ ) n Pr S[x B]. Lemma C.2. Fix any B X and f : X [0, 1]. Given S = {(xi, si)}n i=1 Sn, we have 1 n i=1 1[s f(x), xi B] Pr (x,s)[s f(x), x B] δ ) Pr S[s f(x), x B] δ ) Pr S[x B] Under review as a conference paper at ICLR 2023 Proof. This is just a direct application of the Chernoff bound (Theorem C.2) where we set η = q δ ) n Pr S[s f(x),x B]. Lemma C.3. Fix any B X and f : X [0, 1]. Suppose q δ ) n Pr[x B] 1 2. Given S Dn, we have that with probability 1 δ Pr (x,s) S|B [s f(x)] Pr (x,s) S|B[s f(x)] δ ) n Pr S[x B] Proof. Note that Pr (x,s) S|B [s f(x)] = 1 n Pn i=1 1[si f(xi), xi B] 1 n Pn i=1 1[xi B] . Lemma C.1 and C.2 together give us that with probability 1 δ, 1 n i=1 1[si f(xi), xi B] Pr (x,s)[s f(x), x B] δ ) Pr S[s f(x), x B] δ ) Pr S[x B] i=1 1[xi B] Pr (x,s)[x B] δ ) Pr S[x B] δ ) Pr S[x B] n , we can show that Pr (x,s) S|B [s f(x)] 1 n Pn i=1 1[si f(xi), xi B] 1 n Pn i=1 1[xi B] Pr S[s f(x), x B] + ϵ Pr S[x B] ϵ = Pr S[s f(x), x B] + ϵ Pr S[x B] 1 ϵ Pr S[x B] 1 + 2ϵ Pr S[x B] Pr S[s f(x), x B] + ϵ Pr S|B[s f(x)] + ϵ Pr S[x B] + 2ϵ Pr S[x B] + 2ϵ2 Pr S|B[s f(x)] + 3 δ ) n Pr S[x B] + 2 3 ln( 4 δ ) n Pr S[x B] |{z} ( ) Pr S|B[s f(x)] + 5 δ ) n Pr S[x B] where for (*), we rely on the assumption that q δ ) n Pr[x B] 1 2 to apply the inequality 1/(1 x) (1+2x) for 0 x 1/2 and for (**), we rely on 3 ln( 4 δ ) n Pr S[x B] 1 to get q δ ) n Pr S[x B] 3 ln( 4 δ ) n Pr S[x B]. Lemma C.4. Fix any B X and f : X [0, 1]. Suppose q δ ) n Pr[x B] 1 2. Given S Dn, we have that with probability 1 δ q Pr (x,s) S|B[s f(x)] 2 q Pr (x,s) S|B [s f(x)] δ ) n Pr S[x B] Under review as a conference paper at ICLR 2023 for any q [0, 1]. q Pr (x,s) S|B[s ft(x)] 2 q Pr (x,s) S|B [s ft(x)] Pr S|B [s f(x)] Pr S|B[s f(x)] Pr S|B[s f(x)]2 Pr S|B [s f(x)]2 ! Pr S|B[s f(x)] Pr S|B [s f(x)] Pr S|B[s f(x)] Pr S|B [s f(x)] Pr S|B[s f(x)] + Pr S|B [s f(x)] Pr S|B[s f(x)] Pr S|B [s f(x)] δ ) n Pr S[x B]. where we use Lemma C.3 for the last inequality. Lemma C.5. Fix B X, v [ 1 m], g G, and f : X [0, 1]. Given S Dn, we have with probability 1 δ Pr (x,s) S[x B] q Pr (x,s) S|B[s f(x)] 2 Pr (x,s) S [x B] q Pr (x,s) S|B [s f(x)] δ ) Pr S[x B] n + 12 ln( 8 Proof. First, Suppose q δ ) n Pr S[x B] > 1 2. Then, with probability 1 δ, Pr (x,s) S [B] q Pr (x,s) S|B [s f(x)] Pr (x,s) S[B] + δ ) Pr S[x B] q Pr (x,s) S|B [s f(x)] Pr (x,s) S[B] + δ ) Pr S[x B] q Pr (x,s) S|B [s f(x)] δ ) Pr S[x B] δ ) n Pr S[x B] > 1 On the other hand, we have Pr (x,s) S[B] q Pr (x,s) S|B[s f(x)] 2 Pr (x,s) S[B] 12 ln( 8 Because |a b| max(a, b) for a, b [0, 1], we have Pr (x,s) S[B] q Pr (x,s) S|B[s f(x)] 2 Pr (x,s) S [B] q Pr (x,s) S|B [s f(x)] Under review as a conference paper at ICLR 2023 δ ) Pr S[x B] n + 12 ln( 8 Now, suppose otherwise: q δ ) n Pr[x B] 1 2. Then, Lemma C.1 and C.4 promise us that with probability 1 δ, Pr (x,s) S [B] q Pr (x,s) S|B [s f(x)] Pr (x,s) S[B] + ϵ1 q Pr (x,s) S|B[s f(x)] 2 + ϵ2 Pr (x,s) S[B] q Pr (x,s) S|B[s f(x)] 2 + Pr (x,s) S[B]ϵ2 + ϵ1 + ϵ1ϵ2 where ϵ1 = q δ ) Pr S[x B] n . and ϵ2 = 20 q δ ) n Pr S[x B]. We can show that Pr (x,s) S[B]ϵ2 + ϵ1 + ϵ1ϵ2 δ ) Pr S[x B] n + 3 ln( 8 The opposite direction works the same way. C.3 OUT OF SAMPLE GUARANTEES FOR BA T C HGCP Theorem C.3. Suppose S is ρ-Lipschitz, and write ˆf( ; λ ) = Batch GCP(f, G, q, D) and ˆf( ; λ D) = Batch GCP(f, G, q, D) given some D = {(xi, yi)}n i=1 drawn from D. Then we have with probability 1 δ over the randomness of drawing D from D Pr (x,y) D[y T ˆ f( ,λ D)(x)|g(x = 1)] q Pr[g(x) = 1] where T ˆ f( ,λ D)(x) = {y : s(x, y) ˆf(x; λ D)}, q = max(q, 1 q), b = max(||λ ||2, ||λ D||2) , and α = 24ρbq q 3δ )+|G| ln(1+2n) Proof. Fix some B N. For any fixed λ where ||λ||2 B, the Chernoff Bound (Theorem C.1) with rescaling gives us that with probability 1 δ over the randomness of drawing D = {(xi, yi)}n i=1 from D, 1 n i=1 Lq( ˆf(xi; λ), yi) E (x,y) D[Lq( ˆf(x; λ), y)] as λ Lq( ˆf(x, λ), y) is q -Lipschitz. We now show how to union bound the above concentration over all λ where ||λ||2 B. To do so, we first create a finite ϵ-net for a ball of radius B and union-bound over each λ in the net. Then we can argue that due to the Lipschitzness of the pinball loss Lq, the empirical pinball loss with respect to each λ that is in the ball must be concentrated around its distributional loss. We fist provide the standard ϵ-net argument: Lemma C.6. Fix ϵ R and B N. There exists a RB = {λ1, . . . , λk B} where k B 1 + 2B such that the following holds true: for every λ where ||λ||2 B, there exists j [k] such that ||λ λj||2 ϵ. Under review as a conference paper at ICLR 2023 Proof. For simplicity, write PB = {λ R|G| : ||λ||2 B}. We say that a set R R|G| is an ϵ-net of PB if for every λ PB, there exists λ R such that ||λ λ ||2 ϵ. Without loss of generality, we suppose B = 1. Since an ϵ B -cover of P1 can be scaled up to an ϵ-cover of PB i.e. given R1 which is an ϵ B -cover of P1, the following set RB = {B λ : λ R1} is an ϵ-cover of PB. Now, choose a set R to be a maximally ϵ-separated subset of P1: for every u, v B, ||u v|| ϵ and no set R such that R R has this property. Due to its maximal property, R must be an ϵ-cover for P1. Otherwise, it means that there exists some point u P1 such that for every v R ||u v|| > ϵ. Note that R {u} would still be an ϵ-separate subset of P1, contradicting the maximality of R. Due to the ϵ-separability of R, we have that the balls centered at each u R with radius of ϵ 2 is all disjoint, meaning the sum of the volume of these balls is the volume of their union. On the other hand, we have that they all lie in a ball of radius 1 + ϵ 2 . Therefore, we have 2 ) |R| vol(P1+ ϵ Since vol(Pc) = c|G| vol(P1), we have that 2)|G| = 1 + 2 By union-bounding inequality (1) over RB, we have that with probability 1 δ max j [k B] i=1 Lq( ˆf(xi; λj), yi) E (x,y) D[Lq( ˆf(x; λj), y)] Using q -Lipschitzness of Lq, we can further show that for any λ where ||λ||2 B, we have 1 n i=1 Lq( ˆf(xi; λ), yi) E (x,y) D[Lq( ˆf(x; λ), y)] i=1 Lq( ˆf(xi; λ), yi) 1 i=1 Lq( ˆf(xi; λj), yi) + 1 i=1 Lq( ˆf(xi; λj), yi)] + E (x,y) D[Lq( ˆf(x; λj), y)] E (x,y) D[Lq( ˆf(x; λj), y)] E (x,y) D[Lq( ˆf(x; λ), y)] i=1 Lq( ˆf(xi; λ), yi) 1 i=1 Lq( ˆf(xi; λj), yi)] i=1 Lq( ˆf(xi; λj), yi))] E (x,y) D[Lq( ˆf(x; λj), y] E (x,y) D[Lq( ˆf(x; λj), y)])] E (x,y) D[Lq( ˆf(x; λ), y)] δ ) + ln(k B) = 2ϵq + 4q B δ ) + |G| ln(1 + 2B Under review as a conference paper at ICLR 2023 where j is chosen such that ||λ λj||2 ϵ and we can find such j because RB is an ϵ-net for the ball of radius B. Setting ϵ = B δ ) + |G| ln(1 + 2n) δ ) + |G| ln(1 + 2n) 2n for sufficiently large n. We set δb = 6δ π2b2 so that P b=1 δb = δ. In other words, with probability 1 δ, we have simultaneously over all b N and λ where ||λ||2 b 1 n i=1 Lq( ˆf(xi; λ), yi) E (x,y) D[Lq( ˆf(x; λ), y)] 3δ ) + |G| ln(1 + 2n) In other words, the final output λ D from Batch GCP and λ which is the optimal solution with respect to the true distribution D must be such that 1 n i=1 Lq( ˆf(xi; λ D), yi) E (x,y) D[Lq( ˆf(x; λ D), y)] 3δ ) + |G| ln(1 + 2n) i=1 Lq( ˆf(xi; λ ), yi) E (x,y) D[Lq( ˆf(x; λ ), y)] 3δ ) + |G| ln(1 + 2n) where b = max(||λ ||2, ||λ D||2) . In other words, E (x,y) D[Lq( ˆf(x; λ D), y)] E (x,y) D[Lq( ˆf(x; λ ), y)] 12bq 3δ ) + |G| ln(1 + 2n) Now, for the sake of contradiction, suppose that there exists some g G such that Pr (x,s) S[g(x) = 1] q Pr (x,s) S[s ˆf(x; λ D)|g(x) = 1] 2 > α where α = 24ρbq q 3δ )+|G| ln(1+2n) Then, Lemma 3.2 tells us that we can decrease the true pinball loss with respect to λ D by at least α 2ρ by patching B = {x : g(x) = 1}. However, that cannot be the case that as that would mean there exists λ such that E (x,y) D[Lq( ˆf(x; λ ), y)] < E (x,y) D[Lq( ˆf(x; λ ), y)] where λ is such that λ g = λ D,g for all g = g and λ g is chosen to satisfy q = Pr (x,y) D[f(x) + λ g|g(x) = 1]. This is a contradiction as we have already defined λ = arg min λ E (x,y) D[Lq( ˆf(x; λ), y)]. C.4 OUT OF SAMPLE GUARANTEES FOR BA T C HMVP C.4.1 OUT-OF-SAMPLE QUANTILE MULTICALIBRATION BOUND FOR FIXED T Theorem 4.1. Suppose S is ρ-Lipschitz and S Sn. Suppose Batch MVP(f, α, q, G, ρ, SS, m) (Algorithm 2) runs for T rounds and outputs model f T . Then f T is α -approximately q-quantile multicalibrated with respect to G on S with probability 1 δ, where v u u t3ρ2 ln( 4π2T 2 3δ ) + T ln( ρ4|G| 2αn + 12ρ2( 4π2T 2 3δ ) + T ln( ρ4|G| Under review as a conference paper at ICLR 2023 Proof . For each t N, define δt = δ 6 π2 1 t2 . Note that t=1 δt = δ 6 as P t=1 1 t2 = π2 Fixed any t N and δ. Union-bounding Lemma C.5 over g G and f Ct, we have that for any g G, with probability 1 δ, v R(ft) Pr (x,s) S[ft(x) = v, g(x) = 1] q Pr (x,s) S[s ft(x)|ft(x) = v, g(x) = 1] 2 v R(ft) Pr (x,s) S [ft(x) = v, g(x) = 1] q Pr (x,s) S [s ft(x)|ft(x) = v, g(x) = 1] δ ) + t ln(4m2|G|) Pr S[ft(x) = v, g(x) = 1] n + 12(ln( 8 δ ) + t ln(4m2|G|)) n | {z } ( ) We can further bound ( ) as δ ) + t ln(4m2|G|) Pr S[ft(x) = v, g(x) = 1] n + 2m12(ln( 8 δ ) + t ln(4m2|G|)) |{z} ( ) α + 21 δ ) + t ln(4m2|G|) Pr S[g(x)=1] |R(ft)| n |R(f)|2 + 12ρ2(ln( 8 δ ) + t ln(4m2|G|)) δ ) + t ln(4m2|G|) |R(ft)| n + 12ρ2(ln( 8 δ ) + T ln(4m2|G|)) δ ) + t ln(4m2|G|) 4αn + 12ρ2(ln( 8 δ ) + t ln(4m2|G|)) v u u t3ρ2 ln( 8 δ ) + t ln( ρ4|G| 2αn + 12ρ2(ln( 8 δ ) + t ln( ρ4|G| where we used |R(ft)| = m + 1 2m and m = ρ2 2α. Inequality ( ) follows from the fact that δ )+t ln(4m2|G|)) z n is concave and so the optimization problem P|R(ft)| i=1 h(zi) where P|R(ft)| i=1 zi = Pr S(g(x) = 1) is maximized at zi = Pr S(g(x)=1) |R(ft)| for each i [|R(ft)|]. In other words, with probability 1 δ = 1 P t=1 δt, we simultaneously have for every t N v R(ft) Pr (x,s) S[ft(x) = v, g(x) = 1] q Pr (x,s) S[s ft(x)|ft(x) = v, g(x) = 1] 2 v R(ft) Pr (x,s) S [ft(x) = v, g(x) = 1] q Pr (x,s) S [s ft(x)|ft(x) = v, g(x) = 1] v u u t3ρ2 ln( 8 δt ) + t ln( ρ4|G| 2αn + 12ρ2(ln( 8 δt ) + t ln( ρ4|G| Under review as a conference paper at ICLR 2023 v R(ft) Pr (x,s) S [ft(x) = v, g(x) = 1] q Pr (x,s) S [s ft(x)|ft(x) = v, g(x) = 1] v u u t3ρ2 ln( 4π2t2 3δ ) + t ln( ρ4|G| 2αn + 12ρ2( 4π2t2 3δ ) + t ln( ρ4|G| Finally, when our algorithm halts at round T, f T is α-approximately multicalibrated with respect to its empirical distribution S. Therefore, we have v R(f T ) Pr (x,s) S[f T (x) = v, g(x) = 1] q Pr (x,s) S[s f T (x)|f T (x) = v, g(x) = 1] 2 v u u t3ρ2 ln( 4π2t2 3δ ) + t ln( ρ4|G| 2αn + 12ρ2( 4π2t2 3δ ) + t ln( ρ4|G| C.4.2 SAMPLE COMPLEXITY OF MAINTAINING CONVERGENCE SPEED OF BA T C HMVP We write C0 = {f0} and ft+1 : ft+1(x) = Patch(ft, B(v, g), ) where B(v, g) = {x : ft(x) = v, g(x) = 1} for all ft Ct, v 1 to denote the set of all possible models ft we could obtain at round t of Algorithm 2 regardless of what dataset S = {(xi, si)}n i=1 is used as input. Lemma C.7. Fixing the initial model f0, the number of distinct models ft that can arise at round t of Algorithm 2 (quantified over all possible input datasets S) is upper bounded by: |Ct| = ((m + 1)2|G|)t (4m2|G|)t. Proof. The proof is by induction on t N. By construction we have |C0| = 1. Now, note that |Ct+1| = (m+1)2|G||Ct| because we consider all possible combinations of ft Ct, v 1 m , g G, and 1 m . Therefore, if |Ct| = ((m + 1)2|G|)t for some t, then |Ct+1| = ((m + 1)2|G|)t+1. Theorem 4.2. Suppose S is ρ-Lipschitz and continuous, m = ρ2 2α, and our calibration set S Sn consists of n iid. samples drawn from S, where n is sufficiently large: n 92928 ln 128ρ3 α2 ln ρ4|G| α4 . Then Batch MVP(f, α, q, G, ρ, SS, m) (Al- gorithm 2) halts after T 8ρ3 α2 rounds with prob. 1 δ. Proof . Fix any round t. Because ft is not α-approximately quantile multicalibrated with respect to G and q on S, we have Pr (x,s) S [x Bt] q Pr (x,s) S [s ft(x)|x Bt]) Now, the patch operation in this can be decomposed into the following: vt vt + where q = Pr (x,s) S|Bt[s ft(x) + ] vt + vt + where = arg min v [1/m] |v | vt + vt + t. Under review as a conference paper at ICLR 2023 For convenience, we write f t (x) = ft(x) + 1[x Bt] f t (x) = ft(x) + 1[x Bt] ft+1(x) = ft(x) + t 1[x Bt] Now, we show that we decrease the empirical pinball loss PB S as we go from ft to ft+1 in each round t with high probability. (1) ft f t : First, we can show progress in terms of the pinball loss on S and then by a Chernoff bound, we can show that we have made significant progress on S with high probability. More specifically, we must have decreased the pinball loss with respect to S|Bt by going from vt to vt + . Note that is chosen to satisfy the target quantile q with respect to S|Bt and S is ρ-Lipschitz. Because the empirical quantile error was significant for (vt, gt), we can show that the true quantile error must have been significant. By union bounding Lemma C.5 over all f Ct and using Lemma C.7 to bound the cardinality of Ct, we have with probability 1 δ Pr (x,s) S[Bt] q Pr (x,s) S|Bt[s ft(x)] 2 Pr (x,s) S [Bt] q Pr (x,s) S|Bt [s ft(x)] 3 ln( 8|Ct| δ ) Pr S[x B] n + 12 ln( 8|Ct| δ ) + t ln(4m2|G|) Pr S[x B] n + 12 ln( 8 δ ) + t ln(4m2|G|) δ ) + t ln(4m2|G|) as long n 12 ln( 8 δ ) + t ln(4m2|G|) . In other words, we have Pr (x,s) S[Bt] q Pr (x,s) S|Bt[s ft(x)] 2 α δ ) + t ln(4m2|G|) If n 92928m2(ln( 8 δ )+t ln(4m2|G|)) α2 , then we have Pr (x,s) S[Bt] q Pr (x,s) S|Bt[s ft(x)] 2 α As f t achieves the target quantile q against S|Bt and the quantile error was at least α 4m, applying Lemma 3.2 yields PBS(f t ) PBS(ft) α (2) f t f t : Recall that results from rounding to the nearest grid point in [ 1 m]. Because S|Bt is ρ-Lipschitz and ft( )+ satisfies the target quantile q for S|Bt, we can bound the marginal quantile consistency error of ft( ) + against S|Bt as Pr (x,s) S|Bt[s ft(x) + ] q = Pr (x,s) S|Bt[s ft(x) + ] Pr (x,s) S|Bt[s ft(x) ] ρ 2m as | | 1 2m. Under review as a conference paper at ICLR 2023 Note that f t (x) = f t (x) for x Bt and f t (x) = ft+1(x) + ( ) for x Bt where | | 1 2m. Applying Lemma 3.1 with = , α ρ 2m, and (f, f ) = ( f t , f t ), we have that PBS q ( f t ) PBS q (f t ) = Pr (x,s) S[x Bt] PBS|Bt( f t ) PBS|Bt(f t ) Pr (x,s) S[x Bt] ρ We have so far shown that with probability 1 δ, PBS(ft+1) PBS(ft) = PBS(ft+1) PBS( f t ) + PBS( f t ) PBS(f t ) + PBS(f t ) PBS(ft) PBS(ft+1) PBS( f t ) + ρ 8m2 α 8mρ By union boudning the Chernoff bound (Theorem C.1) over Ct+1, we can show that PB S(f) concentrates around PBS(f) for every f Ct: with probabiliy 1 δ, we simultaneously have PB S(ft) PBS(ft) δ ) + t ln(4m2|G|) PB S( f t ) PBS( f t ) δ ) + t ln(4m2|G|) PB S(ft+1) PBS(ft+1) δ ) + t ln(4m2|G|) as ft, f t , ft+1 Ct+1. In other words, we have with probability 1 2δ PB S(ft+1) PB S(ft) PB S(ft+1) PB S( f t ) + ρ 8m2 α 4mρ + 4 δ ) + t ln(4m2|G|) (3) f t ft+1: Because t is chosen to minimize with respect to the empirical distribution out of grid points [1/m], we can show that the pinball loss against the empirical distribution S must be lower for ft+1 than f t . We can calculate the derivative of the pinball loss with respect to the patch as d d PB S|Bt(ft( ) + ) = d E(x,s) S|Bt[Lq(ft(x) + , s)] = 1 |Bt| d d i:xi Bt Lq(ft(xi) + , si) i:xi Bt,si ft(xi)+ d d Lq(ft(xi) + , si) + X i:xi Bt,si>ft(xi)+ d d Lq(ft(xi) + , si) Under review as a conference paper at ICLR 2023 i:xi Bt,si ft(xi)+ (1 q) X i:xi Bt,si>ft(xi)+ q = 1 |Bt| |{i : xi Bt, si ft(xi) + }| q = Pr (x,s) S|Bt [s ft(x) + ] q. Because PB S|Bt(ft( )+ ) is convex in , minimizing this function is equivalent to minimizing the absolute value of its derivative, which is how t is set. Therefore, PB S|Bt(ft( )+ ) is minimized at t. Hence, we have with probability 1 2δ (δ to argue that there is significant quantile consistency error on Bt with respect to S and δ to argue that the empirical pinall loss concentrates around its expectation), PB S(ft+1) PB S(ft) PB S(ft+1) PB S( f t ) + ρ 8m2 α 8mρ + 4 δ ) + t ln(4m2|G|) ρ 8m2 α 8mρ + 4 δ ) + t ln(4m2|G|) δ ) + t ln(4m2|G|) as we have chosen m = ρ2 If n 512ρ6(ln( 2 δ )+t ln(4m2|G|)) α4 , we have PB S(ft+1) PB S(ft) α2 Because we decrease the empirical pinball loss by α2 8ρ3 in each round with probability 1 2δ, we have that with probability 1 2δ 8ρ3 α2 , the algorithm halts at round T = 8ρ3 n 92928 ln 8 α2 ln(4m2|G|) max m2 we satisfy all the requirements that we stated previous for n in each round t [T]. If we set δ = 16ρ3 α2 δ, we have with probability 1 δ , the algorithm halts in T = 8ρ3 α2 where we require n 92928 ln 128ρ3 α2 ln(4m2|G|) max m2 = 92928 ln 128ρ3 α2 ln ρ4|G| D ADDITIONAL EXPERIMENTS AND DISCUSSION D.1 A DIRECT TEST OF GROUP CONDITIONAL QUANTILE CONSISTENCY AND QUANTILE MULTICALIBRATION In this section we abstract away the non-conformity score, and directly perform a direct evaluation of the ability of Batch GCP and Batch MVP to offer group conditional and multivalid quantile Under review as a conference paper at ICLR 2023 Figure 6: Per-group coverage of Batch MVP (left) and Batch GCP (right) on a representative run. Figure 7: Per-group calibration error of Batch MVP and Batch GCP on a representative run. consistency guarantees. We produce a synthetic regression dataset defined to have a set of intersecting groups that are all relevant to label uncertainty. Specifically, the data {(xi, yi)}10000 i=1 (Z+ R)10000 is generated as follows. First, we define our group collection as G = {g1, . . . , g15}, where for each j = 1, . . . , 15, the group gj = {j, 2j, 3j, . . .} Z+ contains all multiples of j. Note that group g1 encompasses the entire covariate space, ensuring that our methods will explicitly enforce marginal coverage in addition to group-wise coverage. Each xi is a random integer sampled uniformly from the range [1, 5000). We let the corresponding label yi = |y i| |y i|+1 [0, 1], where y i is distributed as the sum of n(xi) i.i.d. N(0, 1) random variables, where n(xi) is the number of groups that xi belongs to. After generating our data, we split it into 80% training data Dtrain and 20% test data Dtest. We then run both methods on the group collection G, for target coverage q = 0.9, with m = 100 buckets. The group coverage obtained by both methods on a sample run can be seen in Figure 6. Generally, both methods achieve target group-wise coverage level on all groups, with Batch MVP exhibiting less variance in the attained coverage levels across different runs. Meanwhile, the group-wise quantile calibration errors of both models are presented in Figure 7. Both Batch MVP and Batch GCP are very well-calibrated on all groups, with calibration errors on the order of 10 3 even though unlike Batch MVP, the definition of Batch GCP does not enforce this constraint explicitly. D.2 FURTHER COMPARISONS ON STATE CENSUS DATA In Section 5.2 we performed an income-prediction task using census data (Ding et al. (2021)) from the state of California to compare the performance of Batch GCP and Batch MVP against each other, as well as against other regularly used conformal prediction methods. Here, we present results of the same experiment using data from other US states. We selected the ten largest states (by population data) to work with, of which California is one. Results for every state are averaged over 50 runs of the experiment, taking different random splits over the data (to form training, calibration, and test sets) each time. The plots in Figure 8, Figure 9, Figure 10 and Figure 11 compare the performance of all four methods with metrics such as group-wise coverage, prediction-set size, and group-wise quantile calibration error. Under review as a conference paper at ICLR 2023 (b) New York (c) Florida (d) Pennsylvania (e) Illinois (g) North Carolina (h) Georgia (i) Michigan Figure 8: Results for group-wise coverage for nine different states (averaged over 50 runs) using different methods of conformal prediction. Error bars show standard deviation. D.2.1 HALTING TIME Recall that our generalization theorem for Batch MVP is in terms of the number of iterations T it runs for before halting. We prove a worst-case upper bound on T, but note that empirically Batch MVP halts much sooner (and so enjoys improved theoretical generalization properties). Here we report the average number of iterations T that Batch MVP runs for before halting on each state, averaged over the 50 runs, together with the empirical standard deviation. Texas: 10.84 1.2059. New York: 11.26 1.2134 Florida: 12.06 1.4752 Pennsylvania: 11.68 2.3189 Illinois: 9.74 1.4941 Ohio: 10.94 1.7482 North Carolina: 13.22 1.5138 Georgia: 12.18 1.6454 Michigan: 11.24 2.0353 Under review as a conference paper at ICLR 2023 (b) New York (c) Florida (d) Pennsylvania (e) Illinois (g) North Carolina (h) Georgia (i) Michigan Figure 9: Results for group-wise average prediction set size for nine different states (averaged over 50 runs) using different methods of conformal prediction. Error bars show standard deviation. Under review as a conference paper at ICLR 2023 (b) New York (c) Florida (d) Pennsylvania (e) Illinois (g) North Carolina (h) Georgia (i) Michigan Figure 10: Results for group-wise calibration error (averaged over 50 runs). Error bars show standard deviation. Under review as a conference paper at ICLR 2023 (b) New York (c) Florida (d) Pennsylvania (e) Illinois (g) North Carolina (h) Georgia (i) Michigan Figure 11: Scatterplots of the number of points associated with each threshold-group pair (g, τi) against the average coverage conditional on that pair for all g G and all τi in a grid, over all tested conformal prediction methods (consolidating results over all 50 runs), for all nine states. Target coverage is q = 0.9.