# the_statistical_scope_of_multicalibration__3a32cacb.pdf

The Statistical Scope of Multicalibration

Georgy Noarov 1 Aaron Roth 1

We make a connection between multicalibration and property elicitation and show that (under mild technical conditions) it is possible to produce a multicalibrated predictor for a continuous scalar property Γ if and only if Γ is elicitable. On the negative side, we show that for non-elicitable continuous properties there exist simple data distributions on which even the true distributional predictor is not calibrated. On the positive side, for elicitable Γ, we give simple canonical algorithms for the batch and the online adversarial setting, that learn a Γ-multicalibrated predictor. This generalizes past work on multicalibrated means and quantiles, and in fact strengthens existing online quantile multicalibration results. To further counter-weigh our negative result, we show that if a property Γ1 is not elicitable by itself, but is elicitable conditionally on another elicitable property Γ0, then there is a canonical algorithm that jointly multicalibrates Γ1 and Γ0; this generalizes past work on mean-moment multicalibration. Finally, as applications of our theory, we provide novel algorithmic and impossibility results for fair (multicalibrated) risk assessment.

1. Introduction

Consider a distribution D over a labeled data domain Z = X R of examples with observable features x X and labels y R. A predictor f : X R is (mean) calibrated if, informally, it correctly estimates the mean label value even conditional on its own predictions: i.e., E(x,y) D[y|f(x) = v] = v for all predictions v. Calibration is a desirable property, but a weak one, since it only refers to the average value of the label, averaged over all examples such that f(x) = v; it might be, for example,

1Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia, PA, USA. Correspondence to: Georgy Noarov <gnoarov@seas.upenn.edu>, Aaron Roth <aaroth@cis.upenn.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

that there is a structured subset of examples G X such that f systematically under-estimates label means for examples x G such a predictor can still be calibrated if it compensates by over-estimating the mean labels for x G.

Multicalibration was introduced by H ebert-Johnson et al. (2018) to strengthen the notion of calibration. A multicalibrated predictor is parameterized by a collection of groups G 2X , and is calibrated not just overall, but also conditional on membership in G for all groups G G. That is, for all v, G, we must have: E(x,y) D[y|f(x) = v, x G] = v.

Multicalibration was generalized from means to moments by Jung et al. (2021). By way of an explicit counterexample, Jung et al. (2021) showed that the variance (and other higher moments) cannot be multicalibrated by themselves but can be multicalibrated jointly with the mean, i.e., as part of a (mean, moment) pair. Later, Gupta et al. (2022) and Jung et al. (2023) showed how to obtain a quantile analogue of multicalibration, which requires that for any target coverage level τ [0, 1], for any v in the range of a predictor f and for any G G: E[1[y f(x)]|f(x) = v, x G] = τ.

Thus, by now we have efficient batch (H ebert-Johnson et al., 2018; Jung et al., 2021; 2023) and online (Gupta et al., 2022; Bastani et al., 2022) multicalibration algorithms for several natural distributional properties (means and quantiles), an impossibility result for the variance and higher moments, and a result showing how to multicalibrate means and moments together despite moments not being multicalibratable on their own (Jung et al., 2021). But are these one-off results, or is there a more general theory of multicalibration for distributional properties i.e., arbitrary functionals Γ : P R mapping any distribution P P to a scalar statistic Γ(P)? To study this question, it is natural to define (cf. Dwork et al. (2022)) that a predictor f is (G, Γ)-multicalibrated for property Γ if for all groups G G and values γ in f s range, it holds that: Γ(D|(f(x) = γ, x G)) = γ, where D|(f(x) = γ, x G) is the data distribution conditioned on the event {x : f(x) = γ, x G}. (When G = {X}, we would simply say that f is Γ-calibrated.)

Our motivation In developing our theory of property multicalibration, we are guided by trying to answer several questions of special significance:

The Statistical Scope of Multicalibration

(1) Which distributional properties of interest are possible to multicalibrate, and which ones are not?

(2) For those properties that are multicalibratable in the batch setting, do we always also have a solution in the online adversarial setting, or is there an online-offline separation?

(3) Both in the batch and in the online setting, can one formulate a natural canonical algorithm with simple and generic performance guarantees, which takes in a description (in some simple format) of any multicalibratable property of interest and outputs a multicalibrated predictor for it?

(4) For practically important properties that are not multicalibratable per se, are there any reasonably general techniques for us to achieve some modified notion of multicalibration? (Cf. the case of the variance, which becomes multicalibratable when paired with the mean as per Jung et al. (2021).)

1.1. Our Results

We give an (almost) complete answer to all these questions by connecting property multicalibration to the well-studied theory of property elicitation.

The modern formulation of property elicitation theory is due to Lambert et al. (2008), but it has been extensively developed in both earlier and later works. In a nutshell, properties are called elicitable if their value on any data distribution can always be directly learned as the minimizer of some loss function over the dataset. For example, means and quantiles are elicitable as they can be solved for, respectively, via least squares and quantile regression. But, for instance, variance is not elicitable. An equivalent (subject to mild assumptions) notion is that of identifiable properties: an identification function (typically a first-order condition on the property s loss function ) tells us if we overor under-estimated the property value, in expectation over the dataset. For example, an expected identification function for a mean predictor fm is simply E(x,y)[fm(x) y], and an expected identification function for a τ-quantile predictor fτ is Pr(x,y)[y fτ(x)] τ, the average overcoverage of fτ. We give the following collection of results.

A Feasibility Criterion: Γ-Multicalibration Is Possible If and Only If Γ Is Elicitable. We provide a very general if-and-only-if characterization that categorizes various distributional properties of interest as possible or not possible to (multi)calibrate. We show (under mild assumptions) that a property Γ is sensible for calibration (see Definition 3.2) if and only if Γ is elicitable, and if and only if Γ is identifiable. See Theorem 3.7 in Section 3. A crucial tool we use is a central result of Steinwart et al. (2014): (under mild conditions) a property Γ is elicitable it is identifiable its level sets are convex. Our key insight, which allows us to invoke this result, is a tight relationship between sensibility for Γ-calibration and the convexity of the level sets of Γ.

Canonical Batch and Online Algorithms. We identify two canonical Γ-multicalibration algorithms for bounded elicitable properties Γ the batch Algorithm 1 and the online adversarial Algorithm 5 and prove convergence guarantees for them. See Sections 4 and F, respectively. Our batch Algorithm 1 naturally extends the known methods for means and quantiles (H ebert-Johnson et al., 2018; Jung et al., 2023). Our online Algorithm 5 generalizes and improves existing online algorithms for means and quantiles (Gupta et al., 2022; Bastani et al., 2022; Lee et al., 2022).

Joint Multicalibration for Conditionally Elicitable Properties. We show that if a property Γ0 is elicitable, and Γ1

is conditionally elicitable given Γ0 (meaning, informally, that conditional on knowing the exact value of Γ0, there is a regression procedure to learn Γ1), then (under technical conditions) the pair (Γ0, Γ1) is jointly multicalibratable using a canonical Algorithm 2. This generalizes (mean, moment) multicalibration of Jung et al. (2021). See Section 5.

Positive and Negative Results on Fair Risk Assessment. Previously, nothing was known about multicalibrating any of the large collection of risk measures beyond quantiles and variances. In Section 6, we begin to fill this gap by applying our theory to derive results about a host of risk measures of central significance in financial risk assessment. We show a general negative result that the large family of distortion risk measures are not multicalibratable, except for means and quantiles (and two other technical quantile variants). On the positive side, we establish that so-called Bayes risks are multicalibratable jointly with the elicitable property whose risk they measure. This is exemplified by Conditional Value at Risk (CVa R), also known as Expected Shortfall (ES) a risk assessment measure of central theoretical and practical significance which, as we show, is not multicalibratable on its own but is multicalibratable jointly with quantiles.

We discuss further related work in Appendix A.

2. Preliminaries

We study prediction problems over labeled datapoints in Z = X Y, where X is a space of feature vectors and Y R is a space of labels. We study both batch (offline) and sequential (online) prediction problems. The online setting will be discussed in Appendix F, and we defer sequential prediction definitions and preliminaries to that section.

In the batch (offline) setting, there is a distribution D Z over labeled examples (x, y) Z. Such a distribution induces a distribution X over features and Y over labels. Keeping D implicit, we let Yx := (Y |{X = x}) be the conditional label distribution given a feature vector x X. Similarly, for any subset G X, we write YG := (Y |{x G}) for the conditional label distribution given x G. A predictor or model is a real-valued mapping f : X R.

The Statistical Scope of Multicalibration

2.1. Distributional Properties

A one-dimensional (distributional) property is a functional Γ : P R, where P is some space of probability distributions. We write RangeΓ R to denote the range of Γ. For our algorithmic Γ-multicalibration results in this paper, we will assume (without further mention) that the property Γ is bounded, and w.l.o.g. rescale Γ so that RangeΓ [0, 1].

Examples of properties include the mean, the median, a τ-quantile, and the variance of a distribution (for further examples see Section 6). What all of these notions have in common is that each of them puts a single real number in correspondence with a given distribution.

Formally Defining P In this paper, the basic assumption we place on any distribution space P is that P is a convex subset of some vector space, so that taking convex combinations over distributions in P is a well-defined operation.

If all distributions P P are defined over the label space Y, we can impose extra structure on P in one of two ways, depending on whether Y is a finite set or not. If |Y| = d < , all P P can be viewed as elements of the d-dimensional simplex (d) Rd, so we view P as a subset of (d) equipped with the Euclidean norm. If Y is infinite, we assume that P WTV, where WTV is a Banach space (i.e. a complete normed vector space; see Diestel & Uhl (1977) for an introduction to Banach spaces) of probability distributions over Y with almost everywhere bounded densities, equipped with the TV norm || ||TV. In particular, WTV is a metric space where the distance between any P1, P2 P is the total variation distance ||P1 P2||TV.

An assumption we will often place on a property Γ is continuity: that is, Γ cannot take drastically different values on very similar distributions. Formally, a continuous property is a functional Γ : P R that is continuous relative to the metric topology on P and the standard topology on R. (A great many properties, including means, variances, quantiles, entropies, etc. are indeed continuous.)

Batch Property Prediction In our batch setting, we consider datasets over X Y, and are interested in training predictors fΓ : X R for various properties Γ of the conditional label distributions Yx Y. Informally, a good predictor fΓ would satisfy fΓ(x) Γ(Yx) for every x X.

Since we study properties Γ : P R of distributions over the dataset s label space Y, we restrict attention to those dataset distributions D (X Y) whose induced label distributions Yx belong to P for all x X, so that the property Γ is at least well-defined on the label distributions Yx of all x X. We formalize this as follows.

Definition 2.1 (P-Compatible Dataset Distribution). Given a family of distributions P Y, we call a dataset distribu-

tion D (X Y) P-compatible if Yx P for all x X: i.e. the induced label distribution given any x belongs to P.

2.2. Property Elicitation and Identification

We are now ready to formally define three related concepts that are the subject of study in the property elicitation literature. These concepts are: elicitability, identifiability, and level set convexity (Cx LS) of distributional properties.

Elicitability Simply put, a property defined on a family of distributions P is called elicitable if its value on any distribution P P can be obtained by minimizing some loss function in expectation over samples from P or, said in the language of statistical learning, by solving a regression problem. As is customary in the elicitation literature, we refer to such loss functions as scoring functions: mathematically, a scoring function is just a function S : R Y R.

Definition 2.2 (Strictly Consistent Scoring Function). Fix a space of probability distributions P. A scoring function S : R Y R is strictly P-consistent for property Γ : P R if: Γ(P) = argminγ R Ey P [S(γ, y)] for all P P. We also say that S elicits Γ.

For brevity, we denote: S(γ, P) = Eγ P [S(γ, y)]. S is called P-order sensitive for Γ if for P P and γ1, γ2 R, |γ1 Γ(P)| < |γ2 Γ(P)| implies S(γ1, P) < S(γ2, P).

Definition 2.3 (Elicitable Property). Fix a space of probability distributions P. A property Γ : P R is said to be elicitable if it has a strictly P-consistent scoring function.

As a basic example of the above definitions, the scoring function defined as S(γ, y) = (γ y)2 elicits distributional means; thus, means are an elicitable property.

Convexity of Level Sets (Cx LS) A simple but deep necessary condition for elicitability due to Osband (1985) is that elicitable properties must have convex level sets (also referred to as Cx LS). This will be key to our characterization of sensibility for calibration via elicitability. Fact 1 (Osband (1985)). Let P be a convex space of probability distributions, and Γ : P R be an elicitable property. Then for all γ RangeΓ, the level set {P P : Γ(P) = γ} is convex: that is, for any P1, P2 with Γ(P1) = Γ(P2) = γ, it holds that Γ(λP1 + (1 λ)P2) = γ for all λ [0, 1].

Identifiability A concept related to the above two is identifiability (Osband, 1985; Gneiting, 2011). It requires that a property have a so-called identification, or id, function:

Definition 2.4 (Identification Function). A function V : R Y R is a P-identification (or id) function for property Γ : P R if for P P, E y P[V (γ, y)] = 0 Γ(P) = γ.

The Statistical Scope of Multicalibration

For brevity, we denote: V (γ, P) = Eγ P [V (γ, y)]. An id function V for Γ is oriented if V (γ, P) > 0 γ > Γ(P).

In other words, identifiability makes it possible to compute the value of a property by setting to zero its expected id function over the distribution. For oriented identification functions, we further have that over- (resp. under-)estimating the property value leads to positive (resp. negative) expected identification function values.

Definition 2.5 (Identifiable Property). Property Γ : P R is identifiable if there is a P-identification function for Γ.

For some intuition on how this relates to elicitability, note that if Γ has a differentiable convex scoring function, then its derivative (in the first argument) is an id function for Γ.

Connecting Elicitability, Identifiability and Cx LS Under mild assumptions, elicitability, identifiability, and Cx LS are in fact equivalent for continuous properties Γ, as shown by Steinwart et al. (2014). While Fact 1 shows, under no assumptions other than P being convex, that Cx LS is a necessary condition for elicitability, the characterization of Steinwart et al. (2014) demonstrates that it is also sufficient, subject to some further assumptions. Formally, this characterization holds for P defined in one of two ways:

Definition 2.6. Let P0 be the subspace of the Banach space WTV containing all probability distributions whose densities are upper-bounded almost everywhere. Let P>0 be the subspace of P0 containing all distributions P P0 whose densities are lower-bounded everywhere by some ϵP > 0.

Theorem 2.7 (Steinwart et al. (2014)). Consider a space of probability distributions P {P0, P>0}. Let Γ : P R be any continuous, strictly locally non-constant1 property. Then the following statements are equivalent:

1. Γ is elicitable.

2. Γ has a bounded non-negative order-sensitive scoring function S.

3. Γ is identifiable and has a bounded, oriented identification function V .

4. Γ has convex level sets (Cx LS): for any γ in the range of Γ, {P P : Γ(P) = γ} is convex.

2.3. (Multi)calibration for Property Predictors

We now give general definitions of calibration and multicalibration for (batch) predictors of any property Γ, extending the notions of mean and quantile calibration of H ebert Johnson et al. (2018) and Jung et al. (2023). A variant of

1 Strictly locally non-constant is a (weak) requirement that for every P in the interior of P with Γ(P) = γ, and any ϵneighborhood Uϵ of P in the metric topology on P, there are distributions P , P Uϵ such that Γ(P ) < γ < Γ(P ).

our definitions first appeared in Dwork et al. (2022) under the name calibration consistency under mixtures.

Fix any dataset over Z = X Y, given by its data distribution D Z. Suppose that, given any features x X, we want to predict the value of property Γ on Yx, the label distribution conditional on x. For this, we procure a Γ-predictor f : X R. We will call this predictor Γ-calibrated if for all γ Rangef, the conditional label distribution given the prediction f(x) = γ indeed has property value γ.

Definition 2.8 (Calibrated Predictor for Property Γ). A Γpredictor f : X R is Γ-calibrated on dataset distribution D (X Y) if for every γ Rangef: Γ(Yf,γ) = γ, where Yf,γ = Y{x:f(x)=γ} is the conditional label distribution induced by D conditional on f(x) = γ.

Now, we extend this definition to that of multicalibration, which offers calibration guarantees for arbitrary collections of subsets ( groups ) of the feature space X.

Definition 2.9 ((G, Γ)-Multicalibrated Γ-Predictor). Fix a collection of groups G 2X . A Γ-predictor f : X R is (G, Γ)-multicalibrated on dataset distribution D (X Y) if for every γ Rangef and G G: Γ(Yf,γ,G) = γ, where Yf,γ,G = Y{x:f(x)=γ,x G} is the conditional label distribution induced by D given f(x) = γ and x G. In other words, a (G, Γ)-multicalibrated predictor f satisfies that, conditional on both the prediction being f(x) = γ and on x being a member of group G, the conditional label distribution indeed has property value γ.

We will later need to work with a definition of approximate multicalibration for predictors with finite range. We adopt an ℓ2-notion of calibration error, generalizing approximate quantile multicalibration as defined in Jung et al. (2023).

Definition 2.10 (α-Approximately (G, Γ)-Multicalibrated Γ-Predictor). Fix a distribution D Z and a collection of groups G 2X . For each G G, let µ(G) = Pr(x,y) D[x G] be the probability mass on group G. A finite-range predictor f : X Rangef is α-approximately (G, Γ)-multicalibrated on D if for all G G: X

γ Rangef Pr (x,y) D[f(x) = γ|x G] (γ Γ(Yf,γ,G))2 α µ(G).

Note that 0-approximate (G, Γ)-multicalibration is equivalent to the (exact) Definition 2.9 of (G, Γ)-multicalibration.

3. Sensibility for Calibration and Elicitability

We are now ready to make a connection between property elicitation and (multi)calibration. First we define the notion of a property Γ being sensible for calibration.

Definition 3.1 (True Distributional Predictor for a Property). Fix a distributional property Γ : P R and a P-compatible

The Statistical Scope of Multicalibration

dataset distribution D. The true distributional predictor f D Γ for Γ on D is defined as f D Γ (x) = Γ(Yx) for x X i.e. the predictor that for every x X gives the correct value of property Γ on the conditional label distribution given x.

Definition 3.2 (Property Sensible for Calibration). Fix a property Γ : P R, and a collection D of P-compatible dataset distributions. We say that Γ is sensible for calibration over D if the true distributional predictor f D Γ is Γ-calibrated on D for all D D.

A key motivation for multicalibration (elaborated on in Dwork et al. (2021)) is that we want to produce a predictor f that is indistinguishable from f D Γ with respect to a class of calibration tests parameterized by G which only makes sense if Γ is sensible for calibration. In general, for properties that are not sensible for calibration, there need not exist calibrated predictors at all (even beyond f D Γ ).

Jung et al. (2021) observed that (in our terminology) variance is not sensible for calibration. We now significantly generalize and tighten this observation into a characterization saying that (under mild assumptions) a property is sensible for calibration if and only if it is elicitable (or, equivalently, is identifiable/has convex level sets).

We begin by showing that if a property Γ : P R does not have convex level sets on P, then it is not sensible for calibration over any family of P-compatible datasets that includes all possible datasets supported on two points in X whose respective label distributions belong to P.

Definition 3.3 (2-Point Dataset Distribution). A dataset distribution D over feature-label pairs (X, Y ) X Y is called 2-point if there exist two feature vectors x1 = x2 X such that Pr D[X {x1, x2}] = 0 and Pr D[X = x1] = 0, Pr D[X = x2] = 0 i.e., exactly two feature vectors have nonzero probability of occurring under distribution D.

Theorem 3.4 (No Cx LS = Not Sensible for Calibration). Fix a property Γ : P R with P convex, and any family D of P-compatible dataset distributions containing all the Pcompatible 2-point dataset distributions. Then, if Γ violates Cx LS on P, it is not sensible for calibration over D.

Proof. Suppose Γ violates Cx LS on P. Then, {P P : Γ(P) = γ} is nonconvex for some γ RangeΓ. Thus, there exist distributions Y1, Y2 P such that Γ(Y1) = Γ(Y2) = γ but Γ(λY1 + (1 λ)Y2) = γ for some λ [0, 1]. Now construct a 2-point dataset distribution D D as follows: have it supported on any two feature vectors x1 = x2 X, and set Yx1 = Y1, Yx2 = Y2, and Pr[X = x1] = λ = 1 Pr[X = x2]. Then, its true distributional predictor f D Γ is not Γ-calibrated, as Γ(Yf D Γ ,γ) = γ.

To prove the converse, we impose a weak and natural regularity assumption on the dataset distribution D: we require

that the mapping from features x to the corresponding Yx induced by D be just well-behaved enough that the label distributions YG over any G X are well-defined as mixtures over the individual label distributions Yx for x G.

In the case of |Y| < , the well-behaved nature of the mapping x Yx can be formalized by requiring it to be Lebesgue measurable. In the case when |Y| = , the space WTV that each Yx belongs to is a Banach space and in this setting, the notions of Lebesgue measurability and integrability (which are only defined in finite-dimensional Euclidean spaces) are replaced by the analogous concepts of Bochner measurability and integrability (see e.g. Diestel & Uhl (1977) for formal definitions). Thus, when |Y| = , we assume the map x Yx is Bochner measurable. Definition 3.5 (P-Regular Dataset Distribution). Fix feature space X, label space Y, and a family of probability distributions P over Y. Consider a P-compatible dataset distribution D over X Y. Let ξD : X P be defined by ξD(x) = Yx, i.e. ξD is the mapping x Yx from feature vectors to their label distributions induced by D. Then, D is called P-regular if any of the following is true:

1. X is finite;

2. Y is finite (|Y| < ) and ξD is Lebesgue measurable when P is viewed as a subset of R|Y|;

3. X, Y are infinite and ξD is Bochner measurable when P is viewed as a subset of WTV.

For P-regular datasets D, we now show that for any continuous property Γ that has Cx LS on P, the true distributional predictor f D Γ is Γ-calibrated. (The proof is in Appendix B.) Theorem 3.6 (Cx LS = Sensible for Calibration). Consider a continuous property Γ : P R, and any family D of P-regular dataset distributions. Then, if Γ has convex level sets on P, it is sensible for calibration over D.

Together, Theorems 3.4 and 3.6 show under mild conditions that for continuous properties, sensibility for calibration is equivalent to having convex level sets. To finally link sensibility for calibration to elicitability and identifiability, we can now invoke the above stated Theorem 2.7 of Steinwart et al. (2014), with its extra assumptions on P (see Definition 2.6) and Γ (the nowhere-locally-constant assumption, see Footnote 1), to obtain our final characterization result: Theorem 3.7 (Sensibility for Calibration Elicitability Identifiability Cx LS). Let Γ : P R be a continuous strictly locally non-constant property on a convex space of distributions P {P0, P>0}. Let D be a family of P-regular dataset distributions over Z = X Y, that includes all the P-compatible 2-point dataset distributions.

Then Γ is sensible for calibration over D Γ is elicitable is identifiable has convex level sets.

The Statistical Scope of Multicalibration

Proof. Our Theorems 3.4 and 3.6 establish that Γ is sensible for calibration if and only if it has convex level sets on P, which by Theorem 2.7 (Steinwart et al., 2014) is equivalent to Γ being elicitable, and to Γ being identifiable.

4. Batch Multicalibration

In this section we give a generic batch Γ-multicalibration algorithm for elicitable properties Γ. It generalizes (and is similar to) past multicalibration algorithms designed for specific properties, like means (H ebert-Johnson et al., 2018; Gopalan et al., 2022a) and quantiles (Jung et al., 2023). These algorithms differ in their specifics; we most closely mirror the multicalibration algorithm of Jung et al. (2023).

We henceforth focus on continuous, strictly locally nonconstant, bounded properties Γ, with RangeΓ = [0, 1]. Our canonical batch Algorithm 1, described below, will produce finite-range multicalibrated Γ-predictors f: Namely, for any integer m 1, denoting [1/m] := { 1 m+1, 2 m+1, . . . , m m+1}, Algorithm 1 finds an O( 1

m)-approximately multicalibrated Γ-predictor f with Rangef = [1/m].

Any elicitable Γ, under the just stated assumptions, has a strictly consistent scoring function and an id function (by Theorem 2.7). We will now need a further assumption: Assumption 4.1. Assume Γ : P RangeΓ has an identification function V such that V ( , Y ) is strictly increasing and L-Lipschitz for each label distribution Y P: |V (γ, Y ) V (γ , Y )| L|γ γ | for all γ, γ .

This assumption is arguably mild. Let S be an antiderivative of V , so that V (γ, y) = S(γ,y)

γ . Then, V being strictly increasing is equivalent to S being a convex strictly consistent scoring function for Γ. Such a convexity assumption is quite natural in the context of optimization; furthermore, Finocchiaro & Frongillo (2018) show that for elicitable properties over finite label spaces |Y| < , this is without loss of generality. The extra Lipschitz assumption is what will allow us to quantify our algorithm s convergence rate.

To state Algorithm 1, it is convenient for us to reparameterize our Definition 2.10 of Γ-multicalibration in terms of an id function V for Γ. (By properties of V , as this updated notion of calibration error goes to 0, so will the one in Definition 2.10.) Below, let Y(G,γ) be the label distribution conditional on the event {x X : f(x) = γ, x G}. Definition 4.2 (Approximate (G, V )-Multicalibration). Fix groups G, a distribution D, and an id function V for a property Γ. A finite-range Γ-predictor f : X [0, 1] is α-approximately (G, V )-multicalibrated if for all G G: X

γ Rangef Pr x X[f(x)=γ|x G] (V (γ, Y(G,γ)))2 α Pr x X[x G].

Algorithm 1 is quite natural. While it can, it finds an inter-

section Qt of a group G G and a level set of the current predictor f, such that f s prediction on Qt is too far from the truth (as measured by the magnitude of the expected id function value over Qt) and fixes the situation by shifting f s value on Qt to the best grid point γ [1/m].

Algorithm 1 Batch Multicalibration(Γ, G, m, f, L)

Initialize t = 1 and f1 = f. Let α = 4L2

m , and let V : RangeΓ Y R be an L-Lipschitz id function for Γ satisfying Assumption 4.1. while ft not α-approximately (G, V )-multicalibrated do

Let Qt = {x : ft(x) = γ, x G}, where

(γ, G) argmax

(γ ,G ) [ 1

m ] G Pr x [ft(x)=γ , x G ](V (γ , Y(γ ,G )))2.

Let: γ = argminγ [ 1

m ] |V (γ , YQt)| Update: ft+1(x) := 1[x Qt] ft(x)+1[x Qt] γ

for all x X, and t t + 1. end while Output ft.

Theorem 4.3 (Guarantees of Algorithm 1). Fix data distribution D Z and groups G 2X . Fix an elicitable property Γ with its scoring function S and id function V = S

γ satisfying Assumption 4.1, so that V ( , YQ) is L-Lipschitz on all label distributions YQ (for Q X) induced by D. Set discretization m 1. If Algorithm 1 is initialized with predictor f1 : X R with score E(x,y) D[S(f1(x), y)] = Cinit, and Copt = E(x,y) D[S(f D Γ (x), y)] is the score of the true distributional predictor f D Γ , then Algorithm 1 produces a 4L2

m -approximately (G, V )-multicalibrated Γ-predictor f after at most (Cinit Copt) m2

The proof of Theorem 4.3 is given in Appendix C, and is similar to the analyses of several existing multicalibration algorithms (H ebert-Johnson et al., 2018; Jung et al., 2023; Deng et al., 2023). We show that the expected score ED[S] (where V = S

γ ) decreases at every step of Algorithm 1, making it a potential function for the algorithm. Convergence rates follow by lower-bounding the per-iteration decrease in E[S]. This naturally generalizes, to arbitrary elicitable properties, the existing analyses of mean (H ebert-Johnson et al., 2018) and quantile multicalibration (Jung et al., 2023), which respectively use squared loss (consistent for means) and pinball loss (consistent for quantiles) as potential functions.

We have described Algorithm 1 as able to directly query the expected identification function V on the true data distribution D. In practice, we would instead run it on the empirical distribution over an i.i.d. sample ˆD Dn of n points from D. Appendix D gives finite sample guarantees for this case.

The Statistical Scope of Multicalibration

5. Joint Multicalibration

Now we take a step towards understanding two interrelated issues: (1) how to multicalibrate vector-valued properties, and (2) how to appropriately extend the notion of multicalibration to some practically important scalar properties that have non-convex level sets and are thus neither elicitable nor sensible for calibration by our Theorem 3.4 (e.g. variance).

Specifically, we study the important case of twodimensional properties Γ = (Γ0, Γ1), where Γ0 is elicitable whereas Γ1 is not elicitable per se, but is elicitable conditional on any fixed value of Γ0. We give an algorithm that can produce jointly multicalibrated estimators for such Γ. Definition 5.1 (Conditional Elicitability). We say that a property Γ1 : P R is elicitable conditionally on a property Γ0 : P R, if the restriction of Γ1 to each level set of Γ0 (i.e. to each distribution family Pγ0 = {P P : Γ0(P) = γ0} for γ0 RangeΓ0) is elicitable.

For the elicitable component Γ0, we denote its scoring and identification functions by S0, V 0. For property Γ1 that is elicitable conditionally on Γ0, for each γ0 RangeΓ0 we denote by V 1 γ0 : RangeΓ1 Y R a function that identifies Γ1 on every distribution P such that Γ0(P) = γ0, and by S1 γ0 : RangeΓ1 Y R a score that is strictly consistent for Γ1 on every distribution P such that Γ0(P) = γ0.

We assume that the elicitable Γ0 satisfies Assumption 4.1, with V 0(γ0, y) strictly increasing and L0-Lipschitz in γ0. We will also need the opposite (similarly mild) assumption: Assumption 5.2. V 0( , Y ) is L0 a-anti-Lipschitz at Γ0(Y ) for all Y P: |γ0 Γ0(Y )| L0 a|V 0(γ0, Y )| for all γ0.

The situation with Γ1 is more complex: it has different id functions V 1 γ0 for different level sets of Γ0, instead of a single function for all P P. In general, nothing prevents these functions Vγ0 from being completely unrelated to each other for different values of γ0 (and even undefined on each other s level sets). However, for most properties of interest we can expect V 1 γ0(γ1, P) to vary continuously in γ0 and be well-defined even for distributions P for which Γ0(P ) = γ0. To reflect this, and enable our joint multicalibration algorithm s guarantees, we make the following (mildly stronger) assumption:

Assumption 5.3. Assume for all P P that V 1 γ0(γ1, P) is defined and is Lc-Lipschitz as a function of γ0: that is, for any γ1, γ0 1, γ1 1, |V 1 γ0 0(γ1, P) V 1 γ0 1(γ1, P)| Lc|γ0 0 γ0 1|.

Further, we assume that the conditional identification functions V 1 γ0 for Γ1 on Γ0 s level sets {Γ0 = γ0} retain their shape , i.e. remain strictly increasing and Lipschitz, even for distributions from other level sets of Γ0. While this is a nontrivial assumption to make, in Section 6 we confirm that it holds e.g. when (Γ0, Γ1) is a so-called Bayes pair. This

will let us establish Bayes pairs, an important and general class of properties (Embrechts et al., 2021), as a major use case for our theory of joint multicalibration.

Assumption 5.4. For all2 γ0 RangeΓ0, assume V 1 γ0( , P) is L1-Lipschitz and strictly increasing for all P P.

We now define the central notion of jointly multicalibrated predictors for properties Γ = (Γ0, Γ1) : P R2. Analogously to Definitions 2.10, 4.2, we define this concept in two versions: one that is parameterized by the property Γ itself, and another one that involves its id functions (V 0, V 1). As in Section 4, the latter version serves to simplify notation in our algorithm analysis (and as this notion of multicalibration error goes to 0, so does the one parameterized by (Γ0, Γ1)). We will use some shorthands (for i = 0, 1): µf(γi|G, γ1 i) := Prx[f i(x) = γi|x G, f 1 i(x) = γ1 i] and µf(G, γi) := Prx[x G, f i(x) = γi]. Also, we let Y(G,γ0,γ1) := (Y |{x G, f(x) = (γ0, γ1)}).

Definition 5.5 (Approximate Joint Multicalibration). Fix distribution D Z and group family G. Given a property Γ = (Γ0, Γ1), a finite-range predictor f = (f 0, f 1) : X R2 is (α0, α1)-approximately (G, Γ0, Γ1)-jointly multicalibrated if for all G G, γ0 Rangef 0, γ1 Rangef 1: X

γ0 Rangef0 µf(γ0|G, γ1) (γ0 Γ0(Y(G,γ0,γ1)))2 α0 µf (G,γ1),

γ1 Rangef1 µf(γ1|G, γ0) (γ1 Γ1(Y(G,γ0,γ1)))2 α1 µf (G,γ0).

Similarly, given id functions V 0, {V 1 γ0}γ0 RangeΓ0 , predictor f =(f 0, f 1) is (α0, α1)-approximately (G, V 0, V 1)-jointly multicalibrated if for G G, γ0 Rangef 0, γ1 Rangef 1: X

γ0 Rangef0 µf(γ0|G, γ1) V 0(γ0, Y(G,γ0,γ1)) 2 α0 µf (G,γ1),

γ1 Rangef1 µf(γ1|G,γ0)(V 1 Γ0(Y(G,γ0,γ1))(γ1, Y(G,γ0,γ1)))2 α1

The Canonical Joint Multicalibration Algorithm: We now introduce Joint Multicalibration (Algorithm 2), a canonical algorithm for learning a jointly multicalibrated predictor f = (f 0, f 1) for (Γ0, Γ1). Algorithm 2 significantly generalizes the (mean, moment)- multicalibration algorithm of Jung et al. (2021), leading to some key differences in the analysis. We defer the discussion of these differences, as well as the proof of its convergence guarantees in Theorem 5.6, to Appendix E.1, and give only a brief overview of the algorithmic ideas below.

2In fact, we only need this to (much less restrictively) hold for γ0 [ 1

m], as Algorithm 2 gives predictors ranging in [ 1

The Statistical Scope of Multicalibration

Algorithm 2 Joint Multicalibration((Γ0, Γ1), G, m, (f 0, f 1))

Let V 0 and {V 1 γ0}γ0 [1/m] be id functions for Γ0 and Γ1

satisfying Assumptions 4.1, 5.2, 5.3, 5.4. Initialize t = 1 and f1 = (f 0, f 1). while (γ0, γ1, G) [ 1

m] G s.t. Pr x X[ft(x) =

(γ0, γ1), x G] V 0(γ0, Y(γ0,γ1,G)) 2 α0

m do Let G0 t {G {x X: f 1 t (x)=γ1}: G G, γ1 [ 1

m]} Update f 0 t+1 Batch Multicalibration V(V 0, G0 t , m, f 0 t, α0) for γ0 [1/m] do

Let G1,γ0 t {G {x X : f 0 t+1(x)=γ0} : G G}

Let f 1,γ0 t+1 Batch Multicalibration V(V 1 γ0, G1,γ0 t , m, f 1 t, α1) end for Update f 1 t+1(x) P

γ0 [1/m] 1[f 0 t+1(x)=γ0] f 1,γ0 t+1(x), x X

Update t t + 1. end while Output ft = (f 0 t , f 1 t ).

Theorem 5.6 (Guarantees of Algorithm 2). Consider any property Γ = (Γ0, Γ1), with Γ0 elicitable and Γ1

elicitable conditionally on Γ0, whose id functions satisfy Assumptions 4.1, 5.2, 5.3, 5.4. Fix any group family G 2X and discretization m 1. Set α0 = 4(L0)2

m and α1 = 4(L1)2

m . Let α1 = 8((L0L0 a Lc)2+(L1)2)

m . Then, Joint Multicalibration (Algorithm 2) will output an (α0, α1 )-approximately (G, V 0, V 1)- jointly multicalibrated Γ-predictor f = (f 0, f 1), via at most B0B1m4

L0L1 updates to f. Here, B0 := supγ,y [0,1] S0(γ, y) infγ,y [0,1] S0(γ, y) for S0 an antiderivative of V 0(γ0, y) wrt. γ0, and B1 :=

max γ0 [1/m]

supγ,y [0,1] S1 γ0(γ, y) infγ,y [0,1] S1 γ0(γ, y)

for each S1 γ0 an antiderivative of V 1 γ0(γ1, y) wrt. γ1.

To train a two-dimensional predictor, we employ a two-stage structure whereby we alternately multicalibrate f 0 on the current level sets of f 1, and f 1 on the current level sets of f 0, until the desired level of joint multicalibration error is reached (in the sense of Equations 1, 2 of Definition 5.5).

As in Section 4, our predictor is discretized: Rangef 0 = Rangef 1 = [ 1

m]. Both f 0 and f 1 are updated via calls to a subroutine Batch Multicalibration V , which is very similar to Batch Multicalibration (Algorithm 1) but has a stricter stopping rule to meet the extra demands of joint multicalibration, and accepts id functions V rather than properties Γ to simplify notation. We defer the pseudocode and the analysis for Batch Multicalibration V to Algorithm 3 and Lemma E.1 in Appendix E.

Throughout the execution of Algorithm 2, the subroutine is invoked on auxiliary group families consisting of pairwise intersections of groups in G with the level sets of f 0, f 1.

Due to updates to f 0 and f 1, these auxiliary groups are always in drift across these invocations, and careful bookkeeping is needed to verify that this does not prevent overall convergence. A key fact we prove towards this is that across all invocations on V 0 throughout Algorithm 2, the subroutine will perform boundedly many updates on f 0, implying that also f 1 will be re-calibrated at most that many times.

6. Applications

By combining our theory with known results from the elicitation literature in an essentially blackbox way, we can obtain several novel positive and negative results shedding light on an important question: when is it possible to produce (multi)calibrated predictors for various risk measures? We now summarize our results informally, and relegate the corresponding formal statements to Appendix G.

Joint Multicalibration of Bayes Pairs An elicitable property Γ by definition minimizes some scoring function S. Thus, we can view S as a loss which yields an accurate Γ-predictor when E[S] is minimized over a dataset. For instance, if Γ is the mean, we would minimize E(x,y) D[S(γ, y)] for S(γ, y) = (γ y)2 which is just the familiar L2 regression. As another example, τquantiles are elicited by optimizing the expected pinball loss Sτ(γ, y) := (1 τ)γ + max{y γ, 0}; this procedure is known as quantile regression.

In loss minimization settings, one may care not about the minimizer per se, but rather about the loss value it induces, effectively asking: how large is the expected (strictly consistent) loss S at the true property value of Γ, i.e. minγ E(x,y) D[S(γ, y)]? This object is known as the Bayes risk ΓB of Γ with respect to S, and the two-dimensional property ΓBP := (Γ, ΓB) is a Bayes pair with respect to S.

Definition 6.1 (Bayes Risk, Bayes Pair). Fix an elicitable Γ : P R and a strictly consistent Γ-scoring function S. The property ΓB : P R given by ΓB(P) := S(Γ(P), P) for P P is the Bayes risk of Γ, and the property ΓBP := (Γ, ΓB) is the Bayes pair with respect to S.

For instance, (mean, variance) is a Bayes pair with respect to the squared loss. Another important Bayes pair, (quantile, CVa R), will be discussed shortly. As it turns out, Bayes risks are often not elicitable per se (Embrechts et al., 2021). However, any Bayes risk ΓB is evidently elicitable conditionally on its underlying property Γ: knowing the value of Γ reveals the value of ΓB. Thus, Bayes pairs are amenable to our joint multicalibration techniques:

Theorem 6.2 (Informal). Under mild assumptions, all Bayes pairs ΓBP := (Γ, ΓB) with respect to Lipschitz losses S are jointly multicalibratable using Algorithm 2.

The Statistical Scope of Multicalibration

CVa R Multicalibration Conditional Value at Risk (CVa R), known also as Expected Shortfall (ES), is a tail risk measure of central significance in the literature. Proposed by Artzner et al. (1999) and Rockafellar & Uryasev (2000), τ-CVa R (for any τ [0, 1]) measures the mean of the top (1 τ)-fraction of a random variable s largest values, defined (with qτ(P) denoting the τ-quantile of P) as:

CVa Rτ(P) := EY P [Y |Y > qτ(P)].

Given its ability to capture tail risk beyond the quantile, as well as many appealing features such as its coherence (see Artzner et al. (1999)), the CVa R has gained widespread prominence in areas ranging from finance to robust optimization. Its real-world significance is underscored by its recent introduction into international banking regulations, known as the Basel Accords, as a replacement for quantiles (called Value-at-Risk (Va R) in finance) for the purposes of market risk capital calculations (Embrechts et al., 2014). Thus, it is very important to ask if the CVa R is sensible for calibration as this would let us train multicalibrated CVa R predictors using our canonical batch and online methods, complementing recent algorithmic quantile multicalibration results of Bastani et al. (2022) and Jung et al. (2023).

The answer to this question turns out to be nuanced. We show that while CVa Rτ is not sensible for calibration for any τ [0, 1] (eliminating the possibility of directly training multicalibrated predictors for it), it can be multicalibrated jointly with the corresponding quantile qτ.

Theorem 6.3 (Informal). For τ [0, 1], τ-CVa R is not sensible for calibration. However, τ-CVa R is multicalibratable jointly with the τ-quantile, by instantiating Algorithm 2.

For the negative part of this theorem, we simply invoke our Theorem 3.4 with a well-known result of (Gneiting, 2011) that CVa R violates Cx LS. For the positive part, we invoke our joint multicalibration Theorem 6.2 by using the known fact that (qτ, CVa Rτ) is a Bayes pair relative to the (rescaled) τ-pinball loss (see e.g. Frongillo & Kash (2021)).

An Impossibility Result for Distortion Risk Measures Distortion risk measures (Wang et al., 1997) are a large theoretically and practically important class of risk measures. Its representatives include means, quantiles, CVa R, spectral risk measures, and numerous other important risk measures; see e.g. Kou & Peng (2016) or Gzyl & Mayoral (2008). Definition 6.4 (Distortion Risk Measure). Given a distortion function h : [0, 1] [0, 1] (i.e., h is nondecreasing and satisfies h(0) = 0 and h(1) = 1), the corresponding distortion risk measure Γh : P R is given by:3

Γh(P) := Z 0

(h (1 FP (x)) 1) dx + Z

0 h(1 FP (x)) dx

3We assume the integrals exist for all P P.

for P P, where FP is the CDF of P.

For instance, the choice h(x) = x for x [0, 1] leads to Γh

being the distribution mean. Letting hτ(x) = 1[x > 1 τ] leads to Γhτ being the τ-quantile.

As Wang & Ziegel (2015) and Kou & Peng (2016) showed, means and quantiles are essentially4 the only distortion risks with convex level sets on finite-support distributions. Paired with our Theorem 3.4 (no Cx LS = not sensible for calibration), this yields a sweeping negative result: Theorem 6.5 (Informal). No distortion risk measures, other than (essentially) means and quantiles, are sensible for calibration on any dataset family D which allows for finitesupport label distributions.

Thus, we learn that there will not be another multicalibration algorithm for distortion risks: the existing mean and quantile multicalibration methods are (essentially) the only ones.

Acknowledgments

This research was done in part while Georgy Noarov was visiting the Simons Institute for the Theory of Computing, and was supported in part by a Gift from AWS AI for Research in Trustworthy AI, the Simons Collaboration on the Theory of Algorithmic Fairness, and NSF grants FAI-2147212 and CCF-2217062. We warmly thank Christopher Jung and Arpit Agarwal for enlightening conversations at an early stage of this work.

Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. Coherent measures of risk. Mathematical finance, 9(3):203 228, 1999.

Bastani, O., Gupta, V., Jung, C., Noarov, G., Ramalingam, R., and Roth, A. Practical adversarial multivalid conformal prediction. In Neural Information Processing Systems (Neur IPS), 2022.

Brier, G. W. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950.

Deng, Z., Dwork, C., and Zhang, L. Happymap: A generalized multicalibration method. In Innovations in Theoretical Computer Science (ITCS), 2023.

Diestel, J. and Uhl, J. J. Vector measures, vol. 15 of mathematical surveys. American Mathematical Society, Providence, RI, USA, 1977.

Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Outcome indistinguishability. In Proceedings

4Up to two further quantile-like properties; see Definition G.3 and Theorem G.4 in Section G for the precise statement.

The Statistical Scope of Multicalibration

of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1095 1108, 2021.

Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Beyond bernoulli: Generating random outcomes that cannot be distinguished from nature. In International Conference on Algorithmic Learning Theory, pp. 342 380. PMLR, 2022.

Embrechts, P., Puccetti, G., R uschendorf, L., Wang, R., and Beleraj, A. An academic response to basel 3.5. Risks, 2 (1):25 48, 2014.

Embrechts, P., Mao, T., Wang, Q., and Wang, R. Bayes risk, elicitability, and the expected shortfall. Mathematical Finance, 31(4):1190 1217, 2021.

Finocchiaro, J. and Frongillo, R. Convex elicitation of continuous properties. Advances in Neural Information Processing Systems, 31, 2018.

Frongillo, R. M. and Kash, I. A. Elicitation complexity of statistical properties. Biometrika, 108(4):857 879, 2021.

Garg, S., Jung, C., Reingold, O., and Roth, A. Oracle efficient online multicalibration and omniprediction. Manuscript, 2023.

Globus-Harris, I., Harrison, D., Kearns, M., Roth, A., and Sorrell, J. Multicalibration as boosting for regression. International Conference on Machine Learning (ICML), 2023.

Gneiting, T. Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746 762, 2011.

Good, I. J. Rational decisions. In Breakthroughs in statistics, pp. 365 377. Springer, 1992.

Gopalan, P., Kalai, A. T., Reingold, O., Sharan, V., and Wieder, U. Omnipredictors. In ITCS, 2022a.

Gopalan, P., Kim, M. P., Singhal, M. A., and Zhao, S. Lowdegree multicalibration. In Conference on Learning Theory, pp. 3193 3234. PMLR, 2022b.

Gupta, V., Jung, C., Noarov, G., Pai, M. M., and Roth, A. Online multivalid learning: Means, moments, and prediction intervals. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Schloss Dagstuhl-Leibniz-Zentrum f ur Informatik, 2022.

Gzyl, H. and Mayoral, S. On a relationship between distorted and spectral risk measures. Rev Econ Financ, 2008.

H ebert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. Multicalibration: Calibration for the (computationallyidentifiable) masses. In International Conference on Machine Learning, pp. 1939 1948. PMLR, 2018.

Jung, C., Lee, C., Pai, M., Roth, A., and Vohra, R. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory, pp. 2634 2678. PMLR, 2021.

Jung, C., Noarov, G., Ramalingam, R., and Roth, A. Batch multivalid conformal prediction. In International Conference on Learning Representations (ICLR), 2023.

Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Blackbox post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247 254, 2019.

Kim, M. P., Kern, C., Goldwasser, S., Kreuter, F., and Reingold, O. Universal adaptability: Target-independent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, 119(4): e2108097119, 2022.

Kou, S. and Peng, X. On the measurement of economic tail risk. Operations Research, 64(5):1056 1072, 2016.

Lambert, N. S., Pennock, D. M., and Shoham, Y. Eliciting properties of probability distributions. In Proceedings of the 9th ACM Conference on Electronic Commerce, pp. 129 138, 2008.

Lee, D., Noarov, G., Pai, M. M., and Roth, A. Online minimax multiobjective optimization: Multicalibeating and other applications. In Neural Information Processing Systems (Neur IPS), 2022.

Osband, K. H. Providing Incentives for Better Cost Forecasting (Prediction, Uncertainty Elicitation). Ph D thesis, University of California, Berkeley, 1985.

Rockafellar, R. T. and Uryasev, S. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000.

Roth, A. Uncertain: Modern topics in uncertainty estimation. https://www.cis.upenn.edu/ aaroth/uncertainty-notes.pdf, 2022.

Savage, L. J. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783 801, 1971.

Steinwart, I., Pasin, C., Williamson, R., and Zhang, S. Elicitation and identification of properties. In Conference on Learning Theory, pp. 482 526. PMLR, 2014.

Wang, R. and Ziegel, J. F. Elicitable distortion risk measures: A concise proof. Statistics & Probability Letters, 100: 172 175, 2015.

Wang, S. S., Young, V. R., and Panjer, H. H. Axiomatic characterization of insurance prices. Insurance: Mathematics and economics, 21(2):173 183, 1997.

The Statistical Scope of Multicalibration

A. Additional Related Work

The main conceptual contribution of our paper is to connect the literature on multicalibration with the literature on property elicitation, both of which have a number of related threads.

Multicalibration (Mean) multicalibration in the batch setting was introduced by H ebert-Johnson et al. (2018), and has subsequently been generalized in a number of ways. As already discussed, Jung et al. (2021) study mean conditioned moment multicalibration in the batch setting, Gupta et al. (2022) study mean, quantile, and mean conditioned moment multicalibration in the sequential setting, and Jung et al. (2023) study quantile multicalibration in the batch setting. These generalizations can be seen as asking for calibration with respect to different distributional properties i.e. they are generalizations of the type that we characterize in our paper. Dwork et al. (2021) study outcome indistinguishability, generalizing batch multi-calibration in a binary-label setting to allow for distinguishers that can evaluate the expectations of arbitrary predicates of triples (x, f(x), y), and consider strengthenings in which the distinguisher gets further access to f e.g. by being able to query it in arbitrary points, or by having access to its code. In particular, they show that without additional access to f, outcome indistinguishability can be reduced to (mean) multicalibration. Note that many distributional properties (e.g. variances, quantiles, etc) only become interesting when the label space is not binary, but rather consists of more than two labels or is real valued.

The most closely related works study abstract generalizations of multi-calibration. Dwork et al. (2022) study a generalization of outcome indistinguishability to real valued labels, and along the way consider batch multicalibration with respect to linearizing statistics. In our language, these are distributional properties Γ that behave linearly over mixture distributions informally that for any two distributions D1, D2 and any α [0, 1], Γ(αD1 + (1 α)D2) = αΓ(D1) + (1 α)Γ(D2). We adopt one of their definitions of multicalibration with respect to general distributional properties. All linearizing statistics have convex level sets (and so are elicitable), but not all elicitable properties are linear in this sense for example, quantiles do not linearize. So our characterization of multicalibration implies that it is possible to multicalibrate with respect to a broader class of properties than are studied by Dwork et al. (2022). Lee et al. (2022) study a general online learning problem that they call Online Minimax Multiobjective Optimization , and derive algorithms for multicalibration along with a number of other applications in this framework. We make use of this framework to derive our sequential multicalibration bounds, using an identification function that arises from the connection we make to property elicitation. Recent work of Deng et al. (2023) studies a one-dimensional generalization of multicalibration that asks for the condition that E[c(f(x), x)s(f(x), y)] = 0 for abstract functions c and s, and derives sufficient (but not necessary) conditions under which this can be achieved in the batch setting. The algorithms we derive for batch multicalibration are similar to theirs, where an identification function takes the place of their s function, and a scoring function takes the place of their potential function; they do not consider sequential or multi-dimensional problems. Relative to this line of work, our result is the first to provide a characterization of when property multicalibration can be obtained, and to provide unifying results for both batch and sequential multicalibration.

There are also generalizations in orthogonal directions. Gopalan et al. (2022b) define low degree multicalibration , which is a hierarchy of properties of predictors that are still trying to predict means, but relax the conditioning event that f(x) = v. At the bottom of the hierarchy is multi-accuracy (H ebert-Johnson et al., 2018; Kim et al., 2019) which does not condition on f(x) at all. Multicalibration lies at the top of the hierarchy; in between are conditions that depend on f(x) only smoothly, through a degree k polynomial. Gopalan et al. (2022b) show that intermediate levels of this hierarchy have some of the desirable properties of multicalibration and can be easier to obtain. Several works (Kim et al., 2019; Gopalan et al., 2022a; Kim et al., 2022; Globus-Harris et al., 2023) study generalizations of mean multicalibration in which groups representing subsets of the data domain are relaxed to arbitrary real valued functions and give a number of applications. In this setting, Globus-Harris et al. (2023) provide a characterization of when mean multicalibration implies Bayes optimality. See Roth (2022) for an introductory exposition of much of this work.

Property elicitation Brier (1950), Good (1992), and Savage (1971) study proper scoring rules which are contracts for paying an expert as a function of their prediction and of the realized outcome, with the property that they maximally reward the expert (in expectation) for truthfully reporting their estimate of the probability of the outcome event. Since scoring rules directly elicit probabilities, they could in principle be used to elicit an entire probability distribution by eliciting the probability of every event in its support, but this is generally infeasible. Instead, Lambert et al. (2008) introduce the problem of property elicitation, whose goal is to design contracts that incentivize experts to truthfully report some property of a large or infinite support distribution like its mean, variance, median, etc. Informally a property is elicitable if there

The Statistical Scope of Multicalibration

exists some function of a report and an outcome that in expectation over the outcome is minimized at the property value. There is now a large literature on property elicitation; we make use of several key results. Osband (1985) and Gneiting (2011) define the notion of an identification function for a property, which like a scoring rule is a function of a report and an outcome; an identification function takes value 0 in expectation over the outcome if the report is equal to the property value. Steinwart et al. (2014) prove a central characterization theorem (subject to mild technical conditions) a continuous property is elicitable if and only if it has an identification function if and only if it has convex level sets. The characterization of Steinwart et al. (2014) holds generally for continuous outcome spaces. When the outcome space is finite, Finocchiaro & Frongillo (2018) show that (subject to technical conditions), elicitable properties can be elicited with convex scoring rules.

B. Proof of Theorem 3.6

Theorem 3.6 (Cx LS = Sensible for Calibration). Consider a continuous property Γ : P R, and any family D of P-regular dataset distributions. Then, if Γ has convex level sets on P, it is sensible for calibration over D.

Proof. Suppose that Γ has convex level sets on P. We now establish that for every dataset distribution D D, the true distributional predictor f D Γ is calibrated: namely, that for all γ Rangef D Γ RangeΓ, we have Γ(Yf D Γ ,γ) = γ. For the remainder of the proof, fix any dataset distribution D D (which is P-regular by assumption) and any value γ Rangef D Γ .

Let LSΓ(γ) = {P P : Γ(P) = γ} be the γ-level set of property Γ which is convex by our assumption on Γ. Further, let Qγ := {x X : Yx LSΓ(γ)} be the feature space region consisting of all points x X whose label distributions have property value γ. Let Xγ be the conditional probability measure on Qγ induced by the dataset distribution D.

Proving that Γ(Yf D Γ ,γ) = γ is equivalent to establishing that Yf D Γ ,γ LSΓ(γ). The conditional distribution Yf D Γ ,γ is a mixture distribution over the individual label distributions Yx for x Qγ, so we can write the following formal expression:

Yf D Γ ,γ = E x Xγ[Yx] := Z

Qγ Yx d Xγ .

If X is finite: In this case, Yf D Γ ,γ is just a convex combination of the constituent distributions Yx for x Qγ:

Yf D Γ ,γ = X

x Qγ Xγ(x) Yx.

The convexity of the level set LSΓ(γ) then implies that indeed Yf D Γ ,γ LSΓ(γ), since Yf D Γ ,γ is a convex combination, over x Qγ, of distributions Yx LSΓ(γ).

If Y is finite: In this case, Yf D Γ ,γ = Ex Xγ[Yx] is naturally given by a Lebesgue integral of the random variable Yx over the simplex (d) Rd. By our assumption that the dataset is P-regular, we have that Yx is a bounded (e.g. in the ℓ norm) and Lebesgue measurable random variable. Consequently, Yx is in fact Lebesgue integrable, so the expectation Ex Xγ[Yx] is well-defined and evaluates to some point u Rd. It remains to show that u LSΓ(γ).

For this, introduce the indicator function 1LSΓ(γ) : (d) {0} {+ }, defined to be 0 for Yx LSΓ(γ), and otherwise. As the set LSΓ(γ) is convex, its indicator function 1LSΓ(γ) is convex. Therefore, we can apply Jensen s inequality to 1LSΓ(γ) to conclude that:

1LSΓ(γ)(u) = 1LSΓ(γ)

E x Xγ[Yx] E x Xγ

1LSΓ(γ)(Yx) = 0,

implying that 1LSΓ(γ)(u) = 0. By definition of 1LSΓ(γ), this demonstrates that u LSΓ(γ), as desired.

X, Y infinite: In this case, we define Yf D Γ ,γ = Ex Xγ[Yx] as the Bochner integral of the Bochner measurable map ξD. (Recall that ξD is defined in Definition 3.5, and see Diestel & Uhl (1977) for formal definitions and properties of Bochner measurability and integrability.) By a standard Bochner integrability criterion (see Theorem 2 on p. 45 of Diestel & Uhl (1977)), this integral indeed exists and evaluates to a point in the ambient space WTV, as it is easy to check that Ex Xγ[||Yx||TV] < (indeed, the TV norm of any probability distribution is 1 so Ex Xγ[||Yx||TV] = 1 < ). Again, we want to show that Yf D Γ ,γ = Ex Xγ[Yx] LSΓ(γ). For this, we use the following result, which can be interpreted as a mean value theorem for Bochner integrals:

The Statistical Scope of Multicalibration

Fact 2 (Corollary 8 on p. 48 of Diestel & Uhl (1977)). Let (Ω, A, µ) be a finite measure space, E a Banach space, and f : Ω E a Bochner µ-integrable map. For G E, let co(G) be the closure of the convex hull of G. Then, for any A A with µ(A) > 0, the Bochner integral of f over A belongs to the closure of the convex hull of the image of A under f:

A f dµ co(f(A)).

To instantiate this fact, we let: (1) Ω:= X, together with the probability measure induced by the dataset over X; (2) the Banach space E := WTV; (3) the Bochner integrable mapping f := ξD; and (4) the measurable event A := Qγ X.

Note that f(A) = f(Qγ) LSΓ(γ) (with this inclusion being strict whenever there is some label distribution P LSΓ(γ) that is not induced by the dataset distribution D conditional on any x X), so we have that co(f(A)) co(LSΓ(γ)). Observe that LSΓ(γ) is convex by assumption, and it is also closed in the standard metric topology on WTV since it is the preimage under the continuous mapping Γ of the closed singleton {γ} RangeΓ. Thus, LSΓ(γ) is a closed convex set so co(LSΓ(γ)) = LSΓ(γ). As a result, we in fact see that co(f(A)) LSΓ(γ).

Since Xγ is the conditional distribution induced by D over x Qγ, we can see that 1 µ(A) R

A fdµ = Ex Xγ[Yx] = Yf D Γ ,γ. Together with our observation that co(f(A)) LSΓ(γ), this lets us conclude by Fact 2 that Yf D Γ ,γ LSΓ(γ), as desired.

C. Convergence Analysis for Batch Multicalibration (Proof of Theorem 4.3)

Our convergence analysis of Algorithm 1 will utilize the following natural potential function:

Definition C.1. The potential for Algorithm 1 at round t is:

Φt := E (x,y) D[S(ft(x), y)] = E x X[S(ft(x), Yx)],

where ft : X R is the property predictor at the beginning of iteration t of the algorithm and S is a strictly consistent scoring function for property Γ satisfying Assumption 4.1.

First, we prove the following helper Lemma that bounds the change in S the potential function of Algorithm 1 in the scenario where an incorrect prediction γ for the property value Γ(Y ) is corrected on a label distribution Y . We will later use this fact to bound the progress of the algorithm after every update to the predictor f for Γ.

Lemma C.2. Consider any property Γ : P R. Suppose S : RangeΓ Y R and V : RangeΓ Y R with V (γ, y) = S(γ,y)

γ are a strictly consistent scoring function and the corresponding identification function for Γ that satisfy Assumption 4.1. Then for any γ RangeΓ and any label distribution Y P, letting LY be the Lipschitz constant of V ( , Y ), it holds that: (V (γ, Y ))2

2LY S(γ, Y ) S(Γ(Y ), Y ) V (γ, Y )(γ Γ(Y )) (V (γ, Y ))2

Proof. We will prove this result with the help of the following claim.

Claim 1. For any L-Lipschitz increasing function h defined on any interval [a, b], it holds that:

h(a)(b a) + (h(b) h(a))2

a h(t)dt h(b)(b a) (h(b) h(a))2

Proof. Under these constraints on h, the largest value of the integral R b a h(t)dt would be obtained if h(t) first increased from h(a) to h(b) for t [a, t ], where t [a, b] is defined by (t a)L = h(b) h(a), at the fastest rate possible (that is, at the rate L), and stayed constant at the value h(b) for t [t , b]. The integral of this piecewise linear function on [a, b] gives the upper bound.

Conversely, the smallest value of the integral R b a h(t)dt would be obtained if h(t) first stayed constant at the value h(a) for t [a, t ], where t is defined so that (b t )L = h(b) h(a), and then increased from h(a) to h(b) at the fastest rate possible (that is, at the rate L) for t [t , b]. Integrating this function on [a, b] gives the claimed lower bound.

The Statistical Scope of Multicalibration

As V is a derivative of S, by the fundamental theorem of calculus we get: S(γ, Y ) S(Γ(Y ), Y ) = R γ Γ(Y ) V (t, Y )dt.

First assume γ Γ(Y ). By Assumption 4.1, V continuously increases from Γ(Y ) to γ, and has Lipschitz constant LY . Then, by Claim 1, and using that V (Γ(Y ), Y ) = 0, we obtain

(V (γ, Y ))2

Γ(Y ) V (t, Y )dt V (γ, Y )(γ Γ(Y )) (V (γ, Y ))2

Now assume γ < Γ(Y ). Then, we have:

S(γ, Y ) S(Γ(Y ), Y ) = Z γ

Γ(Y ) V (t, Y )dt = Z Γ(Y )

γ V (t, Y )dt.

By Assumption 4.1, V continuously increases from γ to Γ(Y ), and has Lipschitz constant LY . By Claim 1, we have: V (Γ(Y ), Y )(Γ(Y ) γ) + (V (Γ(Y ),Y ) V (γ,Y ))2

2LY R Γ(Y ) γ V (t, Y )dt V (γ, Y )(Γ(Y ) γ) (V (Γ(Y ),Y ) V (γ,Y ))

which from V (Γ(Y ), Y ) = 0 simplifies to: (V (γ,Y ))2

2LY R Γ(Y ) γ V (t, Y )dt V (γ, Y )(γ Γ(Y )) (V (γ,Y ))2

2LY . Thus, we have shown our bound for both cases γ Γ(Y ) and γ < Γ(Y ).

Now, we are ready to prove Theorem 4.3, which gives the convergence rate for Algorithm 1. We restate the theorem here for convenience.

Theorem 4.3 (Guarantees of Algorithm 1). Fix data distribution D Z and groups G 2X . Fix an elicitable property Γ with its scoring function S and id function V = S

γ satisfying Assumption 4.1, so that V ( , YQ) is L-Lipschitz on all label distributions YQ (for Q X) induced by D. Set discretization m 1. If Algorithm 1 is initialized with predictor f1 : X R with score E(x,y) D[S(f1(x), y)] = Cinit, and Copt = E(x,y) D[S(f D Γ (x), y)] is the score of the true distributional predictor f D Γ , then Algorithm 1 produces a 4L2

m -approximately (G, V )-multicalibrated Γ-predictor f after at most (Cinit Copt) m2

Proof. Suppose the algorithm has not halted at round t. Thus, ft does not yet satisfy α-approximate (G, V )-multicalibration, so by the pigeonhole principle there is a pair (G, γ) G [1/m] such that on the set Qt := {x X : x g, ft(x) = γ}:

|V (γ, YQt)|

α/m Prx X[x Qt]. (3)

Now, letting γ = argminγ [1/m] |V (γ , YQt)|, the algorithm will update ft ft+1 via the rule:

ft+1(x) := 1[x Qt] ft(x) + 1[x Qt] γ .

From the definition of the potential function values Φt and Φt+1, we have

Φt+1 = Pr x X[x Qt] E x X[S(ft+1(x), Yx)|x Qt] + Pr x X[x Qt] E x X[S(ft+1(x), Yx)|x Qt]

= Pr x X[x Qt] E x X[S(ft+1(x), Yx)|x Qt] + Pr x X[x Qt] E x X[S(ft(x), Yx)|x Qt]

= Φt + Pr x X[x Qt] E x X[S(ft+1(x), Yx) S(ft(x), Yx)|x Qt]

= Φt + Pr x X[x Qt] E x X[S(γ , Yx) S(γ, Yx)|x Qt]

= Φt + Pr x X[x Qt] (S(γ , YQt) S(γ, YQt)) .

Here Step 2 follows because ft(x) = ft+1(x) for all x outside Qt, and Step 5 uses the fact that ft and ft+1 are both constant on Qt to rewrite expected scores of ft, ft+1 over x X simply as the scores of the predictor values γ, γ with respect to the mixture distribution YQt of labels over the region Qt.

The Statistical Scope of Multicalibration

From here, we have:

Φt+1 Φt = Pr x X[x Qt] (S(γ , YQt) S(Γ(YQt), YQt)) (S(γ, YQt) S(Γ(YQt), YQt))

Pr x X[x Qt] V (γ , YQt)(γ Γ(YQt)) (V (γ , YQt))2

2L (V (γ, YQt))2

Pr x X[x Qt] V (γ , YQt)(γ Γ(YQt)) (V (γ , YQt))2

Pr x X[x Qt] V (γ , YQt)(γ Γ(YQt)) α 2Lm

V (γ , YQt)(γ Γ(YQt)) α 2Lm

L|γ Γ(YQt)| |γ Γ(YQt)| α 2Lm

The equality follows by introducing an added and subtracted term S(Γ(YQt), YQt). The 1st inequality applies the upper and lower bound of Lemma C.2 to the two score differences. The 2nd inequality follows by the α-miscalibration condition of Equation 3. The 3rd inequality drops the nonpositive term Prx X[x Qt] (V (γ ,YQt))2

2L . The 4th inequality drops the factor Prx X[x Qt] 1. The 5th inequality holds since by the Lipschitzness of V , we have |V (γ , YQt)| = |V (γ , YQt) V (Γ(YQt), YQt)| L|γ Γ(YQt)|. The 6th inequality holds because γ must be at most 1

m away from the true property value Γ(YQt) on Qt: the algorithm cannot select a farther grid point γ , as that would result in a greater value of |V (γ , YQt)|, by the structure of the m-discretization and the monotonic increase of |V ( , YQt)| in both directions away from Γ(YQt).

Now setting α = 4L2

m , we get Φt+1 Φt L m2 α 2Lm = L

m2 . Telescoping this over the rounds t = 1, . . . , T, where T is the total number of iterations before convergence, we obtain:

By assumption Φ1 = Cinit and ΦT Copt, so we have T L

m2 Φ1 ΦT Cinit Copt, and thus

T (Cinit Copt)m2

concluding the proof.

D. Finite Sample Guarantees for Batch Multicalibration

We have described Algorithm 1 as if it has direct access to the underlying distribution D (since it computes expectations of the identification function V on the underlying distribution). In general we do not have access to D directly, and instead have access only to a sample ˆD Dn of n points sampled i.i.d. from D. In practice, we would run the algorithm on the empirical distribution over the n points in ˆD, and its guarantees would carry over to the underlying distribution D from which these points were sampled. Jung et al. (2023) proved this for the special case of quantiles, but in fact their proof uses nothing other than the conditions in Assumption 4.1. We state the more general version of the theorem here (implicit in Jung et al. (2023)) and briefly sketch the argument. We note that this argument is to establish that Algorithm 1 generalizes when used as an empirical risk minimization algorithm. An alternative way to obtain similar generalization bounds would be to follow the strategy of H ebert-Johnson et al. (2018) and use techniques from adaptive data analysis to implement a statistical query oracle, and to then modify the algorithm so as to compute the quantities V (γ, Qt) only through this oracle. Theorem D.1 (Implicit in Jung et al. (2023)). Fix a distribution D Z and a property Γ together with a bounded identification function V . Suppose Algorithm 1 is run using the empirical distribution ˆD Dn over n i.i.d. samples drawn from D. Then if Algorithm 1 halts after T rounds and returns a model f T , with probability 1 δ over the randomness of the data distribution, f T satisfies α-approximate (G, V )-multicalibration with respect to D for:

ln(1/δ) + T ln(m2|G|)m

n + m ln(1/δ) + T ln(m2|G|)

The Statistical Scope of Multicalibration

The proof has a simple structure. Since V is bounded, expectations of V over D (i.e. the quantities of the form V (γ, YG,γ) that appear in the definition of approximate (G, V )-multicalibration) concentrate around their expectations with high probability when evaluated on the empirical distribution ˆD. Thus, if the model f T was fixed, we would get that its (G, V )- multicalibration error is similar in-sample and out-of-sample by union-bounding over each G G and γ Rangeft = [1/m]. Of course the model f T output by the algorithm is not fixed before ˆD is sampled, so to establish the claim, it is also necessary that we union-bound over all models f T that might be output. But we can do this; since Algorithm 1 produces models with range restricted to [1/m], then for any t [T], we can easily see for any fixed model ft that at most |G|m2 models ft+1 could possibly result at the next step at most one for every choice of G G, γ [1/m], and γ [1/m] at iteration t of Algorithm 1. Thus, fixing any initial model f1 = f, the number of models f T that might be output after the final step T of the algorithm is bounded by (|G|m2)T . Theorem D.1 then follows by union-bounding over all such models.

Theorem D.1 upper bounds the generalization error of Algorithm 1 in terms of the number of rounds T before it halts. Thus, when paired with an upper bound on T, it gives a worst-case bound on generalization error. Theorem 4.3 upper bounds the round complexity by T O m2

L , but there is a catch: L here is the Lipschitz constant for expectations of V taken over

the true underlying distribution D, and this will generally not be preserved over the empirical distribution ˆD. Nevertheless, Jung et al. (2023) show that the same convergence bound holds when run on ˆD (up to constants) by arguing that each round of the algorithm run on ˆD decreases the potential function as measured on D (where the Lipschitz assumption has been made). This is because the algorithm decides on its update each round by measuring quantities of the form V (γ, Q) which are expectations of a bounded function V , and so concentrate around their true values.

Theorem D.2 (Implicit in Jung et al. (2023)). Fix a distribution D Z (which induces a set of conditional label distributions YQ for each Q X) and a property Γ together with an identification function V . Assume that Γ and V together with the set of label distributions P = {YQ : Q X} together satisfy Assumption 4.1 with Lipschitz constant L. Suppose Algorithm 1 is run using the empirical distribution on a dataset ˆD Dn consisting of n i.i.d. samples from D. Then for any δ > 0, if:

with probability 1 δ over the randomness of D, Algorithm 1 halts after at most T = O m2

L many steps.

Together with Theorem D.1, this establishes a worst-case generalization bound for batch property multicalibration that is polynomial in all of the parameters of the problem and in the assumed Lipschitz constant L of the property s identification function V .

E. Joint Multicalibration Guarantees

We begin by formally defining the subroutine Batch Multicalibration V as the following Algorithm 3:

Algorithm 3 Batch Multicalibration V (V, G, m, f, α)

Initialize t = 1 and f1 = f. while (γ, G) [1/m] G such that Prx X [ft(x) = γ, x G] V (γ, Y(γ,G)) 2 α/m do Let Qt = {x : ft(x) = γ, x G} Let: γ = argmin γ [1/m] |V (γ , YQt)|

Update: ft+1(x) := 1[x Qt] ft(x) + 1[x Qt] γ for all x X, and t t + 1. end while Output ft.

We here state the guarantees enjoyed by Algorithm 3. This statement is stronger than that of the guarantees for the similar Algorithm 1, in two ways: (1) the notion of achieved multicalibration error at convergence (Equation 4) is stronger than that of Algorithm 1; and (2) we show that even with the input group family G not fixed beforehand (and thus potentially changing over time), Algorithm 3 will never perform more than a certain number of updates to the predictor f, and if it does

The Statistical Scope of Multicalibration

perform that many updates then it will be approximately calibrated conditional on all measurable subsets of X (rather than just the ones it explicitly performed updates on).

Lemma E.1. Set α = 4L2

m . Batch Multicalibration V (Algorithm 3), when run on a function V that is monotonically increasing and L-Lipschitz in its first argument, outputs a [1/m]-discretized predictor f that satisfies:

Pr x X[f(x) = γ, x G] V (γ, Y(γ,G)) 2 α

m for all γ [1/m], G G. (4)

Moreover, Algorithm 3 terminates in at most Bm2

L iterations, where B = supγ,y [0,1] S(γ, y) infγ,y [0,1] S(γ, y) for S an antiderivative of V , and if it runs for that long, the resulting predictor will satisfy (4) for all (measurable) regions G X.

Proof. Denote by S an antiderivative of V , and define, similar to the proof of Theorem 4.3, the potential value at iteration t of Algorithm 3 as: Φt := E (x,y) D[S(ft(x), y)] = E x X[S(ft(x), Yx)].

Suppose that at round t, the algorithm finds a violation of its while loop condition for some G G and γ [1/m]. Let Qt = {x X : x G, f(x) = γ}. Via the same calculations as in the proof of Theorem 4.3, we have that

Φt+1 Φt Pr x X[x Qt] V (γ , YQt)(γ γ t ) (V (γ, YQt))2

Pr x X[x Qt] |V (γ , YQt)||γ γ t | (V (γ, YQt))2

Pr x X[x Qt] L|γ γ t |2 (V (γ, YQt))2

where we denote by γ t the unique point such that V (γ t , YQt) = 0 (it is the analog of Γ(YQt) in the proof of Theorem 4.3). This argument is still valid as it rests on Lemma C.2, which requires properties of V and S that are still satisfied here.

Now, just as in the aforementioned proof, we have by the monotonicity of V ( , YQt) that since the algorithm chooses γ = argminγ [1/m] |V (γ , YQt)|, it must be that |γ γ t | 1

m, and so we obtain

2L Pr x X[x Qt]|V (γ, YQt)|2.

Since the condition of the while loop demands that Prx X[x Qt] |V (γ, YQt)|2 α/m, we get

Setting α = 4L2

m , we then have Φt+1 Φt L

m2 , and thus, by telescoping, ΦT Φ0 T L

m2 . Since by definition B = supγ,y [0,1] S(γ, y) infγ,y [0,1] S(γ, y), we also have ΦT Φ0 B, and therefore T Bm2

L , providing an upper bound on the number of iterations of the algorithm.

Importantly, in this argument we never referenced the actual definition of Qt (i.e. that Qt = {x X : x G, f(x) = γ}) we only used that it satisfies the while loop condition, i.e. Prx X[x Qt] |V (γ, YQt)|2 α/m. Therefore, the upper bound Bm2

L on the total number of iterations in fact holds for any arbitrary sequence of regions Q1, Q2, . . . where each Qi X is measurable with respect to the marginal data distribution over X. As a result, we know that if Batch Multicalibration V does run for at least Bm2

L iterations, then as soon as it finishes iteration t = Bm2

L , there will not exist any measurable Q X violating condition (4). Thus, no matter which group family G Batch Multicalibration V is run on (and even if the group family were to change arbitrarily during the execution), it will never update the predictor f more than Bm2

L times, concluding the proof.

E.1. Convergence Analysis for the Joint Multicalibration Algorithm 2

Now we are ready to prove our convergence guarantee for the canonical Joint Multicalibration Algorithm 2. As mentioned in Section 5, Algorithm 2 significantly generalizes the (mean, moment)-multicalibration algorithm of Jung et al. (2021), leading to some key differences in the analysis.

The Statistical Scope of Multicalibration

Notably, in our terminology, in their specific case the re-calibration of f 1 given f 0 can be cast as a single mean multicalibration subroutine using what they call a pseudo-label technique. At our level of generality, this does not work anymore as we are forced to work with different id functions V 1 γ0 for Γ1 on each level set {f 0 = γ0}. This is why our inner for loop iterates over the level sets of f 0, re-calibrating f 1 using m separate invocations of the subroutine (fortunately, these can actually be run in parallel, since f 0 s level sets are disjoint). Even with this construction in hand, our potential function argument from Section 4 does not easily port over: each level set {f 0 = γ0} can overlap with multiple level sets of Γ0, so the true property Γ1 will generally not admit a single scoring function on {f 0 = γ0} that could be used as a potential. This is where our assumptions on the behavior of V 1 γ0 with respect to γ0 crucially enable us to show that, subject to f 0

being sufficiently multicalibrated, using the proxy id V 1 γ0 on the level set {f 0 = γ0} will not cause the multicalibration subroutines for Γ1 to fail to converge. Theorem 5.6 (Guarantees of Algorithm 2). Consider any property Γ = (Γ0, Γ1), with Γ0 elicitable and Γ1 elicitable conditionally on Γ0, whose id functions satisfy Assumptions 4.1, 5.2, 5.3, 5.4. Fix any group family G 2X and discretization m 1. Set α0 = 4(L0)2

m and α1 = 4(L1)2

m . Let α1 = 8((L0L0 a Lc)2+(L1)2)

m . Then, Joint Multicalibration (Algorithm 2) will output an (α0, α1 )-approximately (G, V 0, V 1)-jointly multicalibrated Γ-predictor f = (f 0, f 1), via at most B0B1m4

L0L1 updates to f. Here, B0 := supγ,y [0,1] S0(γ, y) infγ,y [0,1] S0(γ, y) for S0 an antiderivative of V 0(γ0, y)

wrt. γ0, and B1 := max γ0 [1/m]

supγ,y [0,1] S1 γ0(γ, y) infγ,y [0,1] S1 γ0(γ, y) for each S1 γ0 an antiderivative of V 1 γ0(γ1, y)

Proof. Runtime: First, observe that the while loop in Algorithm 2 will stop after at most B0m2

L0 iterations if we set

α0 = 4(L0)2

m . Indeed, all invocations of Batch Multicalibration V on f 0 with the identification function V 0 can be pieced together into a single process that first multicalibrates f 0 with respect to G0 1, then takes the resulting predictor and multicalibrates it with respect to G0 2, and so on until the stopping condition of the while loop in Joint Multicalibration is met. This is equivalent to a single run of Batch Multicalibration V where the group family is externally updated from time to time: G0 1 G0 2 . . . G0 t . . .. But by Lemma E.1, this process cannot perform a total of more than B0m2

L0 updates on the predictor f 0. Since the predictor f 0 is updated at least once in each iteration of the while loop of Joint Multicalibration, this also bounds the number of iterations of the while loop.

Now, for each iteration of the while loop, we have m calls to Batch Multicalibration V as applied to all identification functions V 1 γ0 for γ0 [1/m]. Again by Lemma E.1, each of them takes at most B1m2

L1 updates to converge. This follows directly from Assumption 5.4, which states that Vγ0( , P) is L1-Lipschitz and monotonically increasing for all P P (not just for P such that Γ0(P) = γ0). Naively, running the subroutine m times, once for each level set of f 0 t+1, would amount to a total of m B1m2

L1 iterations. But in fact, all these m invocations can be viewed as a single invocation of Batch Multicalibration V that updates the predictor f 1 for Γ1 using an identification function V 1 defined as V 1 γ0 on each level set {f 0 t+1 = γ0} (which is well defined since these level sets partition the domain X).

Therefore, there will be only at most B1m2

L1 across all these m invocations of Batch Multicalibration V .

Taking the above observations together, Algorithm 2 will therefore terminate after at most B0m2

L1 = B0B1m4

L0L1 updates to f 0 and to f 1, as claimed.

Multicalibration Guarantees: Now, we show that the predictors f 0 T , f 1 T output at termination satisfy the conditions of Definition 5.5 of approximate joint multicalibration.

By the stopping condition of the while loop, at termination we have for all γ0, γ1, G that Pr x X[f T (x) = (γ0, γ1), x

G] V 0(γ0, Y(γ0,γ1,G)) 2 α0

m , implying after dividing by Prx X [x G, f 1 T (x) = γ1] that

Pr x X[f 0 T (x) = γ0|x G, f 1 T (x) = γ1] V 0(γ0, Y(γ0,γ1,G)) 2 α0/m Pr[x G, f 1 T (x) = γ1] for all G G, γ1 Rangef 1 T .

(5) For every G G and γ1 Rangef 1 T , summing this inequality over all at most m values γ0 Rangef 0 T , we obtain that:

γ0 Rangef0 T

Pr x X[f 0 T (x) = γ0|x G, f 1 T (x) = γ1] V 0(γ0, Y(G,γ0,γ1)) 2 α0

Prx X[x G, f 1 T (x) = γ1],

The Statistical Scope of Multicalibration

so the predictor f 0 T satisfies its joint multicalibration condition (1) of Definition 5.5.

Now we show that the predictor f 1 T for Γ1 satisfies its joint multicalibration condition (2). By construction, for each γ0 [1/m] the function f 1 T is equal to f 1,γ0

T in the region {x X : f 0 T (x) = γ0}. Since f 1,γ0

T is output by the corresponding call to Batch Multicalibration V , by Lemma E.1 this guarantees for each G G1,γ0

X : f 0 T (x) = γ0} : G G} and for each γ1 [1/m] that Pr x X[f 1 T (x) = γ1, x G ] V 1 γ0(γ1, Y(γ1,G )) 2 α1

m , implying

that Prx X [f T (x) = (γ0, γ1), x G] V 1 γ0(γ1, Y(γ0,γ1,G)) 2 α1

m for all G G.

Therefore, we have for all γ0, γ1, G the bound

V 1 γ0(γ1, Y(γ0,γ1,G))

α1/m Prx X [f T (x) = (γ0, γ1), x G]. (6)

But observe that we instead want to bound V 1 Γ0(Y(G,γ0,γ1)) γ1, Y(G,γ0,γ1) , the absolute value of the true identification

function on this set. This is where we can make use of Assumption 5.3, which gives us Lc-Lipschitzness of V 1 γ0( , ) as a function of γ0, as well as Assumption 5.2, which gives us L0 a-anti-Lipschitzness of V 0(γ0, ) as a function of γ0: we obtain that

|V 1 Γ0(Y(G,γ0,γ1))(γ1, Y(G,γ0,γ1))| |V 1 Γ0(Y(G,γ0,γ1))(γ1, Y(G,γ0,γ1)) V 1 γ0(γ1, Y(γ0,γ1,G))| + |V 1 γ0(γ1, Y(γ0,γ1,G))|

Lc|γ0 Γ0 Y(G,γ0,γ1) | + |V 1 γ0(γ1, Y(γ0,γ1,G))|

Lc L0 a|V 0(γ0, Y(G,γ0,γ1))| + |V 1 γ0(γ1, Y(γ0,γ1,G))|

v u u t α0/m Pr x X[f T (x) = (γ0, γ1), x G] +

v u u t α1/m Pr x X[f T (x) = (γ0, γ1), x G],

where the fourth step is by substituting in Inequalities 5 and 6.

From here, for all γ0, γ1, G we have the bound V 1 Γ0(Y(G,γ0,γ1))(γ1, Y(G,γ0,γ1)) (Lc L0 a

Pr x X[f T (x) = (γ0, γ1), x G] ,

and after squaring both sides of the inequality, we obtain: V 1 Γ0(Y(G,γ0,γ1))(γ1, Y(G,γ0,γ1)) 2 (Lc L0 a

α1)2/m Pr x X[f T (x) = (γ0, γ1), x G] for all γ0, γ1, G.

Now multiplying both sides by Pr x X[f 1 T (x) = γ1|x G, f 0 T (x) = γ0] and noting that

α1)2 2((Lc L0 a)2α0 + α1) = 2((Lc L0 a)2 4(L0)2 + 4(L1)2)/m = 8((L0L0 a Lc)2 + (L1)2)/m = α1 ,

Pr x X[f 1 T (x) = γ1|x G, f 0 T (x) = γ0] V 1 Γ0(Y(G,γ0,γ1))(γ1, Y(G,γ0,γ1)) 2 α1 /m Pr x X[f 0 T (x) = γ0, x G] for all γ0, γ1, G.

For every G G and γ0 Rangef 0 T , summing this inequality over all at most m values γ1 Rangef 1 T , we obtain that:

γ1 Rangef1 T

Pr x X[f 1 T (x) = γ1|x G, f 0 T (x) = γ0] V 1 Γ0(Y(G,γ0,γ1))(γ1, Y(G,γ0,γ1)) 2 α1 Pr x X[f 0 T (x) = γ0, x G],

so the predictor f 1 T satisfies its joint multicalibration condition (2) of Definition 5.5, thus concluding the proof.

The Statistical Scope of Multicalibration

F. Sequential Multicalibration

We now turn to the sequential adversarial setting, in which there is no underlying distribution, and our goal will be to obtain approximate (G, V )-multicalibration (Definition 4.2) on the empirical distribution defined by the transcript π of an interaction between the Learner and an Adversary. This generalizes sequential multicalibration for means and quantiles studied by Gupta et al. (2022) to arbitrary elicitable properties. In fact, even for quantiles, we give a strengthening of the result of Gupta et al. (2022) they give an ℓ variant of calibration that makes use of bucketing in its conditioning event we give a bound on the same ℓ2-notion of calibration we use for batch calibration, without any bucketing. Garg et al.

(2023) similarly obtain this guarantee for sequential mean multicalibration.

F.1. Setup and Preliminaries

F.1.1. THE SEQUENTIAL LEARNING SETTING

In the sequential setting, a Learner interacts with an Adversary in rounds t = 1 to T as follows:

1. The Adversary chooses a feature vector xt X and a distribution Yt Y (possibly subject to some restrictions), and reveals xt to the Learner.

2. The Learner makes a prediction pt R.

3. The Adversary samples yt Yt and reveals yt to the Learner.

The record of the interaction accumulates in a transcript π = {(xt, pt, yt)}T t=1. For any s T and transcript π, the prefix of the transcript π<s is defined as π<s = {(xt, pt, yt)}s 1 t=1. We write Π<s for the domain of all transcripts of length < s. A Learner is a collection of mappings (for each round t T) Lt : Π<t X R, and an Adversary is a collection of mappings At : Π<t X Y, specifying their behavior given their observations thus far.

Now we can introduce our strong, ℓ2, definition of online multicalibration that we will then show how to achieve. Definition F.1 (Online Multicalibration). Fix a transcript π = {(xt, pt, yt)}T t=1. Let n(π, G) = |{t : xt G}| denote the number of rounds containing a member of group G in π, and n(π, γ, G) = |{t : xt G, pt = γ}| denote the number of rounds containing a group G in which the prediction pt was γ.

Fix π, a collection of groups G, a property Γ, and an identification function V for Γ. We say that the transcript π is α-approximately (G, V )-multicalibrated if for all G G:

t:pt=γ,xt G

V (γ, yt) n(π, γ, G)

α T n(π, G).

Remark F.2. Observe that this is exactly the definition of approximate multicalibration we gave in Definition 4.2, in which the empirical distribution over π replaces the distribution D.

We can simplify the notion of multicalibration somewhat by canceling terms: Observation 1. Fix a transcript π, a collection of groups G, a property Γ, and an identification function V for Γ. For each group G G define the quantity:

K2(G, π) = X

1 n(π, γ, G)

t:pt=γ,xt G V (γ, yt)

Then π is α-approximately (G, V )-multicalibrated if K2(G, π) αT for all groups G G.

In the online setting, our goal will be to control the growth of K2(G, π) as the transcript is generated, for each G G. The following Lemma will be key: Lemma F.3. Fix a partial transcript π<s = {(xt, pt, yt)}s 1 t=1 and a one-round continuation (xs, ps, ys). Write π s = π<s (xs, ps, ys) for the transcript extended by one round. Define:

R(π<s, G, γ) = X

t<s:pt=γ,xt G V (γ, yt).

The Statistical Scope of Multicalibration

Then for every G G, if xs G, we have:

K2(G, π s) K2(G, π<s) = 0.

If xs G and ps = γ, we have:

K2(G, π s) K2(G, π<s) 1 n(π<s, γ, G) 2V (γ, ys)R(π<s, G, γ) + V (γ, ys)2 .

Proof. If xs G, then K2(G, π s) = K2(G, π<s) by definition and we are done. Otherwise, if xs G we can calculate:

K2(G, π s) K2(G, π<s)

= 1 n(π<s, γ, G) + 1

t<s:pt=γ,xt G V (γ, yt)

+ V (γ, ys)

1 n(π<s, γ, G)

t<s:pt=γ,xt G V (γ, yt)

1 n(π<s, γ, G)

t<s:pt=γ,xt G V (γ, yt)

+ V (γ, ys)

1 n(π<s, γ, G)

t<s:pt=γ,xt G V (γ, yt)

1 n(π<s, γ, G) 2V (γ, ys)R(π<s, G, γ) + V (γ, ys)2 .

This concludes the proof.

F.1.2. KEY TOOL: ONLINE MINIMAX MULTIOBJECTIVE OPTIMIZATION

For our online algorithm below, we will use the Multiobjective Optimization framework introduced by Lee et al. (2022).

Definition F.4 (Online Minimax Multiobjective Optimization Setting). A Learner plays against an Adversary over rounds t [T] := {1, . . . , T}. Over these rounds, the Learner accumulates a d-dimensional loss vector (d 1), where each round s loss vector lies in [ C, C]d for some C > 0. At each round t, the Learner and the Adversary interact as follows:

1. Before round t, the Adversary selects and reveals to the Learner an environment comprising:

(a) The Learner s and Adversary s respective convex compact action sets X t, Yt embedded into a finite-dimensional Euclidean space; (b) A continuous vector valued loss function ℓt( , ) : X t Yt [ C, C]d, with each ℓt j( , ) : X t Yt [ C, C] (for j [d]) convex in the 1st and concave in the 2nd argument.

2. The Learner selects some xt X t.

3. The Adversary observes the Learner s selection xt, and responds with some yt Yt.

4. The Learner suffers (and observes) the loss vector ℓt(xt, yt).

The Learner s objective is to minimize the value of the maximum dimension of the accumulated loss vector after T rounds in other words, to minimize: maxj [d] P

t [T ] ℓt j(xt, yt).

A key quantity in the analysis of the Learner s performance in the online minimax multi-objective optimization setting is the Adversary-Moves-First value of the stage games at each round t of the interaction i.e. how well the Learner could do if (counter-factually) she knew the Adversary s action ahead of time.

Definition F.5 (Adversary-Moves-First (AMF) Value at Round t). The Adversary-Moves-First value of the game defined by the environment (X t, Yt, ℓt) at round t is:

wt A := sup yt Yt min xt X t

max j [d] ℓt j(xt, yt) .

The Statistical Scope of Multicalibration

We can measure the performance of the Learner by comparing it to a benchmark defined by the Adversary moves first values of the games defined at each round.

Definition F.6 (Adversary-Moves-First (AMF) Regret). On transcript πt ={(X s, Ys, ℓs), xs, ys}t s=1, we define the Learner s Adversary Moves First (AMF) Regret for the jth dimension at time t to be:

Rt j(πt) :=

s=1 ℓs j(xs, ys)

The overall AMF Regret is then defined as follows: Rt(πt) = maxj [d] Rt j.

Lee et al. (2022) show that in any online minimax multiobjective optimization setting, the following Algorithm 4 obtains diminishing AMF regret.

Algorithm 4 General Algorithm for the Learner that Achieves Sublinear AMF Regret

for rounds t = 1, . . . , T do

Learn adversarially chosen X t, Yt, and loss function ℓt( , ).

Let χt j := exp η Pt 1 s=1 ℓs j(xs, ys)

i [d] exp η Pt 1 s=1 ℓs i(xs, ys) for j [d].

Play xt argmin x X t max y Yt X

j [d] χt j ℓt j(x, y).

Observe the Adversary s selection of yt Yt. end for

Theorem F.7 (AMF Regret Guarantee of Algorithm 4 (Lee et al., 2022)). For any T ln d, Algorithm 4 with learning rate

ln d 4T C2 obtains, against any Adversary, AMF regret bounded by: RT 4C

F.2. Canonical Algorithm for Sequential Multicalibration

In the rest of this section, we show how for any bounded and continuous elicitable property Γ with a Lipschitz identification function V , and for any finite group structure G, the problem of obtaining diminishing (G, V )-multicalibration error in the sequential adversarial setting can be cast as an instance of online minimax multiobjective optimization, and so can be solved with an appropriate instantiation of Algorithm 4 with multicalibration error bounds following from an appropriate instantiation of Theorem F.7.

Assumptions Throughout this section, we fix a continuous elicitable property Γ with a bounded range, which we w.l.o.g. rescale such that RangeΓ = [0, 1]; and we also fix an identification function V for Γ. Moreover, we will assume that the label space Y = [0, 1].

In our current online setting, we need to make two continuity assumptions on the identification function V in relation to the Adversary s play. Our first assumption is a weaker version of the batch Assumption 4.1. Namely, we will assume that the Adversary s chosen distributions Yt in every round t are such that V ( , Yt) is Lipschitz in the Learner s prediction, but we do not assume anything about the magnitude of the individual Lipschitz constants in each round: only that they exist, and that their average value over all rounds is bounded by some L.

Assumption F.8 (Average Lipschitzness of V in the Learner s Action). Assume that at each round t, the Adversary s label distribution Yt is such that the identification function V for property Γ is Lt-Lipschitz for some Lt < :

|V (γ, Yt) V (γ , Yt)| Lt|γ γ | for all γ, γ .

We make no assumption about the individual Lipschitz constants Lt, but assume that their average value is bounded by L:

The Statistical Scope of Multicalibration

Our second continuity assumption on V requires that, holding the Learner s play fixed, V should be appropriately piecewise uniformly continuous in the Adversary s choice of label. Intuitively, this assumption s main function will be to ensure that the Adversary s action space is compact and finite-dimensional (thus satisfying the technical requirements of the online minimax multiobjective framework on the Adversary s action space), up to an arbitrarily small error.

Assumption F.9 (Either the Adversary Plays Finite-Support Distributions, or V Is Piecewise Uniformly Continuous in the Adversary s Action). We assume that either the Adversary s distributions Yt are finite-support in all rounds t, or otherwise the following condition on V must hold:

Fix any integer discretization parameter m 1 for the Learner s play. Then, we assume that fixing any γ [1/m], the function V (γ, ) : Y R is piecewise uniformly continuous in the Adversary s action, in the sense that there exist finitely many points 0 = δ1 < δ2 < . . . < δM+1 = 1 such that the function V (γ, ) is uniformly continuous on each subinterval (δi, δi+1) for all i = 1 . . . M.

We now introduce our canonical algorithm, and prove its guarantees subject to Assumptions F.8 and F.9.

Algorithm 5 Online Multicalibration(G, V, m)

Initialize an empty transcript π 0. for rounds t = 1, . . . , T do

Observe the Adversary s chosen feature vector xt. Define the loss function ℓt : [1/m] G R|G| such that for each G G:

ℓt G(γt, yt) = X

γ [1/m] 1[xt G, γt = γ] 1 n(π<t, γ, G) 2V (γ, yt)R(π<t, G, γ) + V (γ, yt)2 ,

where: R(π<t, G, γ) = X

s<t:ps=γ,xs G V (γ, ys).

Let χt G := exp η Pt 1 s=1 ℓs G(ps, ys)

G G exp η Pt 1 s=1 ℓs G (ps, ys) for G G.

Let P t argmin P [1/m] max y

G G E p P χt G ℓt G(p, y) .

Sample pt P t and make prediction pt. Observe the Adversary s selection of yt. Update the transcript π t = π t 1 (xt, pt, yt). end for

Theorem F.10 (Algorithmic Guarantees for Sequential Multicalibration). Fix any finite collection of groups G and any bounded elicitable property Γ with RangeΓ = [0, 1] and with a bounded identification function V satisfying |V (γ, y)| C for all γ, y. Suppose that the Adversary chooses a sequence of distributions that together with V satisfy Assumption F.8 with Lipschitz constant L, and Assumption F.9. Then for any m > 0, and any T max{ln |G|, 3m}, there is a randomized algorithm for the Learner (Algorithm 5) that chooses amongst m discrete predictions at every round and that (together with the Adversary) induces a transcript distribution after T rounds that produces a transcript satisfying α-approximate (G, V )-multicalibration for:

m + 2m C2 ln T

Taking m = Θ

T , this gives:

max ln2 T, ln |G|

The Statistical Scope of Multicalibration

Proof. We embed our learning problem into the online minimax optimization setting so that we can apply Theorem F.7. First, what are the Learner s and the Adversary s strategy spaces? We now define them, and show that both of these are convex, compact and finite dimensional sets as required.

At each round we let the Learner s strategy space be [1/m], the simplex of probability distributions over predictions γt discretized at the granularity of 1/m. This clearly is a convex, compact, and finite-dimensional space.

For the Adversary s strategy space, we would have wanted it to be the set of all probability distributions over Y, but that would not be compact or finite-dimensional; so instead we assume that in each round t, the Adversary s strategy space is the (convex, compact and finite-dimensional) set of all distributions over Y = [0, 1] supported on at most some (arbitrary and unspecified) finite number Nt of points in [0, 1]. (To be clear, our algorithm below does not need to know Nt, nor will it be included in the performance bounds for it.) Recalling Assumption F.9, we easily see that this is without loss of generality, up to an arbitrarily small error term ϵ > 0. Indeed, for any distribution Yt Y, this assumption lets us find, for any ϵ > 0, a finite-support distribution Yt (supported over sufficiently many points) such that for any γ [1/m], | Ey Yt[V (γ, y)] Ey Yt[V (γ, y)]| < ϵ. (Which would have its support consist of the points δi from the assumption, plus sufficiently many discretization points inside each interval (δi, δi+1).)

Next, we need to define the loss function ℓt used at each round. We take the dimension of the loss function to be d = |G| with a coordinate devoted to each group G G. Suppose at round t, the Adversary has chosen feature vector xt (which, recall, is shown to the Learner before she must make a prediction). Then we define the loss vector ℓt as follows. For each G G:

ℓt G(Algt, Yt) = E γt Algt,yt Yt

1 xt G, γt = i

R π<t, G, i

where Algt [1/m] is the distribution over predictions chosen by the Learner, and Yt is the label distribution chosen by the Adversary. By linearity of expectation, this loss function is linear in the actions of both players, and so in particular is convex-concave as required. By the boundedness of V , the definition of R, and the fact that 1 xt G, γt = i m = 1 for exactly one value of i, this loss function takes values in [ C , C ] as required, for C 3C2.

Next, we need to upper-bound the Adversary-Moves-First value of the game at round t:

w A t = sup Yt min Algt [ 1

m ] max G G E γt Algt,yt Yt

i [m] 1 xt G, γt = i

m, yt)R(π<t, G, i

To bound w A t , consider what the Learner should do if the Adversary goes first, revealing the true label distribution Yt. The Learner then knows the true property value γ t = Γ(Yt), so if she could play γt = γ t , this would ensure that V (γt, Yt) = 0, implying that

w A t = (V (γ t , yt))2

n(π<t, γ t , G) C2

n(π<t, γ t , G).

The Learner cannot generally play γ t (since it may not be a multiple of 1/m and hence not in her strategy space), but she can select the discrete point γt [1/m] that is closest to γ t (in particular, this γt will satisfy |γ t γt| 1 m). With this action, the Learner will achieve 0 loss in all coordinates corresponding to groups G such that xt G (since for each of these coordinates, the indicator 1[xt G, γt = i m] = 0 for all i). Thus, it remains to consider the coordinates corresponding to groups G such that xt G. Let i be such that γt = i/m. Then, the indicator 1[xt G, γt = i m] = 1, so the loss value in this coordinate can be bounded as:

" 1 n π<t, i

R π<t, G, i

where we used that Eyt Yt[V (γ t , yt)] = 0, that |γt γ t | 1

m, and that V ( , Yt) is Lt-Lipschitz by Assumption F.8.

This latter expression thus serves as an upper bound on the AMF value w A t . With this bound in hand, we can now apply

The Statistical Scope of Multicalibration

Theorem F.7 to conclude that Algorithm 4 obtains the following AMF regret bound:

max G G 1 T

t=1 E γt Algt,yt Yt

1 n (π<t, γt, G)

2V (γt, yt) R π<t, G, γt + V (γt, yt)2

max G G 1 T

n (π<t, γt, G)

m + max G G 1 T

n (π<t, γt, G) + 12C2 r

m + max G G C2

n(π<T ,γ,G) X

m + 2m C2 ln T

where the second inequality uses our Lipschitz Assumption F.8, the equality uses that n(π t, i

m, G) = n(π<t, i

m, G) + 1 on any round t in which xt G and γt = i m, and the second-to-last inequality uses that P

γ [1/m] n(π<T , γ, G) = T.

By Lemma F.3, this implies that for all groups G G, their miscalibration error is bounded as:

E[K2(G, π)] =

t=1 E K2 G, π t K2 G, π<t 2CL

m + 2m C2 ln T

which completes the proof.

G. Formal Statements of Results in Section 6

G.1. Joint Multicalibration of Bayes Risks

Consider any Bayes pair (Γ, ΓB) with respect to a strictly consistent scoring function S(γ, y). As in Section 5, we assume that RangeΓ [0, 1] and RangeΓB [0, 1]. To show that Bayes pairs are jointly multicalibratable, we will need to set up several assumptions on the scoring and identification functions associated with (Γ, ΓB), in order to ensure the satisfaction of Assumptions 4.1, 5.2, 5.3, and 5.4 that the generic joint multicalibration result of Section 5 relies on.

To satisfy Assumption 4.1, we assume that the property Γ has an identification function V that is strictly increasing and L-Lipschitz in its first argument. To satisfy Assumption 5.2, we additionally assume that V ( , P) is La-anti-Lipschitz for P P.

Now note that for all γ RangeΓ, the Bayes risk ΓB by definition satisfies ΓB(P) = S(γ, P) on the level set {P P : Γ(P) = γ} of Γ. As a result, the identification function for the Bayes risk ΓB on the level set {P P : Γ(P) = γ} can be simply taken to be: V B γ (γB, y) := γB S(γ, y)

for all γB, y [0, 1]. Taking the expectation over any P P, we can thus write the expected conditional identification function of ΓB conditioned on Γ = γ as V B γ (γB, P) := γB S(γ, P).

To satisfy Assumption 5.3, we need to enforce the Lipschitzness of V B γ be Lipschitz with respect to its subscript γ. To do so, we assume that the scoring function S for the Bayes pair (Γ, ΓB) is LS-Lipschitz in its first argument. For any γB and any P, this lets us write |Vγ(γB, P) Vγ (γB, P)| = |S(γ , P) S(γ, P)| LS|γ γ |, implying that V B γ is LS-Lipschitz in γ.

The Statistical Scope of Multicalibration

Finally, we verify Assumption 5.4 of Section 5. Note that the identification function V B γ ( , P) for the Bayes risk ΓB is well-defined for every γ [0, 1] and P P, even when Γ(P) = γ. Furthermore, V B γ (γB, P) is linear in γB with slope 1. Thus, V B γ ( , P) is strictly increasing and, in fact, 1-Lipschitz for γ [0, 1] and P P, as desired.

With all requisite assumptions on the scoring and identification functions for Γ and ΓB satisfied, we can now invoke Theorem 5.6 to obtain the following joint multicalibration guarantees for Bayes pairs:

Theorem G.1 (Bayes Pairs Are Jointly Multicalibratable). Consider any Bayes pair (Γ, ΓB) with respect to a strictly consistent scoring function S. Let V be an identification function for Γ. Assume that: (1) The scoring function S is LS-Lipschitz in its first argument; (2) V is strictly increasing, L-Lipschitz and La-anti-Lipschitz in its first argument.

Pick a discretization factor m 1. Set α0 = 4L2

m and α1 = 4 m. Let α1 = 8 m((LLa LS)2 + 1). Given any G 2X , instantiate Joint Multicalibration (Algorithm 2) using the id function V for Γ, and the id function collection V B

for ΓB, such that V B γ (γB, P) := γB S(γ, P) for all γ, γB [0, 1], P P.

Then, Algorithm 2 will output an 4L2

m((LLa LS)2 + 1) -approximately (G, V, V B)-jointly multicalibrated predictor

f = (f 0, f 1) for (Γ, ΓB), in at most O m4

L updates5 to the joint predictor f.

G.2. Joint (Quantile, CVa R) Multicalibration

By itself, the CVa R is not sensible for calibration. Using our Theorem 3.4, this follows automatically from a well-known negative result of Gneiting (2011), who showed that CVa Rτ is not elicitable as it has nonconvex level sets for various simple distribution families P. Fact 3 (CVa Rτ has nonconvex level sets (Gneiting, 2011)). For any τ [0, 1], CVa Rτ has nonconvex level sets relative to any class P of distributions over some interval I R that includes the finite-support distributions, or the finite mixtures of compact-support distributions with well-defined PDF.

On the positive side, as an easy corollary of Theorem G.1, we obtain our next result that the pair (quantile, CVa R) can be jointly multicalibrated. To be able to apply Theorem G.1, it suffices to identify a strictly consistent scoring function Sτ for which the pair (τ-quantile, CVa Rτ) for any τ [0, 1] is a Bayes pair, and then obtain the Lipschitz constant for Sτ, as well as the Lipschitz and anti-Lipschitz constants for a strictly increasing identification function Vτ for the τ-quantile.

And indeed, it is well-known (see e.g. Example 1 in Embrechts et al. (2021)) that (τ-quantile, CVa Rτ) is a Bayes pair for a scoring function Sτ that is the rescaled (by a factor of 1 1 τ ) pinball loss:

Fact 4 ((τ-quantile, CVa Rτ) is a Bayes pair). Fix any τ [0, 1] and let Γ := qτ be a τ-quantile, and ΓB := CVa Rτ be the τ-CVa R. Then (Γ, ΓB) is a Bayes pair with respect to the strictly Γ-consistent scoring function Sτ defined, for all γ, y [0, 1], as:

Sτ(γ, y) := γ + 1 1 τ (y γ)+,

where we have denoted (u)+ = max{u, 0}.

To bound the Lipschitz constant of Sτ, note that its derivative in the first argument is Sτ (γ,y)

γ = 1[y γ] τ 1 τ 1[y > γ].

Thus Sτ has Lipschitz constant LSτ supγ ,y Sτ (γ ,y )

γ = max{1, τ 1 τ }.

Now we need to settle on a strictly increasing (in the first argument) identification function Vτ for the τ-quantile qτ and investigate its Lipschitz properties. Specifically, let us use the standard quantile id function defined as Vτ(γ, P) := Pry P [y γ] τ for all γ and all P P. Evidently, Vτ( , P) is just the CDF of P shifted by τ. Thus, by assuming that all distributions in P have a strictly increasing CDF, we ensure that Vτ is strictly increasing in γ.

To conveniently quantify the Lipschitzness of Vτ, assume that it is differentiable in γ: this is equivalent to all P P having a well-defined PDF pdf P , which will then be the derivative of Vτ( , P): namely, Vτ (γ,P )

γ = pdf P (γ). Therefore, enforcing a Lipschitz and an anti-Lipschitz constant on Vτ simply translates to assuming an upper and a lower bound on

5Specifically, Algorithm 2 will perform at most R R+ m4

L updates on the predictor f, where we have denoted R =

sup γ,y [0,1] S(γ, y) inf γ,y [0,1] S(γ, y) and R+ = 1

2 max γ [1/m]

sup γB,y [0,1] (γB S(γ, y))2 inf γB,y [0,1](γB S(γ, y))2 !

The Statistical Scope of Multicalibration

the PDF of the distributions in the underlying family P. Indeed, if we now assume that for all P P, the PDF satisfies 0 < M1 pdf P (y) M2 < for all y [0, 1], this gives us that Vτ is M2-Lipschitz and M1-anti-Lipschitz.

Plugging the above Lipschitz and anti-Lipschitz bounds on Sτ and Vτ into Theorem G.1, we thus obtain the following joint (quantile, CVa R) multicalibration result:

Theorem G.2 (Joint Multicalibration of (τ-quantile, CVa Rτ)). Fix any constants 0 < M1 < M2, and take any family P of probability distributions over [0, 1] such that each P P has a strictly increasing CDF and a well-defined density function pdf P satisfying M1 pdf P (y) M2 for all y [0, 1].

Fix any target coverage level τ [0, 1], and any group structure G 2X on the dataset. Pick a discretization m 1. Set α0 = 4M 2 2 m and α1 = 4

m. Let α1 = 8

m((M1M2 max{1, τ/(1 τ)})2 + 1).

Then, by appropriately instantiating Joint Multicalibration (Algorithm 2), we can compute a 4M 2 2 m , 8

M1M2 max 1, τ 1 τ

approximately jointly G-multicalibrated predictor

f = (f 0, f 1) for the pair (τ-quantile, CVa Rτ), after at most O m4 M2

updates to the joint predictor f.

G.3. Most Distortion Risk Measures Are Not Sensible for Calibration

We begin by formally stating the result of Kou & Peng (2016) and Wang & Ziegel (2015) that we will use. It shows that out of all distortion risk measures, the only ones that have convex level sets across the family of all finite-support distributions are: (1) means, (2) quantiles, and (3) two other risk measures which are quantile variants; here are the corresponding definitions.

Definition G.3. Consider any family P of probability distributions. For any distribution P P, let its CDF (which need not be strictly increasing or continuous) be denoted FP . We define the following distributional properties over P:

1. For any τ [0, 1], the τ-quantile is defined by:

qτ(P) = inf{y : FP (y) τ} for P P.

2. For any τ [0, 1] and c [0, 1], define the property:

q1 τ,c(P) := c inf{y : FP (y) τ} + (1 c) inf{y : FP (y) > τ} for P P.

3. For any τ [0, 1] and c [0, 1], define the property:

q2 τ,c(P) := c inf{y : FP (y) > 0} + (1 c) inf{y : FP (y) = 1} for P P.

Observe that (1) q2 τ,c is just a convex combination of the 0% quantile and the 100% quantile of the distribution; and (2) q1 τ,c in fact is (for all c [0, 1]) the τ-quantile subject to the CDF FP being strictly increasing.

Kou & Peng (2016) showed that distribution means, together with the three (parametric) properties listed in Definition G.3, are the only distortion risk measures with convex level sets. The proof of this result was then simplified and refined by Wang & Ziegel (2015), who showed that it holds even over the family of distributions supported on at most 3 points.

Theorem G.4 (On Distortion Risk Measures with Convex Level Sets (Kou & Peng, 2016; Wang & Ziegel, 2015)). Let P3 be the set of all probability distributions supported on at most 3 real-valued points. Let Pbd be the set of all bounded distributions over the reals with a well-defined PDF. Let P be any family of distributions over the reals such that either P P3, or P Pbd.

Consider any distortion risk measure Γ : P R. Then, Γ violates the convex level sets assumption on P, unless it is one of the following distributional properties:

1. The distributional mean;

The Statistical Scope of Multicalibration

2. A τ-quantile qτ, for some τ [0, 1];

3. The property q1 τ,c, for some τ, c [0, 1];

4. The property q2 τ,c, for some τ, c [0, 1].

Now, our Theorem 3.4 lets us immediately conclude that for any P as in Theorem G.4, no distortion risk measure other than means, quantiles, or the two parametric properties q1 τ,c or q2 τ,c is sensible for calibration over any P-compatible family of dataset distributions D that includes all the P-compatible 2-point dataset distributions. To formally restate this:

Theorem G.5 (Sensibility for Calibration for Distortion Risk Measures). Let P3 be the set of all probability distributions supported on at most 3 real-valued points. Let Pbd be the set of all bounded distributions over the reals with a well-defined PDF. Let P be any convex space of distributions over the reals such that either P P3, or P Pbd.

Consider any distortion risk measure Γ : P R, and any family D of P-compatible dataset distributions that includes all the P-compatible 2-point dataset distributions.

Then Γ is not sensible for calibration over D, unless Γ is one of the following distributional properties:

1. The distributional mean;

2. A τ-quantile qτ, for some τ [0, 1];

3. The property q1 τ,c, for some τ, c [0, 1];

4. The property q2 τ,c, for some τ, c [0, 1].