# multicalibration_as_boosting_for_regression__a44d09a0.pdf

Multicalibration as Boosting for Regression

Ira Globus-Harris 1 Declan Harrison 1 Michael Kearns 1 Aaron Roth 1 Jessica Sorrell 1

We study the connection between multicalibration and boosting for squared error regression. First we prove a useful characterization of multicalibration in terms of a swap regret like condition on squared error. Using this characterization, we give an exceedingly simple algorithm that can be analyzed both as a boosting algorithm for regression and as a multicalibration algorithm for a class H that makes use only of a standard squared error regression oracle for H. We give a weak learning assumption on H that ensures convergence to Bayes optimality without the need to make any realizability assumptions giving us an agnostic boosting algorithm for regression. We then show that our weak learning assumption on H is both necessary and sufficient for multicalibration with respect to H to imply Bayes optimality. We also show that if H satisfies our weak learning condition relative to another class C then multicalibration with respect to H implies multicalibration with respect to C. Finally we investigate the empirical performance of our algorithm experimentally using an open source implementation that we make available on Git Hub 2.

1. Introduction

We revisit the problem of boosting for regression, and develop a new agnostic regression boosting algorithm via a connection to multicalibration. In doing so, we shed additional light on multicalibration, a recent learning objective that has emerged from the algorithmic fairness literature (Hébert-Johnson et al., 2018). In particular, we characterize multicalibration in terms of a swap-regret like condition, and use it to answer the question what property

1Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia PA, USA. Correspondence to: Aaron Roth <aaroth@seas.upenn.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 2Our code repository can be found at https://github. com/Declancharrison/Level-Set-Boosting

must a collection of functions H have so that multicalibration with respect to H implies Bayes optimality? , giving a complete answer to problem asked by (Burhanpurkar et al., 2021). Using our swap-regret characterization, we derive an especially simple algorithm for learning a multicalibrated predictor for a class of functions H by reduction to a standard squared-error regression algorithm for H. The same algorithm can also be analyzed as a boosting algorithm for squared error regression that makes calls to a weak learner for squared error regression on subsets of the original data distribution without the need to relabel examples (in contrast to Gradient Boosting as well as existing multicalibration algorithms). This lets us specify a weak learning condition that is sufficient for convergence to the Bayes optimal predictor (even if the Bayes optimal predictor does not have zero error), avoiding the kinds of realizability assumptions that are implicit in analyses of boosting algorithms that converge to zero error. We conclude that ensuring multicalibration with respect to H corresponds to boosting for squared error regression in which H forms the set of weak learners. Finally we define a weak learning condition for H relative to a constrained class of functions C (rather than with respect to the Bayes optimal predictor). We show that multicalibration with respect to H implies multicalibration with respect to C if H satisfies the weak learning condition with respect to C, which in turn implies accuracy at least that of the best function in C.

Multicalibration Consider a distribution D P Z defined over a domain Z X ˆ R of feature vectors x P X paired with real valued labels y. Informally, a regression function f : X Ñ R is calibrated if for every v in the range of f, Epx,yq Dry|fpxq vs v. In other words, fpxq must be an unbiased estimator of y, even conditional on the value of its own prediction. Calibration on its own is a weak condition, because it only asks for f to be unbiased on average over all points x such that fpxq v. For example, the constant predictor that predicts fpxq Epx,yq Drys is calibrated. Thus calibration does not imply accuracy a calibrated predictor need not make predictions with lower squared error than the best constant predictor. Calibration also does not imply that f is equally representative of the label distribution on different subsets of the feature space X. For example, given a subset of the feature space

Multicalibration as Boosting for Regression

G Ď X, even if f is calibrated, it may be that f is not calibrated on the conditional distribution conditional on x P G it might be e.g. that Ery|fpxq v, x P Gs " v, and Ery|fpxq v, x R Gs ! v. To correct this last deficiency, (Hébert-Johnson et al., 2018) defined multicalibration, which is a condition parameterized by a subset of groups G Ď X each defined by an indicator function h : X Ñ t0, 1u in some class H. It asks (informally) that for each such h P H, and for each v in the range of f, that Erhpxqpy vq|fpxq vs 0. Since h is a binary indicator function for some set G, this is equivalent to asking for calibration not just marginally over D, but simultaneously for calibration over D conditional on x P G. (Kim et al., 2019) and (Gopalan et al., 2022) generalize multicalibration beyond group indicator functions to arbitrary real valued functions h : X Ñ R. Intuitively, as H becomes a richer and richer set of functions, multicalibration becomes an increasingly stringent condition. But if H consists of the indicator functions for e.g. even a very large number of randomly selected subsets G Ď X, then the constant predictor fpxq Epx,yq Drys will still be approximately multicalibrated with respect to H. What property of H ensures that multicalibration with respect to H implies that f is a Bayes optimal regression function? This question was recently asked by (Burhanpurkar et al., 2021) and we provide a necessary and sufficient condition.

Boosting for Regression Boosting refers broadly to a collection of learning techniques that reduce the problem of strong learning (informally, finding an error optimal model) to a series of weak learning tasks (informally, finding a model that has only a small improvement over a trivial model) See (Schapire and Freund, 2013) for a textbook treatment. The vast majority of theoretical work on boosting studies the problem of binary classification, in which a weak learner is a learner that obtains classification error bounded below 1{2. Several recent papers (Kim et al., 2019; Gopalan et al., 2022) have made connections between algorithms for guaranteeing multicalibration and boosting algorithms for binary classification.

In this paper, we show a direct connection between multicalibration and the much less well-studied problem of boosting for squared error regression (Friedman, 2001; Duffy and Helmbold, 2002). There is not a single established notion for what constitutes a weak learner in the regression setting ((Duffy and Helmbold, 2002) introduce several different notions), and unlike boosting algorithms for classification problems which often work by calling a weak learner on a reweighting of the data distribution, existing algorithms for boosting for regression typically resort to calling a learning algorithm on relabelled examples. We give a boosting algorithm for regression that only requires calling a squared error regression learning algorithm

on subsets of examples from the original distribution (without relabelling), which lets us formulate a weak learning condition that is sufficient to converge to the Bayes optimal predictor, without making the kinds of realizability assumptions implicit in the analysis of boosting algorithms that assume one can drive error to zero.

1.1. Our Results

We focus on classes of real valued functions H that are closed under affine transformations i.e. classes such that if fpxq P H, then for any pair of constants a, b P R, pafpxq bq P H as well. Many natural classes of models satisfy this condition already (e.g. linear and polynomial functions and regression trees), and any neural network architecture that does not already satisfy this condition can be made to satisfy it by adding two additional parameters (a and b) while maintaining differentiability. Thus we view closure under affine transformations to be a weak assumption that is enforceable if necessary.

First in Section 3 we prove the following characterization for multicalibration over H, for any class H that is closed under affine transformations. Informally, we show that a model f is multicalibrated with respect to H if and only if, for every v in the range of f:

E px,yq Drpfpxq yq2|fpxq vs

ď min h PH E px,yq Drphpxq yq2|fpxq vs.

(See Theorem 3.2 for the formal statement). This is a swap regret -like condition (as in (Foster and Vohra, 1999) and (Blum and Mansour, 2005)), that states that f must have lower squared error than any model h P H, even conditional on its own prediction. Using this characterization, in Section 4 we give an exceedingly simple algorithm for learning a multicalibrated predictor over H given a squared error regression oracle for H. The algorithm simply repeats the following over t rounds until convergence, maintaining a model f : X Ñ t0, 1{m, 2{m, . . . , 1u with a discrete range with support over multiples of 1{m for some discretization factor m:

1. For each level set v P t0, 1{m, 2{m, . . . , 1u, run a regression algorithm to find the ht v P H that minimizes squared error on the distribution D|pft 1pxq vq, the distribution conditional on ft 1pxq v.

2. Replace each level set v of ft 1pxq with ht vpxq to produce a new model ft, and round its output to the discrete range t0, 1{m, 2{m, . . . , 1u

Each iteration decreases the squared error of ft, ensuring convergence, and our characterization of multicalibration

Multicalibration as Boosting for Regression

ensures that we are multicalibrated with respect to H at convergence. Compared to existing multicalibration algorithms (e.g. the split and merge algorithm of (Gopalan et al., 2022)), our algorithm is exceptionally simple and makes use of a standard squared-error regression oracle on subsets of the original distribution, rather than using a classification oracle or requiring example relabelling.

We can also view the same algorithm as a boosting algorithm for squared error regression. Suppose H (or equivalently our weak learning algorithm) satisfies the following weak learning assumption: informally, that on any restriction of D on which the Bayes optimal predictor is nonconstant, there should be some h P H that obtains squared error better than that of the best constant predictor. Then our algorithm converges to the Bayes optimal predictor. In Section C we give uniform convergence bounds which guarantee that the algorithm s accuracy and multicalibration guarantees generalize out of sample, with sample sizes that are linear in the pseudodimension of H.

We then show in Section 5 that in a strong sense this is the right weak learning assumption: Multicalibration with respect to H implies Bayes optimality if and only if H satisfies this weak learning condition. This gives a complete answer to the question of when multicalibration implies Bayes optimality.

In Appendix E, we generalize our weak learning condition to a weak learning condition relative to a constrained class of functions C (rather than relative to the Bayes optimal predictor), and show that if H satisfies the weak learning condition relative to C, then multicalibration with respect to H implies multicalibration with respect to C, and hence error that is competitive with the best model in C.

We give a fast, parallelizable implementation of our algorithm and in Section 7 demonstrate its convergence to Bayes optimality on two-dimensional datasets useful for visualization, as well as evaluate the accuracy and calibration guarantees of our algorithm on real Census derived data using the Folktables package (Ding et al., 2021).

1.2. Additional Related Work

Calibration as a statistical objective dates back at least to (Dawid, 1982). (Foster and Vohra, 1999) showed a tight connection between marginal calibration and internal (equivalently swap) regret. We extend this characterization to multicalibration. Multicalibration was introduced by (Hébert-Johnson et al., 2018), and variants of the original definition have been studied by a number of works (Kim et al., 2019; Jung et al., 2021; Gopalan et al., 2022; Kim et al., 2022; Roth, 2022). We use the ℓ2 variant of multicalibration studied in (Roth, 2022) but this definition implies all of the other variants of multicalibration up to a change

in parameters(Appendix A). (Burhanpurkar et al., 2021) first asked the question when does multicalibration with respect to H imply accuracy , and gave a sufficient condition: when H contains (refinements of) the levelsets of the Bayes optimal regression function, together with techniques for attempting to find these. This can be viewed as a strong learning assumption, in contrast to our weak learning assumption on H.

Several other papers use regression as a tool to obtain calibration like guarantees of various sorts. (Foster and Kakade, 2006) give an algorithm that gives a smooth variant of multicalibration over a finite class of functions H in an online setting by solving a least-squares linear regression problem using the set of functions h P H as a basis i.e. at every round, they solve a |H| dimensional linear regression problem. In addition to being defined only for finite classes H, this does not give an oracle-efficient algorithm: rather than solving regression problems over H, they solve |H|-dimensional linear regression problems, which has a polynomial dependence on H (and obtains calibration error bounds that also scale polynomially with |H|).

(Gopalan et al., 2023a) define a weaker condition than multicalibration calibrated multiaccuracy and show that predictors that are calibrated and multiaccurate with respect to an appropriate class of functions are also good loss minimizers. They also show how to obtain calibrated multiaccuracy with a regression based algorithm. However, in contrast to multicalibration, in order for calibrated multiaccurate predictors to be competitive loss minimizers for a large class of loss functions with respect to a benchmark class H, they must be multiaccurate not just with respect to H, but with respect to the levelsets of H. In general the learning problem over the levelsets of H can be much harder than the learning problem over H itself. For example, if H is the class of linear functions, then learning over H corresponds to linear regression, which has an efficient (closed form) solution. However the levelsets of H are (intersections of) linear threshold functions, for which learning is computationally hard. One motivation for (Gopalan et al., 2023a) s work is the fact that only relying on regression makes algorithms easier to implement (as regression problems have differentiable loss even when they don t have efficient closed-form solutions). This motivation also appears in the contextual bandits literature, where recent works have provided reductions to regression oracles despite the fact that algorithms reducing to classification oracles already existed (Foster and Rakhlin, 2020; Foster et al., 2018). Our reduction to a regression rather than classification oracle follow the spirit of (Gopalan et al., 2023a), but with the stronger guarantee of multicalibration rather than calibrated multiaccuracy.

Boosting for binary classification was introduced by

Multicalibration as Boosting for Regression

(Schapire, 1990) and has since become a major topic of both theoretical and empirical study see (Schapire and Freund, 2013) for a textbook overview. Both (Kim et al., 2019) and (Gopalan et al., 2022) have drawn connections between algorithms for multicalibration and boosting for binary classification. In particular, (Gopalan et al., 2022) draw direct connections between their split-andmerge multicalibration algorithm and agnostic boosting algorithms of (Kalai, 2004; Kanade and Kalai, 2009; Kalai et al., 2008).

Boosting for squared error regression is much less well studied. (Freund and Schapire, 1997) give a variant of Adaboost (Adaboost.R) that reduces regression examples to infinite sets of classification examples, and requires a base regressor that optimizes a non-standard loss function. (Friedman, 2001) introduced the popular gradient boosting method, which for squared error regression corresponds to iteratively fitting the residuals of the current model and then applying an additive update, but did not give a theoretical analysis. (Duffy and Helmbold, 2002) give a theoretical analysis of several different boosting algorithms for squared error regression under several different weak learning assumptions. Their weak learning assumptions imply convergence of their algorithms to 0 error, which implicitly makes a very strong realizability assumption. Our weak learning assumption implies convergence to the Bayes optimal model without realizability assumptions thus our results can be viewed as giving an agnostic boosting algorithm for regression.

2. Preliminaries

We study prediction tasks over a domain Z X ˆ Y. Here X represents the feature domain and Y represents the label domain. We focus on the bounded regression setting where Y r0, 1s (the scaling to r0, 1s is arbitrary). We write D P Z to denote a distribution over labelled examples, DX to denote the induced marginal distribution over features, and write D Dn to denote a dataset consisting of n labelled examples sampled i.i.d. from D. We will be interested in the squared error of a model f with respect to distribution D, Epx,yq Drpy fpxqq2s. We abuse notation and identify datasets D tpx1, y1q, . . . , pxn, ynqu with the empirical distribution over the examples they contain, and so we can write the empirical squared error over D: as Epx,yq Drpy fpxqq2s 1 n řn i 1pyi fpxiqq2. When taking expectations over a distribution that is clear from context, we will frequently suppress notation indicating the relevant distribution for readability.

We write Rpfq to denote the range of a function f, and when Rpfq is finite, use m to denote the cardinality of its range: m |Rpfq|. We are interested in finding models that are multicalibrated with respect to a class of real val-

ued functions H. We use an ℓ2 notion of multicalibration as used in (Roth, 2022). This notion has natural relationships to the ℓ1 and ℓ8 notions of multicalibration used by e.g. (Hébert-Johnson et al., 2018; Gopalan et al., 2022). See appendix A for more details.

Definition 2.1 (Multicalibration). Fix a distribution D P Z and a model f : X Ñ r0, 1s that maps onto a countable subset of its range. Let H be an arbitrary collection of real valued functions h : X Ñ R. We say that f is αapproximately multicalibrated with respect to D and H if for every h P H:

K2pf, h, Dq ÿ

v PRpfq Pr px,yq Drfpxq vs

E px,yq Drhpxqpy vq|fpxq vs 2 ď α.

We say that f is α-approximately calibrated if:

v PRpfq Pr px,yq Drfpxq vs

E px,yq Drpy vq|fpxq vs 2 ď α.

If α 0, then we simply say that a model is multicalibrated or calibrated. We will sometimes refer to K2pf, Dq as the mean squared calibration error of a model f.

We will characterize the relationship between multicalibration and Bayes optimality.

Definition 2.2 (Bayes Optimal Predictor). Let f : X Ñ r0, 1s. We say that f is the Bayes optimal predictor for D if:

E px,yq Drpy f pxqq2s ď min f:XÑr0,1srpy fpxqq2s

The Bayes Optimal predictor satisfies: f pxq Epx1,yq D ry|x1 xs . We say that a function f : X Ñ r0, 1s is γ-approximately Bayes optimal if

E px,yq Drpy fpxqq2s ď E px,yq Drpy f pxqq2s γ.

Throughout this paper, we will denote the Bayes optimal predictor as f .

3. A Characterization of Multicalibration

In this section we give a simple swap-regret like characterization of multicalibration for any class of functions H that is closed under affine transformations:

Definition 3.1. A class of functions H is closed under affine transformations if for every a, b P R, if hpxq P H then h1pxq : ahpxq b P H.

Multicalibration as Boosting for Regression

As already discussed, closure under affine transformation is a mild assumption: it is already satisfied by many classes of functions H like linear and polynomial functions and decision trees, and can be enforced for neural network architectures when it is not already satisfied by adding two additional parameters a and b without affecting our ability to optimize over the class.

The first direction of our characterization states that if f fails the multicalibration condition for some h P H, then there is some other h1 P H that improves over f in terms of squared error, when restricted to a level set of f. The second direction states the opposite: if f is calibrated (but not necessarily multicalibrated), and if there is some level set of f on which h improves over f in terms of squared error, then in fact f must fail the multicalibration condition for h. An equivalent characterization was independently shown by (Gopalan et al., 2023b).

Theorem 3.2. Suppose H is closed under affine transformation. Fix a model f : X Ñ R and a levelset v P Rpfq of f. Then:

1. If there exists an h P H such that:

Erhpxqpy vq|fpxq vs ě α,

for α ą 0, then there exists an h1 P H such that:

Erpfpxq yq2 ph1pxq yq2|fpxq vs

Erhpxq2|fpxq vs.

2. If f is calibrated and there exists an h P H such that

Erpfpxq yq2 phpxq yq2|fpxq vs ě α,

then: Erhpxqpy vq|fpxq vs ě α

See Appendix B for the proof.

4. An Algorithm (For Multicalibration And Regression Boosting)

We now give a single algorithm, and then show how to analyze it both as an algorithm for obtaining a multicalibrated predictor f, and as a boosting algorithm for squared error regression. We give generalization bounds for the algorithm in Appendix C.

Let m P N be a discretization term, and let r1{ms : t0, 1

m, . . . , m 1

m , 1u denote the set of points in r0, 1s that are multiples of 1{m. We will learn a model f whose range is r1{ms, which we will enforce by rounding its outputs to this range as necessary using the following operation:

Definition 4.1 (Roundpf; mq). Let F be the family of all functions f : X Ñ R. Let Round : F ˆ N Ñ F be a function such that Roundpf; mq outputs fpxq arg minv Pr1{ms |fpxq v|.

Unlike other algorithms for multicalibration which make use of agnostic learning oracles for binary classification, our algorithm makes use of an algorithm for solving squared-error regression problems over H:

Definition 4.2. AH is a squared error regression oracle for a class of real valued functions H if for every D P Z, AHp Dq outputs a function h P H such that h P arg minh1PH Epx,yq Drph1pxq yq2s.

For example, if H is the set of all linear functions, then AH simply solves a linear regression problem (which has a closed form solution). Algorithm 1 (LSBoost3) repeats the following operation until it no longer decreases overall squared error: it runs squared error regression on each of the level-sets of ft, and then replaces those levelsets with the solutions to the regression problems, and rounds the output to r1{ms.

We will now analyze the algorithm first as a multicalibration algorithm, and then as a boosting algorithm. For simplicity, in this section we will analyze the algorithm as if it is given direct access to the distribution D. In practice, the algorithm will be run on the empirical distribution over a dataset D Dn, and the multicalibration guarantees proven in this section will hold for this empirical distribution. In Appendix C we prove generalization theorems, which allow us to translate our in-sample error and multicalibration guarantees over D to out-of-sample guarantees over D.

4.1. Analysis as a Multicalibration Algorithm

Theorem 4.3. Fix any distribution D P Z, any model f : X Ñ r0, 1s, any α ă 1, any class of real valued functions H that is closed under affine transformations, and a squared error regression oracle AH for H. For any bound B ą 0 let:

HB th P H : max x PX hpxq2 ď Bu

be the set of functions in h with squared magnitude bounded by B. Then LSBoostpf, α, AH, D, Bq (Algorithm 1) halts after at most T ď 2B

α many iterations and outputs a model f T 1 such that f T 1 is α-approximately multicalibrated with respect to D and HB.

See Appendix B for the proof.

3LSBoost can be taken to stand for either Level Set Boost" or Least Squares Boost , at the reader s discretion.

Multicalibration as Boosting for Regression

Algorithm 1 LSBoost(f, α, AH, D, B)

α . Let f0 Roundpf; mq, err0 Epx,yq Drpf0pxq yq2s, err 1 8 and t 0. while perrt 1 errtq ě α 2B do for each v P r1{ms do

Let Dt 1 v D|pftpxq vq. Let ht 1 v AHp Dt 1 v q. end for Let:

v Pr1{ms 1rftpxq vs ht 1 v pxq

ft 1 Roundp ft 1, mq

Let errt 1 Epx,yq Drpft 1pxq yq2s and t t 1. end while Output ft 1.

4.2. Analysis as a Boosting Algorithm

We now analyze the same algorithm (Algorithm 1) as a boosting algorithm designed to boost a weak learning algorithm AH to a strong learning algorithm. Often in the boosting literature, a strong learning algorithm is one that can obtain accuracy arbitrarily close to perfect, which is only possible under strong realizability assumptions. In this paper, by strong learning , we mean that Algorithm 1 should output a model that is close to Bayes optimal, which is a goal we can enunciate for any distribution D without needing to make realizability assumptions. (Observe that if the Bayes optimal predictor has zero error, then our meaning of strong learning corresponds to the standard meaning, so our analysis is only more general).

We now turn to our definition of weak learning. Intuitively, a weak learning algorithm should return a hypothesis that makes predictions that are slightly better than trivial whenever doing so is possible. We take trivial predictions to be those of the best constant predictor as measured by squared error i.e. the squared error obtained by simply returning the label mean. A weak learning algorithm for us can be run on any restriction of the data distribution D to a subset S Ď X, and must return a hypothesis with squared error slightly better than the squared error of the best constant prediction, whenever the Bayes optimal predictor f

has squared error slightly better than a constant predictor; on restrictions for which the Bayes optimal predictor also does not improve over constant prediction, our weak learning algorithm is not required to do better either.

Traditionally, weak learning assumptions do not distinguish between the optimization ability of the algorithm and the representation ability of the hypothesis class it opti-

mizes over. Since we have defined a squared error regression oracle AH as exactly optimizing the squared error over some class H, we will state our weak learning assumption as an assumption on the representation ability of H but this is not important for our analysis here. To prove Theorem 4.5 we could equally well assume that AH returns a hypothesis h that improves over a constant predictor whenever one exists, without assuming that h optimizes squared error over all of H. Definition 4.4 (Weak Learning Assumption). Fix a distribution D P Z and a class of functions H. Let f pxq Ey Dpxqrys denote the true conditional label expectation conditional on x. We say that H satisfies the γweak learning condition relative to D if for every S Ď X with Prx DX rx P Ss ą 0, if:

Erpf pxq yq2|x P Ss ă min c PR Erpc yq2|x P Ss γ

then there exists an h P H such that:

Erphpxq yq2|x P Ss ă min c PR Erpc yq2|x P Ss γ

When γ 0 we simply say that H satisfies the weak learning condition relative to D.

Observe why our weak learning assumption is weak : the Bayes optimal predictor f may improve arbitrarily over the best constant predictor on some set S in terms of squared error, but in this case we only require of H that it include a hypothesis that improves by some γ which might be very small.

Since the γ-weak learning condition does not make any requirements on H on sets for which f pxq improves over a constant predictor by less than γ, the best we can hope to prove under this assumption is γ-approximate Bayes optimality, which is what we do next. Theorem 4.5. Fix any distribution D P Z, any model f : X Ñ r0, 1s, any γ ą 0, any class of real valued functions H that satisfies the γ-weak learning condition relative to D, and a squared error regression oracle AH for H. Let α γ and B 1{γ (or any pair such that α{B γ2). Then LSBoostpf, α, AH, D, Bq halts after at most T ď 2 γ2 many iterations and outputs a model f T 1 such that f T 1 is 2γ-approximately Bayes optimal over D:

E px,yq Drpf T 1pxq yq2s ď E px,yq Drpf pxq yq2s 2γ

where f pxq Epx,yq Drys is the function that minimizes squared error over D.

See Appendix B for the proof.

5. When Multicalibration Implies Accuracy

We analyzed the same algorithm (Algorithm 1) as both an algorithm for obtaining multicalibration with respect to H,

Multicalibration as Boosting for Regression

and, when H satisfied the weak learning condition given in Definition 4.4, as a boosting algorithm that converges to the Bayes optimal model. In this section we show that this is no coincidence: multicalibration with respect to H implies Bayes optimality if and only if H satisfies the weak learning condition from Definition 4.4,

First we define what we mean when we say that multicalibration with respect to H implies Bayes optimality. Note that the Bayes optimal model f pxq is multicalibrated with respect to any set of functions, so it is not enough to require that there exist Bayes optimal functions f that are multicalibrated with respect to H. Instead, we have to require that every function that is multicalibrated with respect to H is Bayes optimal:

Definition 5.1. Fix a distribution D P Z. We say that multicalibration with respect to H implies Bayes optimality over D if for every f : X Ñ R that is multicalibrated with respect to D and H, we have:

E px,yq Drpfpxq yq2s E px,yq Drpf pxq yq2s

Where f pxq Ey Dpxqrys is the function that has minimum squared error over the set of all functions.

Recall that when the weak learning parameter γ in Definition 4.4 is set to 0, we simply call it the weak learning condition relative to D. We first state and prove our characterization for the exact case when γ 0, because it leads to an exceptionally simple statement. We subsequently extend this characterization to relate approximate Bayes optimality and approximate multicalibration under quantitative weakenings of the weak learning condition.

Theorem 5.2. Fix a distribution D P Z. Let H be a class of functions that is closed under affine transformation. Multicalibration with respect to H implies Bayes optimality over D if and only if H satisfies the weak learning condition relative to D.

See Appendix B for proof. We additionally derive a relationship between approximate multicalibration and approximate Bayes optimality in Appendix D.

6. Weak Learners with Respect to Constrained Classes

Thus far we have studied function classes H that satisfy a weak learning condition with respect to the Bayes optimal predictor f . But we can also study function classes H that satisfy a weak learning condition defined with respect to another constrained class of real valued functions.

Definition 6.1 (Weak Learning Assumption Relative to C). Fix a distribution D P Z and two classes of functions H and C. We say that H satisfies the γ-weak learning

condition relative to C and D if for every S Ă X with Prx DX rx P Ss ą 0, if:

min c PC Erpcpxq yq2 | x P Ss ă Erp y S yq2 | x P Ss γ,

where y S Ery | x P Ss, then there exists h P H such that

Erphpxq yq2 | x P Ss ă Erp y S yq2 | x P Ss γ.

When γ 0 we simply say that H satisfies the weak learning condition relative to C and D.

We will show that if a predictor f is multicalibrated with respect to H, and H satisfies the weak learning assumption with respect to C, then in fact:

1. f is multicalibrated with respect to C, and

2. f has squared error at most that of the minimum error predictor in C.

In fact, (Gopalan et al., 2022) show that if f is multicalibrated with respect to C, then it is an omnipredictor for C, which implies that f has loss no more than the best function cpxq P C, where loss can be measured with respect to any Lipschitz convex loss function (not just squared error). Thus our results imply that to obtain an omnipredictor for C, it is sufficient to be multicalibrated with respect to a class H that satisfies our weak learning assumption with respect to C.

Theorem 6.2. Fix a distribution D P Z and two classes of functions H and C that are closed under affine transformations. Then if f : X Ñ r0, 1s is multicalibrated with respect to D and H, and if H satisfies the weak learning condition relative to C and D, then in fact f is multicalibrated with respect to D and C as well. Furthermore,

E px,yq Drpfpxq yq2s ď min c PC E px,yq Drpcpxq yq2s.

See Appendix B for proof. For the corresponding relationship between approximate multicalibration and approximate loss minimization, see Appendix E.

7. Empirical Evaluation

In this section, we study Algorithm 1 empirically via an efficient, open-source Python implementation of our algorithm on both synthetic and real regression problems.

An important feature of Algorithm 1 which distinguishes it from traditional boosting algorithms is the ability to parallelize not only during inference, but also during training. Let ft be the model maintained by Algorithm 1 at round t with m level sets. Given a data set X, ft creates a partition of X defined by Xt 1 i tx|ftpxq viu. Since the

Multicalibration as Boosting for Regression

Figure 1: The update process at round t with m level sets during training.

Xi are disjoint, each call ht 1 i AHp Xt 1 i q can be made on a separate worker followed by a merge and round operation to obtain ft 1 and ft 1 respectively, as shown in Figure 1. A parallel inference pass at round t works nearly identically, but uses the historical weak learners ht 1 i obtained from training and applies them to each set Xt 1 i . We compare the training times of Algorithm 1 with scikitlearn s Gradient Boosting regressor (Pedregosa et al., 2011) in Appendix F.

From Theorem 5.2, we know that multicalibration with respect to a hypothesis class H satisfying our weak learning condition implies Bayes optimality. To visualize the fast convergence of our algorithm to Bayes optimality, we create two synthetic datasets; each dataset contains one million samples with two features. We define the label of these points using two functions, C0 and C1 (pictured in Figure 2), with differing complexity and attempt to learn the underlying function with Algorithm 1.

Figure 2: C0 maps x1, x2 P r 2, 2s to four cylindrical cones symmetric about the origin. C1 maps x1, x2 P r 1, 1s to a hilly terrain from a more complex function.

In Figure 4, we show an example of Algorithm 1 learning C0 using a discretization of five-hundred level sets and a weak learner hypothesis class of depth one decision trees. Each image in figure 4 corresponds to the map produced by Algorithm 1 at the round listed in the top of the image. As

the round count increases, the number of non-empty level sets increases until each level set is filled, at which point the updates become more granular. The termination round titled final round occurs at T 199 and paints an approximate map of C0. The image titled out of sample is the map produced on a set of one million points randomly drawn outside of the training sample, and shows that Algorithm 1 is in fact an approximation of the Bayes Optimal C0. In Appendix F.2, Figure 8 plots the same kind of progression as Figure 4, but with a more complicated underlying function C1 using a variety of weak learner classes. We are able to learn this more complex surface out of sample with all base classes except for linear regression, which results in a noisy out-of-sample plot.

Figure 3: Comparison of Algorithm 1 (LS) and Gradient Boosting (GB), both using depth 1 regression trees. * indicates termination round of Algorithm 1.

We evaluate the empirical performance of Algorithm 1 on US Census data compiled using the Python folktables package (Ding et al., 2021). In this dataset, the feature space consists of demographic information about individuals, and the labels correspond to the individual s annual income. We cap income at $100,000 and then rescale all labels into r0, 1s. On an 80/20% train-test split with 500,000 total samples, we compare the performance of Algorithm 1 with Gradient Boosting with two performance metrics: mean

Multicalibration as Boosting for Regression

Figure 4: Evolution of Algorithm 1 learning C0.

squared error (MSE), and mean squared calibration error (MSCE). For less expressive weak learner classes (such as DT(1), see Figure 3), Algorithm 1 has superior MSE out of sample compared to Gradient Boosting through one hundred rounds while maintaining significantly lower MSCE, and converges quicker. However, as the weak learning class becomes more expressive (e.g. increasing decision tree depths), Algorithm 1 is more prone to overfitting than gradient boosting (see Appendix F.1, Figure 5).

Avrim Blum and Yishay Mansour. From external to internal regret. In International Conference on Computational Learning Theory, pages 621 636. Springer, 2005.

Maya Burhanpurkar, Zhun Deng, Cynthia Dwork, and Linjun Zhang. Scaffolding sets. ar Xiv preprint ar Xiv:2111.03135, 2021.

A Philip Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605 610, 1982.

Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 2021.

Nigel Duffy and David Helmbold. Boosting methods for regression. Machine Learning, 47(2):153 200, 2002.

Dean P Foster and Sham M Kakade. Calibration via regression. In 2006 IEEE Information Theory Workshop ITW 06 Punta del Este, pages 82 86. IEEE, 2006.

Dean P Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29 (1-2):7 35, 1999.

Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020.

Dylan Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539 1548. PMLR, 2018.

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55 (1):119 139, 1997.

Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189 1232, 2001.

Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In ITCS, 2022.

Parikshit Gopalan, Lunjia Hu, Michael P. Kim, Omer Reingold, and Udi Wieder. Loss minimization through the

Multicalibration as Boosting for Regression

lens of outcome indistinguishability. In Yael Tauman Kalai, editor, 14th Innovations in Theoretical Computer Science Conference, ITCS 2023, January 10-13, 2023, MIT, Cambridge, Massachusetts, USA, volume 251 of LIPIcs, pages 60:1 60:20. Schloss Dagstuhl - Leibniz Zentrum für Informatik, 2023a.

Parikshit Gopalan, Michael P Kim, and Omer Reingold. Characterizing notions of omniprediction via multicalibration. ar Xiv preprint ar Xiv:2302.06726, 2023b.

Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939 1948. PMLR, 2018.

Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh Vohra. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory, pages 2634 2678. PMLR, 2021.

Christopher Jung, Georgy Noarov, Ramya Ramalingam, and Aaron Roth. Batch multivalid conformal prediction. ar Xiv preprint ar Xiv:2209.15145, 2022.

Adam Kalai. Learning monotonic linear functions. In International Conference on Computational Learning Theory, pages 487 501. Springer, 2004.

Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777 1805, 2008.

Varun Kanade and Adam Kalai. Potential-based agnostic boosting. Advances in neural information processing systems, 22, 2009.

Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247 254, 2019.

Michael P Kim, Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. Universal adaptability: Target-independent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, 119(4):e2108097119, 2022.

Balas K Natarajan. On learning sets and functions. Machine Learning, 4(1):67 97, 1989.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.

Aaron Roth. Uncertain: Modern topics in uncertainty estimation. https://www.cis.upenn.edu/ aaroth/uncertaintynotes.pdf, 2022.

Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197 227, 1990.

Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. Kybernetes, 2013.

Eliran Shabat, Lee Cohen, and Yishay Mansour. Sample complexity of uniform convergence for multicalibration. Advances in Neural Information Processing Systems, 33: 13331 13340, 2020.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

V.N. Vapnik and A. YA. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities, 1971.

Multicalibration as Boosting for Regression

A. Relationships between Multicalibration Definitions

Note that multicalibration with respect to function classes is strictly more general than notions of multicalibration with respect to a class of (potentially sensitive) groups. When the functions hpxq have binary range, we can view them as indicator functions for some subset of the data domain S Ă X, in which case multicalibration corresponds to asking for calibration conditional on membership in these subsets S. Allowing the functions h to have real valued range is only a more general condition. Our notion of approximate multicalibration takes a weighted average over the level sets v of the predictor f, weighted by the probability that fpxq v. This is necessary for any kind of out of sample generalization statement otherwise we could not even necessarily measure calibration error from a finite sample.

Other work on multicalibration use related measures of multicalibration that we think of as ℓ1 or ℓ8 variants, that we can write as

K1pf, h, Dq ÿ

v PRpfq Pr px,yq Drfpxq vs ˇˇˇˇ E px,yq Drhpxqpy vq|fpxq vs ˇˇˇˇ

K8pf, h, Dq max v PRpfq Pr px,yq Drfpxq vs ˆ

E px,yq Drhpxqpy vq|fpxq vs .

These notions are related to each other:

K2pf, h, Dq ď K1pf, h, Dq ď a

K2pf, h, Dq and K8pf, h, Dq ď K1pf, h, Dq ď m K8pf, h, Dq (Roth, 2022).

Theorem 3.2. Suppose H is closed under affine transformation. Fix a model f : X Ñ R and a levelset v P Rpfq of f. Then:

1. If there exists an h P H such that: Erhpxqpy vq|fpxq vs ě α,

for α ą 0, then there exists an h1 P H such that:

Erpfpxq yq2 ph1pxq yq2|fpxq vs

Erhpxq2|fpxq vs.

2. If f is calibrated and there exists an h P H such that

Erpfpxq yq2 phpxq yq2|fpxq vs ě α,

then: Erhpxqpy vq|fpxq vs ě α

Proof. We prove each direction in turn.

Lemma B.1. Fix a model f : X Ñ R. Suppose for some v P Rpfq there is an h P H such that:

Erhpxqpy vq|fpxq vs ě α

Let h1 v ηhpxq for η α Erhpxq2|fpxq vs. Then:

Erpfpxq yq2 ph1pxq yq2|fpxq vs ě α2

Erhpxq2|fpxq vs

Multicalibration as Boosting for Regression

Proof. We calculate:

Erpfpxq yq2 ph1pxq yq2|fpxq vs Erpv yq2 pv ηhpxq yq2|fpxq vs

Erv2 2vy y2 pv ηhpxqq2

2ypv ηhpxqq y2|fpxq vs

Er2yηhpxq 2vηhpxq η2hpxq2|fpxq vs

Er2ηhpxqpy vq η2hpxq2|fpxq vs

ě 2ηα η2 Erhpxq2|fpxq vs

Erhpxq2|fpxq vs

Where the last line follows from the definition of η.

The first direction of Theorem 3.2 follows from Lemma B.1, and the observation that since H is closed under affine transformations, the function h1 defined in the statement of Lemma B.1 is in H. Now for the second direction. Note that we prove a more general statement in this lemma, showing that for any set S, if there exists an h that has squared error α better than the best constant predictor y S for that set, then Erhpxqpy y Sq | x P Ss ě α{2. The result claimed in part 2 of Theorem 3.2 follows as a corollary. For any calibrated f we have v Ery | fpxq vs, and so for each level set of f, f is the best constant predictor on that set.

Lemma B.2. Fix a model f : X Ñ R. Suppose for some S Ă X there is an h P H such that:

Erp y S yq2 phpxq yq2|x P Ss ě α,

where y S Ery | x P Ss. Then it must be that:

Erhpxqpy y Sq|x P Ss ě α

Proof. We calculate:

E px,yq Drhpxqpy y Sq|x P Ss E px,yq Drhpxqy|x P Ss y S E px,yq Drhpxq|x P Ss

ˆ 2 E px,yq Drhpxqy|x P Ss 2 y S E px,yq Drhpxq|x P Ss

ˆ 2 E px,yq Drhpxqy|x P Ss 2 y S E px,yq Drhpxq|x P Ss

E px,yq Drphpxq y Sq2|x P Ss

E px,yq Dr2hpxqy hpxq2 y2 S|x P Ss

E px,yq Dr2hpxqy hpxq2 2 y Sy y2 S|x P Ss

E px,yq Drp y S yq2 phpxq yq2|x P Ss

where the 3rd to last line follows from adding and subtracting y2 S.

Multicalibration as Boosting for Regression

Theorem 4.3. Fix any distribution D P Z, any model f : X Ñ r0, 1s, any α ă 1, any class of real valued functions H that is closed under affine transformations, and a squared error regression oracle AH for H. For any bound B ą 0 let:

HB th P H : max x PX hpxq2 ď Bu

be the set of functions in h with squared magnitude bounded by B. Then LSBoostpf, α, AH, D, Bq (Algorithm 1) halts after at most T ď 2B

α many iterations and outputs a model f T 1 such that f T 1 is α-approximately multicalibrated with respect to D and HB.

Remark B.3. Note the form of this theorem we do not promise multicalibration at approximation parameter α for all of H, but only for HB i.e. those functions in H satisfying a bound on their squared value. This is necessary, since H is closed under affine transformations. To see this, note that if Erhpxqpy vqs ě α, then it must be that Erc hpxqpy vqs ě c α. Since h1pxq chpxq is also in H by assumption, approximate multicalibration bounds must always also be paired with a bound on the norm of the functions for which we promise those bounds.

Proof. Since f0 takes values in r0, 1s and y P r0, 1s, we have err0 ď 1, and by definition err T ě 0 for all T. By construction, if the algorithm has not halted at round t it must be that errt ď errt 1 α 2B , and so the algorithm must halt after at most T ď 2B

α many iterations to avoid a contradiction.

It remains to show that when the algorithm halts at round T, the model f T 1 that it outputs is α-approximately multicalibrated with respect to D and HB. We will show that if this is not the case, then err T 1 err T ą α 2B , which will be a contradiction to the halting criterion of the algorithm.

Suppose that f T 1 is not α-approximately multicalibrated with respect to D and HB. This means there must be some h P HB such that: ř

v Pr1{ms Prpx,yq Drf T 1pxq vs Epx,yq Drhpxqpy vq|f T 1pxq vs 2 ą α.

For each v P r1{ms define αv Prpx,yq Drf T 1pxq vs Epx,yq Drhpxqpy vq|f T 1pxq vs 2 So we have ř

v Pr1{ms αv ą α.

Applying the 1st part of Theorem 3.2 we learn that for each v, there must be some hv P H such that:

Erpf T 1pxq yq2 phvpxq yq2|f T 1pxq vs

ą 1 Erhpxq2|f T 1pxq vs αv Prpx,yq Drf T 1pxq vs

ě 1 B αv Prpx,yq Drf T 1pxq vs

where the last inequality follows from the fact that h P HB Now we can compute:

E px,yq Drpf T 1pxq yq2 p f T pxq yq2s

v Pr1{ms Pr px,yq Drf T 1pxq vs E px,yq Drpf T 1pxq yq2 p f T pxq yq2|f T 1pxq vs

v Pr1{ms Pr px,yq Drf T 1pxq vs E px,yq Drpf T 1pxq yq2 ph T v pxq yq2|f T 1pxq vs

v Pr1{ms Pr px,yq Drf T 1pxq vs E px,yq Drpf T 1pxq yq2 phvpxq yq2|f T 1pxq vs

Here the third line follows from the definition of f T and the fourth line follows from the fact hv P H and that h T v minimizes squared error on DT v amongst all h P H.

Multicalibration as Boosting for Regression

Finally we calculate:

err T 1 err T E px,yq Drpf T 1pxq yq2 pf T pxq yq2s

E px,yq Drpf T 1pxq yq2 p f T pxq yq2s E px,yq Drp f T pxq yq2 pf T pxq yq2s

ą α B E px,yq Drp f T pxq yq2 pf T pxq yq2s

where the last equality follows from the fact that m ě 2B

The 2nd inequality follows from the fact that for every pair px, yq:

p f T pxq yq2 pf T pxq yq2 ě 1

To see this we consider two cases. Since y P r0, 1s, if f T pxq ą 1 or f T pxq ă 0 then the Round operation decreases squared error and we have p f T pxq yq2 pf T pxq yq2 ě 0. In the remaining case we have f T pxq P r0, 1s and f T pxq f T pxq is such that | | ď 1 2m. In this case we can compute:

p f T pxq yq2 pf T pxq yq2 pf T pxq yq2 pf T pxq yq2

2 pfpxq yq 2

Theorem 4.5. Fix any distribution D P Z, any model f : X Ñ r0, 1s, any γ ą 0, any class of real valued functions H that satisfies the γ-weak learning condition relative to D, and a squared error regression oracle AH for H. Let α γ and B 1{γ (or any pair such that α{B γ2). Then LSBoostpf, α, AH, D, Bq halts after at most T ď 2 γ2 many iterations and outputs a model f T 1 such that f T 1 is 2γ-approximately Bayes optimal over D:

E px,yq Drpf T 1pxq yq2s ď E px,yq Drpf pxq yq2s 2γ

where f pxq Epx,yq Drys is the function that minimizes squared error over D.

Proof. At each round t before the algorithm halts, we have by construction that errt ď errt 1 α 2B , and since the squared error of f0 is at most 1, and squared error is non-negative, we must have T ď 2B

Now suppose the algorithm halts at round T and outputs f T 1. It must be that err T ą err T 1 γ2

2 . Suppose also that f T 1 is not 2γ-approximately Bayes optimal:

E px,yq Drpf T 1pxq yq2 pf pxq yq2s ą 2γ

We can write this condition as: ÿ

v Pr1{ms Prrf T 1pxq vs E px,yq Drpf T 1pxq yq2 pf pxq yq2|f T 1pxq vs ą 2γ

Define the set: S tv P r1{ms : E px,yq Drpf T 1pxq yq2 pf pxq yq2|f T 1pxq vs ě γu

Multicalibration as Boosting for Regression

to denote the set of values v in the range of f T 1 such that conditional on f T 1pxq v, f T 1 is at least γ-sub-optimal. Since we have both y P r0, 1s and f T 1pxq P r0, 1s, for every v we must have that Erpf T 1pxq yq2 pf pxq yq2|f T 1pxq vs ď 1. Therefore we can bound:

v Pr1{ms Prrf T 1pxq vs E px,yq Drpf T 1pxq yq2 pf pxq yq2|f T 1pxq vs

ď Pr px,yq Drx P Ss p1 Pr px,yq Drx P Ssqγ

Solving we learn that:

Pr px,yq Drx P Ss ě 2γ γ

p1 γq ě 2γ γ γ

Now observe that by the fact that H is assumed to satisfy the γ-weak learning assumption with respect to D, at the final round T of the algorithm, for every v P S we have that h T v satisfies:

E px,yq Drpf T 1pxq yq2 ph T v pxq yq2|f T 1pxq vs ě γ

Let err T Epx,yq Drp f T pxq yq2s Therefore we have:

err T 1 err T ÿ

v Pr1{ms Pr px,yq Drf T 1pxq vs E px,yq Drpf T 1pxq yq2 ph T v pxq yq2|f T 1pxq vs

ě Pr px,yq Drf T 1pxq P Ssγ

We recall that | err T err T | ď 1{m γ2

2 and so we can conclude that

err T 1 err T ě γ2

which contradicts the fact that the algorithm halted at round T, completing the proof.

Theorem 5.2. Fix a distribution D P Z. Let H be a class of functions that is closed under affine transformation. Multicalibration with respect to H implies Bayes optimality over D if and only if H satisfies the weak learning condition relative to D.

Proof. To avoid measurability issues we assume that models f have a countable range (which is true in particular whenever X is countable) but this assumption can be avoided with more care.

First we show that if H satisfies the weak learner condition relative to D, then multicalibration with respect to H implies Bayes optimality over D. Suppose not. Then there exists a function f that is multicalibrated with respect to D and H, but is such that: E px,yq Drpfpxq yq2s ą E px,yq Drpf pxq yq2s

By linearity of expectation we have: ÿ

v PRpfq Prrfpxq vs E px,yq Drpfpxq yq2 pf pxq yq2|fpxq vs ą 0

In particular there must be some v P Rpfq with Prx DX rfpxq vs ą 0 such that:

E px,yq Drpfpxq yq2|fpxq vs ą E px,yq Drpf pxq yq2|fpxq vs

Multicalibration as Boosting for Regression

Let S tx : fpxq vu. Observe that if H is closed under affine transformation, the constant function hpxq 1 is in H, and hence multicalibration with respect to H implies calibration. Since f is calibrated, we know that:

E px,yq Drpv yq2|x P Ss min c PR E px,yq Drpc yq2|x P Ss

Thus by the weak learning assumption there must exist some h P H such that:

Erpv yq2 phpxq yq2|x P Ss Erpfpxq yq2 phpxq yq2|fpxq vs ą 0

By Theorem 3.2, there must therefore exist some h1 P H such that:

E px,yq Drh1pxqpy vq|fpxq vs ą 0

implying that f is not multicalibrated with respect to D and H, a contradiction.

In the reverse direction, we show that for any H that does not satisfy the weak learning condition with respect to D, then multicalibration with respect to H and D does not imply Bayes optimality over D. In particular, we exhibit a function f such that f is multicalibrated with respect to H and D, but such that:

E px,yq Drpfpxq yq2s ą E px,yq Drpf pxq yq2s

Since H does not satisfy the weak learning assumption over D, there must exist some set S Ď X with Prrx P Ss ą 0 such that E px,yq Drpf pxq yq2|x P Ss ă min c PR E px,yq Drpc yq2|x P Ss

but for every h P H: E px,yq Drphpxq yq2|x P Ss ě min c PR E px,yq Drpc yq2|x P Ss.

Let cp Sq Epx,yq Dry|x P Ss. We define fpxq as follows:

# f pxq x R S cp Sq x P S

We can calculate that:

E px,yq Drpfpxq yq2s

Pr px,yq Drx P Ss E px,yq Drpcp Sq yq2|x P Ss Pr px,yq Drx R Ss E px,yq Drpf pxq yq2|x R Ss

ą Pr px,yq Drx P Ss E px,yq Drpf pxq yq2|x P Ss Pr px,yq Drx R Ss E px,yq Drpf pxq yq2|x R Ss

E px,yq Drpf pxq yq2s

In other words, f is not Bayes optimal. So if we can demonstrate that f is multicalibrated with respect to H and D we are done. Suppose otherwise. Then there exists some h P H and some v P Rpfq such that

E px,yq Drhpxqpy vq|fpxq vs ą 0

By Theorem 3.2, there exists some h1 P H such that:

E px,yq Drph1pxq yq2|fpxq vs ă E px,yq Drpfpxq yq2|fpxq vs

Multicalibration as Boosting for Regression

We first observe that it must be that v cp Sq. If this were not the case, by definition of f we would have that:

E px,yq Drph1pxq yq2|fpxq vs ă E px,yq Drpf pxq yq2|fpxq vs

which would contradict the Bayes optimality of f . Having established that v cp Sq we can calculate:

E px,yq Drph1pxq yq2|fpxq cp Sqs

Pr px,yq Drx P Ss E px,yq Drph1pxq yq2|x P Ss

Pr px,yq Drx R S, fpxq cp Sqs E px,yq Drph1pxq yq2|x R S, fpxq cp Sqs

ě Pr px,yq Drx P Ss E px,yq Drph1pxq yq2|x P Ss

Pr px,yq Drx R S, fpxq cp Sqs E px,yq Drpfpxq yq2|x R S, fpxq cp Sqs

where in the last inequality we have used the fact that by definition, fpxq f pxq for all x R S, and so is pointwise Bayes optimal for all x R S.

Hence the only way we can have Epx,yq Drph1pxq yq2|fpxq cp Sqs ă Epx,yq Drpfpxq yq2|fpxq cp Sqs is if:

E px,yq Drph1pxq yq2|x P Ss ă E px,yq Drpcp Sq yq2|x P Ss

But this contradicts our assumption that H violates the weak learning condition on S, which completes the proof.

Theorem 6.2. Fix a distribution D P Z and two classes of functions H and C that are closed under affine transformations. Then if f : X Ñ r0, 1s is multicalibrated with respect to D and H, and if H satisfies the weak learning condition relative to C and D, then in fact f is multicalibrated with respect to D and C as well. Furthermore,

E px,yq Drpfpxq yq2s ď min c PC E px,yq Drpcpxq yq2s.

Proof. We prove the theorem in two lemmas as follows.

Lemma B.4. Fix a distribution D P Z and two classes of functions H and C that are closed under affine transformations. Then if f : X Ñ r0, 1s is multicalibrated with respect to D and H, and if H satisfies the weak learning condition relative to C and D, then in fact f is multicalibrated with respect to D and C as well.

Proof. We assume for simplicity that f has a countable range (which is without loss of generality e.g. whenever X is countable). Suppose for contradiction that f is not multicalibrated with respect to C and D. In this case there must be some c P C such that: ÿ

v PRpfq Prrfpxq vs ˆ

E px,yq Drcpxqpy vq|fpxq vs 2 ą 0

Since C is closed under affine transformations (and so both c and c are in C), there must be some c1 P C and some v P Rpfq with Prrfpxq vs ą 0 such that:

E px,yq Drc1pxqpy vq|fpxq vs ą 0

Therefore, by the first part of Theorem 3.2, there must be some c2 P C such that:

E px,yq Drpc2pxq yq2|fpxq vs ă E px,yq Drpv yq2|fpxq vs

Multicalibration as Boosting for Regression

Since H is closed under affine transformations, the function hpxq 1 is in H and so multicalibration with respect to H implies calibration. Thus v y Sv for Sv tx : fpxq vu. Therefore, the fact that H satisfies the weak learning condition relative to C and D implies that there must be some h P H such that:

E px,yq Drphpxq yq2|fpxq vs ă E px,yq Drpv yq2|fpxq vs

Finally, the second part of Theorem 3.2 implies that:

E px,yq Drhpxqpy vq|fpxq vs ą 0

which is a violation of our assumption that f is multicalibrated with respect to H and D, a contradiction.

Lemma B.5. Fix a distribution D P Z and two classes of functions H and C. Then if f : X Ñ r0, 1s is calibrated and multicalibrated with respect to D and H, and if H satisfies the weak learning condition relative to C and D, then:

E px,yq Drpfpxq yq2s ď min c PC E px,yq Drpcpxq yq2s

Proof. We assume for simplicity that f has a countable range (which is without loss of generality e.g. whenever X is countable). Suppose for contradiction that there is some c P C such that:

E px,yq Drpcpxq yq2s ă E px,yq Drpfpxq yq2s

Then there must be some v P Rpfq with Prrfpxq vs ą 0 and:

E px,yq Drpcpxq yq2|fpxq vs ă E px,yq Drpv yq2|fpxq vs

Since f is calibrated, v y Sv for Sv tx : fpxq vu. Therefore, the fact that H satisfies the weak learning condition relative to C and D implies that there must be some h P H such that:

E px,yq Drphpxq yq2|fpxq vs ă E px,yq Drpv yq2|fpxq vs

Finally, the second part of Theorem 3.2 implies that:

E px,yq Drhpxqpy vq|fpxq vs ą 0

which is a violation of our assumption that f is multicalibrated with respect to H and D, a contradiction.

C. Generalization Bounds

Our analysis of Algorithm 1 assumed direct access to the data distribution D. In practice, we will run the algorithm on the empirical distribution over a sample of n points D Dn. In this section, we show that when we do this, so long as n is sufficiently large, both our squared error and our multicalibration guarantees carry over from the empirical distribution over D to the distribution D from which D was sampled. Most generalization bounds for multicalibration algorithms (e.g. (Hébert-Johnson et al., 2018; Jung et al., 2021; 2022; Shabat et al., 2020)) are either stated and proven for finite classes H, or are proven for algorithms that do not operate as empirical risk minimization algorithms, but instead gain access to a fresh sample of data from the distribution at each iteration, or are proven for hypotheses classes that are fixed independently of the algorithm. We have a different challenge: Like (Hébert-Johnson et al., 2018; Jung et al., 2021) we study an iterative algorithm whose final hypothesis class is not fixed up front, but implicitly defined as a function of H. But we wish to study the algorithms as they are used as empirical risk minimization algorithms so we do not want our analysis to depend on using a fresh sample of data at each iteration. And unlike the analysis in (Jung et al., 2022), for us H is continuously large (since it is closed under affine transformations), so we cannot rely on bounds that depend on log |H|. Instead we give a uniform convergence analysis that depends on the pseudo-dimension of our class of weak learners H:

Multicalibration as Boosting for Regression

Definition C.1. Pseudodimension[(Pollard, 2012)] Let H be a class of functions from X to R. We say that a set S px1, . . . , xm, y1, . . . , ymq P X m ˆ Rm is pseudo-shattered by H if for any pb1, . . . , bmq P t0, 1um there exists h P H such that @i, hpxiq ą y ðñ bi 1 The pseudodimension of H, denoted Pdimp Hq is the largest integer m for which H pseudo-shatters some set S of cardinality m.

Although hypotheses in H are continuously valued, Algorithm 1 outputs functions that have finite range r1{ms, and so we can view them as multi-class classification functions. Our analysis will proceed by studying the generalization properties of these multiclass functions, which we will characterize using Natarajan dimension:

Definition C.2 (Shattering for multiclass functions). (Natarajan, 1989; Shalev-Shwartz and Ben-David, 2014) A set C Ď X is shattered by H if there exists two functions f0, f1 : C Ñ rks such that

1. For every x P C, f0pxq f1pxq.

2. For every B Ď C there exists a function h P H such that

@x P B, hpxq f0pxq and @x P C B, hpxq f1pxq.

Definition C.3 (Natarajan dimension). (Natarajan, 1989; Shalev-Shwartz and Ben-David, 2014) The Natarajan dimension of H, denoted Ndimp Hq, is the maximal size of a shattered set C Ď X.

We can then rely the following standard uniform convergence bound for multiclass classification. This statement is slightly modified from the result in Shalev-Schwartz and Ben-David to account for our use of squared error. The result still holds on account of the fact that the Cherhoff bound only relies on the loss function being bounded, and ours is indeed bounded between 0 and 1.

Theorem C.4 (Multiclass uniform convergence). (Shalev-Shwartz and Ben-David, 2014) Let ϵ, δ ą 0 and let H be a class of functions h : X Ñ r1{ks such that the Natarajan dimension of H is d. Let D P p X ˆ r0, 1sq be an arbitrary distribution and let D tpx1, y1q, . . . , pxn, ynqupxi,yiq D be a sample of n points from D. Then for

n O ˆd logpkq logp1{δq

Pr max h PH

ˇˇˇˇ E px,yq Drpy hpxqq2s E px,yq Drpy hpxqq2s ˇˇˇˇ ě ϵs ȷ ď δ.

Our strategy will be to bound the Natarajan dimension of the class of models that can be output by Algorithm 1 in terms of the pseudodimension of the underlying weak learner, then apply the above uniform convergence result. To do so, we will first use the following lemma, which bounds the Natarajan dimension of functions that can be described as post-processings of binary valued-functions from a class of bounded VC-dimension.

Lemma C.5. (Shalev-Shwartz and Ben-David, 2014) Suppose we have ℓbinary classifiers from binary class Hbin and a rule r : t0, 1uℓÑ rks that determines a multiclass label according to the predictions of the ℓbinary classifiers. Define the hypothesis class corresponding to this rule as

H trph1p q, . . . , hℓp qq : ph1, . . . , hℓq P p Hbinqℓu.

Then, if d VCdimp Hbinq, Ndimp Hq ď 3ℓd logpℓdq.

Recall that the VC-dimension of a binary classifier is defined as follows:

Definition C.6 (VC-dimension). (Vapnik and Chervonenkis, 1971) Let H be a class of binary classifiers h : X Ñ t0, 1u. Let S tx1, . . . , xmu and let ΠHp Sq tphpx1q, . . . , hpxmqq : h P Hu Ď t0, 1um. We say that S is shattered by H if ΠHp Sq t0, 1um. The Vapnik-Chervonenkis (VC) dimension of H, denoted VCdimp Hq, is the cardinality of the largest set S shattered by H.

Lemma C.7. Let Hboost be the class of models output by LSBoostpf, α, AH, , Bq (Algorithm 1) for any input distribution D and let d be the pseudodimension of its input weak learner class H.

Ndim p Hboostq ď 24p B{αq3d log p2B{αq3d .

Multicalibration as Boosting for Regression

Proof. Let m be defined (as in LSBoostpf, α, AH, D, Bq) to be 2B{α. Because our models are always rounded to the nearest value in r1{ms, we can think of the model ft generated in every round of the algorithm multiclass classification problems over m classes. We will show that our final model can be written as a decision rule that maps the outputs of some ℓ Boolean classifiers to r1{ms, and that these Boolean classifiers have VC dimension that is bounded by the pseudodimension of the weak learner class. Then, we will apply Lemma C.5 to get an upper bound on the Natarajan dimension of the class of models in terms of α, B, and the pseudodimension of the input weak learner class H.

Consider the initial round of the algorithm. We can convert our (rounded) initial regressor f0 to a series of m Boolean thresholdings gv which return 1 when f0pxq ě v:

# 1 if f0pxq ě v, 0 otherwise. .

These m Boolean thresholdings can then be mapped back to the original prediction over r1{ms using a decision rule r : t0, 1um Ñ r1{ms which picks the largest of the thresholds that evaluates to 1, and assigns that index to the prediction:

r0ptg0 vuv Pr1{msqpxq arg max i Pr1{ms i1rgvpxq 1s.

Note that since our initial predictor f0 was already rounded to take values in r1{ms, the largest v such that f0pxq ě v will be exactly f0pxq, so r0 is exactly equivalent to f0. Similarly, at round t 1 of LSBoostpf, α, AH, D, Bq, we will show that the model ft 1 can be written as a decision rule rt 1 over m pt 1qm2 binary classifiers g, where

# 1 if ht vpxq ě i 1{p2mq, 0 otherwise. ,

Here, the thresholds measure halfway between each level set, as ht vpxq has yet to be rounded to the nearest level set. We can write a decision rule that maps these thresholds to classifications over r1{ms:

rt 1 rt, tgt 1 v,i ui,v Pr1{ms pxq ÿ

v Pr1{ms 1rrtpxq vs arg max i Pr1{ms

i 1rgt 1 v,i pxq 1s ,

Now, we need to show that this decision rule evaluated at round t is equivalent to ft. We proceed inductively. For our base case, we have already argued that our initial decision rule r0 is equivalent to the classifier f0. Now, say that we have decision rule rt over binary classifiers g that is equivalent to model ft. Then, we can write

rt 1 rt, tgt 1 v,i ui,v Pr1{ms pxq ÿ

v Pr1{ms 1rrtpxq vs arg max i Pr1{ms

i 1rgt 1 v,i pxq 1s ,

v Pr1{ms 1rftpxq vs arg max i Pr1{ms

i 1rgt 1 v,i pxq 1s

v Pr1{ms 1rftpxq vs arg max i Pr1{ms

i 1rht 1 v pxq ě i 1{p2mqs

v Pr1{ms 1rftpxq vs Roundpht 1 v pxqq

where the second line comes from the inductive hypothesis and the second to last line s equality comes from the fact that the largest i such that ht 1 v pxq 1{p2mq ě i will be the exact rounded prediction of ht 1 v pxq.

Now, we need to show that at round t 1, the decision rule is a decision rule over m pt 1qm2 binary classifiers. Note that our initial decision rule r0 has m m 0 m2 binary classifiers. Say that at round t we have a decision rule rt over

Multicalibration as Boosting for Regression

m tm2 classifiers. In the following round, we build m2 new Boolean classifiers gv, i for v, i P r1{ms. So, at round t 1 we have m tm2 m2 m pt 1qm2 classifiers total.

From Theorem 4.3, we know that Algorithm 1 halts after at most T ď 2B{α rounds, at which point it outputs model f T 1. So, we can rewrite f T 1 as a decision rule r T 1 composed of at most m p T 1qm2 ă Tm2 Boolean models. Plugging in our bound for T and definition of m, this gives us a decision rule r T 1 composed of at most 2B

α 3 Boolean classifiers.

Let G be the class of Boolean threshold functions over H, i.e. functions g : X Ñ t0, 1u such that

# 1 hpxq ě i 0 hpxq ă i,

for some h P H and i P R. Say that the VC-dimension of G is d1. Then, applying lemma C.5, it follows that

Ndimp Hboostq ď 3 ˆ2B

Now, it remains to show that we can bound the VC-dimension of these thresholding functions by the pseudodimension of the weak learner class H. Note that G as we have defined it above is a richer hypothesis class than the actual class of thresholding functions used in the above analysis, because it can threshold at any value in R rather than being restricted to r1{ms. Thus, its VC dimension can only be greater than the VC dimension of the class of threshold functions over H restricted to r1{ms, and hence an upper bound on the VC dimension of G in terms of the pseudodimension of H will also be an upper bound on the VC dimension of the restricted class of threshold functions.

Let d be the pseudodimension of H, and say that d ă d1. By the definition of VC-dimension, t0, 1ud 1 must be shattered by G. I.e., for any set of d 1 points x1, . . . , xd 1 P X with arbitrary labels b1, . . . , bd 1, there is some hypothesis g P G that realizes those labels on px1, . . . , xd 1q. Consider the function g that, given the d 1 points in X, realizes the labels b1, . . . , bd 1. By the construction of G, g is a thresholding of some function h P H at some point i. So, there is be some i P R such that hpxiq ą i ñ bi 1 and such that bi 1 ñ hpxiq ą i. But this means that t0, 1ud 1 is pseudo-shattered by H, and thus the pseudodimension of H is not d. Thus, it cannot be the case that d ă d1, and hence d1 ď d, i.e. the VC dimension of G is bounded above by the pseudodimension of H. Plugging this bound into the above bound on the Natarajan dimension gives us that

Ndimp Hboostq ď 24 ˆB

Now, we can state the following uniform convergence theorem for our final model.

Theorem C.8 (Squared Error Generalization for Algorithm 1.). Let ϵ, δ, α, B ą 0. Let Hboost be the class of models that can be output by LSBoostpf, α, AH, , Bq (Algorithm 1) for any input distribution D and let d be the pseudodimension of its input weak learner class H. Let D tpx1, y1q, . . . , pxn, ynqupxi,yiq D be a sample of n points drawn i.i.d. from D. Then if

n O ˆd B3 log2pd B{αq

α3ϵ2 logp1{δq

Multicalibration as Boosting for Regression

Pr max h PHboost

ˇˇˇˇ E px,yq Drpy hpxqq2s E px,yq Drpy hpxqq2s ˇˇˇˇ ě ϵs ȷ ď δ.

Proof. This follows directly from Theorem C.4 and the bound on the Natarajan dimension in Lemma C.7.

We also would like to know that our multicalibration guarantees are generalizable. Rather than doing a bespoke analysis here, we can rely on the connection that we have established between failure of multicalibration and ability to improve squared error and argue that if the final hypothesis output by the algorithm was not multicalibrated with high probability then it would be possible to improve its squared error out-of-sample. Thus, by our previous generalization result for squared error, it would be possible to improve the squared error in-sample as well, giving us a contradiction. Theorem C.9 (Multicalibration generalization guarantee). Let ϵ, δ, α, B ą 0 and consider the model f T 1 output by LSBoostpf, α, AH, D, Bq for some sample D of n points drawn i.i.d. from distribution D such that

n O ˆd B3 log2pd B{αq

α3ϵ2 logp1{δq

Then if ϵ ď α 4B , with probability greater than or equal to 1 2δ it follows that f T 1 is 2α-approximately multicalibrated with respect to the distribution D.

Proof. Let D tpx1, y1q, . . . , pxn, ynqupxi,yiq D. Consider the model f T 1 output by LSBoostpf, α, AH, D, Bq, and recall that within the run of the algorithm there was also a model f T defined in the final round. Say that model f T 1 is not 2α-approximately multicalibrated with respect to HB and the true distribution D.

Since the algorithm running on the sample halted, it must have been that the model in the final round improved in squared error by less than α{p2Bq when measured with respect to the sample D:

E px,yq Drpf T 1 yq2s E px,yq Drpf T yq2s ď pα{2Bq.

Consider what happens if we run the algorithm again, but with f T 1 as its initial model and now with the underlying distribution as input rather than the sample of n points. Let f 1 T be the model found in the first round of running this process LSBoostpf T 1, α, AH, D, Bq. Since f T 1 is not 2α approximately multicalibrated with respect to D and HB, then by an identical argument as in the proof of Theorem 4.3, it it must be that a single round of the algorithm improves the squared error on D by at least α{B. Thus, Epx,yq Drpf T 1 yq2s Epx,yq Drpf 1 T yq2s ą α{B.

We know from our previous convergence bound, Theorem C.8, that with probability 1 δ, | Epx,yq Drpf 1 T yq2s Epx,yq Drpf T yq2s| ă ϵ. So, f 1 T must with high probability also improve on the sample D:

α B ă E px,yq Drpf T 1 yq2s E px,yq Drpf 1 T yq2s

ă E px,yq Drpf T 1 yq2s E px,yq Drpf 1 T yq2s ϵ (with probability ě 1 δ)

ă E px,yq Drpf T 1 yq2s E px,yq Drpf 1 T yq2s 2ϵ (with probability ě 1 2δ)

where the last line comes from the fact that the error of f 1 T on D cannot be less than the error of f T on D, or else the regression oracle would have found it. Now we have a contradiction: since we have set ϵ ď α 4B ,

So, it must follow that f T 1 is, with probability 1 2δ, 2α approximately multicalibrated.

Multicalibration as Boosting for Regression

D. The Relationship Between Approximate Multicalibration and Approximate Bayes Optimality

We now turn our attention to deriving a relationship between approximate multicalibration and approximate Bayes optimality. To do so, we ll introduce an even weaker weak learning condition that has one additional parameter ρ, lower bounding the mass of sets S that we can condition on while still requiring the weak learning condition to hold. We remark that Algorithm 1 can be analyzed as a boosting algorithm under this weaker weak learning assumption as well, with only minor modifications in the analysis.

Definition D.1 ( pγ, ρq-weak learning condition). Fix a distribution D P Z and let H be a class of arbitrary real-valued functions. We say that H satisfies the pγ, ρq-weak learning condition for D if the following holds. For every set S Ď X such that Prx DX rx P Ss ą ρ, if

E px,yq Drpf yq2 | x P Ss ă E px,yq Drp y S yq2 | x P Ss γ,

where y S Epx,yq Dry | x P Ss, then there exists h P H such that

E px,yq Drphpxq yq2 | x P Ss ă E px,yq Drp y S yq2 | x P Ss γ.

We may now prove our theorem showing that approximate multicalibration with respect to a class H implies approximate Bayes optimality if and only if H satisfies the pγ, ρq-weak learning condition. We recall Remark B.3, which notes that we must restrict approximate multicalibration to a bounded subset of H, as we will assume that H is closed under affine transformation.

Theorem D.2. Fix any distribution D P Z, any model f : X Ñ r0, 1s, and any class of real valued functions H that is closed under affine transformation. Let:

H1 th P H : max x PX hpxq2 ď 1u

be the set of functions in H upper-bounded by 1 on X. Let m |Rpfq|, γ ą 0, and α ď γ3

16m. Then if H satisfies the pγ, γ{mq-weak learning condition and f is α-approximately multicalibrated with respect to H1 on D, then f has squared error E px,yq Drpfpxq yq2s ď E px,yq Drpf yq2s 3γ.

Conversely, if H does not satisfy the pγ, γ{mq-weak learning condition, there exists a model f : X Ñ r0, 1s that is αapproximately multicalibrated with respect to H1 on D, for α γ, and is perfectly calibrated on D, but f has squared error E px,yq Drpfpxq yq2s ě E px,yq Drpf yq2s γ2{m.

Proof. We begin by arguing that α-approximate multicalibration with respect to H1 on D implies approximate Bayes optimality when H satisfies the pγ, γ{mq-weak learning condition. Suppose not, and there exists a function f that is α-multicalibrated with respect to H1, but

E px,yq Drpf yq2s ă E px,yq Drpfpxq yq2s 3γ.

Then there must exist some v P Rpfq such that Prpx,yq Drfpxq vs ą γ{m and

E px,yq Drpf yq2 | fpxq vs ă E px,yq Drpfpxq yq2 | fpxq vs 2γ.

We observe that since H is closed under affine transformation, the constant function hpxq 1 is in H, and so αapproximate multicalibration with respect to H1 implies α-approximate calibration as well. Thus by definition,

Prrfpxq vs ˆ

E px,yq Drv y | fpxq vs 2 ď α.

Multicalibration as Boosting for Regression

Letting yv Ery | fpxq vs, our lower-bound that Prrfpxq vs ą γ{m gives us that pv yvq2 ă αm{γ ď γ

4 2. We now use this upper-bound on calibration error in conjuction with our lower-bound on distance from Bayes optimality to show that the squared error of the constant predictor yv must also be far from Bayes optimal.

E px,yq Drpf pxq yq2 | fpxq vs ă E px,yq Drpfpxq yq2 | fpxq vs 2γ

E px,yq Drpv yv yv yq2 | fpxq vs 2γ

E px,yq Drp yv yq2 | fpxq vs pv yvq2 2γ

ă E px,yq Drp yv yq2 | fpxq vs γ.

The pγ, γ{mq-weak learning condition then guarantees that there exists some h P H such that

E px,yq Drph yq2 | fpxq vs ă E px,yq Drp yv yq2 | fpxq vs γ.

By Lemma B.2, the fact that h improves on the squared loss of yv by an additive factor γ, on the set of x such that fpxq v, implies that Erhpxqpy yvq | fpxq vs ą γ{2. Because f is α-approximately calibrated on D, we can use the existence of such an h to witness a failure of multicalibration:

Erhpy vq | fpxq vs

Erhpxqpy yv yv vq | fpxq vs

Erhpxqpy yvq | fpxq vs Erhpxqp yv vq | fpxq vs

ą γ{2 | yv v|

Prrfpxq vs ˆ

E px,yq Drhpxqpy vq | fpxq vs 2 ą γ3

contradicting our assumption that f is α-approximately multicalibrated with respect to H1 for α ă γ3

16m. Therefore approximate multicalibration with respect to H1 must imply that f is approximately Bayes optimal.

It remains to show the other direction, that α-approximate multicalibration with respect to a class H1 implies approximate Bayes optimality only if H satisfies the pγ, γ{mq-weak learning condition. If this claim were not true for the stated parameters, then there must exist a class H such that every predictor f that:

is α-approximately multicalibrated with respect to H1

is perfectly calibrated on D

has range with cardinality |Rpfq| m

also has squared error within γ2{m of Bayes optimal, but H does not satisfy the weak learning condition. We will show that no such class exists by defining, for any class H not satisfying the weak learning condition, a predictor f that is α-approximately multicalibrated with respect to that class, but has squared error that is not within γ2{m of Bayes optimal.

Recall that if a class H does not satisfy the pγ, γ{mq-weak learning condition, then there must be some set SH such that Prrx P SHs ą γ{m, there does not exist an h P H such that

E px,yq Drph yq2 | x P SHs ă E px,yq Drp y SH yq2 | x P SHs γ,

but for the Bayes optimal predictor, it holds that its squared loss satisfies

E px,yq Drpf yq2 | x P SHs ă E px,yq Drp y SH yq2 | x P SHs γ,

Multicalibration as Boosting for Regression

where y SH Ery | x P SHs. For some hypothesis class H not satisfying the weak learning condition, and associated set SH, let f H be defined as follows:

# f pxq, x R SH y SH, x P SH. .

Note that, because f H is constant on SH, there must be some v P Rpfq such that the level set Sv tx P X : fpxq vu contains SH. To see that f H is α-approximately multicalibrated with respect to H1, we first consider the contribution to multicalibration error from the level sets not containing SH. For all h P H and v P Rpfq such that v y SH,

E px,yq Drhpxqpy f Hpxqq | f Hpxq vs E px,yq Drhpxqpy f pxqq | f Hpxq vs

E x Dx E y Dypxqrhpxqy | f Hpxq vs E x Dx rhpxqf pxq | f Hpxq vs

E x Dx E y Dypxqrhpxqy | f Hpxq vs E x Dx E y Dypxqrhpxqy | f Hpxq vs

For the level set Sv for which SH Ď Sv, we know from the argument above that the elements x P Svz SH contribute nothing to the multicalibration error, as fpxq f pxq on these elements. So,

E px,yq Drhpxqpy f Hpxqq | fpxq vs Pr x DXrx P SHs E px,yq Drhpxqpy y SHq | x P SHs

Pr x DXrx R SHs E px,yq Drhpxqpy f pxqq | x P Svz SHs

Pr x DXrx P SHs E px,yq Drhpxqpy y SHq | x P SHs

Therefore if f H is not α-approximately multicalibrated with respect to H1 on D, it must be the case that there exists some h P H1 such that Erhpxqpy y SHq | x P SHs ą ?α. Then by Theorem 3.2, there must exist a h1 P H such that

E px,yq Drp y SH yq2 ph1pxq yq2 | x P SHs ą α γ.

But SH was defined to be a subset of X for which no such h1 exists and for which Prrx P SHs ą γ{m. This would contradict our assumption that H does not satisfy the pγ, γ{mq-weak learning condition on D, and therefore f H is αapproximately multicalibrated with respect to H1 on D.

It remains to prove that f H is far from Bayes optimal.

E px,yq Drpf Hpxq yq2s Pr x DXrx P SHs E px,yq Drp y SH yq2 | x P SHs Prrx R SHs E px,yq Drpf pxq yq2 | x R SHs

ě Pr x DXrx P SHs ˆ

E px,yq Drpf yq2 | x P SHs γ

Prrx R SHs E px,yq Drpf pxq yq2 | x R SHs

E px,yq Drpf yq2s γ Pr x DXrx P SHs

ě E px,yq Drpf yq2s γ2{m.

E. Approximate Multicalibration and Loss Minimization for Constrained Classes

In this section we prove approximate versions of the exact statements from Section 6, showing that approximate multicalibration with respect to a class H implies approximate multicalibration with respect to C when H satisfies a weak learning condition relative to C. To do so, we need a refined version of one direction of Theorem 3.2 that shows us that if f witnesses a failure of multicalibration with respect to some h P H, then there is another function h1 P H that can be used to improve on f s squared error, while controlling the norm of h1.

Multicalibration as Boosting for Regression

Lemma E.1. Suppose H is closed under affine transformation. Fix a model f : X Ñ r0, 1s, a levelset v P Rpfq, and a bound B ą 0. Then if there exists an h P H such that maxx PX hpxq2 ď B and

Erhpxqpy vq|fpxq vs ě α,

for α ě 0, then there exists an h1 P H such that maxx PX h1pxq2 ď p1 ?

B α q2 and:

Erpfpxq yq2 ph1pxq yq2|fpxq vs ě α2

Proof. Let h1pxq v ηhpxq where η α Erhpxq2|fpxq vs, as in Theorem 3.2. Because hpxq2 is uniformly bounded by B on X, it follows that Erhpxq2s ď B, and we have already shown in the proof of Theorem 3.2 that this implies

Erpfpxq yq2 ph1pxq yq2|fpxq vs ě α2

It only remains to bound maxx PX h1pxq2. We begin by lower-bounding Erhpxq2 | fpxq vs in terms of α.

α2 ď Erhpxqpy vq | fpxq vs2

ď Er|hpxq||py vq| | fpxq vs2

ď Er|hpxq| | fpxq vs2

ď Erhpxq2 | fpxq vs

where the last inequality follows from Jensen s inequality applied to the convex function fpxq x2.

We will also need a parameterized version of our weak learning condition. Recalling Remark B.3, for approximate multicalibration to be meaningful with respect to a class that is closed under affine transformation, we must specify a bounded subset of that class with respect to which a predictor is approximately multicalibrated. Then to show that approximate multicalibration with respect to one potentially unbounded class implies approximate multicalibration with respect to another, we will need to specify the subsets of each class with respect to which a predictor is claimed to be approximately multicalibrated. This motivates a parameterization of our previous weak learning condition relative to a class C. We will need to assume that whenever there is a B-bounded function in C that improves over the best constant predictor on a restriction of D, there also exists a B-bounded function in H that improves on the restriction as well.

Definition E.2 (B-Bounded Weak Learning Assumption Relative to C). Fix a distribution D P Z and two classes of functions H and C. Fix a bound B ą 0 and let HB and CB denote the sets

HB th P H : max x PX hpxq2 ď Bu

and CB tc P C : max x PX cpxq2 ď Bu

respectively. We say that H satisfies the B-bounded γ-bounded weak learning condition relative to C and D if for every S Ď X with Prx DX rx P Ss ą 0, if:

min c PCB E px,yq Drpcpxq yq2 | x P Ss ă E px,yq Drp y S yq2 | x P Ss γ,

where y S Ery | x P Ss, then there exists h P HB such that

E px,yq Drphpxq yq2 | x P Ss ă E px,yq Drp y S yq2 | x P Ss γ.

Multicalibration as Boosting for Regression

Theorem E.3. Fix a distribution D P Z and two classes of functions H and C that are closed under affine transformations. Fix αC, B ą 0. Let B1 p1 b

2B αC q2 and γ αC

4B . Fix a function f : X Ñ r0, 1s that maps into a countable subset

of its range, and let m |Rpfq|, αH ă α3 C 29m B12 , and α ă αCγ2

32m B12 . Then if

H satisfies the B1-bounded γ-weak learning condition relative to C and D

f is αH-approximately multicalibrated with respect to D and HB1

f is α-approximately calibrated on D,

then f is αC-approximately multicalibrated with respect to D and CB.

Proof. Suppose not and there exists some c P CB such that

v PRpfq Pr x Dxrfpxq vs ˆ

E px,yq Drcpxqpy vq | fpxq vs 2 ą αC.

Then there must exist some v P Rpfq such that Prrfpxq vs ą αC

E px,yq Drcpxqpy vq | fpxq vs2 ą αC{2.

Because C is closed under affine transformations, CB is closed under negation, so there must also exist some c1 P CB such that E px,yq Drc1pxqpy vq | fpxq vs ą a

Then Lemma B.1 shows that there is a c2 P Cp1 b

2B αC q2 CB1 such that

E px,yq Drpy fpxqq2 py c2pxqq2 | fpxq vs ě αC

Because f is α-calibrated on D, by definition we have

Pr x Dxrfpxq vs ˆ

E px,yq Drv y | fpxq vs 2 ă α.

Letting yv Ery | fpxq vs, our lower-bound that Prrfpxq vs ą αC

2m gives us that pv yvq2 ă 2αm

16B12 ă γ. So, because v is close to yv, we can show the squared error of f must be close to the squared error of yv on this level set.

E px,yq Drpy fpxqq2 | fpxq vs E px,yq Drpy yv yv fpxqq2 | fpxq vs

E px,yq Drpy yvq2 2py yvqp yv vq | fpxq vs p yv vq2

E px,yq Drpy yvq2 | fpxq vs p yv vq2

ă E px,yq Drpy yvq2 | fpxq vs γ.

Then, because the squared error of c2 on this level set is much less than the squared error of f, we find that c2 must also have squared error less than that of yv:

E px,yq Drpy yvq2 py c2pxqq2 | fpxq vs ą E px,yq Drpy fpxqq2 γ py c2pxqq2 | fpxq vs

Multicalibration as Boosting for Regression

We assumed H satisfies the B1-bounded γ-weak learning condition relative to C, so this gives us a function h P HB1 such that

E px,yq Drpy yvq2 py hpxqq2 | fpxq vs ą γ.

Then Lemma B.1 shows that

Erhpxqpy yvq | fpxq vs ą γ{2.

So h witnesses a failure of multicalibration of f, since it follows that

Erhpxqpy vq | fpxq vs Erhpxqpy yvq | fpxq vs Erhpxqp yv vq | fpxq vs

ą γ{2 B1 | yv v|

Pr x Dxrfpxq vs ˆ

E px,yq Drhpxqpy vq | fpxq vs 2 ą αCγ2

contradicting αH-approximate multicalibration of f on HB1 and D.

In (Gopalan et al., 2022), Gopalan, Kalai, Reingold, Sharan, and Wieder show that any predictor that is approximately multicalibrated for a class H and distribution D can be efficiently post-processed to approximately minimize any convex, Lipschitz loss function relative to the class H. The theorem we have just proved can now be used to extend their result to approximate loss minimization over any other class C, so long as H satisfies the B-bounded γ-weak learning assumption relative to C. Intuitively, this follows from the fact that if f is approximately multicalibrated with respect to H on D, it is also approximately multicalibrated with respect to C. However, the notion of approximate multicalibration adopted in (Gopalan et al., 2022) differs from the one in this work. So, to formalize our intuition above, we will first state the covariance-based definition of approximate multicalibration appearing in (Gopalan et al., 2022) and prove a lemma relating it to our own. We note that, going forward, we will restrict ourselves to distributions D over X ˆ t0, 1u, as in this case the two definitions of approximate multicalibration are straightforwardly connected.

Definition E.4 (Approximate Covariance Multicalibration (Gopalan et al., 2022)). Fix a distribution D over X ˆ t0, 1u and a function f : X Ñ r0, 1s that maps onto a countable subset of its range, denoted Rpfq. Let H be an arbitrary collection of real valued functions h : X Ñ R. Then f is α-approximately covariance multicalibrated with respect to H on D if ÿ

v PRpfq Pr x DXrfpxq vs ˇˇErphpxq hvqpy yvq | fpxq vs ˇˇ ď α,

where hv Erhpxq | fpxq vs and yv Ery | fpxq vs.

Lemma E.5. Fix a distribution D over X ˆ t0, 1u and a class of functions on X, H. Let HB denote the subset

HB th P H : max x PX hpxq2 ď Bu.

Fix a function f : X Ñ r0, 1s that maps onto a countable subset of its range, denoted Rpfq. Then if f is α-approximately multicalibrated with respect to HB on D, then f is p?αp1 ?

Bqq-approximately covariance multicalibrated. That is, for all h P HB, f satisfies

v PRpfq Prrfpxq vs ˇˇErphpxq hvqpy yvq | fpxq vs ˇˇ ď ?αp1 ?

Multicalibration as Boosting for Regression

v PRpfq Prrfpxq vs ˇˇErphpxq hvqpy yvq | fpxq vs ˇˇ

v PRpfq Prrfpxq vs ˇˇErhpxqy | fpxq vs yv hv ˇˇ

v PRpfq Prrfpxq vs ˇˇErhpxqy | fpxq vs v hv v hv yv hv ˇˇ

v PRpfq Prrfpxq vs ˇˇErhpxqpy vq | fpxq vs hvpv yvq ˇˇ

v PRpfq Prrfpxq vs |Erhpxqpy vq | fpxq vs| ˇˇ hvpv yvq ˇˇ

v PRpfq Prrfpxq vs |v yv|

where the second inequality follows from the fact that Erxs ď a

Erx2s and the bound maxx PX hpxq2 ď B.

We now recall a theorem of (Gopalan et al., 2022), showing that approximate covariance multicalibration with respect to a class H implies approximate loss minimization relative to H, for convex, Lipschitz losses.

Theorem E.6. Fix a distribution D over X ˆ t0, 1u and a class of real-valued functions on X, H. Fix a function f : X Ñ r0, 1s that maps onto a countable subset of its range, denoted Rpfq. Let L be a class of functions on t0, 1u ˆ R that are convex and L-Lipschitz in their second argument. If f is α-approximately covariance multicalibrated with respect to HB on D, then for every ℓP L there exists an efficient post-processing function kℓsuch that

E px,yq Drℓpy, kℓpfpxqqqs ď min h PHB E px,yq Drℓpy, hpxqqs 2αL.

Corollary E.7. Fix a distribution D over X ˆ t0, 1u and two classes of real-valued functions on X that are closed under affine transformation, H and C. Fix a function f : X Ñ r0, 1s that maps onto a countable subset of its range, denoted Rpfq. Let L be a class of functions on t0, 1uˆR that are convex and L-Lipschitz in their second argument. Fix αC, B ą 0.

Let B1 p1 b

2B αC q2 and γ αC

4B . Let αH ă α3 C 29m B12 , and α ă αCγ2

32m B12 . Then if

H satisfies the B1-bounded γ-weak learning condition relative to C and D

f is αH-approximately multicalibrated with respect to D and HB1

f is α-approximately calibrated on D,

then for every ℓP L there exists an efficient post-processing function kℓsuch that

E px,yq Drℓpy, kℓpfpxqqqs ď min c PCB E px,yq Drℓpy, cpxqqs 2L?αCp1 ?

Proof. We have from Theorem E.3 that given the assumed conditions, f will be αC-approximately multicalibrated with respect to CB on D. It follows from Lemma E.5 that f is ?αCp1 ?

Bq-approximately covariance multicalibrated with respect to CB on D. The result of (Gopalan et al., 2022) then gives us that for all ℓP L, there exists an efficient postprocessing function kℓsuch that

E px,yq Drℓpy, kℓpfpxqqqs ď min c PCB E px,yq Drℓpy, cpxqqs 2L?αCp1 ?

Multicalibration as Boosting for Regression

F. Additional Experimental Details

In this section we provide further details from Section 7 regarding the empirical evaluation of Algorithm 1.

F.1. Prediction on Census Data

In the evaluation of Algorithm 1 on US Census data from the folktables package (Ding et al., 2021), the exact features included are:

feature description AGEP age COW class of worker SCHL education level MAR marital status OCCP occupation POBP place of birth RELP relationship WKHP work hours per week SEX binary sex RAC1P race

Table 1: Features included in income prediction task.

# Weak Learners DT(1) DT(2) DT(3) LS GB Faster? LS GB Faster? LS GB Faster? 50 level sets 100 9.11 11.97 5.86 23.01 6.88 32.92 300 18.70 35.81 14.90 69.17 15.64 102.14 500 27.00 58.19 21.74 115.65 24.77 169.90 1000 46.73 116.49 42.92 231.74 46.38 336.89 100 level sets 100 7.18 11.97 5.29 23.01 5.06 32.92 300 13.08 35.81 13.55 69.17 14.72 102.14 500 21.20 58.19 19.57 115.65 21.79 169.90 1000 41.99 116.49 36.26 231.74 40.92 336.89 300 level sets 100 5.87 11.97 9.18 23.01 6.54 32.92 300 13.21 35.81 17.46 69.17 11.13 102.14 500 19.05 58.19 22.20 115.65 19.64 169.90 1000 32.80 116.49 36.61 231.74 27.12 336.89

Table 2: Time (in seconds) comparison of Algorithm 1 (LS) with fifty level sets and Gradient Boosting to train certain numbers of estimators for various weak learner classes.

In Table 2, we compare the time taken to train n weak learners with Algorithm 1 and with scikit-learn s version of Gradient Boosting on our census data. Recall that our algorithm trains multiple weak learners per round of boosting, and so comparing the two algorithms for a fixed number of calls to the weak learner is distinct from comparing them for a fixed number of rounds. Because models output by Algorithm 1 may be more complex than those produced by Gradient Boosting run for the same number of rounds, we use number of weak learners trained as a proxy for model complexity, and compare the two algorithms holding this measure fixed. We see the trend for Gradient Boosting is linear with respect to number of weak learners, whereas Algorithm 1 does not follow the same linear pattern upfront. This is due to not being able to fully leverage parallelization of training weak learners in early stages of boosting. At each round, Algorithm 1 calls the weak learner on every large enough level set of the current model, and it is these independent calls that can be easily parallelized. However, in the early rounds of boosting the model may be relatively simple, and so many level sets may be

Multicalibration as Boosting for Regression

sparsely populated. As the model becomes more expressive over subsequent rounds, the weak learner will be invoked on more sets per round, allowing us to fully utilize parallelizability.

Figure 5: MSE and MSCE comparison of Algorithm 1 (LS) and Gradient Boosting (GB) on linear regression and decision trees of varying depths. * indicates termination round of LS and occurs, from top to bottom, at T 41, 23, 39, 20.

Multicalibration as Boosting for Regression

In Figure 5, we measure MSE and MSCE for Algorithm 1 and Gradient Boosting over rounds of training on our census data. Again, we note that one round of Algorithm 1 is not equivalent to one round of Gradient Boosting, but intend to demonstrate error comparisons and rates of convergence. For the linear regression plots, Gradient Boosting does not reduce either error since combinations of linear models are also linear. As the complexity of the underlying model class increases, Gradient Boosting surpasses Algorithm 1 in terms of MSE, though it does not minimize calibration error.

To further show the advantages of parallel training, in Figure 6 we plot the training performance over time for both Gradient Boosting and Algorithm 1 using various weak learners.

0 10 20 30 40 50 60 Time (sec)

Error (MSE)

Training Performance over Time with Depth 1 Decision Trees

Gradient Boosting LS (50 level sets) LS (100 level sets) LS (200 level sets)

0 10 20 30 40 50 60 Time (sec)

Error (MSE)

Training Performance over Time with Depth 2 Decision Trees

Gradient Boosting LS (50 level sets) LS (100 level sets) LS (200 level sets)

0 10 20 30 40 50 60 Time (sec)

Error (MSE)

Training Performance over Time with Depth 3 Decision Trees

Gradient Boosting LS (50 level sets) LS (100 level sets) LS (200 level sets)

0 10 20 30 40 50 60 Time (sec)

Error (MSE)

Training Performance over Time with Depth 4 Decision Trees

Gradient Boosting LS (50 level sets) LS (100 level sets) LS (200 level sets)

Figure 6: MSE as a function of training time for Algorithm 1 (LS) and Gradient Boosting (GB) on decision trees of varying depth.

We notice that Algorithm 1, like most machine learning algorithms, is prone to overfitting when allowed. Future performance hueristics we intend to investigate include validating updates, complexity penalties, and weighted mixtures of updates.

Multicalibration as Boosting for Regression

F.2. Prediction on Synthetic Data

In the synthetic prediction tasks, we produce labels for each point using one the following two functions (Figure 7):

px 1q2 py 1q2, if x ď 0, y ě 0 px 1q2 py 1q2, if x ą 0, y ě 0 px 1q2 py 1q2, if x ď 0, y ă 0 px 1q2 py 1q2, if x ą 0, y ă 0

x 20xy2 cosp 8xq sinp8yq p1.5x 4qpx 1q2

y 3 py 1q2 , if x ď 0, y ě 0

x 20xy2 cosp8xq sinp8yq p1.5x 4qpx 1q2

y 3 py 1q2 , if x ą 0, y ě 0

x 20xy2 cosp 8xq sinp8yq p1.5x 4qpx 1q2

y 3 py 1q2 , if x ď 0, y ă 0

x 20xy2 cosp8xq sinp8yq p1.5x 4qpx 1q2

y 3 py 1q2 , if x ą 0, y ă 0

Figure 7: 3-D visual mapping of C0 and C1. Repeated image of Figure 2 with larger scale for ease of viewing.

Multicalibration as Boosting for Regression

Figure 8: Stages of Algorithm 1 learning C1 with linear regression (LR) and varying depth d decision trees (DT(d)). In the out of sample plot for linear regression, points are not mapped to their proper position, implying C1 cannot be learned by linear functions. All other hypothesis classes eventually converge to C1.