# regression_with_multiexpert_deferral__7a08f2c6.pdf

Regression with Multi-Expert Deferral

Anqi Mao 1 Mehryar Mohri 2 1 Yutao Zhong 1

Learning to defer with multiple experts is a framework where the learner can choose to defer the prediction to several experts. While this problem has received signiﬁcant attention in classiﬁcation contexts, it presents unique challenges in regression due to the inﬁnite and continuous nature of the label space. In this work, we introduce a novel framework of regression with deferral, which involves deferring the prediction to multiple experts. We present a comprehensive analysis for both the single-stage scenario, where there is simultaneous learning of predictor and deferral functions, and the two-stage scenario, which involves a pre-trained predictor with a learned deferral function. We introduce new surrogate loss functions for both scenarios and prove that they are supported by H-consistency bounds. These bounds provide consistency guarantees that are stronger than Bayes consistency, as they are non-asymptotic and hypothesis set-speciﬁc. Our framework is versatile, applying to multiple experts, accommodating any bounded regression losses, addressing both instance-dependent and label-dependent costs, and supporting both singlestage and two-stage methods. Our single-stage formulation subsumes as a special case the recent regression with abstention (Cheng et al., 2023) framework, where only a single expert is considered, speciﬁcally for the squared loss and a labelindependent cost. Minimizing our proposed loss functions directly leads to novel algorithms for regression with deferral. We report the results of extensive experiments showing the effectiveness of our proposed algorithms.

1Courant Institute of Mathematical Sciences, New York, NY; 2Google Research, New York, NY. Correspondence to: Anqi Mao <aqmao@cims.nyu.edu>, Mehryar Mohri <mohri@google.com>, Yutao Zhong <yutao@cims.nyu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

The accuracy of learning algorithms can be greatly enhanced by redirecting uncertain predictions to experts or advanced pre-trained models. Experts can be individuals with specialized domain knowledge or more sophisticated, albeit costly, pre-trained models. The cost of an expert is important to consider, as it may capture the computational resources it requires or the quality of its performance. The cost can further be instance-dependent and label-dependent.

How can we effectively assign each input instance to the most suitable expert among a pool of several, considering both accuracy and cost? This is the challenge of learning to defer in the presence of multiple experts, which is prevalent in various domains, including natural language generation tasks, notably large language models (LLMs) (Wei et al., 2022; Bubeck et al., 2023), speech recognition, image annotation and classiﬁcation, medical diagnosis, ﬁnancial forecasting, natural language processing, computer vision, and many others.

This paper deals with the problem of learning to defer with multiple experts in the regression setting. While this problem has received signiﬁcant attention in classiﬁcation contexts (Hemmer et al., 2022; Keswani et al., 2021; Kerrigan et al., 2021; Straitouri et al., 2022; Benz & Rodriguez, 2022; Verma et al., 2023; Mao et al., 2023a; 2024a), it presents unique challenges in regression due to the inﬁnite and continuous nature of the label space. In particular, the score-based formulation commonly used in classiﬁcation is inapplicable here, since regression problems cannot be represented using multi-class scoring functions, with auxiliary labels corresponding to each expert.

Our approach involves deﬁning prediction and deferral functions, consistent with previous studies in classiﬁcation (Mao et al., 2023a; 2024a). We present a comprehensive analysis for both the single-stage scenario (simultaneous learning of predictor and deferral functions) (Section 3), and the twostage scenario (pre-trained predictor with learned deferral function) (Section 4). We introduce new surrogate loss functions for both scenarios and prove that they are supported by H-consistency bounds. These are consistency guarantees that are stronger than Bayes consistency, as they are non-asymptotic and hypothesis set-speciﬁc. Our framework is versatile, applying to multiple experts, accommodating

Regression with Multi-Expert Deferral

any bounded regression losses, addressing both instancedependent and label-dependent costs, and supporting both single-stage and two-stage methods. We also instantiate our formulations in the special case of a single expert (Section 5), and demonstrate that our single-stage formulation includes the recent regression with abstention framework (Cheng et al., 2023) as a special case, where only a single expert, the squared loss and a label-independent cost are considered. In Section 6, we report the results of extensive experiments showing the effectiveness of our proposed algorithms.

Previous related work. The problem of learning to defer, or the special case of learning with abstention characterized by a single expert and constant cost, has received much attention in classiﬁcation tasks. Previous work on this topic mainly includes the following formulations or methods: conﬁdence-based, predictor-rejector, score-based, and selective classiﬁcation.

In the conﬁdence-based formulation, the rejection function r is based on the magnitude of the value of the predictor h (Chow, 1957; 1970; Bartlett & Wegkamp, 2008; Yuan & Wegkamp, 2010; 2011). This approach has been further extended to multi-class classiﬁcation in (Ramaswamy et al., 2018; Ni et al., 2019), where the function r is based on the magnitude of the value of the probability (e.g., softmax) corresponding to the predictor h. This formulation becomes inapplicable in regression, since in this setting the prediction value cannot be interpreted as a measure of conﬁdence.

The score-based formulation was proposed in the multiclass classiﬁcation scenario, where the multi-class categories are augmented with additional labels corresponding to the experts, and the deferral is determined using the highest score (Mozannar & Sontag, 2020; Verma & Nalisnick, 2022; Cao et al., 2022; Mao et al., 2024c; Verma et al., 2023; Mao et al., 2024a). However, this formulation is also inapplicable in regression, since regression problems cannot be represented using multi-class scoring functions with auxiliary labels corresponding to each expert.

The approach of learning based on two distinct yet jointly learned functions h and r in this paper is commonly referred as the predictor-rejector formulation (Cortes et al., 2016b;a; Charoenphakdee et al., 2021; Cortes et al., 2023; Mohri et al., 2024; Mao et al., 2024b). We show that this method can be extended to the regression setting for deferral with multiple experts, which underscores its versatility and signiﬁcance.

An alternative approach of selective classiﬁcation (El-Yaniv et al., 2010; Wiener & El-Yaniv, 2011; El-Yaniv & Wiener, 2012; Wiener & El-Yaniv, 2012; 2015; Geifman & El-Yaniv, 2017; 2019) optimizes non-abstained sample generalization error under a ﬁxed selection rate. However, this method does

not apply to the deferral case where the cost depends on the label y and where there are multiple experts. Moreover, it has been reported to perform suboptimally compared to the predictor-rejector formulation in regression with abstention settings with constant cost and a single expert (Cheng et al., 2023).

More recently, a series of publications (Mao et al., 2023a; Mohri et al., 2024; Mao et al., 2024b) have explored the two-stage method of learning with deferral or abstention, wherein the predictor h is ﬁrst learned and subsequently used in the learning process of the deferral function r. This scenario is crucial in practice because the predictor is often given and often cannot be retrained. This method also differs from post-hoc approaches (Okati et al., 2021; Narasimhan et al., 2022), which are not applicable to existing predictors trained in the standard classiﬁcation scenario. In this work, we will study both the single-stage and two-stage methods for regression with deferral.

In the special case of regression with abstention (corresponding to a single expert and label-independent cost case), Wiener & El-Yaniv (2012) characterized the optimal selector for selective regression, Zaoui et al. (2020) studied non-parametric algorithms, Geifman & El-Yaniv (2019) and Jiang et al. (2020) explored the selective classiﬁcation using neural network-based algorithms; Shah et al. (2022) used the selective classiﬁcation with greedy algorithms; De et al. (2020) presented a method that is tailored speciﬁcally to ridge regression and small datasets. It proposed an approximate procedure for learning a linear hypothesis that determines which training instances should be deferred. They then used a nearest neighbor approach to defer on new instances; and Li et al. (2023) investigated a two-step norejection learning strategy. However, none of these previous publications studied surrogate losses for regression with abstention. This excludes (Cheng et al., 2023), who proposed a single-stage surrogate loss for learning the predictor-rejector pair. We will show that their method coincides with a special case of our single-stage regression with deferral surrogate losses, where there is a single expert and where the cost does not depend on the label y.

Another line of work has studied dynamic classiﬁer selection or dynamic ensemble selection (Ko et al., 2008; Cruz et al., 2018; Ekanayake et al., 2023), which aims to select the most competent classiﬁers or ensemble of classiﬁers in the local region where each instance is located. While these methods also consider how to select an expert from a pool of several, their primary mechanism involves dividing the feature space into distinct regions (a region is typically deﬁned via clustering or nearest-neighbor techniques). In contrast, learning to defer with multiple experts aims to learn a deferral function by minimizing a surrogate loss that accounts for the accuracy and cost of each expert across

Regression with Multi-Expert Deferral

all instances. Also, no local region or local competence is considered.

It is worth noting that not all dynamic classiﬁer selection methods consider a local region or local competence. For instance, the recent work by Ekanayake et al. (2023) learns a joint feature acquisition and classiﬁer selection policy to identify the most relevant subset of features based on which classiﬁcation should be performed, and the classiﬁer to be used. In that sense, the policy for classiﬁer selection essentially functions as a deferral mechanism, since it decides when to defer the decision to an expert. However, this method bases its decisions solely on accuracy, not on cost, whereas our focus is on regression, considering both accuracy and base (inference) cost.

Learning to defer with multiple experts in the classiﬁcation setting has been studied in (Mao et al., 2023a; Verma et al., 2023; Mao et al., 2024a). Verma et al. (2023) and (Mao et al., 2024a) investigated the single-stage scenario with a score-based formulation, while Mao et al. (2023a) explored the two-stage scenario with both score-based and predictorrejector formulations. However, as previously highlighted, the score-based formulation does not apply in the regression setting. Our new predictor-rejector formulation not only overcomes this limitation, but also provides the foundation for the design of new deferral algorithms for classiﬁcation.

2. Preliminaries

Learning scenario of regression. We ﬁrst describe the familiar problem of supervised regression and introduce our notation. Let X be the input space and Y R the label space. We write D to denote a distribution over X Y. Let Hall be the family of all real-valued measurable functions h X Y, and let H Hall be the hypothesis set adopted. The learning challenge in regression is to use the labeled sample to ﬁnd a hypothesis h H with small expected loss or generalization error EL(h), with EL(h) = E(x,y) D[L(h(x),y)], where L Y Y R+ is a loss function used to measure the magnitude of error in the regression. In the most common case, where L is the squared loss L2 deﬁned by L2(y ,y) = y y 2, this represents the mean squared error. In the case where L is the L1 loss deﬁned by L1(y ,y) = y y , this represents the mean absolute error. More generally, L can be an Lp loss, deﬁned by Lp(y ,y) = y y p for all y ,y Y, for some p 1. In this work, we will consider an arbitrary regression loss function L, subject to the boundedness assumption, that is L(y ,y) l for some constant l > 0 and for all y,y Y. This assumption is commonly adopted in the theoretical analysis of regression (Mohri et al., 2018).

Regression with deferral. We introduce a novel framework where a learner can defer predictions to multiple

experts, g1,...,gne. Each expert may represent a pretrained model or a human expert. The learner s output is a pair (h,r), where h X Y is a prediction function and r X {0,1,...,ne} R a deferral function. For any input x, r(x) = argmaxy [ne] r(x,y) = j is the expert deferred to when j > 0, no deferral if j = 0. The learner makes the prediction h(x) when r(x) = 0, or defers to gj when r(x) = j > 0. Deferral incurs the cost L(gj(x),y) + αj, where αj is a base cost. The base cost can be the inference cost incurred when querying an expert, factoring in scenarios where engaging experts entails certain costs. Nondeferral incurs the cost L(h(x),y).

Let Hall and Rall denote the family of all measurable functions h X Y and r X {0,1,...,ne} R respectively. Given a hypothesis set H Hall and a hypothesis set R Rall, the goal of the regression with deferral problem consists of using the labeled sample to ﬁnd a pair (h,r) (H,R) with small expected deferral loss ELdef(h,r) = E(x,y) D[Ldef(h,r,x,y)], where Ldef is deﬁned for any (h,r) H R and (x,y) X Y by

Ldef(h,r,x,y)=L(h(x),y)1r(x)=0 +

ne j=1 cj(x,y)1r(x)=j (1)

and cj(x,y) > 0 is a cost function, which can be typically chosen as αj + L(gj(x),y) for an expert gj and a base cost αj > 0 as mentioned before. Here, we adopt a general cost functions cj for any j, and only require that the cost remains bounded: cj(x,y) cj for all (x,y) X Y, for some constant cj > 0.

Learning with surrogate losses. As with most target losses in learning problems, such as the zero-one loss in classiﬁcation (Zhang, 2004a; Bartlett et al., 2006; Zhang, 2004b; Tewari & Bartlett, 2007) and the classiﬁcation with abstention loss (Bartlett & Wegkamp, 2008; Cortes et al., 2016b), directly minimizing the deferral loss Ldef is computationally hard for most hypothesis sets due to its non-continuity and non-differentiability. Instead, surrogate losses are proposed and adopted in practice. Examples include the hinge loss in binary classiﬁcation (Cortes & Vapnik, 1995), the (multinomial) logistic loss in multi-class classiﬁcation (Verhulst, 1838; 1845; Berkson, 1944; 1951), and the predictor-rejector abstention loss in classiﬁcation with abstention (Cortes et al., 2016b). We will derive surrogate losses for the deferral loss.

Given a surrogate loss L (h,r,x,y) R+, we denote by EL(h,r) the generalization error of a pair (h,r), deﬁned as

EL(h,r) = E (x,y) D[L(h,r,x,y)].

Let EL(H,R) = infh H,r R EL(h,r) be the best-in-class error within the family H R. One desired property for surrogate losses in this context is Bayes-consistency (Steinwart, 2007). This means that minimizing the expected surrogate

Regression with Multi-Expert Deferral

loss over the family of all measurable functions leads to minimizing the expected deferral loss over the same family. More precisely, for a surrogate loss L (h,r,x,y) R+, it is Bayes-consistent with respect to Ldef if,

EL(hn,rn) EL(H,R) n + ÐÐÐ 0

Ô ELdef(hn,rn) ELdef(H,R) n + ÐÐÐ 0

for all sequences {(hn,rn)}n N H R and all distributions. Recently, Awasthi et al. (2022a;b) (see also (Awasthi et al., 2021a;b; 2023a;b; Mao et al., 2023f;c;d; Zheng et al., 2023; Mao et al., 2023b;e; 2024e;d;f)) pointed out that Bayes-consistency does not take into account the hypothesis set H and is non-asymptotic. Thus, they proposed a stronger guarantee called H-consistency bounds. In our context, a surrogate loss L is said to admit an (H,R)-consistency bound with respect to Ldef if, for all (h,r) H R and all distributions, the following inequality holds:

f(ELdef(h,r) E Ldef(H,R)) EL(h,r) E L(H,R)

for some non-decreasing function f R+ R+. In particular, when (H,R) = (Hall,Rall), the (H,R)-consistency bound implies Bayes-consistency.

We will prove (H,R)-consistency bounds for our proposed surrogate losses, which imply their Bayes-consistency. One key term in our bound is the minimizability gap, deﬁned as ML(H,R) = E L(H,R) Ex Ey x[L(h,r,x,y)]. The minimizability gap characterizes the difference between the best-in-class error and the expected best-in-class pointwise error, and is non-negative. As shown by Mao et al. (2023f), the minimizability gap is upper bounded by the approximation error, satisfying 0 ML(H,R) E L(H,R) E L(Hall,Rall) and is generally a ﬁner quantity. The minimizability gap vanishes when (H,R) = (Hall,Rall), or, more generally, when E L(H,R) = E L(Hall,Rall).

Given a loss function ℓ (r,x,y) R+ that only depends on the hypothesis r, the notions of generalization error, bestin-class generalization error, and minimizability gaps, as well as Bayes-consistency and R-consistency bounds, are similarly deﬁned (Awasthi et al., 2022a;b).

In the next sections, we study the problem of learning a pair (h,r) in the framework of regression with deferral. We will derive a family of surrogate losses of Ldef, starting from ﬁrst principles. We will show that these loss functions beneﬁt from strong consistency guarantees, which yield directly principled algorithms for our deferral problem. We will speciﬁcally distinguish two approaches: the single-stage surrogate losses, where the predictor h and the deferral function r are jointly learned, and the two-stage surrogate losses wherein the predictor h have been previously trained and is ﬁxed and subsequently used in the learning process of the deferral function r.

3. Single-Stage Scenario

In this section, we derive single-stage surrogate losses for the deferral loss and prove their strong (H,R)-consistency bounds guarantees. To do so, we ﬁrst prove that the following alternative expression holds for Ldef.

Lemma 3.1. For any (h,r) H R and (x,y) X Y, the loss function Ldef can be expressed as follows:

Ldef(h,r,x,y) =

ne j=1 cj(x,y) 1r(x) 0

ne j=1 [L(h(x),y) +

ne k=1 ck(x,y)1k j]1r(x) j

(ne 1) L(h(x),y) +

ne j=1 cj(x,y) .

Let ℓ0 1 be the zero-one multi-class classiﬁcation loss deﬁned by ℓ0 1(r,x,y) = 1r(x) y for all r R and (x,y) X Y and let ℓ R X [ne] R+ be a surrogate loss for ℓ0 1 such that ℓ ℓ0 1. ℓmay be chosen to be the logistic loss, for example. Since the last term (ne 1) ne j=1 cj(x,y) in the expression of Ldef in Lemma 3.1 does not depend on h and r, the following loss function Lℓdeﬁned for all (h,r) H R and (x,y) X Y by

Lℓ(h,r,x,y) =

ne j=1 cj(x,y) ℓ(r,x,0) (2)

L(h(x),y) +

ne j j cj (x,y) ℓ(r,x,j)

(ne 1)L(h(x),y),

is a natural single-stage surrogate loss for Ldef. We will show that when ℓadmits a strong R-consistency bound with respect to ℓ0 1, then Lℓadmits an (H,R)-consistency bound with respect to Ldef.

Let us underscore the novelty of the surrogate loss formulation presented in equation (2) in the context of learning to defer with multiple experts. This formulation represents a substantial departure from the existing score-based approach prevalent in classiﬁcation. As previously highlighted, the score-based formulation becomes inapplicable in regression. Our new predictor-rejector formulation not only overcomes this limitation, but also provides the foundation for the design of new deferral algorithms for classiﬁcation.

We say that a hypothesis set R is regular if for any x X, the predictions made by the hypotheses in R cover the complete set of possible classiﬁcation labels: {r(x) r R} = {0,1,...,ne}. Widely used hypothesis sets such as linear hypotheses, neural networks, and of course the family of all measurable functions are all regular.

Regression with Multi-Expert Deferral

Recent studies by Awasthi et al. (2022b) and Mao et al. (2023f) demonstrate that common multi-class surrogate losses, such as constrained losses and comp-sum losses (including the logistic loss), admit strong R-consistency bounds with respect to the multi-class zero-one loss ℓ0 1, when using such regular hypothesis sets. The next result shows that, for multi-class loss functions ℓ, their corresponding deferral surrogate losses Lℓ(Eq. (2)) also exhibit (H,R)-consistency bounds with respect to the deferral loss (Eq. (1)).

Theorem 3.2. Let R be a regular hypothesis set and ℓ a surrogate loss for the multi-class loss function ℓ0 1 upper-bounding ℓ0 1. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R)

Γ(Eℓ(r) E ℓ(R) + Mℓ(R)).

Then, the following (H,R)-consistency bound holds for all h H, r R and any distribution,

ELdef(h,r) E Ldef(H,R) + MLdef(H,R)

Γ(ELℓ(h,r) E Lℓ(H,R) + MLℓ(H,R)),

where Γ(t) = max{t,(ne(l + ne j=1 cj)) 1 αβ tα}.

The proof is given in Appendix B. As already mentioned, when the best-in-class error coincides with the Bayes error E L(H,R) = E L(Hall,Rall) for L = Lℓand L = Ldef, the minimizability gaps MLℓ(H,R) and MLdef(H,R) vanish. In such cases, the (H,R)-consistency bound guarantees that when the surrogate estimation error ELℓ(h,r) E Lℓ(H,R) is optimized up to ϵ, the estimation error of the deferral loss ELdef(h,r) E Ldef(H,R) is upper bounded by Γ(ϵ).

In particular, when both H and R include all measurable functions, all the minimizability gap terms in Theorem 3.2 vanish, which yields the following result.

Corollary 3.3. Given a multi-class loss function ℓ ℓ0 1. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following excess error bound holds for all r Rall and any distribution,

Eℓ0 1(r) E ℓ0 1(Rall) Γ(Eℓ(r) E ℓ(Rall)).

Then, the following excess error bound holds for all h Hall, r Rall and any distribution,

ELdef(h,r) E Ldef(Hall,Rall)

Γ(ELℓ(h,r) E Lℓ(Hall,Rall)),

where Γ(t) = max{t,(ne(l + ne j=1 cj)) 1 αβ tα}.

In this case, as shown by Mao et al. (2023f), Γ(t) can be expressed as

2t for the logistic loss ℓlog (r,x,y) log2( ne j=0 er(x,j) r(x,y)). Then, by Corollary 3.3, we further obtain the following corollary.

Corollary 3.4. For any h Hall, r Rall and distribution,

ELdef(h,r) E Ldef(Hall,Rall)

Γ(ELℓlog (h,r) E Lℓlog (Hall,Rall)),

where Γ(t) = max{t, 2ne(l + ne j=1 cj)

1 2 t 1 2 }.

By taking the limit on both sides, we derive the Bayesconsistency of these single-stage surrogate losses Lℓwith respect to the deferral loss Ldef. More generally, Corollary 3.3 shows that Lℓadmits an excess error bound with respect to Ldef when ℓadmits an excess error bound with respect to ℓ0 1.

4. Two-Stage Scenario

In the single-stage scenario, we introduced a family of surrogate losses and resulting algorithms for effectively learning the pair (h,r). However, practical applications often encounter a two-stage scenario, where deferral decisions are based on a ﬁxed, pre-trained predictor h. Retraining this predictor is often prohibitively expensive or time-consuming. Thus, this two-stage scenario (Mao et al., 2023a) requires a different approach to optimize deferral decisions controlled by r, while using the existing predictor h.

In this section, we will introduce a principled two-stage algorithm for regression with deferral, with favorable consistency guarantees. Remarkably, we show that the singlestage approach can be adapted for the two-stage scenario if we ﬁx the predictor h and disregard constant terms.

Let h be a predictor learned by minimizing a regression loss L in a ﬁrst stage. A deferral function r is then learned based on that predictor h and the following loss function Lh ℓin the second stage: for any r R, x X and y Y,

Lh ℓ(r,x,y) =

ne j=1 cj(x,y) ℓ(r,x,0) (3)

L(h(x),y) +

ne j j cj (x,y) ℓ(r,x,j),

where ℓis a surrogate loss in the standard multi-class classiﬁcation. Equation (3) resembles (2), except for the constant term (ne 1)L(h(x),y). In (3), the predictor h remains ﬁxed while only the deferral function r is optimized. In (2), both h and r are learned jointly.

Similarly, we deﬁne Lh def as the deferral loss (1) with a ﬁxed

Regression with Multi-Expert Deferral

predictor h as follows:

Lh def(r,x,y) = L(h(x),y)1r(x)=0+

ne j=1 cj(x,y)1r(x)=j. (4)

Here too, h is ﬁxed in (4). Both Lh ℓand Lh def are loss functions deﬁned for deferral function r, while ℓℓand ℓdef are loss functions deﬁned for pairs (h,r) (H,R).

As with the proposed single-stage approach, the two-stage surrogate losses Lh ℓbeneﬁt from strong consistency guarantees. We show that in the second stage where a predictor h is ﬁxed, the surrogate loss function Lh ℓbeneﬁts from Rconsistency bounds with respect to Lh def when ℓadmits a strong R-consistency bound with respect to the binary zeroone loss ℓ0 1. Theorem 4.1. Given a hypothesis set R, a multi-class loss function ℓ ℓ0 1 and a predictor h. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R)

Γ(Eℓ(r) E ℓ(R) + Mℓ(R)).

Then, the following R-consistency bound holds for all r R and any distribution,

ELh def(r) E Lh def(R) + MLh def(R)

Γ(ELh ℓ(r) E Lh ℓ(R) + MLh ℓ(R)),

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα.

The proof is given in Appendix C. When the best-in-class error coincides with the Bayes error, E L(R) = E L(Rall) for L = Lh ℓand L = Lh def, the minimizability gaps MLh ℓ(R) and MLh def(R) vanish. In that case, the R-consistency bound guarantees that when the surrogate estimation error ELh ℓ(r) E Lh ℓ(R) is optimized up to ϵ, the target estimation

error ELh def(r) E Lh def(R) is upper bounded by Γ(ϵ). In the special case where H and R are the family of all measurable functions, all the minimizability gap terms in Theorem 4.1 vanish. Thus, we obtain the following corollary. Corollary 4.2. Given a multi-class loss function ℓ ℓ0 1 and a predictor h. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following excess error bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(Rall) Γ(Eℓ(r) E ℓ(Rall)).

Then, the following excess error bound holds for all r Rall and any distribution,

ELh def(r) E Lh def(Rall)) Γ(ELh ℓ(r) E Lh ℓ(Rall)), (5)

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα.

Corollary 4.2 shows that Lh ℓadmits an excess error bound with respect to Lh def when ℓadmits an excess error bound with respect to ℓ0 1.

We now establish (H,R)-consistency bounds the entire two-stage approach with respect to the deferral loss function Ldef. This result applies to any multi-class loss function ℓ that satisﬁes a strong R-consistency bound with respect to the multi-class zero-one loss ℓ0 1.

Theorem 4.3. Given a hypothesis set H, a regular hypothesis set R and a multi-class loss function ℓ ℓ0 1. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R)

Γ(Eℓ(r) E ℓ(R) + Mℓ(R)).

Then, the following (H,R)-consistency bound holds for all h H, r R and any distribution,

ELdef(h,r) E Ldef(H,R) + MLdef(H,R)

EL(h) EL(H) + ML(H)

+ Γ(ELh ℓ(r) E Lh ℓ(R) + MLh ℓ(R)),

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα.

The proof is presented in Appendix D. As before, when H and R are the family of all measurable functions, all the minimizability gap terms in Theorem 4.3 vanish. In particular, Γ(t) can be expressed as

2t for the logistic loss. Thus, we obtain the following on excess error bounds.

Corollary 4.4. Given a multi-class loss function ℓ ℓ0 1. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following excess error bound holds for all r Rall and any distribution,

Eℓ0 1(r) E ℓ0 1(Rall) Γ(Eℓ(r) E ℓ(Rall)).

Then, the following excess error bound holds for all h Hall, r Rall and any distribution,

ELdef(h,r) E Ldef(Hall,Rall)

EL(h) EL(Hall) + Γ(ELh ℓ(r) E Lh ℓ(Rall)), (7)

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα. In particular,

Γ(t) = 2ne(l + ne j=1 cj)

1 2 t 1 2 for ℓ= ℓlog.

Corollary 4.4 shows that our two-stage approach admits an excess error bound with respect to Ldef when ℓadmits an excess error bound with respect to ℓ0 1. More generally,

Regression with Multi-Expert Deferral

when the minimizability gaps are zero, as when the bestin-class errors coincide with the Bayes errors, the (H,R)- consistency bound of Theorem 4.3 guarantees that the target estimation error, ELdef(h,r) E Ldef(H,R), is upper bounded by ϵ1 + Γ(ϵ2) provided that the surrogate estimation error in the ﬁrst stage, EL(h) EL(H), is reduced to ϵ1 and the surrogate estimation error in the second stage, ELh ℓ(r) E Lh ℓ(R), reduced to ϵ2.

Signiﬁcance and novelty. The challenges in dealing with multiple experts in the theoretical analysis of learning to deferral in regression arise ﬁrst from the need to formulate new surrogate losses that cannot be directly extended from previous work. Furthermore, proving theoretical guarantees requires analyzing the conditional regret for both the surrogate and target deferral loss, which becomes more complex with multiple experts. The novelty and signiﬁcance of our work are rooted in these new surrogate losses and algorithmic solutions, which come with strong theoretical guarantees speciﬁcally tailored for this context. These enhancements are non-trivial and represent a substantial extension beyond the existing framework of regression with abstention, which is limited in scope to a single expert, squared loss, and label-independent cost.

5. Special Case of a Single Expert

In the special case of a single expert, ne = 1, both the singlestage surrogate loss Lℓand the two-stage surrogate loss Lh ℓ can be simpliﬁed as follows:

c(x,y)ℓ(r,x,0) + L(h(x),y)ℓ(r,x,1).

Let ℓ(r,x,0) = Φ(r(x)) and ℓ(r,x,1) = Φ( r(x)), where Φ R R+ is a non-increasing auxiliary function upper bounding the indicator u 1u 0. Here, r X R is a function whose sign determines if there is deferral, that is r(x) 0:

ℓdef(h,r,x,y) = L(h(x),y)1r(x)>0 + c(x,y)1r(x) 0.

As an example, Φ can be the auxiliary function that deﬁnes a margin-based loss in the binary classiﬁcation. Thus, both the single-stage surrogate loss ℓΦ and the two-stage surrogate loss ℓh Φ can be reformulated as follows:

c(x,y)Φ(r(x)) + L(h(x),y)Φ( r(x)). (8)

Some common examples of Φ are listed in Table 2 in Appendix E. Note that (8) is a straightforward extension of the two-stage formulation given in (Mao et al., 2024b, Eq. (5)). In their formulation, the zero-one loss function replaces the regression loss and is tailored for the classiﬁcation context. A special case of the straightforward extension (8) is one where the cost does not depend on the label y and the squared loss is considered. This coincides with the loss

function (Cheng et al., 2023, Eq. (10)) in the context of regression with abstention. It is important to note that incorporating y as argument of the cost functions is crucial in the more general deferral setting, as each cost takes into account the accuracy of the corresponding expert.

Let the binary zero-one loss be ℓbi 0 1(r,x,y) = 1sign(r(x)) y, where sign(α) = 1α>0 1α 0. We say that a hypothesis set R consists of functions mapping from X to R is regular, if {sign(r(x)) r R} = {+1, 1} for any x X.

Then, Theorems 3.2 and 4.3 can be reduced to Theorems 5.1 and 5.3 below respectively. We present these guarantees and their corresponding corollaries in the following sections.

5.1. Single-Stage Guarantees

Here, we present guarantees for the single-stage surrogate. Theorem 5.1. Given a hypothesis set H, a regular hypothesis set R and a margin-based loss function Φ. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓbi 0 1(r) E ℓbi 0 1(R) + Mℓbi 0 1(R)

Γ(EΦ(r) E Φ(R) + MΦ(R)).

Then, the following (H,R)-consistency bound holds for all h H, r R and any distribution,

Eℓdef(h,r) E ℓdef(H,R) + Mℓdef(H,R)

Γ(EℓΦ(h,r) E ℓΦ(H,R) + MℓΦ(H,R)),

where Γ(t) = max{t,(l + c) 1 αβ tα}.

In particular, when H and R are the family of all measurable functions, all the minimizability gap terms in Theorem 5.1 vanish. In this case, as shown by Awasthi et al. (2022a), Γ(t) can be expressed as t2

2 for exponential and logistic loss, t2 for quadratic loss and t for hinge, sigmoid and ρ-margin losses. Thus, the following result holds. Corollary 5.2. Given a margin-based loss function Φ. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following excess error bound holds for all r Rall and any distribution,

Eℓbi 0 1(r) E ℓbi 0 1(Rall) Γ(EΦ(r) E Φ(Rall)).

Then, the following excess error bound holds for all h Hall, r Rall and any distribution,

Eℓdef(h,r) E ℓdef(Hall,Rall)

Γ(EℓΦ(h,r) E ℓΦ(Hall,Rall)),

where Γ(t) = max{t,(l + c) 1 αβ tα}. In particular,

Γ(t) = max{t, 1

2(l + c) 1 2 t 1 2 } for Φ = Φexp and Φlog,

Regression with Multi-Expert Deferral Table 1. System MSE of deferral with multiple experts: mean standard deviation over three runs. Dataset Base cost Method Base model Single expert Two experts Three experts

Single 18.98 2.44 13.16 0.93 8.53 1.57 Two 23.35 1.90 18.64 1.96 13.33 0.92 8.81 1.56 Single 18.83 2.14 13.79 0.75 8.64 1.40 Two 23.35 1.90 19.15 1.99 15.12 0.62 10.06 1.54

Single 14.85 5.40 14.75 3.53 12.43 2.03 Two 22.72 7.68 16.26 5.58 14.82 3.60 12.02 1.97 Single 15.17 5.18 15.07 2.88 14.80 3.48 Two 22.72 7.68 16.24 4.64 15.62 3.04 14.87 4.04

Single 104.38 5.55 41.08 2.05 37.83 2.60 Two 120.20 8.09 114.73 6.50 44.46 5.34 36.75 1.76 Single 105.01 5.40 39.52 2.81 38.46 1.79 Two 120.20 8.09 114.11 5.34 39.93 2.77 37.51 2.32

Γ(t) = max{t,(l + c) 1 2 t 1 2 } for Φ = Φquad, and Γ(t) = t for Φ = Φhinge, Φsig, and Φρ.

By taking the limit on both sides, we derive the Bayesconsistency and excess error bound of these single-stage surrogate losses ℓΦ with respect to the deferral loss ℓdef. More generally, Corollary 5.2 shows that ℓΦ admits an excess error bound with respect to ℓdef when Φ admits an excess error bound with respect to ℓ0 1. Corollary 5.2 also include the theoretical guarantees in (Cheng et al., 2023, Theorems 7 and 8) as a special case where the cost does not depend on the label y and the squared loss is considered.

5.2. Two-Stage Guarantees

Here, we present guarantees for the two-stage surrogate. Theorem 5.3. Given a hypothesis set H, a regular hypothesis set R and a margin-based loss function Φ. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓbi 0 1(r) E ℓbi 0 1(R) + Mℓbi 0 1(R)

Γ(EΦ(r) E Φ(R) + MΦ(R)).

Then, the following (H,R)-consistency bound holds for all h H, r R and any distribution,

Eℓdef(h,r) E ℓdef(H,R) + Mℓdef(H,R)

EL(h) EL(H) + ML(H)

+ Γ(Eℓh Φ(r) E ℓh Φ(R) + Mℓh Φ(R)),

where Γ(t) = (l + c) 1 αβ tα.

As before, when H and R include all measurable functions, all the minimizability gap terms in Theorem 4.3 vanish. In particular, Γ(t) can be expressed as t2

2 for exponential and logistic loss, t2 for quadratic loss and t for hinge, sigmoid and ρ-margin losses (Awasthi et al., 2022a). Thus, we obtain the following result.

Corollary 5.4. Given a margin-based loss function Φ. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following excess error bound holds for all r Rall and any distribution,

Eℓbi 0 1(r) E ℓbi 0 1(Rall) Γ(EΦ(r) E Φ(Rall)).

Then, the following excess error bound holds for all h Hall, r Rall and any distribution,

Eℓdef(h,r) E ℓdef(Hall,Rall)

EL(h) EL(Hall) + Γ(Eℓh Φ(r) E ℓh Φ(Rall)), (9)

where Γ(t) = (l + c) 1 αβ tα. In particular, Γ(t) = 1

2(l + c) 1 2 t 1 2 for Φ = Φexp and Φlog, Γ(t) = (l + c) 1 2 t 1 2 for Φ = Φquad, and Γ(t) = t for Φ = Φhinge, Φsig, and Φρ.

Corollary 5.4 shows that the proposed two-stage approach admits an excess error bound with respect to ℓdef when Φ admits an excess error bound with respect to ℓ0 1. More generally, in the cases where the minimizability gaps are zero (as when the best-in-class errors coincide with the Bayes errors), the (H,R)-consistency bound in Theorem 5.3 guarantees that when the surrogate estimation error in the ﬁrst stage EL(h) EL(H) is minimized up to ϵ1 and the surrogate estimation error in the second stage Eℓh Φ(r) E ℓh Φ(R) is minimized up to ϵ2, the target estimation error Eℓdef(h,r) E ℓdef(H,R) is upper bounded by ϵ1 + Γ(ϵ2).

It is worth noting that while our theoretical results are general, they can be effectively applied to derive bounds for speciﬁc loss functions. Applicable regression loss functions are those with a boundedness assumption, and any classiﬁcation loss function that beneﬁts from H-consistency bounds is suitable. Our theoretical analysis guides the selection of the loss function by considering several key factors: the functional form Γ of the bound, the approximation properties indicated by the minimizability gaps, the optimization

Regression with Multi-Expert Deferral

advantages of each loss function, and how favorably the bounds depend on the number of experts.

6. Experiments

In this section, we report the empirical results for our singlestage and two-stage algorithms for regression with deferral on three datasets from the UCI machine learning repository (Asuncion & Newman, 2007), the Airfoil, Housing and Concrete, which have also been studied in (Cheng et al., 2023).

Setup and Metrics. For each dataset, we randomly split it into a training set of 60% examples, a validation set of 20% examples and a test set of 20% examples. We report results averaged over three such random splits. We adopted linear models for both the predictor h and the deferral function r. We considered three experts g1, g2 and g3, each trained by feedforward neural networks with Re LU activation functions (Nair & Hinton, 2010) with one, two, and three hidden layers, respectively. We used the Adam optimizer (Kingma & Ba, 2014) with a batch size of 256 and 2,000 training epochs. The learning rate for all datasets is selected from {0.01,0.05,0.1}. We adopted the squared loss as the regression loss (L = L2). For our single-stage surrogate loss (2) and two-stage surrogate loss (3), we choose ℓ= ℓlog as the logistic loss. In the experiments, we considered two types of costs: cj(x,y) = L(gj(x),y) and cj(x,y) = L(gj(x),y)+αj, for 1 j ne. In the ﬁrst case, the cost corresponds exactly to the expert s squared error. In the second case, the constant αj is the base cost for deferring to expert gj. We chose (α1,α2,α3) = (4.0,8.0,12.0). We selected those values of α because they are empirically determined to encourage an optimal balance, ensuring a reasonable number of input instances are deferred to each expert. For evaluation, we compute the system mean squared error (MSE), that is the average squared difference between the target value and the prediction made by the predictor h or the expert selected by the deferral function r. We also report the empirical regression loss, 1

n n i=1 L(h(xi),yi), of the base model used in the two-stage algorithm.

Results. In Table 1, we report the mean and standard deviation of the empirical regression loss of the base model, as well as the System MSE obtained by using a single expert g1, two experts g1 and g2 and three experts g1, g2 and g3, over three random splits of the dataset. Here, the base model is the predictor. We did not report its performance in the single-stage method because it is independently trained and exhibits varying accuracies across settings with single expert, two experts, and three experts, in contrast to the two-stage method where the base model is pre-learned and ﬁxed. Additionally, in the single-stage method, the base model is always used in conjunction with deferral, rather than being used separately.

Table 1 shows that the performance of both our single-stage and two-stage algorithms improves as more experts are taken into account across the Airfoil, Housing and Concrete datasets. In particular, our algorithms are able to effectively defer difﬁcult test instances to more suitable experts and outperform the base model. Table 1 also shows that the two-stage algorithm usually does not perform as well as the single-stage one in the regression setting, mainly due to the error accumulation in the two-stage process, particularly if the ﬁrst-stage predictor has large errors. However, the two-stage algorithm is still useful when an existing predictor cannot be retrained due to cost or time constraints. In such cases, we can still improve its performance by learning a multi-expert deferral with our two-stage surrogate losses.

Note that since our work is the ﬁrst to study regression with multi-expert deferral, there are no existing baselines for direct comparison. Nevertheless, for completeness, we include additional experimental results with three simple baselines in Appendix F, further demonstrating the effectiveness of our approach.

In real-life scenarios, cancellation effects might occur when using experts with similar expertise (Verma et al., 2023). However, the experts we have used do not exhibit such an effect. This is because, for each expert, there are speciﬁc input instances that they can predict correctly, which others cannot. Therefore, without base costs, the system s error after using deferral is lower than that of any individual expert. Furthermore, in our scenario, we operate under the assumption that the experts are predeﬁned.

Our H-consistency guarantees demonstrate that in both singleand two-stage scenarios, given a sufﬁcient amount of data, our algorithms can approximate results close to the optimal deferral loss for the given experts. Our analysis and experiments do not directly address the process of selecting experts beforehand. Optimally selecting diverse and accurate experts is an interesting research question.

7. Conclusion

We introduced a novel and principled framework for regression with deferral, enhancing the accuracy and reliability of regression predictions through the strategic use of multiple experts. Our comprehensive analysis of this framework includes the formulation of novel surrogate losses for both single-stage and two-stage scenarios, and the proof of strong H-consistency bounds. These theoretical guarantees lead to powerful algorithms that leverage multiple experts, providing a powerful tool for addressing the inherent challenges of regression problems. Empirical results validate the effectiveness of our approach, showcasing its practical signiﬁcance and opening up new avenues for developing robust solutions across diverse regression tasks.

Regression with Multi-Expert Deferral

Acknowledgements

We thank the reviewers for suggesting useful baselines for our experiments and for bringing to our attention the literature on dynamic classiﬁer selection, which, while distinct from the learning-to-defer framework studied in this paper, has somewhat relevant connections.

Impact Statement

This paper presents work whose goal is to advance the ﬁeld of Machine Learning. There are many potential societal consequences of our work, none which we feel must be speciﬁcally highlighted here.

Asuncion, A. and Newman, D. Uci machine learning repository, 2007.

Awasthi, P., Frank, N., Mao, A., Mohri, M., and Zhong, Y. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems, pp. 9804 9815, 2021a.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. A ﬁner calibration analysis for adversarial robustness. ar Xiv preprint ar Xiv:2105.01550, 2021b.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Hconsistency bounds for surrogate loss minimizers. In International Conference on Machine Learning, pp. 1117 1174, 2022a.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Multi-class H-consistency bounds. In Advances in neural information processing systems, pp. 782 795, 2022b.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. Theoretically grounded loss functions and algorithms for adversarial robustness. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 10077 10094, 2023a.

Awasthi, P., Mao, A., Mohri, M., and Zhong, Y. DCprogramming for neural network optimizations. Journal of Global Optimization, 2023b.

Bartlett, P. L. and Wegkamp, M. H. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.

Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Benz, N. L. C. and Rodriguez, M. G. Counterfactual inference of second opinions. In Uncertainty in Artiﬁcial Intelligence, pp. 453 463. PMLR, 2022.

Berkson, J. Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39:357 -365, 1944.

Berkson, J. Why I prefer logits to probits. Biometrics, 7(4): 327 -339, 1951.

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. Sparks of artiﬁcial general intelligence: Early experiments with gpt-4. ar Xiv preprint ar Xiv:2303.12712, 2023.

Cao, Y., Cai, T., Feng, L., Gu, L., Gu, J., An, B., Niu, G., and Sugiyama, M. Generalizing consistent multi-class classiﬁcation with rejection to be compatible with arbitrary losses. In Advances in neural information processing systems, 2022.

Charoenphakdee, N., Cui, Z., Zhang, Y., and Sugiyama, M. Classiﬁcation with rejection based on cost-sensitive classiﬁcation. In International Conference on Machine Learning, pp. 1507 1517, 2021.

Cheng, X., Cao, Y., Wang, H., Wei, H., An, B., and Feng, L. Regression with cost-based rejection. In Advances in Neural Information Processing Systems, 2023.

Chow, C. An optimum character recognition system using decision function. IEEE T. C., 1957.

Chow, C. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41 46, 1970.

Cortes, C. and Vapnik, V. Support-vector networks. Machine learning, 20:273 297, 1995.

Cortes, C., De Salvo, G., and Mohri, M. Boosting with abstention. In Advances in Neural Information Processing Systems, pp. 1660 1668, 2016a.

Cortes, C., De Salvo, G., and Mohri, M. Learning with rejection. In International Conference on Algorithmic Learning Theory, pp. 67 82, 2016b.

Cortes, C., De Salvo, G., and Mohri, M. Theory and algorithms for learning with rejection in binary classiﬁcation. Annals of Mathematics and Artiﬁcial Intelligence, 2023.

Cruz, R. M., Sabourin, R., and Cavalcanti, G. D. Dynamic classiﬁer selection: Recent advances and perspectives. Information Fusion, 41:195 216, 2018.

De, A., Koley, P., Ganguly, N., and Gomez-Rodriguez, M. Regression under human assistance. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pp. 2611 2620, 2020.

Regression with Multi-Expert Deferral

Ekanayake, S. P., Zois, D.-S., and Chelmis, C. Sequential datum wise feature acquisition and classiﬁer selection. IEEE Transactions on Artiﬁcial Intelligence, 2023.

El-Yaniv, R. and Wiener, Y. Active learning via perfect selective classiﬁcation. Journal of Machine Learning Research, 13(2), 2012.

El-Yaniv, R. et al. On the foundations of noise-free selective classiﬁcation. Journal of Machine Learning Research, 11 (5), 2010.

Geifman, Y. and El-Yaniv, R. Selective classiﬁcation for deep neural networks. In Advances in neural information processing systems, 2017.

Geifman, Y. and El-Yaniv, R. Selectivenet: A deep neural network with an integrated reject option. In International conference on machine learning, pp. 2151 2159, 2019.

Hemmer, P., Schellhammer, S., V ossing, M., Jakubik, J., and Satzger, G. Forming effective human-ai teams: Building machine learning models that complement the capabilities of multiple experts. ar Xiv preprint ar Xiv:2206.07948, 2022.

Jiang, W., Zhao, Y., and Wang, Z. Risk-controlled selective prediction for regression deep neural network models. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 8, 2020.

Kerrigan, G., Smyth, P., and Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. Advances in Neural Information Processing Systems, 34:4421 4434, 2021.

Keswani, V., Lease, M., and Kenthapadi, K. Towards unbiased and accurate deferral to multiple experts. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 154 165, 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Ko, A. H., Sabourin, R., and Britto Jr, A. S. From dynamic classiﬁer selection to dynamic ensemble selection. Pattern recognition, 41(5):1718 1731, 2008.

Li, X., Liu, S., Sun, C., and Wang, H. When no-rejection learning is optimal for regression with rejection. ar Xiv preprint ar Xiv:2307.02932, 2023.

Mao, A., Mohri, C., Mohri, M., and Zhong, Y. Two-stage learning to defer with multiple experts. In Advances in Neural Information Processing Systems, 2023a.

Mao, A., Mohri, M., and Zhong, Y. H-consistency bounds: Characterization and extensions. In Advances in Neural Information Processing Systems, 2023b.

Mao, A., Mohri, M., and Zhong, Y. H-consistency bounds for pairwise misranking loss surrogates. In International conference on Machine learning, 2023c.

Mao, A., Mohri, M., and Zhong, Y. Ranking with abstention. In ICML 2023 Workshop The Many Facets of Preference Based Learning, 2023d.

Mao, A., Mohri, M., and Zhong, Y. Structured prediction with stronger consistency guarantees. In Advances in Neural Information Processing Systems, 2023e.

Mao, A., Mohri, M., and Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In International Conference on Machine Learning, 2023f.

Mao, A., Mohri, M., and Zhong, Y. Principled approaches for learning to defer with multiple experts. In International Symposium on Artiﬁcial Intelligence and Mathematics, 2024a.

Mao, A., Mohri, M., and Zhong, Y. Predictor-rejector multiclass abstention: Theoretical analysis and algorithms. In International Conference on Algorithmic Learning Theory, pp. 822 867, 2024b.

Mao, A., Mohri, M., and Zhong, Y. Theoretically grounded loss functions and algorithms for score-based multi-class abstention. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 4753 4761, 2024c.

Mao, A., Mohri, M., and Zhong, Y. H-consistency guarantees for regression. ar Xiv preprint ar Xiv:2403.19480, 2024d.

Mao, A., Mohri, M., and Zhong, Y. Top-k classiﬁcation and cardinality-aware prediction. ar Xiv preprint ar Xiv:2403.19625, 2024e.

Mao, A., Mohri, M., and Zhong, Y. A universal growth rate for learning with smooth surrogate losses. ar Xiv preprint ar Xiv:2405.05968, 2024f.

Mohri, C., Andor, D., Choi, E., Collins, M., Mao, A., and Zhong, Y. Learning to reject with a ﬁxed predictor: Application to decontextualization. In International Conference on Learning Representations, 2024.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Mozannar, H. and Sontag, D. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pp. 7076 7087, 2020.

Nair, V. and Hinton, G. E. Rectiﬁed linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807 814, 2010.

Regression with Multi-Expert Deferral

Narasimhan, H., Jitkrittum, W., Menon, A. K., Rawat, A. S., and Kumar, S. Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems, pp. 29292 29304, 2022.

Ni, C., Charoenphakdee, N., Honda, J., and Sugiyama, M. On the calibration of multiclass classiﬁcation with rejection. In Advances in Neural Information Processing Systems, pp. 2582 2592, 2019.

Okati, N., De, A., and Rodriguez, M. Differentiable learning under triage. Advances in Neural Information Processing Systems, 34:9140 9151, 2021.

Ramaswamy, H. G., Tewari, A., and Agarwal, S. Consistent algorithms for multiclass classiﬁcation with an abstain option. Electronic Journal of Statistics, 12(1):530 554, 2018.

Shah, A., Bu, Y., Lee, J. K., Das, S., Panda, R., Sattigeri, P., and Wornell, G. W. Selective regression under fairness criteria. In International Conference on Machine Learning, pp. 19598 19615, 2022.

Steinwart, I. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.

Straitouri, E., Wang, L., Okati, N., and Rodriguez, M. G. Provably improving expert predictions with conformal prediction. ar Xiv preprint ar Xiv:2201.12006, 2022.

Tewari, A. and Bartlett, P. L. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007.

Verhulst, P. F. Notice sur la loi que la population suit dans son accroissement. Correspondance math ematique et physique, 10:113 -121, 1838.

Verhulst, P. F. Recherches math ematiques sur la loi d accroissement de la population. Nouveaux M emoires de l Acad emie Royale des Sciences et Belles-Lettres de Bruxelles, 18:1 -42, 1845.

Verma, R. and Nalisnick, E. Calibrated learning to defer with one-vs-all classiﬁers. In International Conference on Machine Learning, pp. 22184 22202, 2022.

Verma, R., Barrej on, D., and Nalisnick, E. Learning to defer to multiple experts: Consistent surrogate losses, conﬁdence calibration, and conformal ensembles. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 11415 11434, 2023.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang,

P., Dean, J., and Fedus, W. Emergent abilities of large language models. Co RR, abs/2206.07682, 2022.

Wiener, Y. and El-Yaniv, R. Agnostic selective classiﬁcation. In Advances in neural information processing systems, 2011.

Wiener, Y. and El-Yaniv, R. Pointwise tracking the optimal regression function. Advances in Neural Information Processing Systems, 25, 2012.

Wiener, Y. and El-Yaniv, R. Agnostic pointwise-competitive selective classiﬁcation. Journal of Artiﬁcial Intelligence Research, 52:171 201, 2015.

Yuan, M. and Wegkamp, M. Classiﬁcation methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(1), 2010.

Yuan, M. and Wegkamp, M. SVMs with a reject option. In Bernoulli, 2011.

Zaoui, A., Denis, C., and Hebiri, M. Regression with reject option and application to knn. In Advances in Neural Information Processing Systems, pp. 20073 20082, 2020.

Zhang, T. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004a.

Zhang, T. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004b.

Zheng, C., Wu, G., Bao, F., Cao, Y., Li, C., and Zhu, J. Revisiting discriminative vs. generative classiﬁers: Theory and implications. ar Xiv preprint ar Xiv:2302.02334, 2023.

Regression with Multi-Expert Deferral

Contents of Appendix

A Useful lemmas 14

B Proof of Theorem 3.2 15

Case I: r(x) = 0 and c0(x) minne j=1 cj(x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Case II: c0(x) > minne j=1 cj(x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Case III: r(x) > 0 and c0(x) minne j=1 cj(x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C Proof of Theorem 4.1 18

D Proof of Theorem 4.3 20

E Common margin-based losses and corresponding deferral surrogate losses 21

F Additional experiments 22

Regression with Multi-Expert Deferral

A. Useful lemmas

Lemma 3.1. For any (h,r) H R and (x,y) X Y, the loss function Ldef can be expressed as follows:

Ldef(h,r,x,y) =

ne j=1 cj(x,y) 1r(x) 0 +

ne j=1 [L(h(x),y) +

ne k=1 ck(x,y)1k j]1r(x) j

(ne 1) L(h(x),y) +

ne j=1 cj(x,y) .

Proof. Observe that, for any x X, since r(x) = 0 if and only if r(x) j for all j 1, the following equality holds:

1r(x)=0 = 1 ne j=1{r(x) j} =

ne j=1 1r(x) j (ne 1).

Similarly, since r(x) = j if and only if r(x) k for k j and r(x) 0, the following equality holds:

1r(x)=j = 1r(x) 0 +

ne k=1 1r(x) k1k j (ne 1).

In view of these identities, starting from the deﬁnition of Ldef, we can write:

Ldef(h,r,x,y) = L(h(x),y)1r(x)=0 +

ne j=1 cj(x,y)1r(x)=j

= L(h(x),y)

ne j=1 1r(x) j (ne 1) +

ne j=1 cj(x,y)[1r(x) 0 +

ne k=1 1r(x) k1k j (ne 1)]

ne j=1 cj(x,y) 1r(x) 0 +

ne j=1 L(h(x),y)1r(x) j +

ne k=1 cj(x,y)1k j1r(x) k

(ne 1) L(h(x),y) +

ne j=1 cj(x,y)

ne j=1 cj(x,y) 1r(x) 0 +

ne j=1 L(h(x),y)1r(x) j +

ne j=1 ck(x,y)1k j1r(x) j

(ne 1) L(h(x),y) +

ne j=1 cj(x,y) (change of variables k and j)

ne j=1 cj(x,y) 1r(x) 0 +

ne j=1 [L(h(x),y) +

ne k=1 ck(x,y)1k j]1r(x) j (ne 1) L(h(x),y) +

ne j=1 cj(x,y) .

This completes the proof.

Lemma A.1. Assume that the following R-consistency bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R) Γ(Eℓ(r) E ℓ(R) + Mℓ(R)).

Then, for any p = (p0,...,pne) ne and x X, we have

ne j=0 pj1r(x) j inf r R

ne j=0 pj1r(x) j Γ

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j) .

Proof. For any x X, consider a distribution δx that concentrates on that point. Let pj = P(y = j x), j [ne]. Then, by deﬁnition, Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R) can be expressed as

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R) =

ne j=0 pj1r(x) j inf r R

ne j=0 pj1r(x) j .

Regression with Multi-Expert Deferral

Similarly, Eℓ(r) E ℓ(R) + Mℓ(R) can be expressed as

Eℓ(r) E ℓ(R) + Mℓ(R) =

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j) .

Since the R-consistency bound holds by the assumption, we complete the proof.

B. Proof of Theorem 3.2

Theorem 3.2. Let R be a regular hypothesis set and ℓa surrogate loss for the multi-class loss function ℓ0 1 upper-bounding ℓ0 1. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R) Γ(Eℓ(r) E ℓ(R) + Mℓ(R)).

Then, the following (H,R)-consistency bound holds for all h H, r R and any distribution,

ELdef(h,r) E Ldef(H,R) + MLdef(H,R) Γ(ELℓ(h,r) E Lℓ(H,R) + MLℓ(H,R)),

where Γ(t) = max{t,(ne(l + ne j=1 cj)) 1 αβ tα}.

Proof. The conditional error of the deferral loss can be expressed as

E y x[Ldef(h,r,x,y)] = E y x[L(h(x),y)]1r(x)=0 +

ne j=1 E y x[cj(x,y)]1r(x)=j. (10)

Let c0(x) = infh H Ey x[L(h(x),y)] and cj(x) = Ey x[c(x,y)]. Thus, the best-in class conditional error of the deferral loss can be expressed as inf h H,r R E y x[Ldef(h,r,x,y)] = min j [ne]cj(x). (11)

The conditional error of the surrogate loss can be expressed as

E y x[ℓℓ(h,r,x,y)] =

ne j=1 E y x[cj(x,y)] ℓ(r,x,0) +

E y x[L(h(x),y)] +

ne j j E y x[cj (x,y)] ℓ(r,x,j)

(ne 1) E y x[L(h(x),y)]. (12)

Note that the coefﬁcient of term Ey x[L(h(x),y)] satisﬁes ne j=1 ℓ(r,x,j) (ne 1) 0 since ℓ ℓ0 1. Thus, the best-in class conditional error of the surrogate loss can be expressed as

inf h H,r R E y x[Lℓ(h,r,x,y)]

cj(x) ℓ(r,x,0) +

cj (x) ℓ(r,x,j) (ne 1)c0(x). (13)

Next, we analyze four cases separately to show that the calibration gap of the surrogate loss can be lower bounded by that of the deferral loss.

Case I: r(x) = 0 and c0(x) minne j=1 cj(x). In this case, by (10) and (11), the calibration gap of the deferral loss can be expressed as

E y x[Ldef(h,r,x,y)] inf h H,r R E y x[Ldef(h,r,x,y)] = E y x[L(h(x),y)] inf h H E y x[L(h(x),y)].

Regression with Multi-Expert Deferral

By (12) and (13), the calibration gap of the surrogate loss can be expressed as

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j) (ne 1) E y x L(h(x),y)

cj(x) ℓ(r,x,0) +

cj (x) ℓ(r,x,j) + (ne 1)c0(x).

Since ℓ ℓ0 1, we have ℓ(r,x,j) 1 for j 0. By eliminating the inﬁmum over R from the ﬁnal line, and consequently canceling the terms related to cj(x) for j 0, the calibration gap of the surrogate loss can be lower bounded as

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

( E y x[L(h(x),y)] inf h H E y x[L(h(x),y)])

ne j=1 ℓ(r,x,j) ne + 1

E y x[L(h(x),y)] inf h H E y x[L(h(x),y)] ( ne j=1 ℓ(r,x,j) ne + 1 1)

= E y x[Ldef(h,r,x,y)] inf h H,r R E y x[Ldef(h,r,x,y)].

Case II: c0(x) > minne j=1 cj(x). In this case, by (10) and (11), the calibration gap of the deferral loss can be expressed as

E y x[Ldef(h,r,x,y)] inf h H,r R E y x[Ldef(h,r,x,y)] = cr(x)(x) ne min j=1 cj(x).

By (12) and (13), the calibration gap of the surrogate loss can be expressed as

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j) (ne 1) E y x[L(h(x),y)]

cj(x) ℓ(r,x,0) +

cj (x) ℓ(r,x,j) + (ne 1)c0(x).

Using the fact that c0(x) = infh H Ey x[L(h(x),y)] Ey x[L(h(x),y)], the calibration gap of the surrogate loss can be lower bounded as

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j)

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j)

= ne E y x[L(h(x),y)] +

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j)

Regression with Multi-Expert Deferral

where we let p0 = ne j=1 cj(x)

ne(Ey x[L(h(x),y)]+ ne j=1 cj(x)) and pj = Ey x[L(h(x),y)]+ ne j j cj (x)

ne(Ey x[L(h(x),y)]+ ne j=1 cj(x)), j = {1,...,ne} in the last equality.

By Lemma A.1, we have

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j)

ne j=0 pj1r(x) j inf r R

ne j=0 pj1r(x) j

= Γ 1( max j [ne]pj pr(x))

Ey x[L(h(x),y)] minne j=1 cj(x)

ne(Ey x[L(h(x),y)] + ne j=1 cj(x))

Therefore, we obtain

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

ne E y x[L(h(x),y)] +

Ey x[L(h(x),y)] minne j=1 cj(x)

ne(Ey x[L(h(x),y)] + ne j=1 cj(x))

ne E y x[L(h(x),y)] +

Ey x[Ldef(h,r,x,y)] infh H,r R Ey x[Ldef(h,r,x,y)]

ne(Ey x[L(h(x),y)] + ne j=1 cj(x))

(Ey x[Ldef(h,r,x,y)] infh H,r R Ey x[Ldef(h,r,x,y)])

(ne(l + ne j=1 cj))

where we use the fact that Γ(t) = βtα, α (0,1], β > 0, L l and cj cj, j = {1,...,ne} in the last inequality.

Case III: r(x) > 0 and c0(x) minne j=1 cj(x). In this case, by (10) and (11), the calibration gap of the deferral loss can be expressed as

E y x[Ldef(h,r,x,y)] inf h H,r R E y x[Ldef(h,r,x,y)] = cr(x)(x) c0(x).

By (12) and (13), the calibration gap of the surrogate loss can be expressed as

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j) (ne 1) E y x[L(h(x),y)]

cj(x) ℓ(r,x,0) +

cj (x) ℓ(r,x,j) + (ne 1)c0(x).

Regression with Multi-Expert Deferral

Using the fact that Ey x[L(h(x),y)] infh H Ey x[L(h(x),y)] = c0(x), the calibration gap of the surrogate loss can be lower bounded as

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

cj(x) ℓ(r,x,0) +

cj (x) ℓ(r,x,j)

cj(x) ℓ(r,x,0) +

cj (x) ℓ(r,x,j)

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j)

where we let p0 = ne j=1 cj(x)

ne( ne j=0 cj(x)) and pj =

c0(x)+ ne j j cj (x)

ne( ne j=0 cj(x)) , j = {1,...,ne} in the last equality. By Lemma A.1, we have

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j)

ne j=0 pj1r(x) j inf r R

ne j=0 pj1r(x) j

= Γ 1( max j [ne]pj pr(x))

cr(x)(x) c0(x)

ne( ne j=0 cj(x))

Therefore, we obtain

E y x[Lℓ(h,r,x,y)] inf h H,r R E y x[Lℓ(h,r,x,y)]

cr(x)(x) c0(x)

ne( ne j=0 cj(x))

Ey x[Ldef(h,r,x,y)] infh H,r R Ey x[Ldef(h,r,x,y)]

ne( ne j=0 cj(x))

(Ey x[Ldef(h,r,x,y)] infh H,r R Ey x[Ldef(h,r,x,y)])

(ne(l + ne j=1 cj))

where we use the fact that Γ(t) = βtα, α (0,1], β > 0, L l and cj cj, j = {1,...,ne} in the last inequality.

Overall, by taking the expectation of the deferral and surrogate calibration gaps and using Jensen s inequality in each case, we obtain ELdef(h,r) E Ldef(H,R) + MLdef(H,R) Γ(ELℓ(h,r) E Lℓ(H,R) + MLℓ(H,R)).

where Γ(t) = max{t,(ne(l + ne j=1 cj)) 1 αβ tα}.

C. Proof of Theorem 4.1

Theorem 4.1. Given a hypothesis set R, a multi-class loss function ℓ ℓ0 1 and a predictor h. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R) Γ(Eℓ(r) E ℓ(R) + Mℓ(R)).

Regression with Multi-Expert Deferral

Then, the following R-consistency bound holds for all r R and any distribution,

ELh def(r) E Lh def(R) + MLh def(R) Γ(ELh ℓ(r) E Lh ℓ(R) + MLh ℓ(R)),

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα.

Proof. Given a hypothesis set R, a multi-class loss function ℓand a predictor h. For any r R, x X and y Y, the conditional error of Lh ℓand Lh def can be written as

E y x[Lh def(r,x,y)] = E y x[L(h(x),y)]1r(x)=0 +

ne j=1 E y x[cj(x,y)]1r(x)=j

E y x[Lh ℓ(r,x,y)] =

ne j=1 E y x[cj(x,y)] ℓ(r,x,0) +

E y x[L(h(x),y)] +

ne j j E y x[cj (x,y)] ℓ(r,x,j).

Let c0(x) = infh H Ey x[L(h(x),y)] and cj(x) = Ey x[c(x,y)]. Thus, the best-in class conditional error of of Lh ℓand Lh def can be expressed as

inf r R E y x[Lh def(r,x,y)] = min j [ne]cj(x)

inf r R E y x[Lh ℓ(r,x,y)] = inf r R

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j)

Let p0 = ne j=1 cj(x)

ne(Ey x[L(h(x),y)]+ ne j=1 cj(x)) and pj = Ey x[L(h(x),y)]+ ne j j cj (x)

ne(Ey x[L(h(x),y)]+ ne j=1 cj(x)), j = {1,...,ne}. Then, the calibration gap of

Lh ℓcan be written as

E y x[Lh ℓ(r,x,y)] inf r R E y x[Lh ℓ(r,x,y)]

= ne E y x[L(h(x),y)] +

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j)

By Lemma A.1, we have

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j) Γ 1

ne j=0 pj1r(x) j inf r R

ne j=0 pj1r(x) j

= Γ 1( max j [ne]pj pr(x))

cr(x)(x) minj [ne] cj(x)

ne(Ey x[L(h(x),y)] + ne j=1 cj(x))

Therefore, we obtain

E y x[Lℓ(r,x,y)] inf r R E y x[Lℓ(r,x,y)]

ne E y x[L(h(x),y)] +

cr(x)(x) minj [ne] cj(x)

ne(Ey x[L(h(x),y)] + ne j=1 cj(x))

(Ey x[Lh def(r,x,y)] infr R Ey x[Lh def(r,x,y)])

(ne(l + ne j=1 cj))

where we use the fact that Γ(t) = βtα, α (0,1], β > 0, L l and cj cj, j = {1,...,ne} in the last inequality. Taking the expectation on both sides and using Jensen s inequality, we obtain

ELh def(r) E Lh def(R) + MLh def(R) Γ(ELh ℓ(r) E Lh ℓ(R) + MLh ℓ(R)).

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα.

Regression with Multi-Expert Deferral

D. Proof of Theorem 4.3

Theorem 4.3. Given a hypothesis set H, a regular hypothesis set R and a multi-class loss function ℓ ℓ0 1. Assume that there exists a function Γ(t) = β tα for some α (0,1] and β > 0, such that the following R-consistency bound holds for all r R and any distribution,

Eℓ0 1(r) E ℓ0 1(R) + Mℓ0 1(R) Γ(Eℓ(r) E ℓ(R) + Mℓ(R)).

Then, the following (H,R)-consistency bound holds for all h H, r R and any distribution,

ELdef(h,r) E Ldef(H,R) + MLdef(H,R) EL(h) EL(H) + ML(H) + Γ(ELh ℓ(r) E Lh ℓ(R) + MLh ℓ(R)),

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα.

Proof. The conditional error of the deferral loss can be expressed as

E y x[Ldef(h,r,x,y)] = E y x[L(h(x),y)]1r(x)=0 +

ne j=1 E y x[cj(x,y)]1r(x)=j.

Let c0(x) = infh H Ey x[L(h(x),y)] and cj(x) = Ey x[c(x,y)]. Thus, the best-in class conditional error of the deferral loss can be expressed as inf h H,r R E y x[Ldef(h,r,x,y)] = min j [ne]cj(x).

Thus, by introducing the term min{Ey x[L(h(x),y)],minne j=1 cj(x)} and subsequently subtracting it after rearranging, the conditional regret of the deferral loss Ldef can be written as follows

E y x[Ldef(h,r,x,y)] inf h H,r R E y x[Ldef(h,r,x,y)]

= E y x[L(h(x),y)]1r(x)=0 +

ne j=1 E y x[cj(x,y)]1r(x)=j min j [ne]cj(x)

= E y x[L(h(x),y)]1r(x)=0 +

ne j=1 E y x[cj(x,y)]1r(x)=j ne min j=1 cj(x) + ( ne min j=1 cj(x) min j [ne]cj(x)).

Note that by the property of the minimum, the second term can be upper bounded as

ne min j=1 cj(x) min j [ne]cj(x) E y x[L(h(x),y)] inf h H E y x[L(h(x),y)].

Next, we will upper bound the ﬁrst term. Note that the conditional error and the best-in class conditional error of Lh ℓcan be expressed as

Lh ℓ(r,x,y) =

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j)

inf r R E y x[Lh ℓ(r,x,y)] = inf r R

cj(x) ℓ(r,x,0) +

E y x[L(h(x),y)] +

cj (x) ℓ(r,x,j)

Let p0 = ne j=1 cj(x)

ne(Ey x[L(h(x),y)]+ ne j=1 cj(x)) and pj = Ey x[L(h(x),y)]+ ne j j cj (x)

ne(Ey x[L(h(x),y)]+ ne j=1 cj(x)), j = {1,...,ne}. Then, the ﬁrst term can be

rewritten as

E y x[L(h(x),y)]1r(x)=0 +

ne j=1 E y x[cj(x,y)]1r(x)=j ne min j=1 cj(x)

= ne E y x[L(h(x),y)] +

ne j=0 pj1r(x) 0 inf r R

ne j=0 pj1r(x) j

Regression with Multi-Expert Deferral

By Lemma A.1, we have

ne j=0 pj1r(x) j inf r R

ne j=0 pj1r(x) j Γ

ne j=0 pjℓ(r,x,j) inf r R

ne j=0 pjℓ(r,x,j)

Lh ℓ(r,x,y) infr R Ey x[Lh ℓ(r,x,y)]

ne(Ey x[L(h(x),y)] + ne j=1 cj(x))

Therefore, the ﬁrst term can be upper bounded as

E y x[L(h(x),y)]1r(x)=0 +

ne j=1 E y x[cj(x,y)]1r(x)=j ne min j=1 cj(x)

= ne E y x[L(h(x),y)] +

ne j=0 pj1r(x) 0 inf r R

ne j=0 pj1r(x) j

ne E y x[L(h(x),y)] +

Lh ℓ(r,x,y) infr R Ey x[Lh ℓ(r,x,y)]

ne(Ey x[L(h(x),y)] + ne j=1 cj(x))

β (Lh ℓ(r,x,y) inf r R E y x[Lh ℓ(r,x,y)])

where we use the fact that Γ(t) = βtα, α (0,1], β > 0, L l and cj cj, j = {1,...,ne} in the last inequality. After upper bounding the ﬁrst term and the second term in (16) as above, taking the expectation on both sides and using Jensen s inequality, we obtain

ELdef(h,r) E Ldef(H,R) + MLdef(H,R) EL(h) EL(H) + ML(H)

+ Γ(ELh ℓ(r) E Lh ℓ(R) + MLh ℓ(R)),

where Γ(t) = (ne(l + ne j=1 cj)) 1 αβ tα.

E. Common margin-based losses and corresponding deferral surrogate losses

Table 2. Common margin-based losses and corresponding deferral surrogate losses. Name Φ(u) Deferral surrogate loss ℓΦ

Exponential Φexp(u) = e u L(h(x),y)er(x) + c(x,y)e r(x)

Logistic Φlog(u) = log(1 + e u) L(h(x),y)log(1 + er(x)) + c(x,y)log(1 + e r(x)) Quadratic Φquad(u) = max{1 u,0}2 L(h(x),y)Φquad( r(x)) + c(x,y)Φquad(r(x)) Hinge Φhinge(u) = max{1 u,0} L(h(x),y)Φhinge( r(x)) + c(x,y)Φhinge(r(x)) Sigmoid Φsig(u) = 1 tanh(ku),k > 0 L(h(x),y)Φsig( r(x)) + c(x,y)Φsig(r(x)) ρ-Margin Φρ(u) = min{1,max{0,1 u

ρ}},ρ > 0 L(h(x),y)Φρ( r(x)) + c(x,y)Φρ(r(x))

Regression with Multi-Expert Deferral

Table 3. Comparison of our proposed method with three simple baselines.

Baseline 1 Baseline 2 Baseline 3 Ours

EXP 1 EXP 2 EXP 3 EXP 1 EXP 1, 2 EXP 1, 2, 3 EXP 1 EXP 2 EXP 3 EXP 1 EXP 1, 2 EXP 1, 2, 3

17.37 4.80 15.07 3.03 12.72 2.30 17.77 5.12 15.43 2.83 12.92 2.45 16.26 5.58 15.44 2.25 12.36 3.32 16.26 5.58 14.82 3.60 12.02 1.97

F. Additional experiments

Here, we report additional experimental results with three simple baselines:

Baseline 1: The accuracy of the expert.

Baseline 2: Always defer to one expert (random or not random) with probability a%.

Baseline 3: Single-expert formulation using only expert 1 (or 2, or 3).

In Table 3, we report the empirical results of our two-stage method without base cost on the Housing dataset alongside the corresponding baselines, which further demonstrates our approach s effectiveness. For our method, the single-expert deferral ratio is 91%, the two-expert deferral rate is 8% for the ﬁrst expert and 85% for the second expert, and the three-expert deferral rate is 4% for the ﬁrst expert, 35% for the second expert, and 60% for the third expert. We use the same deferral rate for randomly deferring to experts in Baseline 2. The error of the base model is 22.72 7.68. EXP represents the expert used, and system MSE values are reported. Clearly, our method outperforms all three baselines.