# regression_with_costbased_rejection__0203aec3.pdf

Regression with Cost-based Rejection

Xin Cheng1 Yuzhou Cao2 Haobo Wang3 Hongxin Wei4 Bo An2 Lei Feng2

1College of Computer Science, Chongqing University, China 2School of Computer Science and Engineering, Nanyang Technological University, Singapore 3School of Software Technology, Zhejiang University, China 4Department of Statistics and Data Science, Southern University of Science and Technology, China xincheng9215@gmail.com, yuzhou002@e.ntu.edu.sg, wanghaobo@zju.edu.cn weihx@sustech.edu.cn, boan@ntu.edu.sg, lfengqaq@gmail.com

Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.

1 Introduction

In machine learning, the learned model from training data is expected to make predictions on unknown test data as accurately as possible. However, it would be unreasonable for the learned model to make predictions on all the test instances, as there may exist some difficult instances that the learned model cannot give an accurate prediction. Incorrect predictions can cause severe consequences and even can be life-threatening, especially in risk-sensitive applications [5, 19, 36, 11], such as healthcare management, autonomous driving, and product inspection. Therefore, the learning with rejection (Lw R) framework [10, 11, 36, 27, 31, 17, 8, 6, 38] was extensively investigated, which aims to provide a reject option to not make a prediction in order to prevent critical false predictions. In this case, the Lw R model can be learned by balancing the rejection and the prediction.

So far, most of the existing studies on Lw R have focused on the classification setting, i.e., classification with rejection [10, 3, 44, 45, 14, 19, 37, 17, 8, 6]. A well-known framework for classification with rejection that has been studied extensively is called the cost-based framework, i.e., classification with cost-based rejection [11, 12, 18, 39, 16, 36, 8, 6]. In the classification with cost-based rejection setting, there is a pre-determined rejection cost c for each instance, which must be smaller than the classification error 1. A typical approach for classification with cost-based rejection is the confidence-based approach [22, 3, 45, 39, 40, 8]. The main idea is to use the real-valued output of the classifier as the confidence score and decide whether to reject the prediction based on the

Corresponding author: Lei Feng.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

confidence score and the given rejection cost c. Another effective approach is classifier-rejector approach [11, 12, 36], which simultaneously trains a classifier and a rejector, and this approach achieves state-of-the-art performance in binary classification [36].

Despite many previous studies on Lw R, they only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In many real-world scenarios, regression tasks with continuous real-valued targets can be commonly encountered. However, even state-of-the-art regression models may make incorrect predictions, and blindly trusting the model results may lead to critical consequences, especially in risk-sensitive applications [7, 4, 28, 23, 48]. Therefore, it is necessary to consider adding a rejection option for the regression problem to not make predictions in order to avoid critical mispredictions. To this end, many studies have been conducted on regression with rejection. A widely studied framework in regression with rejection is called selective regression [26, 44, 20, 46, 27, 41] that trains a regression model with a reject option given a fixed reject rate of predictions. However, this selective regression setting fails to consider the cost-based rejection scenario where a certain cost could be incurred if the model chooses to refrain from making a prediction for a certain instance.

In this paper, we provide the first attempt to investigate a novel regression setting called regression with cost-based rejection (Rc R), where the model could reject to make predictions on some instances at certain costs to avoid critical mispredictions. To solve the Rc R problem, we first formulate the expected risk and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the popular mean squared error is used as the regression loss. However, it is difficult to directly optimize the expected risk to derive the optimal solution, since the variance of the instances cannot be easily accessed. Therefore, we propose a surrogate loss function to train the model that considers the rejection behavior as a binary classification and we provide theoretical analyses to show that the Bayes optimal solution can be recovered by minimizing our surrogate loss under mild conditions. Our main contributions can be summarized as follows:

We propose a surrogate loss function considering rejection as binary classification and we demonstrate the regressor-consistency and the rejector-calibration when the binary classification loss function is classification-calibrated and is always greater than 0. We also provide a relaxed condition that allows the classification-calibrated binary classification loss to be non-negative. In the relaxed condition, the regressor-consistency can still be satisfied for correctly accepted instances. We derive a regret transfer bound and an estimation error bound for our proposed method, and extensive experiments demonstrate the effectiveness of our method.

2 Preliminaries

In this section, we introduce preliminary knowledge of ordinary regression, classification with rejection, and selective regression.

2.1 Ordinary Regression

For the ordinary regression problem, let the feature space be X Rd and the label space be Y R. Let us denote by (x, y) an example including an instance x and a real-valued label y. Each example (x, y) X Y is assumed to be independently sampled from an unknown data distribution with probability density p(x, y). For the regression task, we aim to learn a regression model h : X 7 R that minimizes the following expected risk: R(L) = Ep(x,y)[L(h(x), y)], (1)

where Ep(x,y) denotes the expectation over the data distribution p(x, y) and L : R R 7 R+ is a conventional loss function (such as mean squared error and mean absolute error) for regression, which measures how well a model estimates a given real-valued label.

2.2 Classification with Rejection

A widely studied framework in classification with rejection is the cost-based framework [10, 36, 8, 6] that aims to train a classifier f : X Z that can reject to make a prediction, where denotes the

reject option. The evaluation metric of this task is the zero-one-c loss ℓ01c defined as follows:

ℓ01c(f(x), z) = c(x), f(x) = , ℓ01(f(x, z), otherwise, (2)

where c(x) is the rejection cost associated with x. Then, the expected risk with ℓ01c can be represented as follows:

R01c(f) = Ep(x,y)[ℓ01c(f(x), y)], (3)

The optimal solution for classification with rejection f = argminf FR01c(f) given by Chow s rule [10] can be expressed as follows:

Definition 1. (Chow s Rule [10]) A classifier f : X Z is the optimal solution of expected risk (3) if and only if the following conditions are almost satisfied:

f(x) = , maxzηz(x) 1 c(x), argmaxzηz(x), otherwise, (4)

where ηz(x) = p(z|x). Chow s rule shows that classification with rejection can be solved when η(x) is known. However, the estimation of the probability is difficult especially when using deep neural networks [24].

2.3 Selective Regression

In selective regression, for a given instance x, the selective model can choose to make a prediction for it or reject to make a prediction without costs. Formally, the selective model is a pair (h, r) where h : X R is a regression prediction model and r : X R is a selection model, which serves as the rejection rule as follows,

(h, r)(x) = h(x), r(x) > 0, , otherwise. (5)

Let us denote by ϵ the expected rejection rate and denote by ϕ(r) = Ep(x,y)I[r(x) > 0] the coverage of selective regression. The purpose of selective regression is to derive a pair (h, r) such that the risk R(h, r) = Ep(x,y)[L(h(x), y)I[r(x) > 0]] is minimized with coverage 1 ϕ(r) < ϵ, where L(h(x), y) is a conventional regression loss function. However, the selective regression setting fails to consider the cost-based rejection scenario but fixes the rejection rate ϵ. In many real-world scenarios, rejection with costs is more common, and the cost c(x) is easier to provide compared with the rejection rate ϵ.

3 Regression with Cost-based Rejection

Let X Rd be the d-dimensional feature space and Y R be the label space. Suppose the training set is denoted by D = {(xi, yi)}n i=1, and each training example (xi, yi) X Y is assumed to be sampled from an unknown data distribution with probability density p(x, y). In Rc R setting, for a given instance x, the learner has the option to reject making a prediction or to make a regression prediction. If the learner rejects an instance, the cost is a non-negative loss c(x). The goal of Rc R is to induce a pair (h, r) where h : X 7 R is a regressor to predict the accepted instance and r : X 7 R is a rejector to determine whether to reject an instance or not. The evaluation metric of this task is the following loss function L(h, r, c, x, y):

L(h, r, c, x, y) = L(h(x), y), r(x) > 0, c(x), otherwise, (6)

where L(h(x), y) is a conventional regression loss function (e.g., mean squared error).

In what follows, we will present a Bayes optimal solution to the Rc R problem and provide a surrogate loss function to train the regressor-rejector.

3.1 Bayes Optimal Solution

In this paper, we only discuss the case where the loss function L(h(x), y) is the mean squared error (MSE), which is the most widely used regression loss function. The expected risk of L(h, r, c, x, y) over the data distribution can be represented as follows: RRc R(h, r) = Ep(x,y)[L(h, r, c, x, y)]. (7)

Let us denote by (h , r ) = argmin(h,r)RRc R(h, r) the optimal pair of the expected risk RRc R and we use Ep(y|x)[y] = R

Y p(y|x)ydy and Dp(y|x)[y] = R

Y p(y|x)(y Ep(y|x)[y])2dy denote the expectation and variance of y over the distribution p(y|x). For a given cost function c(x), we have the following theorem: Theorem 2. For a given instance x, a non-negative cost c(x) and the Bayes optimal pair (h , r ) of the risk RRc R, the following equality holds: ( h (x) = Ep(y|x)[y],

r (x) = I c(x) Dp(y|x)[y] . (8)

The proof of Theorem 2 is provided in Appendix A. It is worth noting that our derived Bayes optimal solution in Theorem 2 can be considered as a generalized version of Proposition 2.1 in Zaoui et al. [46], with an instance-dependent cost function. Theorem 2 shows the expected optimal pair (h , r ) of risk RRc R where the rejector r should reject making a prediction if the variance of the distribution of labels y associated with x is so large that it exceeds a given rejection cost c(x). This is intuitive and easy to understand. Unfortunately the probability density function p(y|x) is usually unknown, meaning that obtaining the variance Dp(y|x)[y] and expectation Ep(y|x)[y] is difficult or even impossible. Many previous studies adopted specific assumptions to avoid directly estimating the variance and the expectation (e.g., homoscedasticity [26, 43, 42] and heteroscedasticity [29, 30, 9, 32]), while all of them have certain constraints. Therefore, the key challenge of Rc R is how to learn the optimal solution (h , r ) without the expectation and the variance.

3.2 Surrogate Loss Function of Training Regressor-Rejector

From Theorem 2, we know how the optimal pair (h , r ) makes rejection and prediction for an unknown instance, but since the expectation and the variance are difficult to obtain, we cannot directly derive the optimal regressor and rejector. Let us reconsider the Rc R loss function L(h, r, c, x, y) by the following equation: L(h, r, c, x, y) = (h(x) y)2I[r(x) > 0] + c(x)I[r(x) 0], (9)

where I[ ] denotes the indicator function. We cannot directly derive a regressor h and a rejector r by the above loss since the loss function contains non-convex and discontinuous parts I[r(x) > 0] and I[r(x) 0]. In order to efficiently optimize the target loss, using surrogate loss is preferred. It is worth noting that the behavior of the rejector is similar to binary classification due to the only two options, rejection and acceptance. We may consider it directly as a binary classification {+1, 1}, where +1 means acceptance and 1 means rejection. Then we have the following surrogate loss function: ψ(h, r, c, x, y) = (h(x) y)2ℓ(r(x), 1) + c(x)ℓ(r(x), +1), (10) where ℓ( ) is an arbitrary binary classification loss function such as hinge loss. Then the expected risk with our surrogate loss ψ can be represented as follows:

Rψ Rc R(h, r) = Ep(x,y)[ψ(h, r, c, x, y)]. (11) The intuition behind this is that when the squared error is less than the given cost, we expect its weight ℓ(r(x), 1) to be larger, i.e., ℓ(r(x), +1) to be smaller. However, not every binary classification loss is theoretically grounded.

4 Theoretical Analysis

In this section, we first introduce the definitions of regressor-consistency and rejector-calibration. Then, we show the condition that our method can result in the Bayes optimal solution. Furthermore, we provide a relaxed condition that the regressor-consistency is only satisfied for correctly accepted instances. Finally, We derive a regret transfer bound and an estimation error bound for our method.

4.1 Regressor-Consistency and Rejector-Calibration

The rejector-calibration we are talking about here is related to the notion of classification calibration [1, 47, 15]. The notion of classification calibration of surrogate losses is defined as the minimum requirement to ensure that a risk-minimizing classifier becomes the Bayes optimal classifier, which is a pointwise version of consistency. The definition of rejector-calibration is given below. Definition 3. (Rejector-calibration) We say a rejector r : X R is calibrated if r always makes the same decisions as the Bayes optimal rejector r in Theorem 2, i.e., sign(r(x)) = sign(r (x)) for all x X such that r (x) = 0.

The definition of rejector-calibration indicates that we do not need to obtain the exact optimal rejector due to the difficulty of obtaining the variance, therefore, we just need to ensure that our rejector makes the same decisions as the optimal rejector in Theorem 2.

On the other hand, we define the regressor-consistency as follows. Definition 4. (Regressor-consistency) We say a regressor h : X R is consistent if h is equivalent to the Bayes optimal regressor h in Theorem 2, i.e., h = h .

Then we demonstrate that our method is regressor-consistent by the following theorem: Theorem 5. Suppose the binary loss ℓis always larger than 0. For a given non-negative cost function c(x), for any fixed rejector r, the optimal regressor h ψ = argminh HRψ Rc R(h, r) is equivalent to the Bayes optimal regressor h .

The proof of Theorem 5 is provided in Appendix B.1. Theorem 5 shows that the regressor h ψ learned from our method can be equivalent to the Bayes optimal regressor h .

It is worth noting that there is a special case ℓ(r(x), 1) = 0, where the regressor actually ignores the instance x. Here we show a relaxed condition for regressor-consistency by the following theorem: Theorem 6. Suppose the binary loss function ℓis classification-calibrated and is always nonnegative. Given a non-negative cost function c(x), for any fixed rejector r and for the optimal regressor h ψ = argminh HRψ Rc R(h, r), the regressor-consistency can be only satisfied for correctly accepted instances (i.e., x X, r (x) > 0).

The proof of Theorem 6 is provided in Appendix B.2. Theorem 6 gives a relaxed condition of consistency, where the regressor-consistency is still satisfied for correctly accepted instances.

Then, we demonstrate that our method is rejector-calibrated by the following theorem: Theorem 7. Suppose the binary loss ℓis classification-calibrated and is always larger than 0. For a given non-negative cost function c(x), the optimal rejector r ψ = argminr RRψ Rc R(h ψ, r) is calibrated (i.e., sign(r ψ(x)) = sign(r (x))), where r is the Bayes optimal rejector.

The proof of Theorem 7 is provided in Appendix B.3. Theorem 7 shows that our method is rejectorcalibrated if the used binary loss ℓis classification-calibrated.

4.2 Regret Transfer Bound and Estimation Error Bound

In the previous section, we have given the Bayes consistency analysis of our method, i.e., if the minimizer of our proposed risk can be the optimal one in Theorem 2. However, such a result does not guarantee the performance of models that are close to but not the minimizer of the risk Rψ Rc R, which occurs commonly since we usually minimize the empirical risk in practice. We provide a theoretical guarantee for such cases by showing the following regret transfer bound: Theorem 8. Suppose that x X, y Y, Ep(y|x)[(h(x) y)2] M and c(x) C hold almost surely:

RRc R(h, r) R Rc R ξ(Rψ Rc R(h, r) Rψ Rc R)),

This regret transfer bound holds for widely used binary losses, e.g., when ℓis the sigmoid loss or the hinge loss, ξ(u) = |u|. When ℓis the logistic loss or the square loss, ξ(u) =

min n 2u, 2 p

(M + C)u o .

The proof of Theorem 8 is provided in Appendix C.1. This theorem guarantees that even if the obtained (h, r) is not exactly the minimizer of Rψ Rc R, we can also expect them to achieve good performance as long as they have a low risk Rψ Rc R. Then we can further get the following estimation error bound:

Theorem 9. Suppose the binary loss is upper bounded by M1 > 0 and ρ-Lipschitz continuous, |h|, c(x), and |y| are bounded by M2 > 0. Given the empirical risk minimizers ˆh and ˆr, the following bound holds with probability at least 1 δ:

Rψ Rc R(ˆh, ˆr) Rψ Rc R 2

2L1(Rn(H) + Rn(R)) + C1

where C1 = (4M 2 2 + M2)M1, L1 = p

(4M 2 1 ρ + M1ρ)2 + 16M 4 1 M 2 2 , n is the i.i.d. sample size, and Rn is the Rademacher complexity [2].

The proof of Theorem 9 is provided in Appendix C.2. Given the fact that the Rademacher complexity usually decays at the rate of O(1/ n), we can finally conclude that the performance of our model can approximate its optimal performance with the increasing size of the training set. In the following sections, we will demonstrate the effectiveness of our approach through experiments.

5 Experiments

In this section, we show the experimental results when our method is equipped with various binary classification losses and is compared with selective regression methods. In addition to the evaluation metrics commonly used for Lw R, we propose additional evaluation metrics. Details of the experiment and the complete experiment can be found in Appendix D.

5.1 Implementation Details

When using deep neural networks as the model and using the gradient descent optimization, we consider a possible scenario where the regressor h predicts any instance x with such a large error that ℓ(h(x), y) >> c(x). In this case the rejector r expects to reject all instances to make the empirical risk minimal. However, when the rejector r converges quickly to reject all train instances, i.e., ℓ(r(x), 1) 0 for all train instances, the surrogate loss ψ will be constant equal to c(x)ℓ(r(x), +1). At that point the gradient of the regressor h suffers from gradient vanishing. The main reason for this situation is that the regressor h has not learned the distribution of the label, but the rejector r has converged, which means that the regressor is not ready. Fortunately, we can avoid such a situation by training the rejector after the regressor is ready, and we name such a method Slow-Start. Specifically, Slow-Start prioritizes training the regressor h without training the rejector r, and then co-trains the regressor h and rejector r when the regressor h is capable of making predictions.

5.2 Datasets and Backbone Models

We conduct experiments on seven datasets, including one computer vision dataset (Age DB [35]), one healthcare dataset (Breast Path Q [33]), and five datasets from the UCI Machine Learning Repository [13] (Abalone, Airfoil, Auto-mpg, Housing and Concrete). For each dataset, we randomly split the original dataset into training, validation, and test sets by the proportions of 60%, 20%, and 20%, respectively. It is worth noting that our approach has no restrictions on the regressor h and rejector r, so h and r can be two separate parts or share parameters.

Age DB is a regression dataset on age prediction [21] collected by [35]. It contains 16.4K face images with a minimum age of 0 and a maximum age of 101. Age prediction is not an easy task, especially when only a single photo is available. Lighting, clothing, makeup, and facial expressions all tend to affect the intuitive age, and even friends can hardly say they can identify the age in a photo. Rejecting predictions for photos with complex environments can avoid large errors. We employ Res Net-50 [25] as our backbone network for Age DB, and the regressor h and rejector r share parameters. We use the Adam optimizer to train our method for 100 epochs where the Slow Start is set to 40 epochs, the initial learning rate of 10 3 and fix the batch size to 256.

Table 1: Test performance (mean and std) of our surrogate loss equipped MAE on Breast Path Q. We repeat the sampling-and-training process 5 times. The metrics Rej, AR, RA are scaled to 0-100 and Sup, Rc RLoss, AL and RL are all magnified by a factor of 1000.

Cost Sup Rc RLoss AL RL Rej AR RA

5 4.37 2.70 31.51 72.53 52.61 6.53 (0.17) (1.07) (2.29) (4.44) (5.10) (2.66)

10 8.22 5.50 37.14 60.08 43.14 11.01 (0.70) (1.98) (4.43) (4.34) (4.49) (4.22)

15 16.77 11.11 6.84 40.39 53.49 38.39 15.46 (1.22) (0.55) (1.43) (1.67) (3.39) (2.86) (3.97)

20 13.84 9.53 43.41 40.65 29.98 29.02 (0.62) (1.69) (5.34) (7.28) (6.58) (9.81)

25 16.01 12.91 46.62 24.47 17.46 48.97 (1.32) (2.48) (9.43) (4.26) (4.96) (8.00)

Table 2: Test performance (mean and std) of our surrogate loss equipped MAE on Age DB. We repeat the sampling-and-training process 5 times. The metrics Rej, AR and RA are scaled to 0-100.

Cost Sup Rc RLoss AL RL Rej AR RA

60 59.80 54.25 156.81 95.40 93.13 2.51 (0.31) (4.41) (23.21) (2.88) (4.30) (1.56)

70 69.00 61.56 151.04 86.22 81.41 8.12 (0.39) (4.10) (12.05) (2.94) (3.07) (2.49)

80 77.10 67.32 150.52 76.00 70.63 16.11 (1.72) (2.21) (12.36) (15.71) (16.36) (13.20)

90 100.34 85.36 73.07 162.44 73.38 67.33 17.20 (3.73) (2.23) (3.21) (12.45) (11.50) (12.07) (9.08)

100 92.94 82.89 170.04 58.35 52.15 30.56 (3.02) (7.47) (20.53) (12.51) (11.59) (12.48)

110 95.08 79.62 166.07 52.15 46.13 34.38 (5.62) (5.44) (13.75) (14.96) (14.76) (13.40)

120 96.80 82.44 173.14 37.11 32.54 51.31 (7.45) (2.40) (12.58) (22.64) (21.42) (23.96)

Breast Path Q [33] is a healthcare dataset collected at the Sunnybrook Health Sciences Centre, Toronto. The dataset contains 2579 patch images, each patch has been assigned a tumor cellularity score score of 0 to 1 by 1 expert pathologist. Currently, this task is performed manually and relies upon expert interpretation of complex tissue structures. Moreover, cancer cellularity scoring is extremely risky and the use of automated methods could lead to irreversible disasters. Regression with rejection can improve this problem very well by predicting only the accepted samples and leaving the rejected samples back to the experts for evaluation. We use the same network as Age DB and train 300 epochs using Adam optimizer where the Slow-Start is set to 50 epochs, the initial learning rate of 10 3 and fix the batch size to 128.

We conducted experiments on five UCI benchmark datasets including Abalone, Airfoil, Auto-mpg, Housing and Concrete. All of these datasets can be downloaded from the UCI Machine Learning [13]. Since our proposed method do not depend on a specific model, and we train two types of base models including the linear model and the multilayer perceptron (MLP) to support the flexibility of our method on choosing a model, where the MLP model is a five-layer (d-20-30-10-1) neural network with a Re LU activation function. For the rejector r and regressor h, we consider them as two separate parts with the same structure. For both the linear model and the MLP model, we use the Adam optimization method with the batch size set to 1024 and the number of training epochs set to 1000 where the Slow-Start is set to 200 epochs. The learning rate for all UCI benchmark datasets is selected from {10 1, 10 2, 10 3}.

5.3 Evaluation Metrics

For evaluation metrics, we use the Rc R loss (Rc RLoss) in Eq. (6) and the rejection rate (Rej). In order to further investigate how the model work, we propose additional metrics. Accepted losses

Table 3: Test performance (mean and std) of our surrogate loss equipped MAE on five UCI datasets trained with the MLP model. We repeat the sampling-and-training process 10 times. The metrics Rej, AR, and RA are scaled to 0-100.

Datasets Cost Sup Rc RLoss AL RL Rej AR RA

3 2.41 1.99 8.13 42.04 32.82 33.33 (0.12) (0.21) (1.08) (3.18) (3.44) (3.22)

4 2.88 2.30 11.37 33.70 25.56 39.27 4.44 (0.13) (0.21) (1.70) (2.47) (2.81) (3.71)

5 (0.46) 3.22 2.66 10.30 23.43 16.83 48.98 (0.23) (0.35) (1.25) (2.94) (2.41) (5.90)

6 3.53 2.93 12.13 19.32 13.81 53.20 (0.25) (0.35) (1.69) (3.47) (3.33) (5.67)

9 7.20 4.23 37.80 62.23 41.49 11.60 (0.35) (0.86) (2.95) (3.73) (5.73) (3.29)

12 8.11 5.39 51.51 40.33 23.37 25.88 (0.36) (0.86) (10.51) (7.95) (7.20) (9.04)

16 9.15 6.84 72.80 24.92 11.92 38.17 12.96 (0.43) (0.70) (20.79) (5.67) (6.93) (5.02)

20 (2.60) 11.32 8.83 58.28 21.53 13.70 48.66 (0.75) (1.47) (8.87) (7.71) (5.34) (18.08)

25 11.47 9.24 74.38 14.19 8.08 52.11 (1.54) (1.35) (16.07) (5.11) (3.60) (12.32)

30 11.68 11.17 96.55 2.52 1.38 86.35 (3.07) (3.20) (16.60) (3.81) (1.78) (20.06)

4 3.64 2.99 13.98 56.92 46.80 28.74 (0.29) (0.83) (4.16) (13.00) (15.49) (10.51)

6 4.83 3.83 18.04 37.31 29.01 42.42 (0.93) (1.70) (5.95) (14.10) (12.74) (19.54)

8 8.34 6.75 6.14 25.59 22.95 19.26 64.99 (2.16) (1.93) (2.41) (12.48) (19.88) (18.27) (23.95)

10 7.14 6.11 23.29 24.07 17.15 48.47 (1.64) (2.24) (9.54) (6.58) (5.12) (15.98)

13 8.13 7.42 35.49 12.56 10.38 71.52 (2.41) (2.83) (23.74) (6.83) (6.14) (14.52)

9 8.80 6.25 40.28 84.46 77.60 9.72 (0.34) (3.22) (17.30) (11.67) (15.88) (5.91)

12 9.52 7.40 58.94 44.65 33.30 31.25 12.57 (0.75) (1.48) (25.98) (8.69) (8.99) (8.64)

16 (3.43) 10.12 8.35 88.14 22.38 14.21 51.84 (1.84) (1.58) (44.53) (8.90) (6.81) (14.41)

20 10.50 9.59 184.24 8.51 5.81 73.40 (3.32) (3.50) (109.35) (6.82) (5.32) (13.11)

20 18.03 13.17 82.17 69.42 54.06 12.34 (1.32) (4.91) (14.58) (6.92) (9.37) (4.47)

30 24.20 19.29 112.13 44.08 27.43 26.80 (1.85) (3.85) (30.32) (8.81) (8.55) (7.90)

40 34.44 28.63 23.12 136.51 31.50 18.32 39.49 (3.05) (2.56) (4.59) (46.59) (8.98) (7.30) (12.07)

50 32.48 27.90 168.19 19.76 10.54 53.82 (2.76) (4.31) (41.73) (7.54) (4.51) (13.74)

60 34.33 30.33 197.26 12.82 5.67 60.95 (3.50) (4.89) (49.03) (6.62) (3.21) (14.99)

(AL) and rejected losses (RL) denote losses on accepted instances and rejected instances, and they are defined as

Pn i=1 I[r(xi)>0](h(xi) yi)2

Pn i=1 I[r(xi)>0] and

Pn i=1 I[r(xi) 0](h(xi) yi)2

Pn i=1 I[r(xi) 0] . We also present the false rejection rate (AR) and false acceptance rate (RA) similar to false negative and false positive, which denote the rate of instances that should be accepted that are rejected and the ratio of instances that should be rejected that are accepted, and they are defined as

Pn i=1 I[(h(xi) yi)2<c(xi)]I[r(xi) 0] Pn i=1 I[(h(xi) yi)2<c(xi)] and Pn i=1 I[(h(xi) yi)2 c(xi)]I[r(xi)>0] Pn i=1 I[(h(xi) yi)2 c(xi)] . It is worth noting that the optimal pair (h , r ) is unknown, so AR and RA are for the current regressor and rejector. We also provide the results under supervised regression method (Sup) that directly trains the model with MSE from fully training set.

5.4 Formulation of Surrogates and Setting of Rejection Costs

In our experiments, we consider a variety of binary classification loss functions, such as mean absolute error (MAE), mean square loss, logistic loss, sigmoid and hinge loss. The rejection cost

5 10 15 20 25 Rejection cost c

Hinge Logistic Square Sigmoid Mae

(a) Rc RLoss on Breast

5 10 15 20 25 Rejection cost c

Hinge Logistic Square Sigmoid Mae

(b) AL on Breast

3 4 5 6 Rejection cost c

Hinge Logistic Square Sigmoid Mae

(c) Rc RLoss on Abalone

3 4 5 6 Rejection cost c

Hinge Logistic Square Sigmoid Mae

(d) AL on Abalone

Figure 1: Figures (a) and (b) report the Rc RLoss and AL with different rejection cost when different binary classification losses are equipped to the Breast dataset, respectively. Figures (c) and (d) report the Rc RLoss and AL with different rejection costs when different binary classification losses are equipped to the Abalone dataset, respectively.

c(x) is considered as a constant, which is the most commonly considered scenario in learning with rejection [6, 8, 36, 11]. For each dataset, we set various rejection costs c including extreme cases and unstressed cases depending on the supervised loss. The complete experiments are provided in Appendix D.

5.5 Experimental Performance

Table 1, Table 2 and Table 3 show the experimental results when our surrogate loss function equipped with MAE on the Age DB, Breast Path Q, and UCI datasets, respectively. From these three tables, we have the following observations: (1) Our proposed method significantly outperforms the supervised regression method in almost all cases, which validates the effectiveness of our method by rejecting difficult test instances. (2) In most cases, the average loss of our method in accepted test instances is always smaller than the average loss of the supervised regression model in all test instances. This further indicates the ability of our method to identify hard-to-predict instances and reject them. (3) As the rejection cost c increases, we can clearly see the following trends in all datasets: Rc RLoss decreases; Rejection rate decrease; Accepted loss increases. This is because as the prediction error we can accept increases, i.e., the rejection cost increases, the rejector will accept more instances, leading to a decrease in the rejection rate. However, the regressor capacity remains the same and more instances (containing difficult instances) also face more challenges, so Rc RLoss and AL increase but remain smaller than Sup. (4) In setting the rejection cost c, we consider many extreme cases, e.g., the rejection cost is much smaller and much larger than the average loss in the supervised regression. In such extreme cases, our approach is still effective to identify and reject difficult test instances. In addition, we show in Figure 1 a plot of Rc RLoss with the increasing rejection cost when different binary classification losses are equipped on the Breast and Abalone datasets. As the rejection cost c increases, both AL loss and Rc RLoss also increase.

5.6 Comparison with Selective Regression Methods

Since our paper provides the first attempt to investigate regression with cost-based rejection, there does not exist a baseline that can be directly compared. However, the Bayes optimal solution of regression with cost-based rejection is the same as the Bayes optimal solution of selective regression, so we can compare with selective regression methods. We have conducted additional experiments to compare with Selective Net [20] and Knn-Plug [46]. For Selective Net, this work proposes a neural network architecture and an optimization goal to control the rejection threshold of the rejector to ensure a specific rejection rate. To ensure a fair comparison, we use the same base model for all datasets. For Knn-Plug, this method proposes a semi-supervised learning process for learning a data-driven predictor with a reject option based on the plug-in principle. Specifically, this method learns a regression function h and a conditional variance function σ from the labeled dataset, while the unlabeled dataset is used to calibrate the conditional variance function σ to ensure the desired rejection rate. For each dataset, we randomly split the original dataset into a labeled training set, an unlabeled training set, and a test set by the proportions of 60% (the same size of labeled training set as ours), 20%, and 20%, respectively. To ensure fairness, we replace the original base model Knn with MLP and named MLP-Plug.

Table 4: Comparison with Selective Net and MLP-Plug. The cost is the rejection cost in regression with costbased rejection and Rj is the expected rejection rate in selective regression.

Datasets Rc R_MAE MLP Plug Selective Net Cost Rc RLoss AL Rej Rj AL Rej Rj AL Rej

3 2.41 1.99 42.04 42.04 2.13 40.93 42.00 2.10 42.80 (0.12) (0.21) (3.18) (0.22) (2.70) (0.19) (1.84)

4 2.88 2.30 33.70 33.70 2.36 32.30 33.00 2.40 34.26 (0.13) (0.21) (2.47) (0.20) (2.89) (0.22) (1.52)

5 3.22 2.66 23.43 23.43 2.72 22.00 23.00 2.84 25.10 (0.23) (0.35) (2.94) (0.23) (3.69) (0.42) (1.71)

6 3.53 2.93 19.32 19.32 2.94 17.64 19.00 2.92 21.20 (0.25) (0.35) (3.47) (0.31) (2.71) (0.39) (2.00)

4 3.64 2.99 56.92 56.92 4.83 56.08 56.00 3.58 57.44 (0.29) (0.83) (13.00) (2.28) (6.53) (1.91) (5.06)

6 4.83 3.83 37.31 37.31 5.64 36.46 37.00 6.29 35.13 (0.93) (1.70) (14.10) (2.14) (9.47) (2.48) (6.34)

8 6.75 6.14 22.95 22.95 6.30 23.67 23.00 6.86 23.59 (1.93) (2.41) (19.88) (2.03) (9.82) (2.08) (5.91)

13 8.13 7.42 12.56 12.56 6.68 10.63 13.00 7.79 11.15 (2.41) (2.83) (6.83) (1.93) (3.54) (2.29) (3.49)

9 8.80 6.25 84.46 84.46 8.56 84.90 84.00 7.70 86.88 (0.34) (3.22) (11.67) (4.70) (8.22) (2.22) (4.56)

12 9.52 7.40 44.65 44.65 9.76 44.80 45.00 8.20 46.83 (0.75) (1.48) (8.69) (2.82) (8.62) (2.11) (5.86)

16 10.12 8.35 22.38 22.38 10.52 22.55 22.00 8.73 26.44 (1.84) (1.58) (8.90) (3.70) (6.50) (1.64) (4.52)

20 10.50 9.59 8.51 8.51 11.16 10.20 9.00 8.79 11.39 (3.32) (3.50) (6.82) (3.72) (4.14) (1.67) (2.44)

20 18.03 13.17 69.42 69.00 18.96 71.17 69.00 14.44 71.60 (1.32) (4.91) (6.92) (5.71) (4.93) (4.42) (3.16)

30 24.20 19.29 44.08 44.00 25.13 48.20 44.00 21.48 49.27 (1.85) (3.85) (8.81) (4.57) (8.23) (3.77) (2.54)

40 28.63 23.12 31.50 32.00 26.69 33.74 31.00 25.02 36.99 (2.56) (4.59) (8.98) (4.83) (7.59) (1.67) (2.46)

50 32.48 27.90 19.76 20.00 29.26 21.36 20.00 28.66 28.54 (2.76) (4.31) (7.54) (4.56) (5.37) (3.23) (2.91)

60 34.33 30.33 12.82 13.00 31.23 14.90 12.00 30.51 20.00 (3.50) (4.89) (6.62) (4.65) (4.81) (3.90) (3.59)

Table 4 shows the results of comparison experiments. Specifically, to be able to establish a connection between our studied Rc R and selective regression, we set the expected rejection rate (Rj) of the selective regression method based on the results of Rc R_MAE (our surrogate loss equipped MAE). It is important to note that there is no way to perfectly match the rejection rate, as the set Rj is "expected". In addition, we show plots of the variation in AL loss for all methods with different rejection rates in Appendix D.7. As can be seen Table 4, our proposed method outperforms (smaller AL loss with the same rejection rate) almost all compared methods, which validates the effectiveness of our method.

6 Conclusion

In this paper, we investigated a novel regression problem called regression with cost-based rejection, which aims to learn a model that can reject predictions to avoid critical mispredictions at a certain rejection cost. In order to solve this problem, we first formulated the expected risk for regression with cost-based rejection and derived the Bayes optimal solution for the expected risk, which shows that we should reject instances where the variance is greater than the rejection cost. Since the variance is difficult to obtain, we proposed a surrogate loss function that considers rejection as binary classification. Further, we provided conditions for the consistency of our method, implying that the optimal solution can be recovered by our method. Finally, we derived the regret transfer bound and the estimation error bound for our method and conducted extensive experiments on various datasets to demonstrate the effectiveness of our proposed method. We expect that our first study of a simple yet theoretically grounded method for regression with cost-based rejection can inspire more interesting studies on this new problem.

Acknowledgements

Lei Feng is supported by Chongqing Overseas Chinese Entrepreneurship and Innovation Support Program, CAAI-Huawei Mind Spore Open Fund, and Openl Community (https://openi.pcl.ac.cn). Bo An is supported by the National Research Foundation, Singapore under its Industry Alignment Fund Pre-positioning (IAF-PP) Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

[1] P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

[2] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of the American Statistical Association, 3:463 482, 2002.

[3] P. L. Bartlett and M. H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(8), 2008.

[4] E. Beede, E. Baylor, F. Hersch, A. Iurchenko, L. Wilcox, P. Ruamviboonsuk, and L. M. Vardoulakis. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In CHI, pages 1 12, 2020.

[5] R. K. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, et al. Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. ar Xiv preprint ar Xiv:1810.01943, 2018.

[6] Y. Cao, T. Cai, L. Feng, L. Gu, J. Gu, B. An, G. Niu, and M. Sugiyama. Generalizing consistent multi-class classification with rejection to be compatible with arbitrary losses. In Neur IPS, pages 521 534, 2022.

[7] I. Chalkidis, I. Androutsopoulos, and N. Aletras. Neural legal judgment prediction in english. In ACL, 2019.

[8] N. Charoenphakdee, Z. Cui, Y. Zhang, and M. Sugiyama. Classification with rejection based on cost-sensitive classification. In ICML, pages 1507 1517, 2021.

[9] K. Chaudhuri, P. Jain, and N. Natarajan. Active heteroscedastic regression. In ICML, pages 694 702, 2017.

[10] C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41 46, 1970.

[11] C. Cortes, G. De Salvo, and M. Mohri. Boosting with abstention. In Neur IPS, 2016.

[12] C. Cortes, G. De Salvo, and M. Mohri. Learning with rejection. In ALT, pages 67 82, 2016.

[13] D. Dua and C. Graff. UCI machine learning repository, 2017.

[14] R. El-Yaniv et al. On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11(5), 2010.

[15] J. Finocchiaro, R. Frongillo, and B. Waggoner. An embedding framework for consistent polyhedral surrogates. In Neur IPS, 2019.

[16] V. Franc and D. Prusa. On discriminative learning of prediction uncertainty. In ICML, pages 1963 1971, 2019.

[17] A. Gangrade, A. Kag, and V. Saligrama. Selective classification via one-sided prediction. In AISTATS, pages 2179 2187, 2021.

[18] W. Gao and Z.-H. Zhou. On the consistency of auc pairwise optimization. ar Xiv preprint ar Xiv:1208.0645, 2012.

[19] Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. In Neur IPS, 2017.

[20] Y. Geifman and R. El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In ICML, pages 2151 2159, 2019.

[21] X. Geng, C. Yin, and Z.-H. Zhou. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(10):2401 2412, 2013.

[22] Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a reject option. In Neur IPS, 2008.

[23] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362 386, 2020.

[24] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321 1330, 2017.

[25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016.

[26] C. M. Jarque and A. K. Bera. Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics Letters, 6(3):255 259, 1980.

[27] W. Jiang, Y. Zhao, and Z. Wang. Risk-controlled selective prediction for regression deep neural network models. In IJCNN, pages 1 8, 2020.

[28] M. A. Kadampur and S. Al Riyaee. Skin cancer detection: Applying a deep learning based model driven architecture in the cloud for classifying dermal cell images. Informatics in Medicine Unlocked, 18:100282, 2020.

[29] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard. Most likely heteroscedastic gaussian process regression. In ICML, pages 393 400, 2007.

[30] M. Lázaro-Gredilla and M. K. Titsias. Variational heteroscedastic gaussian process regression. In ICML, pages 841 848, 2011.

[31] J. K. Lee, Y. Bu, D. Rajan, P. Sattigeri, R. Panda, S. Das, and G. W. Wornell. Fair selective classification via sufficiency. In ICML, 2021.

[32] H. Liu, Y.-S. Ong, and J. Cai. Large-scale heteroscedastic regression via gaussian process. IEEE Transactions on Neural Networks and Learning Systems, 32(2):708 721, 2020.

[33] A. L. Martel, S. Nofech-Mozes, S. Salama, S. Akbar, and M. Peikari. Assessment of residual breast cancer cellularity after neoadjuvant chemotherapy using digital pathology [data], 2019.

[34] A. Maurer. A vector-contraction inequality for rademacher complexities. In ALT, pages 3 17, 2016.

[35] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou. Agedb: The first manually collected, in-the-wild age database. In CVPRW, pages 1997 2005, 2017.

[36] C. Ni, N. Charoenphakdee, J. Honda, and M. Sugiyama. On the calibration of multiclass classification with rejection. In Neur IPS, 2019.

[37] T. Pietraszek. Optimizing abstaining classifiers using roc analysis. In ICML, pages 665 672, 2005.

[38] A. Pugnana. Topics in selective classification. In AAAI, pages 16129 16130, 2023.

[39] H. G. Ramaswamy, A. Tewari, and S. Agarwal. Consistent algorithms for multiclass classification with a reject option. ar Xiv preprint ar Xiv:1505.04137, 2015.

[40] H. G. Ramaswamy, A. Tewari, and S. Agarwal. Consistent algorithms for multiclass classification with an abstain option. Electronic Journal of Statistics, 12(1):530 554, 2018.

[41] A. Shah, Y. Bu, J. K. Lee, S. Das, R. Panda, P. Sattigeri, and G. W. Wornell. Selective regression under fairness criteria. In ICML, pages 19598 19615, 2022.

[42] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society: Series B (Methodological), 47(1):1 21, 1985.

[43] H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: Journal of the Econometric Society, pages 817 838, 1980.

[44] Y. Wiener and R. El-Yaniv. Pointwise tracking the optimal regression function. In Neur IPS, 2012.

[45] M. Yuan and M. Wegkamp. Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11(1), 2010.

[46] A. Zaoui, C. Denis, and M. Hebiri. Regression with reject option and application to knn. In Neur IPS, pages 20073 20082, 2020.

[47] T. Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225 1251, 2004.

[48] Y. Zoabi, S. Deri-Rozov, and N. Shomron. Machine learning-based prediction of covid-19 diagnosis based on symptoms. npj Digital Medicine, 4(1):1 5, 2021.

A Proof of Theorem 2

For an instance x, we have the following expected risk for x:

RRc R|x(h, r) = Ep(y|x)[L(h, r, c, x, y)]

Y p(y|x)L(h, r, c, x, y)dy

If we refuse to make a prediction for x, i.e., r(x) < 0, the above expected risk transforms into the following equation:

RRc R|x,r(x)<0(h, r) = Z

Y p(y|x)L(h, r, c, x, y)dy

Y p(y|x)c(x)dy

When we make a prediction for x, i.e., r(x) > 0, the above expected risk transforms into the following equation:

RRc R|x,r(x)>0(h, r) = Z

Y p(y|x)L(h, r, c, x, y)dy

Y p(y|x)(h(x) y)2dy

Y p(y|x)(h(x)2 2yh(x) + y2)dy

Y p(y|x)h(x)2dy Z

Y p(y|x)2yh(x)dy + Z

Y p(y|x)y2dy

= h2(x) 2h(x)Ep(y|x)[y] + Ep(y|x)[y2]

= h2(x) 2h(x)Ep(y|x)[y] + E2 p(y|x)[y] + Dp(y|x)[y]

= (h(x) Ep(y|x)[y])2 + Dp(y|x)[y]

When h(x) = Ep(y|x)[y] makes RRc R|x,r(x)>0 = Dp(y|x)[y] minimum. It is easy to know that RRc R|x,r(x)>0 < RRc R|x,r(x)<0 when c(x) Dp(y|x)[y] > 0 and RRc R|x,r(x)>0 > RRc R|x,r(x)<0 when c(x) Dp(y|x)[y] < 0 which means that RRc R|x is minimum when the following equation holds.

r (x) = I(c(x) Dp(y|x)[y]).

The proof is completed.

B Proofs of Consistent and Calibration

B.1 Proof of Theorem 5

First, we prove that the optimal regressor h is also an optimal regressor for Rψ Rc R as follows.

Rψ Rc R(h , r) = Ep(x,y)[ψ(h , r, c, x, y)]

= Ep(x,y)[(h (x) y)2ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]

= Ep(x,y)[(h (x)2 2yh (x) + y2)ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]

Y p(x, y)[(h (x)2 2yh (x) + y2)ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]dydx

Y p(y|x)p(x)[(h (x)2 2yh (x) + y2)ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]dydx

X p(x)[(h (x)2 Z

Y 2yh (x)p(y|x)dy + Z

Y y2p(y|x)dy)ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]dx

X p(x)[(h (x)2 2h (x)Ep(y|x)[y] + Ep(y|x)[y2])ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]dx

X p(x)[(h (x)2 2h (x)Ep(y|x)[y] + E2 p(y|x)[y] + Dp(y|x)[y])ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]dx

X p(x)[((h (x)2 Ep(y|x)[y])2 + Dp(y|x)[y])ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]dx, (12)

where h (x) = Ep(y|x)[y] makes the above risk minimal when x X, ℓ(r(x), z) 0, and thus h is an optimal regressor for Rψ Rc R, with any rejector r. On the other hand, we prove that h is the only optimal regressor if the condition: x X, ℓ(r(x), z) > 0 is achieved.

First we prove that h is not the only optimal regressor when the condition: x X, ℓ(r(x), z) 0 is achieved by contradiction. Specifically, we assume that there is at least one other regressor that makes risk minimize. Suppose given an instance x0 and a rejector r such that ℓ(r (x0), 1) = 0, hence we have at least one other regressor h such that Rψ Rc R(h , r ) = Rψ Rc R(h , r ) and h (x0) = h (x0) due to the following equation holds:

Dp(y|x0)[y]ℓ(r (x0), 1) = 0. (13)

The above equation implies that there exist multiple regressors that minimize the risk when the condition: x X, ℓ(r(x), z) 0 is achieved. This is mainly due to the binary classification loss ℓ(r(x), z) = 0. However, we can easily know that h is the only optimal regressor when the condition: x X, ℓ(r(x), z) > 0 is achieved, since there are no ignored terms in Eq. (12).

B.2 Proof of Theorem 6

Let us go back to the discussion of Eq. (12):

Rψ Rc R(h , r) = Z

X [((h (x)2 Ep(y|x)[y])2 + Dp(y|x)[y])ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]p(x)dx

X Dp(y|x)[y]ℓ(r(x), 1)p(x) + c(x)ℓ(r(x), +1)p(x)dx.

Similarly to the proof of Theorem 5, the optimal regressor h still minimizes the risk for any rejector. However, it is easy to know that for a rejector r0 when there exists an instance x0 such that ℓ(r (x0), 1) = 0, there exists at least one other regressor h such that Rψ Rc R(h , r ) = Rψ Rc R(h , r ) and h (x0) = h (x0) due to ((h(x)2 Ep(y|x)[y])2 + Dp(y|x)[y])ℓ(r(x), 1) = 0 always holds. Therefore, the optimal regressor h is not the only optimal solution. Fortunately, we can still show that it is regressor-consistent for some instances in this case.

For a binary classification loss function ℓ, we denote by X 1 ℓthe space where x X 1 ℓ, ℓ(r(x), 1) = 0 for any rejector r. It is worth noting that when ℓ(r(x), 1) = 0, we can infer r(x) < 0, which means that the instance x will be rejected. Then we have the following equation:

Rψ Rc R(h, r) = Z

X [((h(x) Ep(y|x)[y])2 + Dp(y|x)[y])ℓ(r(x), 1) + c(x)ℓ(r(x), +1)]p(x)dx

X 1 ℓ (h(x) Ep(y|x)[y])2ℓ(r(x), 1)p(x)dx

X 1 ℓ Dp(y|x)[y]ℓ(r(x), 1)p(x) + c(x)ℓ(r(x), +1)p(x)dx.

When for all x X 1 ℓ, we have the above risk is minimised when h(x) = Ep(y|x)[y] for all x X 1 ℓ. In this case, we obtain a similar form as the proof of Theorem 5. Therefore, we have proved that h ψ is consistent (i.e., h ψ = h ) for accepted instances when the loss ℓis non-negative. The proof is completed.

B.3 Proof of Theorem 7

By fixing the optimal regressor h , let us discuss the pointwise version of Eq. (12): Ep(y|x)[ψ(h , r, c, x, y)] = Dp(y|x)[y]ℓ(r(x), 1) + c(x)ℓ(r(x), +1).

The rejector r(x) that minimizes the expected value is given by min r(x) Dp(y|x)[y]ℓ(r(x), 1) + c(x)ℓ(r(x), +1)

If the binary classification loss ℓ(r(x), z) is classification-calibrated, then it is easy to see that the optimal rejector r ψ should have the same sign with Dp(y|x)[y] c(x) by the definition of classification calibration. When Dp(y|x)[y] < c(x), the optimal rejector r ψ(x) must have a positive sign, and when Dp(y|x)[y] c(x), the optimal rejector r ψ(x) must have a negative sign. The proof is completed.

C Proofs of the Regret Transfer Bound and the Estimation Error Bound

C.1 Proof of Theorem 8

Proof. Without loss of generality, we discuss the regret transfer on a point x s conditional risk and it can be generalized to the whole distribution using Jensen s inequality of concave functions. We denote by D = Dp(y|x)[y], D = Ep(x,y)[(h(x) y)2], α = r(x), and c = c(x). Notice we are using margin loss ℓ, thus we denote by ℓ(α, 1) = ϕ(α) and ℓ(α, 1) = ϕ(α).

Then we can denote our conditional risk as R( D, α) = Dϕ( α)+cϕ(α), where Ep(x)[R( D, α)] = RRc R(h, r). According to our previous discussion, R( D, α) achieves its minimum at (D, α ), where α = argminα R Dϕ( α) + cϕ(α). We can learn that D D. Then we can write our point-wise regret as:

R( D, α) R(D, α ) = Dϕ( α) + cϕ(α) Dϕ( α ) cϕ(α ) Then we can begin our proof. There are three cases that the point-wise regret of our target loss is non-zero:

When D < c but α 0, the point-wise regret of target loss is c D.

In this case: R( D, α) R(D, α ) = Dϕ( α) + cϕ(α) Dϕ( α ) cϕ(α ) Dϕ( α) + cϕ(α) Dϕ( α ) cϕ(α )

(D + c) D D + cϕ( α) + c D + cϕ(α) D D + cϕ( α ) c D + cϕ(α )

(D + c)ξ 1 c D

When D > c but α 0, the point-wise regret of target loss is D c.

In this case, denote by α = argmin R( D, α), we can learn that R(D, α ) R(D, α ):

R( D, α) R(D, α ) R( D, α) R(D, α )

= Dϕ( α) + cϕ(α) Dϕ( α ) cϕ( α )

Dϕ( α) + cϕ(α) Dϕ( α ) cϕ( α )

" D D + c ϕ( α) + c D + c ϕ(α) D D + c ϕ( α ) c D + c ϕ( α )

( D + c)ξ 1 D c

When D < c and α 0, the point-wise regret of target loss is D D.

In this case, we should notice that α > 0. We further split this case into two cases:

When D > c, α < 0:

R( D, α) R(D, α ) = Dϕ( α) + cϕ(α) Dϕ( α ) cϕ(α )

= Dϕ( α) + cϕ(α) Dϕ( α ) cϕ( α ) + Dϕ( α ) + cϕ( α ) Dϕ( α ) cϕ(α )

Dϕ( α) + cϕ(α) Dϕ( α ) cϕ( α ) + (Dϕ( α ) + cϕ( α ) Dϕ( α ) cϕ(α ))

( D + c)ξ 1 D c

+ (D + c)ξ 1 c D

When D c, we can learn α 0 denote by α = argminα Dϕ(α) + Dϕ( α), and we can learn α 0:

R( D, α) R(D, α ) = Dϕ( α) + cϕ(α) Dϕ( α ) cϕ(α )

Dϕ( α ) + cϕ( α ) Dϕ( α ) cϕ(α )

( D + c)H c D + c

+ (D + c)H c D + c

Notice that for sigmoid loss and hinge loss, ξ 1(α) = |α|, H(α) = 2(1 α). For logistic loss, ξ 1 = 1

2α2 and H(α) = α log(α) (1 α) log(1 α). Denote by γ the point-wise regret of target loss. Using the linear bound of sigmoid loss and hinge loss, we can learn:

R( D, α) R(D, α ) γ.

For logistic loss, the derivation is more complicated. In the first two cases:

R( D, α) R(D, α ) D + c

2(M + c)(R( D, α) R(D, α )) γ. In the third case, we can learn:

R( D, α) R(D, α ) D + c

1 2( D + c) (( D c)2 + (c D)2)

1 4( D + c) ( D D)2

and thus 2 q

(M + c)(R( D, α) R(D, α )) γ. In the last case:

R( D, α) R(D, α ) ( D + c)H c D + c

+ (D + c)H c D + c

c log c D + c

+ c log c D + c

+ D log D D + c

1 + D D D + c

+ D log 1 + c

D log 1 + c

1 + D D D + c

D+c 1 + D D

Combining these cases, we can learn

γ max{2(R( D, α) R(D, α )), 2 q

(M + c)(R( D, α) R(D, α ))}.

C.2 Proof of Theorem 9

Definition 10. (Rademacher complexity) Let Z1, , Zn be n i.i.d. random variables drawn from a probability distribution µ and F = {f : Z R} be a class of measurable functions. Then the expected Rademacher complexity of function class F is given as follows:

Rn(F) = EZ1, ,Zn µEσ

i=1 σif(Zi)

where σ1, , σn are the Rademacher variables that take the value from { 1, +1} uniformly.

Then we can begin proving Theorem 9.

Proof. Suppose that the binary loss is bounded by M1 and ρ-Lipschitz continuous, |h|, c(x), |y| is bounded by M2 and the hypothesis space H and R is strong enough, then we can learn that the total loss is bounded by C1 = (4M 2 2 + M2)M1, and is L1-Lipschitz continuous w.r.t. (h, r), where L1 = p

(4M 2 1 ρ + M1ρ)2 + 16M 4 1 M 2 2 . By applying the Mc Diarmid s inequality, it is routine to show that the following inequalities hold with probability at least 1 δ

2, respectively:

sup h,r H,R

Rψ Rc R(h, r) ˆRψ Rc R(h, r) E x1, ,xn

sup h,r H,R

Rψ Rc R(h, r) ˆRψ Rc R(h, r) #

sup h,r H,R

ˆRψ Rc R(h, r) Rψ Rc R(h, r) E x1, ,xn

sup h,r H,R

ˆRψ Rc R(h, r) Rψ Rc R(h, r) #

By applying Talagrand s contraction lemma [34], we can learn that:

sup h,r H,R

ˆRψ Rc R(h, r) Rψ Rc R(h, r) #

2L1(Rn(H) + Rn(R))

and this conclusion also holds for another direction. Plugging this conclusion into the former inequalities and using the union bound, we can learn this inequality holds with probability at least 1 δ:

sup h,r H,R

ˆRψ Rc R(h, r) Rψ Rc R(h, r)

2L1(Rn(H) + Rn(R)) + C1

According to the definition of empirical risk minimization and identifiable condition, we can get the following conclusion:

Rψ Rc R(ˆh, ˆr) min h,r H,R Rψ Rc R(h, r) = Rψ Rc R(ˆh, ˆr) Rψ Rc R(h , r )

= Rψ Rc R(ˆh, ˆr) ˆRψ Rc R(ˆh, ˆr) + ˆRψ Rc R(ˆh, ˆr) ˆRψ Rc R(h , r ) + ˆRψ Rc R(h , r ) Rψ Rc R(h , r )

Rψ Rc R(ˆh, ˆr) ˆRψ Rc R(ˆh, ˆr) + ˆRψ Rc R(h , r ) Rψ Rc R(h , r )

2 sup h,r H,R

ˆRψ Rc R(h, r) Rψ Rc R(h, r) .

Combining Theorem 7, we can conclude the proof.

Table 5: Test performance (mean and std) of our surrogate loss equipped hinge loss on Breast Path Q. We repeat the sampling-and-training process 5 times. The metrics Rej, AR, RA are scaled to 0-100 and Sup, Rc RLoss, AL, and RL are all magnified by a factor of 1000.

Cost Sup Rej AL RL Rej AR RA

5 4.74 3.51 53.05 80.86 60.37 4.19 (0.38) (1.96) (20.33) (4.17) (6.70) (2.36)

10 8.32 4.58 58.90 68.99 46.63 5.91 (0.21) (1.74) (13.15) (4.71) (6.45) (2.54)

15 11.89 6.44 49.42 62.45 46.86 10.69 16.77 (0.31) (1.82) (8.12) (4.76) (5.12) (3.84)

20 (1.22) 15.07 9.53 49.58 52.33 38.04 17.88 (0.33) (1.15) (8.10) (3.94) (3.47) (5.36)

25 16.54 10.36 58.39 41.23 29.71 25.34 (0.78) (2.39) (19.85) (8.36) (7.36) (12.36)

Table 6: Test performance (mean and std) of our surrogate loss equipped hinge loss on Age DB. We repeat the sampling-and-training process 5 times. The metrics Rej, AR, RA are scaled to 0-100.

Cost Sup Rej AL RL Rej AR RA

60 59.99 44.80 177.36 97.30 95.84 1.43 (0.10) (13.97) (40.19) (2.16) (3.19) (1.17)

70 70.24 71.81 185.68 92.41 88.90 4.32 (0.50) (4.61) (26.75) (1.20) (1.75) (0.74)

80 79.67 76.43 185.14 87.23 82.63 7.83 (1.40) (12.86) (18.51) (2.08) (2.63) (1.88)

90 100.34 88.71 76.78 166.84 83.43 79.46 11.19 (3.73) (1.08) (8.93) (5.90) (11.01) (12.07) (9.13)

100 96.95 77.02 182.70 84.78 80.39 9.14 (0.67) (7.46) (13.20) (6.77) (7.29) (5.49)

110 104.29 85.84 192.05 73.52 67.73 17.41 (0.31) (6.98) (26.75) (10.39) (10.33) (9.56)

120 111.54 92.59 186.50 67.31 61.11 21.74 (2.23) (4.79) (13.73) (11.60) (11.56) (10.43)

Table 7: Test performance (mean and std) of our surrogate loss equipped logistic loss on five UCI datasets trained with the Linear model. We repeat the sampling-and-training process 10 times. The metrics Rej, AR, and RA are scaled to 0-100.

Datasets Cost Supervised Rej AL RL Rej AR RA

3 2.52 1.94 7.84 54.77 44.76 24.00 (0.08) (0.21) (1.06) (2.66) (3.45) (1.93)

4 2.99 2.39 9.83 36.93 27.94 36.55 4.92 (0.11) (0.19) (1.44) (2.78) (3.02) (2.83)

5 (0.51) 3.38 2.80 11.78 25.90 18.43 46.24 (0.18) (0.25) (1.86) (2.27) (2.08) (3.20)

6 3.69 3.19 13.81 17.80 11.89 55.08 (0.26) (0.31) (2.00) (1.98) (1.45) (4.19)

9 8.83 7.55 26.44 85.58 78.07 7.24 (0.35) (2.64) (1.99) (6.10) (8.53) (3.99)

12 11.31 8.76 27.59 79.93 71.90 9.74 (0.49) (2.18) (2.06) (3.65) (5.57) (1.97)

16 14.46 10.90 30.60 69.47 60.88 17.14 23.32 (0.51) (1.46) (2.13) (6.49) (7.17) (5.66)

20 (1.54) 17.10 13.01 33.51 58.54 50.38 26.19 (0.83) (1.79) (4.17) (5.07) (5.67) (6.54)

25 19.62 15.95 35.26 39.30 34.28 47.29 (1.26) (2.52) (3.40) (5.76) (5.41) (8.82)

30 20.94 17.60 42.36 25.38 21.35 61.14 (1.84) (3.18) (7.32) (8.07) (6.97) (12.70)

4 3.92 2.78 15.05 79.87 73.18 15.28 (0.26) (1.68) (4.12) (11.10) (14.13) (7.25)

6 5.73 5.25 17.98 58.97 52.23 32.06 (0.70) (1.63) (6.14) (8.50) (9.69) (10.05)

8 11.66 6.73 5.72 21.58 42.56 36.08 43.16 (2.26) (0.52) (0.92) (7.67) (7.84) (8.66) (11.24)

10 7.37 5.94 26.05 31.28 25.72 53.96 (0.95) (1.61) (11.45) (12.77) (10.98) (21.76)

13 8.75 7.72 28.93 19.62 17.31 69.71 (1.64) (1.94) (12.64) (4.69) (4.60) (13.77)

9 8.65 6.95 33.87 67.92 58.56 19.28 (0.75) (3.16) (12.62) (11.51) (14.04) (7.94)

12 10.27 8.19 40.93 58.32 48.25 23.32 24.08 (1.08) (2.90) (16.74) (10.20) (13.11) (5.75)

16 (5.34) 12.34 9.20 50.31 45.35 36.24 31.42 (1.14) (2.03) (19.14) (6.04) (6.00) (6.77)

20 14.19 10.48 55.08 38.42 32.22 42.45 (1.67) (2.72) (23.26) (6.08) (6.62) (12.64)

20 19.80 10.00 204.20 97.57 95.25 1.55 (0.29) (6.44) (63.34) (2.04) (4.29) (0.86)

30 29.51 24.08 227.13 91.17 87.60 6.02 (0.92) (12.35) (99.54) (4.57) (6.54) (3.11)

40 111.12 38.09 28.95 282.98 80.34 73.83 12.05 (8.01) (1.38) (7.18) (96.63) (6.60) (7.46) (6.39)

50 46.94 34.22 242.13 75.34 68.94 15.98 (1.82) (9.57) (98.34) (10.15) (11.79) (7.51)

60 51.96 41.36 370.24 56.26 45.96 25.26 (2.08) (5.76) (113.24) (2.61) (2.24) (4.63)

D Additional Information of Experiments

D.1 Evaluation Metrics

We describe in detail all the evaluation metrics we used in our experiments.

Rc R loss. The Rc R loss (Rc RLoss) is the main evaluation metric for Rc R. For a given example (x, y) and the rejection cost c(x), the Rc RLoss defined as if r(x) > 0, L(h, r, c, x, y) = (h(x) y)2, otherwise L(h, r, c, x, y) = c(x).

Rejection rate. The rejection rate (Rej) is defined as

Pn i=1 I[r(x) 0]

n . Rej indicates the rejection rate of our model on the test dataset.

Table 8: Test performance (mean and std) of our surrogate loss equipped logistic loss on five UCI datasets trained with the MLP model. We repeat the sampling-and-training process 10 times. The metrics Rej, AR, and RA are scaled to 0-100.

Datasets Cost Supervised Rej AL RL Rej AR RA

3 2.41 1.95 8.34 43.14 33.37 33.32 (0.10) (0.18) (0.87) (2.84) (3.34) (2.81)

4 2.87 2.33 9.68 32.29 24.28 41.96 4.44 (0.18) (0.29) (1.18) (3.28) (3.29) (3.29)

5 (0.46) 3.20 2.59 10.86 25.25 17.86 45.15 (0.20) (0.29) (1.34) (2.15) (2.02) (3.61)

6 3.48 2.82 11.54 20.44 14.74 51.48 (0.25) (0.35) (1.53) (1.72) (1.42) (4.63)

9 6.78 5.05 51.45 43.68 20.35 23.75 (0.47) (0.82) (4.32) (5.19) (5.21) (3.96)

12 7.96 5.79 59.74 35.05 12.94 26.90 (0.64) (0.82) (3.53) (3.02) (2.79) (3.03)

16 9.04 7.46 68.20 18.57 7.36 48.00 12.96 (0.56) (0.47) (10.78) (4.08) (3.14) (7.21)

20 (2.60) 9.64 7.81 74.27 15.05 5.83 47.91 (0.74) (0.44) (10.19) (7.76) (1.91) (11.27)

25 10.47 8.50 80.28 12.03 3.78 49.00 (1.08) (0.52) (27.67) (4.43) (1.89) (17.65)

30 10.95 8.95 89.30 9.50 2.93 50.67 (1.11) (0.78) (31.58) (3.86) (1.10) (18.37)

4 3.85 3.22 11.99 62.44 54.18 25.19 (0.56) (1.51) (3.81) (10.72) (9.53) (13.29)

6 5.33 4.67 15.08 43.01 35.19 41.25 (0.82) (1.36) (3.83) (16.01) (16.09) (15.35)

8 8.34 6.53 5.86 19.34 29.49 23.19 53.57 (2.16) (1.18) (1.53) (7.60) (13.45) (13.60) (14.78)

10 7.06 6.42 21.71 17.95 14.15 65.59 (1.60) (1.91) (8.88) (3.63) (3.48) (11.17)

13 7.80 7.04 28.55 13.59 11.05 70.15 (1.90) (2.09) (14.96) (5.35) (5.32) (14.83)

9 8.60 8.43 45.60 26.57 19.05 66.95 (2.49) (3.15) (27.02) (5.78) (5.63) (10.39)

12 9.50 8.63 63.52 25.44 17.38 52.68 12.57 (1.56) (2.10) (26.15) (6.43) (5.84) (12.58)

16 (3.43) 9.30 8.03 90.19 15.45 10.84 62.99 (1.37) (1.70) (38.08) (4.30) (3.87) (9.05)

20 9.67 8.33 103.55 11.18 8.71 70.06 (1.40) (1.77) (54.73) (2.97) (2.70) (8.59)

20 18.65 14.32 58.33 68.93 57.72 16.19 (1.41) (4.80) (15.88) (13.47) (17.08) (9.45)

30 25.64 23.16 80.43 32.85 19.30 48.23 (2.50) (4.95) (14.47) (12.82) (10.81) (15.91)

40 34.44 29.79 25.25 107.54 30.24 19.97 44.38 (3.05) (2.33) (3.55) (22.18) (8.58) (7.91) (9.87)

50 31.63 25.79 120.02 24.22 15.29 47.42 (4.29) (4.22) (24.12) (11.22) (10.47) (10.35)

60 34.04 33.26 165.71 2.77 2.14 92.59 (4.48) (4.55) (62.18) (5.13) (2.93) (12.85)

Accepted loss. The accepted loss (AL) is defined as

Pn i=1 I[r(xi)>0](h(xi) yi)2

Pn i=1 I[r(xi)>0] . AL denotes the average loss of our regressor on the accepted test dataset.

Rejected loss. The rejected loss (RL) is defined as

Pn i=1 I[r(xi) 0](h(xi) yi)2

Pn i=1 I[r(xi) 0] . RL denotes the average loss of our regressor on the rejected test dataset.

False rejection rate. The false rejection rate (AR) is defined as

Pn i=1 I[(h(xi) yi)2<c(xi)]I[r(xi) 0] Pn i=1 I[(h(xi) yi)2<c(xi)] . AR denotes the rate of instances that should be accepted that are rejected.

Table 9: Test performance (mean and std) of our surrogate loss equipped logistic loss on Breast Path Q. We repeat the sampling-and-training process 5 times. The metrics Rej, AR, RA are scaled to 0-100 and Sup, Rc RLoss, AL and RL are all magnified by a factor of 1000.

Cost Sup Rc RLoss AL RL Rej AR RA

5 4.41 2.92 35.59 71.56 47.48 5.71 (0.35) (1.51) (7.36) (3.20) (5.95) (3.02)

10 7.99 4.72 41.34 61.72 40.69 10.30 (0.47) (1.01) (9.74) (9.54) (4.90) (1.02)

15 16.77 11.52 7.98 42.67 50.48 35.50 20.18 (1.22) (0.36) (1.36) (5.87) (5.85) (4.61) (7.72)

20 13.69 8.92 63.51 43.65 31.11 22.61 (0.81) (1.05) (49.02) (5.24) (5.00) (5.72)

25 16.64 12.96 35.27 29.84 23.63 44.77 (0.94) (2.28) (2.63) (5.93) (5.42) (10.54)

Table 10: Test performance (mean and std) of our surrogate loss equipped square loss on Breast Path Q. We repeat the sampling-and-training process 5 times. The metrics Rej, AR, RA are scaled to 0-100 and Sup, Rc RLoss, AL and RL are all magnified by a factor of 1000.

Cost Sup Rc RLoss AL RL Rej AR RA

5 4.67 3.60 36.70 69.23 44.09 6.26 (0.41) (1.17) (4.46) (4.94) (4.16) (3.25)

10 8.13 5.30 43.66 59.69 37.47 10.42 (0.38) (0.73) (6.95) (2.59) (2.40) (2.60)

15 12.02 8.83 39.83 51.70 36.78 17.65 16.77 (1.09) (2.14) (8.27) (2.56) (0.80) (2.03)

20 (1.22) 14.58 9.66 43.69 44.72 33.20 24.21 (0.57) (2.55) (7.58) (9.54) (7.89) (11.09)

25 15.75 11.98 45.65 27.57 19.94 43.73 (0.76) (2.59) (7.18) (8.62) (7.86) (12.04)

False acceptance rate. The false acceptance rate (RA) denotes the rate of instances that should be rejected that are accepted, and is defined as

Pn i=1 I[(h(xi) yi)2 c(xi)]I[r(xi)>0] Pn i=1 I[(h(xi) yi)2 c(xi)] .

D.2 Some Results for Hinge Loss

In this section, we show some experimental results of the surrogate loss equipped with hinge loss, which can be formulated as follows:

ψ(h, r, c, x, y) = (h(x) y)2max(0, 1 + r(x)) + c(x)max(0, 1 r(x)).

Table 5 and Table 6 show some of the experimental results on the Age DB adn Breast Path Q when surrogate loss equipped hinge loss, respectively. From this table, we can see that Rc RLoss and AL is always lower than Sup in almost all experiments, which means that our method is effective in identifying test instances should be accepted and test instances should be rejected. It is worth noting that in most experiments, there is a low RA, which means that there is a higher tendency to reject hard-to-predict test instances to avoid serious errors when equipped hinge loss.

D.3 Some Results for Logistic Loss

In this section, we show some experimental results of the surrogate loss equipped with logistic loss, which can be formulated as follows:

ψ(h, r, c, x, y) = (h(x) y)2log(1 + exp(r(x))) + c(x)log(1 + exp( r(x))).

Table 7, Table 8 and Table 9 show some of the experimental results on the UCI datasets and Breast Path Q when surrogate loss equipped logistic loss, respectively. Our proposed method significantly outperforms the supervised regression method in almost all cases, which verifies the ability of our

Table 11: Test performance (mean and std) of our surrogate loss equipped square loss on five UCI datasets trained with the MLP model. We repeat the sampling-and-training process 10 times. The metrics Rej, AR, and RA are scaled to 0-100.

Datasets Cost Supervised Rej AL RL Rej AR RA

3 2.39 1.96 7.82 42.54 32.79 32.58 (0.10) (0.19) (0.63) (2.49) (2.69) (2.63)

4 2.84 2.33 8.79 31.82 23.72 40.81 4.44 (0.16) (0.25) (0.97) (1.87) (2.14) (2.77)

5 (0.46) 3.18 2.60 9.89 25.37 18.32 45.65 (0.18) (0.27) (1.15) (2.13) (2.07) (4.21)

6 3.50 2.89 10.40 20.38 14.37 49.83 (0.29) (0.42) (1.11) (2.17) (1.85) (5.47)

9 6.40 4.36 51.93 43.65 22.05 22.22 (0.25) (0.36) (5.13) (3.26) (2.93) (3.71)

12 7.46 5.11 61.13 33.75 15.04 27.05 (0.31) (0.38) (5.83) (2.90) (3.10) (2.74)

16 8.57 5.81 70.20 26.98 10.54 28.83 12.96 (0.40) (0.30) (7.82) (3.23) (3.08) (1.79)

20 (2.60) 9.27 6.66 76.90 19.34 7.70 35.09 (0.42) (0.43) (10.02) (2.34) (1.54) (4.65)

25 9.97 7.23 87.37 15.35 5.19 32.96 (0.60) (0.51) (9.07) (2.44) (1.38) (5.61)

30 10.33 7.82 85.67 11.23 3.92 37.40 (0.86) (0.79) (18.03) (1.95) (1.25) (8.27)

4 3.65 2.83 11.93 62.31 51.76 22.05 (0.26) (0.90) (3.22) (11.46) (12.43) (10.30)

6 5.19 4.31 18.00 39.62 33.51 47.55 (0.77) (1.47) (7.69) (21.51) (21.46) (24.43)

8 8.34 6.51 5.82 22.62 29.10 22.28 52.29 (2.16) (1.35) (1.74) (8.96) (14.57) (13.86) (16.21)

10 6.80 6.08 23.57 17.82 13.91 65.62 (1.16) (1.43) (9.41) (2.84) (1.93) (9.09)

13 7.28 6.45 30.51 12.69 10.05 71.16 (1.30) (1.36) (15.46) (3.74) (3.46) (15.70)

9 8.41 8.22 53.44 28.22 21.42 56.77 (1.56) (2.10) (20.25) (7.81) (7.35) (9.38)

12 9.03 8.36 76.10 17.13 12.16 66.75 12.57 (1.26) (1.64) (47.37) (5.71) (5.11) (11.99)

16 (3.43) 8.52 7.64 109.61 10.10 7.30 73.14 (1.35) (1.62) (60.72) (3.67) (2.71) (13.36)

20 9.40 8.56 148.19 7.03 5.09 73.76 (1.94) (2.16) (98.03) (3.30) (2.31) (16.54)

20 19.95 18.77 75.19 55.10 43.24 28.28 (2.56) (5.05) (11.68) (13.77) (14.20) (11.92)

30 25.22 22.44 103.99 33.45 22.25 43.75 (3.22) (5.34) (18.06) (6.93) (6.21) (9.93)

40 34.44 31.21 28.84 127.77 21.17 12.47 56.12 (3.05) (1.50) (1.84) (19.65) (5.11) (4.12) (7.44)

50 29.55 25.00 147.99 18.01 9.87 52.49 (2.97) (3.80) (36.87) (4.62) (3.88) (9.18)

60 33.07 28.81 158.65 13.64 7.28 59.55 (3.51) (3.94) (33.36) (3.71) (2.91) (8.52)

method to reject difficult test instances demonstrating the effectiveness of our method. In most cases, the average loss of our method in the accepted test instances (AL) is always smaller than the average loss of the supervised regression model (Sup) in all test instances. This further indicates the ability of our method to identify hard-to-predict samples and reject them. On both MLP and Linear models, our method is effective in avoiding serious errors, which verifies that our method can be adapted to different models.

D.4 Some Results for Square Loss

In this section, we show some experimental results of the surrogate loss equipped with square loss, which can be formulated as follows:

ψ(h, r, c, x, y) = (h(x) y)2(r(x) + 1)2 + c(x)(r(x) 1)2.

Table 12: Test performance (mean and std) of our surrogate loss equipped sigmoid on Breast Path Q. We repeat the sampling-and-training process 5 times. The metrics Rej, AR, RA are scaled to 0-100 and Sup, Rc RLoss, AL and RL are all magnified by a factor of 1000.

Cost Sup Rej AL RL Rej AR RA

5 4.42 2.40 43.86 79.22 56.41 2.81 (0.17) (1.01) (10.60) (1.54) (4.19) (0.92)

10 8.44 5.09 51.35 69.34 47.12 6.39 (0.46) (1.64) (7.30) (2.56) (3.83) (1.83)

15 11.63 6.30 58.40 60.23 41.23 10.88 16.77 (0.38) (1.83) (14.79) (6.75) (5.42) (5.57)

20 (1.22) 14.31 8.91 57.83 49.19 33.84 17.31 (0.66) (1.26) (9.11) (4.68) (5.32) (3.13)

25 16.61 9.09 95.10 47.38 29.46 17.78 (0.60) (1.53) (59.42) (2.92) (4.84) (4.73)

Table 10 and Table 11 show some of the experimental results on the Breast Path Q and UCI datasets with MLP model when surrogate loss equipped square loss, respectively. When the rejection cost c is small, both Rc RLoss and AL are significantly smaller than Sup. When the rejection cost c is large, Rc RLoss and AL are close to Sup but always smaller, which shows the effectiveness of our method to deal with regression with cost-based rejection.

D.5 Some Results for Sigmoid

In this section, we show some experimental results of the Sigmoid function equipped with sigmoid, which can be formulated as follows:

ψ(h, r, c, x, y) = (h(x) y)2sigmoid(r(x)) + c(x)sigmoid( r(x)).

Unlike other binary classification losses, sigmoid can be viewed as weight balancing prediction loss and the rejection cost due to sigmoid(r(x)) + sigmoid( r(x)) = 1. Table 12 show some of the experimental results on Breast Path Q when surrogate loss equipped sigmoid, respectively. Rc RLoss and AL are always smaller than Sup, verifying the effectiveness of our method.

In our experiments, we used multiple binary classification losses (MAE, hinge loss, logistic loss, square loss and sigmoid) and different datasets including two deep datasets (Breast Path Q and Age DB) and five uci datasets (Abalone, Airfoil, Auto-mpg, Housing and Concrete), and our method outperformed supervised regression in most cases, which demonstrates the effective of our method.

D.6 Performance of Increasing Training Data

As we showed in Theorem 8 and Theorem 9, the pair (h, r) learned by our surrogate loss could converge to the optimal pair (h , r ) when the number of training examples approaches to infinity. Therefore, the performance of the model can be improved as the training samples increase. To empirically validate such a theoretical finding, we further conduct experiments on the Abalone and Auto-mpg datasets by changing the fraction of training examples (100% means that we use all training examples in the training set). As shown in Figure 2, the Rc RLoss and the AL generally decreases when more training examples are used for model training. This observation is clearly in accordance with our theoretical analyses in Theorem 8 and Theorem 9, because the learned model would be closer to the optimal model as more training examples are provided.

D.7 Curves of AL and Rej

We show the AL loss for all methods with different rejection rates on Abalone, Autompg and Concrete datasets in Figure 3. In the plot of the relationship between AL loss and rejection rate, a curve at the bottom means better results.

20% 40% 60% 80% 100% Training set size

Rc RLoss AL

(a) Breast with c = 5

20% 40% 60% 80% 100% Training set size

Rc RLoss AL

(b) Breast with c = 10

20% 40% 60% 80% 100% Training set size

Rc RLoss AL

(c) Abalone with c = 4

20% 40% 60% 80% 100% Training set size

Rc RLoss AL

(d) Abalone with c = 6

20% 40% 60% 80% 100% Training set size

Rc RLoss AL

(e) Housing with c = 12

20% 40% 60% 80% 100% Training set size

Rc RLoss AL

(f) Housing with c = 16

Figure 2: The test performance on the Breast, Abalone and Housing datasets for the surrogate loss equipped MAE when the number of training examples increases.

20 25 30 35 40 45 Rejection rate (100%)

MLP-Plug Selective Hinge Logistic Square Sigmoid Mae

(a) Abalone

10 20 30 40 50 60 70

Rejection rate (100%)

MLP-Plug Selective Hinge Logistic Square Sigmoid Mae

(b) Autompg

0 10 20 30 40 50 60 70 Rejection rate (100%)

MLP-Plug Selective Hinge Logistic Square Sigmoid Mae

(c) Concrete

Figure 3: Figures (a), (b) and (c) report the accepted loss (AL) for all methods with different rejection rates on abalone, auto-mpg and concrete datasets, respectively.

E Limitations

In Theorem 4 and Theorem 5, we show that there is a limitation in our proposed method that requires the binary classification loss ℓ(r(x), z) to be always greater than 0. This is easily satisfied by the design of the binary classification loss. However, to avoid the modification of the binary classification loss, we further propose Theorem 6, which only requires the binary classification loss to be non-negative, and this is easily satisfied. Extensive experiments on various datasets demonstrate the effectiveness of our proposed method. on the other hand, we use the pointwise cost functions c(x) in our theoretical analysis, which shows that our method can be used for pointwise cost functions. For simplicity, in our experiments, we only consider the rejection cost as a constant c, but our method can easily handle the various pointwise cost functions in different application scenarios.