# improving_expert_predictions_with_conformal_prediction__c4df082c.pdf

Improving Expert Predictions with Conformal Prediction

Eleni Straitouri 1 Lequn Wang 2 Nastaran Okati 1 Manuel Gomez Rodriguez 1

Automated decision support systems promise to help human experts solve multiclass classification tasks more efficiently and accurately. However, existing systems typically require experts to understand when to cede agency to the system or when to exercise their own agency. Otherwise, the experts may be better off solving the classification tasks on their own. In this work, we develop an automated decision support system that, by design, does not require experts to understand when to trust the system to improve performance. Rather than providing (single) label predictions and letting experts decide when to trust these predictions, our system provides sets of label predictions constructed using conformal prediction prediction sets and forcefully asks experts to predict labels from these sets. By using conformal prediction, our system can precisely trade-off the probability that the true label is not in the prediction set, which determines how frequently our system will mislead the experts, and the size of the prediction set, which determines the difficulty of the classification task the experts need to solve using our system. In addition, we develop an efficient and near-optimal search method to find the conformal predictor under which the experts benefit the most from using our system. Simulation experiments using synthetic and real expert predictions demonstrate that our system may help experts make more accurate predictions and is robust to the accuracy of the classifier the conformal predictor relies on.

1. Introduction

In recent years, there has been an increasing interest in developing automated decision support systems to help human

1Max Planck Institute for Software Systems, Kaiserslautern, Germany 2Department of Computer Science, Cornell University, Ithaca, United States. Correspondence to: Eleni Straitouri <estraitouri@mpi-sws.org>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

experts solve tasks in a wide range of critical domains, from medicine (Jiao et al., 2020) and drug discovery (Liu et al., 2021) to candidate screening (Wang et al., 2022) and criminal justice (Grgi c-Hlaˇca et al., 2019), to name a few. Among them, one of the main focuses has been multiclass classification tasks, where a decision support system uses a classifier to make label predictions and the experts decide when to follow the predictions made by the classifier (Bansal et al., 2019; Lubars & Tan, 2019; Bordt & von Luxburg, 2020).

However, these systems typically require human experts to understand when to trust a prediction made by the classifier. Otherwise, the experts may be better off solving the classification tasks on their own (Suresh et al., 2020). This follows from the fact that, in general, the accuracy of a classifier differs across data samples (Raghu et al., 2019). In this context, several recent studies have analyzed how factors such as model confidence, model explanations and overall model calibration modulate trust (Papenmeier et al., 2019; Wang & Yin, 2021; Vodrahalli et al., 2022). Unfortunately, it is not yet clear how to make sure that the experts do not develop a misplaced trust that decreases their performance (Yin et al., 2019; Nourani et al., 2020; Zhang et al., 2020). In this work, we develop a decision support system for multiclass classification tasks that, by design, does not require experts to understand when to trust the system to improve their performance.

Our contributions. For each data sample, our decision support system provides a set of label predictions a prediction set and forcefully asks human experts to predict a label from this set1. We view this type of decision support system as more natural since, given a set of alternatives, experts tend to narrow down their options to a subset of them before making their final decision (Wright & Barbour, 1977; Beach, 1993; Ben-Akiva & Boccara, 1995). In a way, our support system helps experts by automatically narrowing down their options for them, decreasing their cognitive load and allowing them to focus their attention where it is most needed. This could be particularly useful when the task is tedious or requires domain knowledge since it is difficult to outsource the task, and domain experts are often a scarce

1There are many systems used everyday by experts (e.g., pilots flying a plane) that, under normal operation, restrict their choices. This does not mean that, in extreme circumstances, the expert should not have the ability to essentially switch off the system.

Improving Expert Predictions with Conformal Prediction

resource. In the context of clinical text annotation2, a recent empirical study has concluded that, in terms of the overall accuracy, it may be more beneficial to recommend a subset of options than a single option (Levy et al., 2021).

By using the theory of conformal prediction (Vovk et al., 2005; Angelopoulos & Bates, 2021) to construct the above prediction set, our system can precisely control the tradeoff between the probability that the true label is not in the prediction set, which determines how frequently our system will mislead an expert, and the size of the prediction set, which determines the difficulty of the classification task the expert needs to solve using our system. In this context, note that, if our system would not forcefully ask the expert to predict a label from the prediction set, it would not be able to have this level of control and good performance would depend on the expert developing a good sense on when to predict a label from the prediction set and when to predict a label from outside the set, as noted by Levy et al. (2021). In addition, given an estimator of the expert s success probability for any of the possible prediction sets, we develop an efficient and near-optimal search method to find the conformal predictor under which the expert is guaranteed to achieve the greatest accuracy with high probability. In this context, we also propose a practical method to obtain such an estimator using the confusion matrix of the expert predictions in the original classification task and a given discrete choice model.

Finally, we perform simulation experiments using synthetic and real expert predictions on several multiclass classification tasks. The results demonstrate that our decision support system is robust to both the accuracy of the classifier and the estimator of the expert s success probability it relies on the competitive advantage it provides improves with their accuracy, and the human experts do not decrease their performance by using the system even if the classifier or the estimator are very inaccurate. Additionally, the results also show that, even if the classifiers that our system relies on have high accuracy, an expert using our system may achieve significantly higher accuracy than the classifiers on their own in our experiments with real data, the relative reduction in misclassification probability is over 72%. Finally, by using our system, our results suggest that the expert would reduce their misclassification probability by 80%3.

Further related work. Our work builds upon further related work on distribution-free uncertainty quantification, reliable classification and learning under algorithmic triage.

2Clinical text annotation is a task where medical experts aim to identify clinical concepts in medical notes and map them to labels in a large ontology. 3An open-source implementation of our system is available at https://github.com/Networks-Learning/improve-expertpredictions-conformal-prediction.

There exist three fundamental notions of distribution-free uncertainty quantification in the literature: calibration, confidence intervals, and prediction sets (Vovk et al., 2005; Balasubramanian et al., 2014; Gupta et al., 2020; Angelopoulos & Bates, 2021). Our work is most closely related to the rapidly increasing literature on prediction sets (Romano et al., 2019; 2020; Angelopoulos et al., 2021; Podkopaev & Ramdas, 2021), however, to the best of our knowledge, prediction sets have not been optimized to serve automated decision support systems such as ours. In this context, we acknowledge that Babbar et al. (2022) have also very recently proposed using prediction sets in decision support systems. However, in contrast to our work, for each data sample, they allow the expert to predict label values outside the recommended subset, i.e., to predict any alternative from the entire universe of alternatives, and do not optimize the probability that the true label belongs to the subset. As a result, their method is not directly comparable to ours4.

There is an extensive line of work on reliable or cautious classification (Del Coz et al., 2009; Liu et al., 2014; Yang et al., 2017; Mortier et al., 2021; Ma & Denoeux, 2021; Nguyen & H ullermeier, 2021). Reliable classification aims to develop models that can provide set-valued predictions to account for the prediction uncertainty of a classifier. However, in this line of work, there are no human experts who make the final predictions given the set-valued predictions, in contrast with our work. Moreover, the set-valued predictions typically lack distribution-free guarantees.

Learning under algorithmic triage seeks the development of machine learning models that operate under different automation levels models that make decisions for a given fraction of instances and leave the remaining ones to human experts (Raghu et al., 2019; Mozannar & Sontag, 2020; De et al., 2020; 2021; Okati et al., 2021). This line of work has predominantly focused on supervised learning settings with a few very recent notable exceptions (Straitouri et al., 2021; Meresht et al., 2022). However, in this line of work, each sample is either predicted by the model or by the human expert. In contrast, in our work, the model helps the human predict each sample. That being said, it is worth noting that there may be classifiers, data distributions and conformal scores under which the optimal conformal predictor and the optimal triage policy coincide, i.e., the optimal conformal predictor does recommend a single label or the entire label set of labels.

4In our simulation experiments, we estimate the performance achieved by an expert using our system via a model-based estimator of the expert s success probability. Therefore, to compare our system with the system by Babbar et al. (2022), we would need to model the expert s success probability whenever the expert can predict any label given a prediction set, a problem for which discrete choice theory provides little guidance.

Improving Expert Predictions with Conformal Prediction

2. Problem Formulation

We consider a multiclass classification task where a human expert observes a feature vector5 x X, with x P(X), and needs to predict a label y Y = {1, . . . , n}, with y P(Y | X). Then, our goal is to design an automated decision support system C : X 2Y that, given a feature vector x X, helps the expert by automatically narrowing down the set of potential labels to a subset of them C(x) Y using a trained classifier ˆf : X [0, 1]|Y| that outputs scores for each class (e.g., softmax scores)6. The higher the score ˆfy(x), the more the classifier believes the true label Y = y. Here, we assume that, for each x P(X), the expert predicts a label ˆY among those in the subset C(x) according to an unknown policy π(x, C(x)). More formally, ˆY π(x, C(x)), where π : X 2Y (Y) and (Y) denotes the probability simplex over the set of labels Y, and πy(x, C(x)) = 0 if y / C(x). Refer to Figure 1 for an illustration of the automated decision support system we consider.

Ideally, we would like that, by design, the expert can only benefit from using the automated decision support system C, i.e., P[ ˆY = Y ; C] P[ ˆY = Y ; Y], (1)

where P[ ˆY = Y ; C] denotes the expert s success probability if, for each x P(X), the human expert predicts a label ˆY among those in the subset C(x). However, not all automated decision support systems fulfilling the above requirement will be equally useful some will help experts increase their success probability more than others. For example, a system that always recommends C(x) = Y for all x X satisfies Eq. 1. However, it is useless to the experts. Therefore, among those systems satisfying Eq. 1, we would like to find the system C that helps the experts achieve the highest success probability7, i.e.,

C = argmax C P[ ˆY = Y ; C]. (2)

To address the design of such a system, we will look at the problem from the perspective of conformal prediction (Vovk et al., 2005; Angelopoulos & Bates, 2021).

3. Subset Selection using Conformal Prediction

In general, if the trained classifier ˆf we use to build C(X) is not perfect, the true label Y may or may not be included in

5We denote random variables with capital letters and realizations of random variables with lower case letters. 6The assumption that ˆf(x) [0, 1]n is without loss of generality. 7Note that maximizing the expert s success probability P[ ˆY = Y ; C] is equivalent to minimizing the expected 0-1 loss E[I( ˆY = Y ) ; C]. Considering other types of losses is left as an interesting avenue for future work.

C(X). In what follows, we will construct the subsets C(X) using the theory of conformal prediction. This will allow our system to be robust to the accuracy of the classifier ˆf it uses the probability P[Y C(X)] that the true label Y belongs to the subset C(X) = Cα(X) will match almost exactly a given target probability 1 α with high probability, without making any distributional assumptions about the data distribution P(X)P(Y | X) nor the classifier ˆf.

Let Dcal = {(xi, yi)}m i=1 be a calibration set, where (xi, yi) P(X)P(Y | X), s(xi, yi) = 1 ˆfyi(xi) be the conformal score8 (i.e., if the classifier is catastrophically wrong, the conformal score will be close to one), and ˆqα be the (m+1)(1 α)

m empirical quantile of the conformal scores s(x1, y1), . . . , s(xm, ym). Then, if we construct the subsets Cα(X) for new data samples as follows:

Cα(X) = {y | s(X, y) ˆqα}, (3)

we have that the probability that the true label Y belongs to the subset Cα(X) conditionally on the calibration set Dcal is almost exactly 1 α with high probability as long as the size m of the calibration set is sufficiently large. Specifically, we first note that the coverage probability is a random quantity9

whose distribution is given by the following proposition (refer to Appendix A.5 in Hulsman (2022) for the proof):

Proposition 3.1. For a decision support system Cα that constructs the subsets Cα(X) using Eq. 3, it holds that

P[Y Cα(X) | Dcal] Beta( (m+1)(1 α) , (m+1)α ) (4) as long as the conformal scores s(Xi, Yi) for all (Xi, Yi) Dcal are almost surely distinct.

As an immediate consequence of Proposition 3.1, using the definition of the beta distribution, we have that

1 α E [P[Y Cα(X) | Dcal]] = 1 (m + 1)α

1 α + 1 m + 1.

Moreover, given a target probability 1 α and tolerance values δ, ϵ (0, 1), we can compute the minimum size m of the calibration set Dcal such that Cα enjoys Probably Approximately Correct (PAC) coverage guarantees, i.e., with probability 1 δ, it holds that (Angelopoulos & Bates, 2021)

1 α ϵ P[Y Cα(X) | Dcal] 1 α + ϵ.

While the above coverage guarantee is valid for any choice of α value, we would like to emphasize that there may be

8In general, the conformal score s(x, y) can be any function of x and y measuring the similarity between samples. 9The randomness comes from the randomness of the calibration set sampling.

Improving Expert Predictions with Conformal Prediction

Airplane 0.1 Car 0.2 Bird 0.01 Cat 0.01 Deer 0.01 Dog 0.02 Frog 0.01 Horse 0.01 Ship 0.03 Truck 0.6

Automated Decision

Support System

Figure 1. Our automated decision support system C. Given a sample with a feature vector x, our system C narrows down the set of potential labels y Y to a subset of them C(x) using the scores ˆfy(x) provided by a classifier ˆf for each class y. The human expert receives the recommended subset C(x), together with the sample, and predicts a label ˆy from C(x) according to a policy π(x, C(x)).

some α values that will lead to larger gains in terms of success probability P[ ˆY = Y ; Cα] than others. Therefore, in what follows, our goal is to find the optimal α that maximizes the expert s success probability given a calibration set Dcal.

Remark. Most of the literature on conformal prediction focuses on the following conformal calibration guarantee (refer to Appendix D in Angelopoulos & Bates (2021) for the proof): Theorem 3.2. For an automated decision support system Cα that constructs the subsets Cα(X) using Eq. 3, it holds that

1 α P[Y Cα(X)] 1 α + 1 m + 1,

where the probability is over the randomness in the sample it helps predicting and the calibration set used to compute the empirical quantile ˆqα.

However, to afford the above marginal guarantee in our work, we would be unable to optimize the α value to maximize the expert s success probability given a calibration set Dcal. This is because the guarantee requires that α and Dcal are independent. That being said, in our experiments, we have empirically found that the optimal α does not vary significantly across calibration sets, as shown in Appendix E.

4. Optimizing Across Conformal Predictors

We start by realizing that, given a calibration set Dcal = {(xi, yi)}m i=1, there only exist m different conformal predictors. This is because the empirical quantile ˆqα, which

the subsets Cα(xi) depend on, can only take m different values. As a result, to find the optimal conformal predictor that maximizes the expert s success probability, we need to solve the following maximization problem:

α = argmax α A P[ ˆY = Y ; Cα], (5)

where A = {αi}i [m], with αi = 1 i/(m + 1), and the probability is only over the randomness in the samples the system helps predicting.

However, to find a near optimal solution ˆα to the above problem, we need to estimate the expert s success probability P[ ˆY = Y ; Cα]. Assume for now that, for each α A, we have access to an estimator ˆµα of the expert s success probability such that, for any δ (0, 1), with probability at least 1 δ, it holds that |ˆµα P[ ˆY = Y ; Cα]| ϵα,δ. Then, we can use the following proposition to find a near-optimal solution ˆα to Eq. 5 with high probability: Proposition 4.1. For any δ (0, 1), consider an automated decision support system Cˆα with

ˆα = argmax α A ˆµα ϵα,δ/m. (6)

With probability at least 1 δ, it holds that P[ ˆY = Y ; Cˆα] P[ ˆY = Y ; Cα] 2ϵα,δ/m α A simultaneously.

More specifically, the above result directly implies that for any δ (0, 1), with probability at least 1 δ, it holds that:

P[ ˆY = Y ; Cα ] P[ ˆY = Y ; Cˆα] 2ϵα ,δ/m. (7)

Here, note that the above guarantees do not make use of the PAC coverage guarantees afforded by conformal prediction they hold for any parameterized set-value predictor.

Improving Expert Predictions with Conformal Prediction

Algorithm 1 Finding a near-optimal ˆα

1: Input: ˆf, Dest, Dcal, δ, m 2: Initialize: A = {}, ˆα 0, t 0

3: for i 1, .., m do 4: α 1 i m+1 5: A A {α} 6: end for

7: for α A do 8: ˆµα, ϵα,δ/m ESTIMATE(α, δ, Dest, Dcal, ˆf) 9: if t ˆµα ϵα,δ/m then 10: t ˆµα ϵα,δ/m 11: ˆα α 12: end if 13: end for

14: return ˆα

In what follows, we propose a practical method to estimate the expert s success probability P[ ˆY = Y ; Cα] that builds upon the multinomial logit model (MNL), one of the most popular models in the vast literature on discrete choice models (Heiss, 2016). More specifically, given a sample (x, y), we assume that the expert s conditional success probability for the subset Cα(x) is given by

P[ ˆY = y ; Cα | y Cα(x)] = euyy P

y Cα(x) euyy , (8)

where uyy denotes the expert s preference for label value y Y whenever the true label is y. In the language of discrete choice models, one can view the true label y as the context in which the expert chooses among alternatives (Tversky & Simonson, 1993). In Appendix I, we consider and experiment with a more expressive context that, in addition to the true label, distinguishes between different levels of difficulty across data samples.

Further, to estimate the parameters uyy , we assume we have access to (an estimation of) the confusion matrix C for the expert predictions in the (original) multiclass classification task, similarly as in Kerrigan et al. (2021), i.e.,

C = [Cyy ]y,y Y, where Cyy = P[ ˆY = y ; Y | Y = y],

and naturally set uyy = log Cyy . Then, we can compute a Monte-Carlo estimator ˆµα of the expert s success probability P[ ˆY = Y ; Cα] using the above conditional success probability P[ ˆY = y ; Cα | y Cα(x)] and an estimation set Dest = {(xi, yi)}i [m]10, i.e.,

i [m] | yi Cα(xi) P[ ˆY = yi ; Cα | yi Cα(xi)].

10The number of samples in Dcal and Dest can differ. For simplicity, we assume both sets contain m samples.

Finally, for each α A, using Hoeffding s inequality11,12, we can conclude that, with probability at least 1 δ, it holds that (refer to Appendix A.2):

|ˆµα P[ ˆY = Y ; Cα]|

δ 2m := ϵα,δ. (10)

As a consequence, as m , ϵα,δ converges to zero. This directly implies that the near-optimal ˆα converges to the true optimal α and that, with probability at least 1 δ, our system Cα satisfies Eq. 1 asymptotically with respect to the number of samples m in the estimation set.

Algorithm 1 summarizes the overall search method, where the function ESTIMATE( ) uses Eqs. 9 and 10. The algorithm first builds A and then finds the near-optimal ˆα in A. To build A, it needs O(m) steps. To find the near-optimal ˆα, for each value α A and each sample (xi, yi) Dest, it needs to compute a subset Cα(x). This is achieved by sorting the conformal scores and reusing computations across α values, which takes O(m log m + mn log n) steps. Therefore, the overall time complexity is O(m log m + mn log n).

Remarks. By using the MNL, we implicitly assume the independence of irrelevant alternatives (IIA) (Luce, 1959), an axiom that states that the expert s relative preference between two alternatives remains the same over all possible subsets containing these alternatives. While IIA is one of the most widely used axioms in the literature on discrete choice models, there is also a large body of experimental literature claiming to document real-world settings where IIA fails to hold (Tversky, 1972; Huber et al., 1982; Simonson, 1989). Fortunately, we have empirically found that experts may benefit from using our system even under strong violations of the IIA assumption in the estimator of the expert s success probability (i.e., when the estimator of the expert s success probability is not accurate), as shown in Figures 3 and 4.

Conformal prediction is one of many possible ways to construct set-valued predictors (Chzhen et al., 2021), i.e., predictors that, for each sample x X, output a set of label candidates C(x). In our work, we favor conformal predictors over alternatives because they provably output trustworthy sets Cα(x) without making any assumption about the data distribution nor the classifier they rely upon. In fact, we can use conformal predictors with any off-the-shelf classifier. However, we would like to emphasize that our efficient

11By using Hoeffding s inequality, we derive a fairly conservative constant error bound for all α values, however, we have experimentally found that, even with a relatively small amount of estimation and calibration data, our algorithm identifies nearoptimal ˆα values, as shown in Figure 2. That being said, one could use tighter concentration inequalities such as Hoeffding Bentkus and Waudby-Smith Ramdas (Bates et al., 2021). 12We are applying Hoeffding s inequality only on the randomness of the samples (Xi, Yi), which are independent and identically distributed.

Improving Expert Predictions with Conformal Prediction

Table 1. Empirical success probability achieved by four different experts using our system during test, each with a different success probability P[ ˆY = Y ; Y], on four prediction tasks where the classifier achieves a different success probability P[Y = Y ]. Each column corresponds to a prediction task, and each row to an expert. In each task, the number of label values n = 10 and the size of the calibration and estimation sets is m = 1,200. Each cell shows only the average since all standard errors are below 10 2.

P[ ˆY = Y ; Y] P[Y = Y ] 0.3 0.5 0.7 0.9

0.3 0.41 0.58 0.75 0.91 0.5 0.55 0.68 0.80 0.93 0.7 0.72 0.79 0.87 0.95 0.9 0.90 0.91 0.95 0.98

search method (Algorithm 1) is rather generic and, together with an estimator of the expert s success probability with provable guarantees, may be used to find a near-optimal set-valued predictor within a discrete set of set-valued predictors that maximizes the expert s success probability. This is because our near-optimal guarantees in Proposition 4.1 do not make use of the PAC guarantees afforded by conformal prediction, as discussed previously. In Appendix D, we discuss an alternative set-valued predictor with PAC coverage guarantees, which may provide improved performance in scenarios where the classifier underpinning our system has not particularly high average accuracy. We hope our work will encourage others to develop set-valued predictors specifically designed to serve decision support systems.

5. Experiments on Synthetic Data

In this section, we evaluate our system against the accuracy of the expert and the classifier, the size of the calibration and estimation sets, as well as the number of label values. Moreover, we analyze the robustness of our system to violations of the IIA assumption in the estimator of the expert s success probability13.

Experimental setup. We create a variety of synthetic prediction tasks, each with 20 features per sample and a varying number of label values n and difficulty. Refer to Appendix B for more details about the prediction tasks. For each prediction task, we generate 10,000 samples, pick 20% of these samples at random as test set, which we use to estimate the performance of our system, and also randomly split the remaining 80% into three disjoint subsets for training, calibration, and estimation, whose sizes we vary across experiments. In each experiment, we specify the number of

13All algorithms ran on a Debian machine equipped with Intel Xeon E5-2667 v4 @ 3.2 GHz, 32GB memory and two M40 Nvidia Tesla GPU cards. See Appendix B for further details.

samples in the calibration and estimation sets the remaining samples are used for training.

For each prediction task, we train a logistic regression model Pθ(Y | X), which depending on the difficulty of the prediction task, achieves different success probability values P[Y = Y ]. Moreover, we sample the expert s predictions ˆY from the multinomial logit model defined by Eq. 8, with Cyy = π

n γϵc and Cyy = 1 Cyy

n β, where π is a parameter that controls the expert s success probability P[ ˆY = Y ; Y], ϵc U(0, min(1 π

n)), β N(0, ((1 Cyy)/(6n))2) for all y = y , and γ is a normalization term. Finally, we repeat each experiment ten times and, each time, we sample different train, estimation, calibration, and test sets following the above procedure.

Experts always benefit from our system even if the classifier has low accuracy. We estimate the success probability P[ ˆY = Y ; Cˆα] achieved by four different experts, each with a different success probability P[ ˆY = Y ; Y], on four prediction tasks where the classifier achieves a different success probability P[Y = Y ]. Table 1 summarizes the results, where each column corresponds to a different prediction task and each row corresponds to a different expert. We find that, using our system, the expert solves the prediction task significantly more accurately than the expert or the classifier on their own. Moreover, it is rather remarkable that, even if the classifier has low accuracy, the expert always benefits from using our system in other words, our system is robust to the performance of the classifier it relies on. In Appendix C, we show qualitatively similar results for prediction tasks with other values of n and m.

The performance of our system under ˆα found by Algorithm 1 and under α is very similar. Given three prediction tasks where the expert and the classifier achieve different success probabilities P[ ˆY = Y ; Y] and P[Y = Y ], we compare the performance of our system under the nearoptimal ˆα found by Algorithm 1 and under all other possible α A values. Figure 2 summarizes the results, which suggest that: (i) the performance under ˆα is very close to that under α , as suggested by Proposition 4.1; and, (ii) as long as α α , the performance of our system increases monotonically with respect to α, however, once α > α , the performance deteriorates as we increase α. (iii) the higher the expert s success probability P[ ˆY = Y ; Y], the smaller the near optimal ˆα and thus the greater the average size of the subsets Cˆα(X). In Appendix G, we also show that, the smaller the near optimal ˆα, the greater the spread of the empirical distribution of the size of the subsets Cˆα(X). We found qualitatively similar results using other expertclassifier pairs with different success probabilities.

Our system needs a relatively small amount of calibration and estimation data. We vary the amount of calibration and estimation data m we feed into Algorithm 1

Improving Expert Predictions with Conformal Prediction

0.0 0.2 0.4 0.6 0.8 1.0 α

Empirical Success Probability

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.9

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.9

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.5

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.5

(a) Empirical success probability vs α

0.0 0.2 0.4 0.6 0.8 1.0 α

Empirical Average Set Size

P[Y = Y ] = 0.9

P[Y = Y ] = 0.5

(b) Empirical average set size vs α

Figure 2. Empirical success probability achieved by two different experts using our system, each with a different success probability P[ ˆY = Y ; Y], and average size of the recommended sets during test for each α A on two synthetic prediction tasks where the classifier achieves a different success probability P[Y = Y ]. Here, note that the empirical average set size only depends on the classifier s success probability P[Y = Y ], not the expert, and thus we only need two lines. In all experiments, the number of label values n = 10 and the size of the calibration and estimation sets is m = 1,200. Each marker corresponds to a different α value, and the darker points correspond to ˆα. The coloring of the darker points for each prediction task is the same in both panels.

and, each time, estimate the expert s success probability P[ ˆY = Y ; Cˆα]. Across prediction tasks, we consistently find that our system needs a relatively small amount of calibration and estimation data to perform well. For example, for all prediction tasks with n = 10 label values and varying level of difficulty, the relative gain in empirical success probability achieved by an expert using our system with respect to an expert on their own, averaged across experts with P[ ˆY = Y ; Y] {0.3, 0.5, 0.7, 0.9}, goes from 47.56 4.51% for m = 160 to 48.66 4.54% for m = 1,200.

The greater the number of label values, the more an expert benefits from using our system. We consider prediction tasks with a varying number of label values, from n = 10 to n = 100, and estimate the expert s success probability P[ ˆY = Y ; Cˆα] for each task. Our results suggest that the relative gain in success probability, averaged across experts with P[ ˆY = Y ; Y] {0.3, 0.5, 0.7, 0.9}, increases with the number of label values. For example, for m = 400, it goes from 48.36 4.50% for n = 10 to 69.44 5.20% for n = 100. For other m values, we found a similar trend.

Our system is robust to strong violations of the IIA assumption in the estimator of the expert s success probability. To study the robustness of our system to violations of the IIA assumption in the estimator of the expert s success probability, we allow the expert s preference uyy for each label value y = y in Eq. 8 to depend on the corresponding prediction set Cˆα(x) at test time. More specifically, we set

Cyy + p I[y = y] |Cˆα(x)\{y}|

y / C ˆ α(x) Cyy

where p [0, 1] is a parameter that controls the severity of the violation of the IIA assumption at test time. Here, note that if p = 1, the expert does not benefit from using our system as long as the prediction set Cα(x) = {y}, i.e., the expert s conditional success probability is given by P[ ˆY = y ; Cα | y Cα(x)] = P[ ˆY = y ; Y]. Figure 3 summarizes the results, which show that our system is robust to (strong) violations of the IIA assumption in the estimator of the expert s success probability.

6. Experiments on Real Data

In this section, we experiment with a dataset with real expert predictions on a multiclass classification task over natural images and several popular and highly accurate deep neural network classifiers. In doing so, we benchmark the performance of our system against a competitive top-k set-valued predictor baseline, which always returns the k label values with the highest scores, and analyze its robustness to violations of the IIA assumption in the estimator of the expert s success probability. Here, we would like to explicitly note that we rely on the confusion matrix estimated using real expert predictions on the (original) multiclass classification task and the multinomial logit model defined by Eq. 8 to estimate the performance of our system and the competitive top-k set-valued predictor baseline no real experts actually used our decision support system.

Data description. We experiment with the dataset CIFAR10H (Peterson et al., 2019), which contains 10,000 natural images taken from the test set of the standard CIFAR10 (Krizhevsky et al., 2009). Each of these images belongs to one of n = 10 classes and contains approximately 50

Improving Expert Predictions with Conformal Prediction

0.0 0.2 0.4 0.6 0.8 1.0 p

Success Probability

P[ ˆY = Y ; Y] = 0.5

0.0 0.2 0.4 0.6 0.8 1.0 p

Success Probability

P[ ˆY = Y ; Y] = 0.7

0.0 0.2 0.4 0.6 0.8 1.0 p

Success Probability

P[ ˆY = Y ; Y] = 0.9

Figure 3. Empirical success probability achieved by three different experts using our system during test, each with a different success probability P[ ˆY = Y ; Y], against severity p of the violation of the IIA assumption on three prediction tasks where the classifier achieves a different success probability P[Y = Y ]. In each panel, the horizontal line shows the empirical success probability achieved by the expert at solving the (original) multiclass task during test. The number of labels is n = 10 and the size of the calibration and estimation sets is m = 1,200. Shaded regions correspond to 95% confidence intervals.

expert predictions ˆY 14. Here, we randomly split the dataset into three disjoint subsets for calibration, estimation and test, whose sizes we vary across experiments. In each experiment, we use the test set to estimate the performance of our system and we specify the number of samples in the calibration and estimation sets the remaining samples are used for testing.

Experimental setup. Rather than training a classifier, we use three popular and highly accurate deep neural network classifiers trained on CIFAR-10, namely Res Net110 (He et al., 2016a), Pre Res Net-110 (He et al., 2016b) and Dense Net (Huang et al., 2017). Moreover, we use the human predictions ˆY to estimate the confusion matrix C for the expert predictions in the (original) multiclass classification task (Kerrigan et al., 2021) and then sample the expert s prediction ˆY from the multinomial logit model defined by Eq. 8 to both estimate the expert s conditional success probabilities in Eq. 9 in Algorithm 1 and estimate the expert s success probability during testing. In what follows, even though the expert s performance during testing is estimated using the multinomial logit model, rather than using real predictions from experts using our system, we refer to (the performance of) such a simulated expert as an expert.

Performance evaluation. We start by estimating the success probability P[ ˆY = Y ; Cˆα] achieved by an expert using our system (Cˆα) and the best top-k set-valued predictor (Ck), which returns the k label values with the highest scores15.

14The dataset CIFAR-10H is among the only publicly available datasets (released under Creative Commons BY-NC-SA 4.0 license) that we found containing multiple expert predictions per sample, necessary to estimate C, a relatively large number of samples, and more than two classes. However, since our methodology is rather general, our system may be useful in other applications. 15Appendix F shows the success probability achieved by an expert using the top-k set-valued predictor for different k values both for synthetic and real data.

Table 2. Empirical success probabilities achieved by three popular deep neural network classifiers and by an expert using our system (Cˆα) and the best top-k set-valued predictor (Ck) with these classifiers during test on the CIFAR-10H dataset. The size of the calibration and estimation sets is m = 1,500 and the expert s empirical success probability at solving the (original) multiclass task is P[ ˆY = Y ; Y] 0.947. Each cell shows only the average since the standard errors are all below 10 2.

CLASSIFIER Cˆα Ck

RESNET-110 0.928 0.987 0.967 PRERESNET-110 0.944 0.989 0.972 DENSENET 0.964 0.990 0.980

Table 2 summarizes the results, where we also report the (empirical) success probability achieved by an expert solving the (original) multiclass task in their own. We find that, by allowing for recommended subsets of varying size, our system is consistently superior to the top-k set-valued predictor. Moreover, we also find it very encouraging that, although the classifiers are highly accurate, our results suggest that an expert using our system can solve the prediction task significantly more accurately than the classifiers. More specifically, the relative reduction in misclassification probability goes from 72.2% (Dense Net) to 81.9% (Res Net-110). Finally, by using our system, our results suggest that the (average) expert would reduce their misclassification probability by 80%.

Robustness to violations of the IIA assumption in the estimator of the expert s success probability. To study the robustness of our system to violations of the IIA assumption in the estimator of the expert s success probability, we use the same experimental setting as in the synthetic experiments, where the parameter p controls the severity of the violation of the IIA assumption at test time. Figure 4

Improving Expert Predictions with Conformal Prediction

0.25 0.50 0.75 1.00 p

Success Probability

Dense Net-BC Pre Res Net-110 Res Net-110

Figure 4. Empirical success probability achieved by an expert using our system with three different classifiers during test against severity p of the violation of the IIA assumption on the CIFAR10H dataset. The empirical success probability achieved by the expert at solving the (original) multiclass task during test is P[ ˆY = Y ; Y] 0.947. The size of the calibration and estimation sets is m = 1,500. Shaded regions correspond to 95% confidence intervals.

summarizes the results for different p values. It is remarkable that, even for highly accurate classifiers like the ones used for our experiments, the expert benefits from using our system even when p = 1. This is because, for accurate classifiers, many prediction sets are singletons containing the true label, as shown in Appendix H.

7. Conclusions

We have initiated the development of automated decision support systems that, by design, do not require human experts to understand when each of their recommendations is accurate to improve their performance with high probability. We have focused on multiclass classification and designed a system that, for each data sample, recommends a subset of labels to the experts using a classifier. Moreover, we have shown that our system can help experts make predictions more accurately and is robust to the accuracy of the classifier and the estimator of the expert s success probability.

Our work opens up many interesting avenues for future work. For example, we have considered a simple, well-known conformal score function from the literature. However, it would be valuable to develop score functions especially designed for decision support systems. Moreover, it would be interesting to perform online estimation of the expert s conditional success probability. Further, it would be important to investigate the ethical impact of our system, including human trust and bias, understand the robustness of our system to malicious attacks, and consider alternative performance metrics such as expert prediction time. Finally, it would be important to deploy and evaluate our system on a real-world application with human experts.

Acknowledgements

We would like to thank the anonymous reviewers for constructive feedback, which has helped improve our paper. Gomez-Rodriguez acknowledges support from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 945719). Wang acknowledges support from NSF Awards IIS-1901168 and IIS-2008139. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

Angelopoulos, A., Bates, S., Malik, J., and Jordan, M. I. Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations, 2021.

Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. ar Xiv preprint ar Xiv:2107.07511, 2021.

Babbar, V., Bhatt, U., and Weller, A. On the utility of prediction sets in human-ai teams. ar Xiv preprint ar Xiv:2205.01411, 2022.

Balasubramanian, V., Ho, S.-S., and Vovk, V. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes, 2014.

Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., and Horvitz, E. Beyond accuracy: The role of mental models in human-ai team performance. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 7(1):2 11, Oct. 2019. URL https://ojs.aaai.org/index. php/HCOMP/article/view/5285.

Bates, S., Angelopoulos, A., Lei, L., Malik, J., and Jordan, M. Distribution-free, risk-controlling prediction sets. J. ACM, 68(6), sep 2021. ISSN 0004-5411. doi: 10.1145/3478535. URL https://doi.org/10. 1145/3478535.

Beach, L. R. Broadening the definition of decision making: The role of prechoice screening of options. Psychological Science, 4(4):215 220, 1993.

Ben-Akiva, M. and Boccara, B. Discrete choice models with latent choice sets. International Journal of Research in Marketing, 12(1):9 24, 1995.

Bordt, S. and von Luxburg, U. When humans and machines make joint decisions: A non-symmetric bandit model. ar Xiv preprint ar Xiv:2007.04800, 2020.

Improving Expert Predictions with Conformal Prediction

Chzhen, E., Denis, C., Hebiri, M., and Lorieul, T. Setvalued classification overview via a unified framework. ar Xiv preprint ar Xiv:2102.12318, 2021.

De, A., Koley, P., Ganguly, N., and Gomez-Rodriguez, M. Regression under human assistance. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

De, A., Okati, N., Zarezade, A., and Gomez-Rodriguez, M. Classification under human assistance. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.

Del Coz, J. J., D ıez, J., and Bahamonde, A. Learning nondeterministic classifiers. Journal of Machine Learning Research, 10(10), 2009.

Grgi c-Hlaˇca, N., Engel, C., and Gummadi, K. P. Human decision making with machine assistance: An experiment on bailing and jailing. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1 25, 2019.

Gupta, C., Podkopaev, A., and Ramdas, A. Distribution-free binary classification: prediction sets, confidence intervals and calibration. In Advances in Neural Information Processing Systems, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630 645. Springer, 2016b.

Heiss, F. Discrete choice methods with simulation. Taylor & Francis, 2016.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261 2269, 2017. doi: 10.1109/ CVPR.2017.243.

Huber, J., Payne, J. W., and Puto, C. Adding asymmetrically dominated alternatives: Violations of regularity and the similarity hypothesis. Journal of consumer research, 9 (1):90 98, 1982.

Hulsman, R. Distribution-free finite-sample guarantees and split conformal prediction. ar Xiv preprint ar Xiv:2210.14735, 2022.

Jiao, W., Atwal, G., Polak, P., Karlic, R., Cuppen, E., Danyi, A., de Ridder, J., van Herpen, C., Lolkema, M. P., Steeghs, N., et al. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nature communications, 11(1):1 12, 2020.

Kerrigan, G., Smyth, P., and Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. ar Xiv preprint ar Xiv:2109.14591, 2021.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Citeseer, 2009.

Levy, A., Agrawal, M., Satyanarayan, A., and Sontag, D. Assessing the impact of automated suggestions on decision making: Domain experts mediate model errors but take less initiative. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021.

Liu, R., Rizzo, S., Whipple, S., Pal, N., Pineda, A. L., Lu, M., Arnieri, B., Lu, Y., Capra, W., Copping, R., et al. Evaluating eligibility criteria of oncology trials using realworld data and ai. Nature, 592(7855):629 633, 2021.

Liu, Z.-G., Pan, Q., Dezert, J., and Mercier, G. Credal classification rule for uncertain data based on belief functions. Pattern Recognition, 47(7):2532 2541, 2014.

Lubars, B. and Tan, C. Ask not what ai can do, but what ai should do: Towards a framework of task delegability. Advances in Neural Information Processing Systems, 32: 57 67, 2019.

Luce, R. D. On the possible psychophysical laws. Psychological review, 66(2):81, 1959.

Ma, L. and Denoeux, T. Partial classification in the belief function framework. Knowledge-Based Systems, 214: 106742, 2021.

Meresht, V. B., De, A., Singla, A., and Gomez-Rodriguez, M. Learning to switch among agents in a team. Transactions on Machine Learning Research, 2022.

Mortier, T., Wydmuch, M., Dembczy nski, K., H ullermeier, E., and Waegeman, W. Efficient set-valued prediction in multi-class classification. Data Mining and Knowledge Discovery, 35(4):1435 1469, 2021.

Mozannar, H. and Sontag, D. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pp. 7076 7087, 2020.

Nguyen, V.-L. and H ullermeier, E. Multilabel classification with partial abstention: Bayes-optimal prediction under label independence. Journal of Artificial Intelligence Research, 72:613 665, 2021.

Nourani, M., King, J. T., and Ragan, E. D. The role of domain expertise in user trust and the impact of first impressions with intelligent systems. Ar Xiv, abs/2008.09100, 2020.

Improving Expert Predictions with Conformal Prediction

Okati, N., De, A., and Gomez-Rodriguez, M. Differentiable learning under triage. In Advances in Neural Information Processing Systems, 2021.

Papenmeier, A., Englebienne, G., and Seifert, C. How model accuracy and explanation fidelity influence user trust. ar Xiv preprint ar Xiv:1907.12652, 2019.

Peterson, J. C., Battleday, R. M., Griffiths, T. L., and Russakovsky, O. Human uncertainty makes classification more robust. ar Xiv preprint ar Xiv:1908.07086, 2019.

Podkopaev, A. and Ramdas, A. Distribution-free uncertainty quantification for classification under label shift. In Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence, 2021.

Raghu, M., Blumer, K., Corrado, G., Kleinberg, J., Obermeyer, Z., and Mullainathan, S. The algorithmic automation problem: Prediction, triage, and human effort. ar Xiv preprint ar Xiv:1903.12220, 2019.

Romano, Y., Patterson, E., and Candes, E. Conformalized quantile regression. Advances in Neural Information Processing Systems, 32:3543 3553, 2019.

Romano, Y., Sesia, M., and Candes, E. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581 3591, 2020.

Simonson, I. Choice based on reasons: The case of attraction and compromise effects. Journal of consumer research, 16(2):158 174, 1989.

Straitouri, E., Singla, A., Meresht, V. B., and Gomez Rodriguez, M. Reinforcement learning under algorithmic triage. ar Xiv preprint ar Xiv:2109.11328, 2021.

Suresh, H., Lao, N., and Liccardi, I. Misplaced trust: Measuring the interference of machine learning in human decision-making. In 12th ACM Conference on Web Science, pp. 315 324, 2020.

Tversky, A. Elimination by aspects: A theory of choice. Psychological review, 79(4):281, 1972.

Tversky, A. and Simonson, I. Context-dependent preferences. Management Science, 39(10):1179 1189, 1993. ISSN 00251909, 15265501. URL http://www. jstor.org/stable/2632953.

Vodrahalli, K., Gerstenberg, T., and Zou, J. Uncalibrated models can improve human-ai collaboration. In Advances in Neural Information Processing Systems, 2022.

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522.

Wang, L., Joachims, T., and Gomez-Rodriguez, M. Improving screening processes via calibrated subset selection. In Proceedings of the 39th International Conference on Machine Learning, 2022.

Wang, X. and Yin, M. Are explanations helpful? a comparative study of the effects of explanations in ai-assisted decision-making. In 26th International Conference on Intelligent User Interfaces, pp. 318 328, 2021.

Wright, P. and Barbour, F. Phased decision strategies: Sequels to an initial screening. Graduate School of Business, Stanford University, 1977.

Yang, G., Destercke, S., and Masson, M.-H. Cautious classification with nested dichotomies and imprecise probabilities. Soft Computing, 21(24):7447 7462, 2017.

Yin, M., Wortman Vaughan, J., and Wallach, H. Understanding the effect of accuracy on trust in machine learning models. In Proceedings of the 2019 chi conference on human factors in computing systems, pp. 1 12, 2019.

Zhang, Y., Liao, Q. V., and Bellamy, R. K. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 295 305, 2020.

Improving Expert Predictions with Conformal Prediction

A.1. Proof of Proposition 4.1

Given the estimators ˆµα of P[ ˆY = Y ; Cα], we have that, for each α A, it holds that ˆµα P[ ˆY = Y ; Cα] ϵα,δ/m (11)

with probability at least 1 δ/m. By applying the union bound, we know that the above events hold simultaneously for all α A with probability at least 1 δ. Moreover, by rearranging, the above expression can be rewritten as

ˆµα ϵα,δ/m P[ ˆY = Y ; Cα] ˆµα + ϵα,δ/m. (12)

Let ˆα = argmaxα A{(ˆµα ϵα,δ/m)}. For ˆα, with probability 1 δ, it holds that for all α A,

P[ ˆY = Y ; Cˆα] ˆµˆα ϵˆα,δ/m ˆµα ϵα,δ/m = ˆµα ϵα,δ/m + 2ϵα,δ/m 2ϵα,δ/m = ˆµα + ϵα,δ/m 2ϵα,δ/m,

P h ˆY = Y ; Cα i 2ϵα,δ/m,

where the last inequality follows from Eq. 12.

A.2. Derivation of Error Expression for Hoeffding s Inequality

From Hoeffding s inequality we have that:

Theorem A.1. Let Z1, ..., Zk be i.i.d., with Zi [a, b], i = 1, ..., k, a < b and ˆµ be the empirical estimate ˆµ =

k of E[Z] = E[Zi]. Then:

P [ˆµ E[Z] ϵ] exp 2kϵ2

P [ˆµ E[Z] ϵ] exp 2kϵ2

hold for all ϵ 0.

In our case we have k = m and Zi = I {Yi Cα(Xi)} P h ˆY = Yi; Cα | Yi Cα(Xi) i (0, 1). Moreover, note that the expectation of Zi is given by:

E[Zi] = E h I {Yi Cα(Xi)} P h ˆY = Yi; Cα | Yi Cα(Xi) ii

= E h I {Yi Cα(Xi)} P h ˆY = Yi; Cα | Yi Cα(Xi) i + I {Yi / Cα(Xi)} P h ˆY = Yi; Cα | Yi / Cα(Xi) ii

= E h P h ˆY = Yi; Cα ii

= P h ˆY = Yi; Cα i ,

where the expectations are over the joint distribution of prediction sets Cα(X) and true labels Y .

Hence, for the empirical estimate ˆµ = ˆµα of P[ ˆY = Y ; Cα] and its error ϵ = ϵα,δ:

P h ˆµα P[ ˆY = Y ; Cα] ϵα,δ i exp

2mϵ2 α,δ (1 0)2

Improving Expert Predictions with Conformal Prediction

P h ˆµα P[ ˆY = Y ; Cα] ϵα,δ i exp

2mϵ2 α,δ (1 0)2

hold. Further, if we set

δ = exp 2mϵ2 α,δ , (17)

1 P h ˆµα P[ ˆY = Y ; Cα] ϵα,δ i δ P h ˆµα P[ ˆY = Y ; Cα] ϵα,δ i 1 δ (18)

1 P h ˆµα P[ ˆY = Y ; Cα] ϵα,δ i δ P h ˆµα P[ ˆY = Y ; Cα] ϵα,δ i 1 δ (19)

hold for any ϵα,δ 0. As follows, based on Eq. 17:

δ = exp 2mϵ2 α,δ log 1

δ = 2mϵ2 α,δ ϵ2 α,δ = log 1

δ 2m ϵα,δ =

A.3. Proof of Proposition D.1

We proceed similarly as in the Appendix A.5 in Hulsman (2022). First, note that, by definition, we have that

ˆqα1 = s( (1 α1)(m+1) ) and ˆqα2 = s( (1 α2)(m+1) ),

where s(i) denotes the i-th smallest conformal score in the calibration set Dcal. Then, as long as the conformal scores in the calibration set are almost surely distinct, it follows directly from Proposition 4 in Hulsman (2022) that

P [ˆqα2 < s(X, Y ) ˆqα1 | Dcal] Beta(l, m l + 1), (20)

where l = (m + 1)(1 α1) (m + 1)(1 α2) . Moreover, for any (X, Y ) P(X)P(Y | X), we have that, by construction, Y Cα1,α2(X) if and only if s(X, Y ) (ˆqα2, ˆqα1). Then, Eq. 22 follows directly from Eq. 20.

Improving Expert Predictions with Conformal Prediction

B. Implementation Details

To implement our algorithms and run all the experiments on synthetic and real data, we used Py Torch 1.12.1, Num Py 1.20.1 and Scikit-learn 1.0.2 on Python 3.9.2. For reproducibility, we use a fixed random seed in all random procedures. Moreover, we set δ = 0.1 everywhere.

Synthetic prediction tasks. We create 4 3 = 12 different prediction tasks, where we vary the number of labels n {10, 50, 100} and the level of difficulty of the task. More specifically, for each value of n, we create four different tasks of increasing difficulty where the success probability of the logistic regression classifier is P[Y = Y ] = 0.9, 0.7, 0.5 and 0.3, respectively.

To create each task, we use the function make classification of the Scikit-learn library. This function allows the creation of data for synthetic prediction tasks with very particular user-defined characteristics, through the generation of clusters of normally distributed points on the vertices of a multidimensional hypercube. The number of the dimensions of the hypercube indicates the number of informative features of each sample, which in our case we set at 15 for all prediction tasks. Linear combinations of points, i.e., the informative features, are used to create redundant features, the number of which we set at 5. The difficulty of the prediction task is controlled through the size of the hypercube, with a multiplicative factor, namely clas sep, which we tuned accordingly for each value n so that the success probability of the logistic regression classifier above spans a wide range of values across tasks. All the selected values of this parameter can be found in the configuration file config.py in the code. Finally, we set the proportion of the samples assigned to each label, i.e., the function parameter weights, using a Dirichlet distribution of order n with parameters α1 = ... = αn = 1.

Improving Expert Predictions with Conformal Prediction

C. Additional Synthetic Prediction Tasks, Number of Labels and Amount of Calibration and Estimation Data

To complement the results in Table 1 in the main paper, we experiment with additional prediction tasks with different number of labels n and amount of calibration and estimation data m. For each value of n and m, we estimate the success probability P[ ˆY = Y ; Cˆα] achieved by four different experts using our system, each with a different success probability P[ ˆY = Y ; Y], on four prediction tasks where the classifier achieves a different success probability P[Y = Y ]. Figure 5 summarizes the results.

P[ ˆY = Y ; Y] P[Y = Y ] 0.3 0.5 0.7 0.9

0.3 0.56 0.72 0.84 0.94 0.5 0.68 0.80 0.89 0.95 0.7 0.79 0.87 0.93 0.97 0.9 0.92 0.95 0.97 0.99

(a) n = 50, m = 1,200

P[ ˆY = Y ; Y] P[Y = Y ] 0.3 0.5 0.7 0.9

0.3 0.62 0.76 0.87 0.95 0.5 0.72 0.83 0.91 0.96 0.7 0.83 0.90 0.95 0.98 0.9 0.93 0.96 0.98 0.99

(b) n = 100, m = 1,200

P[ ˆY = Y ; Y] P[Y = Y ] 0.3 0.5 0.7 0.9

0.3 0.42 0.58 0.75 0.91 0.5 0.55 0.66 0.80 0.93 0.7 0.72 0.79 0.87 0.96 0.9 0.90 0.92 0.94 0.98

(c) n = 10, m = 400

P[ ˆY = Y ; Y] P[Y = Y ] 0.3 0.5 0.7 0.9

0.3 0.56 0.73 0.84 0.94 0.5 0.67 0.80 0.88 0.96 0.7 0.79 0.88 0.93 0.98 0.9 0.92 0.94 0.97 0.99

(d) n = 50, m = 400

P[ ˆY = Y ; Y] P[Y = Y ] 0.3 0.5 0.7 0.9

0.3 0.62 0.77 0.87 0.95 0.5 0.73 0.83 0.91 0.97 0.7 0.83 0.89 0.95 0.98 0.9 0.93 0.96 0.98 0.99

(e) n = 100, m = 400

Figure 5. Empirical success probability achieved by four different experts using our system during test, each with a different success probability P[ ˆY = Y ; Y], on four prediction tasks where the classifier achieves a different success probability P[Y = Y ]. Each table corresponds to a different number of label values n and calibration and estimation set size m. For readability, each cell shows only the average since the standard errors are all below 10 2.

Improving Expert Predictions with Conformal Prediction

D. Beyond Standard Conformal Prediction

In Section 4, we have used standard conformal prediction (Angelopoulos & Bates, 2021) to construct the recommended subsets C(X) we have constructed C(X) by comparing the conformal scores s(X, y) to a single threshold ˆq, as shown in Eq. 3. Here, we introduce a set-valued predictor based on conformal prediction that constructs C(X) using two thresholds ˆqα1 and ˆqα2. By doing so, the recommended subsets will include label values whose corresponding conformal scores are neither unreasonably large, as in standard conformal prediction, nor unreasonably low in comparison with the conformal scores of the samples in the calibration set Dcal. This may be useful in scenarios where the classifier underpinning our system has not particularly high average accuracy16.

More specifically, given a calibration set Dcal = {(xi, si)}m i=1, let α1, α2 [0, 1], with α1 < α2, and ˆqα1 and ˆqα2 be the (m+1)(1 α1)

m and (m+1)(1 α2)

m empirical quantiles of the conformal scores s(x1, y1), . . . , s(xm, ym). If we construct the subsets Cα1,α2(X) for new data samples as follows:

Cα1,α2(X) = {y | ˆqα2 < s(X, y) ˆqα1}, (21)

we have that the probability that the true label Y belongs to the subset Cα1,α2(X) conditionally on the calibration set Dcal is almost exactly α2 α1 with high probability as long as the size m of the calibration set is sufficiently large. More specifically, we first note that the coverage probability is a random quantity whose distribution is given by the following proposition, which is the counterpart of Proposition 3.1:

Proposition D.1. For a decision support system Cα1,α2 that constructs Cα1,α2(X) using Eq. 21, as long as the conformal scores s(xi, yi) for all (xi, yi) Dcal are almost surely distinct, it holds that:

P[Y Cα1,α2(X) | Dcal] Beta(l, m l + 1), (22)

where l = (m + 1)(1 α1) (m + 1)(1 α2) .

As an immediate consequence of Proposition D.1, using the definition of the beta distribution, we have that

α2 α1 E [P[Y Cα1,α2(X) | Dcal]] = α2 α1 + c1 c2

m + 1 α2 α1 + 1 m + 1,

where c1, c2 [0, 1). Moreover, given a target probability α2 α1 and tolerance values δ, ϵ (0, 1), we can compute the minimum size m of the calibration set Dcal such that Cα1,α2 enjoys Probably Approximately Correct (PAC) coverage guarantees, i.e., with probability 1 δ, it holds that

α2 α1 ϵ P[Y Cα(X) | Dcal] α2 α1 + ϵ.

Finally, given an estimator of the expert s success probability ˆµα1,α2 such that for each α1 < α2 and δ (0, 1), with probability at least 1 δ, it holds that |ˆµα1,α2 P[ ˆY = Y ; Cα1,α2]| ϵα1,α2,δ, we can proceed similarly as in standard conformal prediction to find the near optimal ˆα1, ˆα2 A that maximizes the expert s success probability with high probability, by using ˆµα1,α2 and ϵα1,α2,2δ/(m(m 1)). Here, it is worth pointing out that, in contrast with the case of standard conformal prediction, the time complexity of finding the near optimal ˆα1 and ˆα2 is O(m log m + mn log n + mn2). Moreover, we can still rely on the practical method to estimate the expert s conditional success probability introduced in Section 4.

16In such scenarios, the conformal scores of the samples in the calibration set can occasionally have low values otherwise, the classifier would be highly accurate and thus it is beneficial to exclude label values with (very) low conformal scores from the recommended subsets those label values the classifier is confidently wrong about.

Improving Expert Predictions with Conformal Prediction

E. Sensitivity to the Choice of Calibration Set

In this section, we repeat the experiments on synthetic and real data using 100 independent realizations of the calibration, estimation and test sets. Then, for each data split, we compare the empirical coverage 1 |Dtest| P (x,y) Dset I[y Cˆα(x)] := 1 ˆαemp achieved by our system Cˆα on the test set Dtest to the corresponding target coverage 1 ˆα.

Figure 6 summarizes the results for (a) one synthetic prediction task and one synthetic expert and (b) one popular deep neural network classifier on the CIFAR-10H dataset. We find that the value of the near-optimal ˆα does not vary significantly across experiments (i.e., across calibration sets) and, for each experiment, the empirical coverage 1 ˆαemp is very close to and typically higher than the target coverage 1 ˆα. We found similar results for other expert-classifier pairs with different success probabilities.

0 20 40 60 80 100 Index

1 αemp 1 ˆα

(a) Synthetic Task

0 20 40 60 80 100 Index

1 αemp 1 ˆα

(b) CIFAR-10H

Figure 6. Empirical test coverage 1 αemp and target coverage 1 ˆα for 100 independent realizations of the calibration, estimation and test sets. In Panel (a), the synthetic task comprises a classifier with P[Y = Y ] = 0.5 and an expert with P[ ˆY = Y ; Y] = 0.5, the number of labels is n = 10 and the size of the calibration and estimation sets is m = 1,200. In Panel (b), the classifier is the popular Dense Net classifier and m = 1,500.

Improving Expert Predictions with Conformal Prediction

F. Success Probability Achieved by an Expert using Top-k Set-Valued Predictors

In this section, we estimate the success probability achieved by an expert using the top-k set-valued predictor for different k values using both synthetic and real data. Figures 7 and 8 summarize the results, which show that, by allowing for recommended subsets of varying size, our system is consistently superior to the top-k set-valued predictor across configurations. Moreover, the results on synthetic data also show that, the higher the expert s success probability P[ ˆY = Y ; Y], the greater the optimal k value (i.e., the greater the optimal size of the recommended subsets Ck(X)). This latter observation is consistent with the behavior exhibited by our system, where the higher the expert s success probability P[ ˆY = Y ; Y], the lower the value of the near-optimal ˆα and thus the greater the average size of the recommended subsets Cˆα(X), as shown in Figure 9.

2 3 4 5 6 7 8 9 k

0.6 0.7 0.8 0.9

Success Probability

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.5

2 3 4 5 6 7 8 9 k

0.6 0.7 0.8 0.9

Success Probability

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.7

2 3 4 5 6 7 8 9 k

0.6 0.7 0.8 0.9

Success Probability

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.9

2 3 4 5 6 7 8 9 k 0.6

Success Probability

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.5

2 3 4 5 6 7 8 9 k 0.6

Success Probability

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.7

2 3 4 5 6 7 8 9 k 0.6

Success Probability

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.9

Figure 7. Empirical success probability achieved by two different experts using the top-k set-valued predictor (Ck) during test, each with a different success probability P[ ˆY = Y ; Y], on three prediction tasks where the classifier achieves a different success probability P[Y = Y ]. In each panel, the horizontal dashed line shows the empirical success probability achieved by the same experts using our system (Cˆα) during test. In all panels, the number of labels is n = 10, the size of the calibration and estimation sets is m = 1,200 and the results for the optimal k value during test are highlighted in orange.

2 3 4 5 6 7 8 9 k 0.90 0.92 0.94 0.96 0.98

Success Probability

Res Net-110, P[Y = Y ] = 0.928

2 3 4 5 6 7 8 9 k 0.90 0.92 0.94 0.96 0.98

Success Probability

Pre Res Net-110, P[Y = Y ] = 0.944

2 3 4 5 6 7 8 9 k 0.90 0.92 0.94 0.96 0.98

Success Probability

Dense Net, P[Y = Y ] =0.964

Figure 8. Empirical success probability achieved by an expert using three different top-k predictors (Ck) during test, each with a different deep neural network classifier, on the CIFAR-10H dataset. In each panel, the horizontal dashed line shows an empirical success probability achieved by the same expert using our system (Cˆα) during test. In all panels, the size of the calibration and estimation sets is m = 1,500 and the results for the optimal k value during test are highlighted in orange.

Improving Expert Predictions with Conformal Prediction

G. Size Distribution of the Recommended Subsets

Figure 9 shows the empirical size distribution of the subsets Cˆα(X) recommended by our system during test for different experts and prediction tasks on synthetic data. The results show that, as the expert s success probability P[ ˆY = Y ; Y] increases and the near optimal ˆα decreases, the spread of the size distribution increases.

1 2 3 4 5 6 7 8 |Cˆα(X)| 0

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.5

1 2 3 4 5 |Cˆα(X)| 0

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.7

0 1 2 3 |Cˆα(X)| 0

P[ ˆY = Y ; Y] = 0.5, P[Y = Y ] = 0.9

1 2 3 4 5 6 7 8 9 10

P[ ˆY = Y ; Y] = 0.7, P[Y = Y ] = 0.5

1 2 3 4 5 6 7 |Cˆα(X)| 0

P[ ˆY = Y ; Y] = 0.7, P[Y = Y ] = 0.7

1 2 3 4 5 |Cˆα(X)| 0

P[ ˆY = Y ; Y] = 0.7, P[Y = Y ] = 0.9

2 3 4 5 6 7 8 9 10

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.5

1 2 3 4 5 6 7 8 9

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.7

1 2 3 4 5 6 7 |Cˆα(X)| 0

P[ ˆY = Y ; Y] = 0.9, P[Y = Y ] = 0.9

Figure 9. Empirical size distribution of the subsets Cˆα(X) recommended by our system during test for different prediction tasks where the expert and the classifier achieve different success probabilities P[ ˆY = Y ; Y] and P[Y = Y ], respectively. In all panels, the number of labels is n = 10 and the size of the calibration and estimation sets is m = 1,200.

Improving Expert Predictions with Conformal Prediction

H. Performance of Our System Under Different α Values

In this section, we complement the results on CIFAR-10H dataset in the main paper by comparing, for each choice of classifier, the performance of our system under the near optimal ˆα found by Algorithm 1 and under all other possible α values, including the optimal α . Figure 10 summarizes the results, which suggest that, similarly as in the experiments on synthetic data, the performance of our system under ˆα and α is very similar. However, since the classifiers are all highly accurate, the average size of the recommended subsets under ˆα and α is quite close to one even though ˆα is much smaller than in the experiments in synthetic data.

10 3 10 2 10 1 100 α

Empirical Success Probability

Res Net-110 Pre Res Net-110 Dense Net-BC

(a) Empirical success probability vs α

10 3 10 2 10 1 100 α

Empirical Average Set Size

Res Net-110 Pre Res Net-110 Dense Net-BC

(b) Empirical average set size vs α

Figure 10. Empirical success probability achieved by an expert using our system and average size of the recommended sets during test for each α A and for three popular deep neural network classifiers on the CIFAR-10H dataset. The size of the calibration and estimation sets is m = 1,500. Each marker corresponds to a different α value and the darker points correspond to ˆα for each task.

Improving Expert Predictions with Conformal Prediction

I. Additional Experiments using an Estimator of the Expert s Success Probability with a More Expressive Context

In this section, we repeat the experiments on the CIFAR-10H dataset using an alternative discrete choice model with a more expressive context which, additionally to the true label, distinguishes between different levels of difficulty across data samples. The goal here is to show that our results are not an artifact of the choice of context used in the main paper.

We consider three increasing levels of difficulty, denoted as Leasy, Lmedium, Lhard. The difficulty levels correspond to the 50% and 25% quantiles of the experts fractions of correct predictions per sample in the (original) multiclass classification task. Samples with a fraction of correct predictions larger than the 50% quantile belong to Leasy, those with a fraction of correct predictions smaller than the 25% quantile belong to Lhard, and the remaining ones belong to Lmedium. Then, given a sample (x, y) of difficulty L, we assume that the expert s conditional success probability for the subset Cα(x) is given by:

P[ ˆY = y ; Cα | y Cα(x), L] = eu L yy P

y Cα(x) eu L yy , (23)

where u L yy denotes the expert preference for the label value y Y whenever the true label is y and the difficulty level of the sample is L.

Further, to estimate the parameters u L yy , we resort to the conditional confusion matrix for the expert predictions on samples of difficulty L, i.e., CL = CL yy

y,y Y , where CL yy = P[ ˆY = y ; Y | Y = y, L], and set u L yy = log CL yy . Finally, we

compute a Monte-Carlo estimate ˆµα of the expert s success probability P[ ˆY = Y ; Cα] required by Algorithm 1 using the above conditional success probability and an estimation set Dest = {(xi, yi)}i [m], i.e.,

i [m] | yi Cα(xi) P[ ˆY = yi ; Cα | yi Cα(xi), L(xi)], (24)

where L(xi) {Leasy, Lmedium, Lhard} denotes the difficulty level of xi.

Table 3 summarizes the results, which suggest that, in agreement with the main paper, an expert using our system may solve the prediction task significantly more accurately than the expert or the classifier on their own.

Table 3. Empirical success probabilities achieved by three popular deep neural network classifiers and by an expert using our system with these classifiers during test on the CIFAR-10H dataset. Here, we assume the expert follows the alternative discrete choice model defined by Eq. 23. The size of the calibration and estimation sets is m = 1,500 and the expert s empirical success probability at solving the (original) multiclass task is P[ ˆY = Y ; Y] 0.947. Each cell shows only the average since the standard errors are all below 10 2.

CLASSIFIER EXPERT USING Cˆα

RESNET-110 0.928 0.981 PRERESNET-110 0.944 0.983 DENSENET 0.964 0.987

Finally, similarly as in the main paper, we also found that our system performs well with a small amount of calibration and estimation data the relative gain in empirical success probability achieved by an expert using our system with respect to the same expert on their own raises from 3.02 0.05% under m = 200 to just 3.28 0.04% under m = 1,500.