# humanaligned_calibration_for_aiassisted_decision_making__922e4e07.pdf Human-Aligned Calibration for AI-Assisted Decision Making Nina L. Corvelo Benz Max Planck Institute for Software Systems ETH Zürich ninacobe@mpi-sws.org Manuel Gomez Rodriguez Max Planck Institute for Software Systems manuel@mpi-sws.org Whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. Then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. In this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. In this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. We first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. However, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker s confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker s confidence on her own predictions is a sufficient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions. 1 Introduction In recent years, there has been an increasing excitement on the potential of machine learning models to improve decision making in a variety of high-stakes domains such as medicine, education or criminal justice [1 3]. One of the main focus has been binary classification tasks, where a classifier helps a decision maker by predicting a binary label of interest using a set of observable features [4 7]. For example, in medical treatment, the classifier may help a doctor by predicting whether a patient may benefit from a treatment. In college admissions, it may help an admissions committee by predicting whether a candidate may successfully complete an undergraduate program. In loan decisions, it may help a bank by predicting whether a prospective customer may default on a loan. In all these scenarios, the decision maker the doctor, the committee or the bank aim to use these predictions, together with their own predictions, to take good decisions that maximize a given utility function. In this context, since the predictions are unlikely to always match the truth, it has been widely agreed that the classifier should also provide a confidence value together with each prediction [8, 9]. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). While the conventional wisdom is that the confidence value should be a well calibrated estimate of the probability that the predicted label matches the true label [10 16], multiple lines of empirical evidence have recently shown that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values [17 19]. Therein, Vodrahali et al. [17] have shown that, in certain scenarios, decision makers take better decisions using uncalibrated probability estimates rather than calibrated ones. However, a theoretical framework explaining this puzzling observation has been missing and it is yet unclear what properties we should be looking for to guarantee that confidence values are useful for AI-assisted decision making. In our work, we aim to bridge this gap. Our contributions. We start by formally characterizing AI-assisted decision making using a structural causal model (SCM) [20], as seen in Figure 1. Building upon this characterization, we first argue that, if a decision maker is rational, the level of trust she places on predictions will be monotone on the confidence values she will place more (less) trust on predictions with higher (lower) confidence values. Then, we show that, for a broad class of utility functions, there are data distributions for which a rational decision maker can never take optimal decisions using calibrated estimates of the probability that the predicted label matches the true label as confidence values. However, we further show that, if the confidence values a decision maker uses satisfy a natural alignment property with respect to the confidence she has on her own predictions, which we refer to as human-alignment, then the decision maker can both be rational and take optimal decisions. In addition, we demonstrate that human-alignment can be achieved via multicalibration [11], a statistical notion introduced in the context of algorithmic fairness. In particular, we show that multicalibration with respect to the decision maker s confidence on her own predictions is a sufficient condition for human-alignment. Finally, we validate our theoretical framework using real data from four different AI-assisted decision making tasks where a classifier provides decision support to human decision makers in four different binary classification tasks. Our results suggest that, comparing across tasks, classifiers providing human-aligned confidence values facilitate better decisions than classifiers providing confidence values that are not human-aligned. Moreover, our results also suggest that rational decision makers trust level increases monotonically with the classifier s provided confidence. Further related work. Our work builds upon a rapidly increasing literature on AI-assisted decision making (refer to Lai et al. [21] for a recent review). More specifically, it is motivated by several empirical studies showing that decision makers have difficulties at modulating trust using confidence values [17 19], as discussed previously. In this context, it is also worth noting that other empirical studies have analyzed how other factors such as model explanations and accuracy modulate trust [22 26]. However, except for a very recent notable exception [27], theoretical frameworks, which could be used to better understand the mixed findings found by these empirical studies, have been missing. More broadly, our work also relates to a flurry of recent work on reinforcement learning with human feedback [28 30], which aims to better align the outputs of large language models (LLMs) with human preferences. However, our formulation is fundamentally different and our technical contributions are orthogonal to theirs. 2 A Causal Model of AI-Assisted Decision Making We consider an AI-assisted decision making process where, for each realization of the process, a decision maker first observes a set of features (x, v) X V, then takes a binary decision t {0, 1} informed by a classifier s prediction ˆy = argmaxy fy(x), as well as confidence fˆy(x) [0, 1], of a binary label of interest y {0, 1}, and finally receives a utility u(t, y) R. Such an AI-assisted decision making process fits a variety of real-world applications. For example, in medical treatment, the features (x, v) may comprise multiple sources of information regarding a patient s health1, the label y may indicate whether a patient would benefit from a specific treatment, the decision t may indicate whether the doctor applies the specific treatment to the patient, and the utility u(t, y) may quantify the trade-off between health benefit to the patient and economic cost to the decision maker. In what follows, rather than working with both ˆy and fˆy(x), we will work with just b = f1(x), which we will refer to as classifier s confidence, without loss of generality2. Moreover, we will assume that 1Our formulation allows for a subset of the features v to be available only to the decision maker but not to the classifier. 2We can recover ˆy and fˆy(x) from b, i.e., if b > 0.5, we have that ˆy = 1 and fˆy(x) = b; if b < 0.5, ˆy = 0 and fˆy(x) = 1 b. Individual characteristics Q W DM s confidence Decision 70% AI s confidence D Data generating Figure 1: Our structural causal model M. Orange circles represent endogenous random variables and blue boxes represent exogenous random variables. The value of each endogenous variable is given by a function of the values of its ancestors in the structural causal model, as defined by Eqs. 2 and 3. The value of each exogenous variable is sampled independently from a given distribution. the utility u(t, y) is greater if the value of t and y coincide, i.e., u(1, 1) > u(1, 0), u(1, 1) > u(0, 1), u(0, 0) > u(1, 0), and u(0, 0) u(0, 1), (1) a condition that we think it is natural under an appropriate choice of label and decision values. For example, in medical diagnosis, if t = 1 means the patient is tested early for a disease and y = 1 means the patient suffers the disease, the above condition implies that the utility of either testing a patient who suffers the disease or not testing a patient who does not suffer the disease are greater than the utility of either not testing a patient who suffers the disease or testing a patient who does not suffer the disease. In condition 1, we allow for a non-strict inequality u(0, 0) u(0, 1) because, in settings in which the label Y is only realized whenever the decision t = 1 (e.g., in our previous example on medical treatment, we can only observe if a treatment is eventually beneficial or not if the patient is treated), it has been argued that, whenever t = 0, any choice of utility must be independent of the label value [4 6], i.e., u(0, 0) = u(0, 1) = u(0). Next, we characterize the above AI-assisted decision making process using a structural causal model (SCM) [20], which we denote as M. The SCM M is defined by a set of assignments, which entail a distribution P M and divide naturally into two subsets. One subset comprises the features and the label3, i.e., X = f X(D) V = f V (D) and Y = f Y (D), (2) where D is an independent exogenous random variable, often called exogenous noise, characterizing the data generating process and f X, f V and f Y are given functions4. The second subset comprises the decision maker and the classifier, i.e., H = f H(X, V, Q), B = f B(X, H), T = π(H, B, W) and U = u(T, Y ), (3) where f H and f B are given functions, which determine the decision maker s confidence H and classifier s confidence B that the value of the label of interest is Y = 1, π is a given AI-assisted decision policy, which determines the decision maker s decision T, u is a given utility function, which determines the utility U, and Q and W are independent exogenous variables modeling the decision maker s individual characteristics influencing her own confidence H and her decision T, respectively. By distinguishing both sources of noise, we allow for the presence of uncertainty on the decision T even after conditioning on fixed confidence values h and b. This accounts for the fact that, in reality, a decision maker may take different decisions T for instances with the same confidence values h and b. For example, in medical treatment, for two different patients with the same confidence h and b, a doctor s decision may differ due to limited resources. In our SCM M, the decision maker s confidence H refers to the confidence the decision maker has that the label Y = 1 before observing the classifier s confidence B. Moreover, following previous behavioral studies showing that human s confidence H is discretized in a few distinct levels [32, 33], we assume H takes values h from a totally ordered discrete set H. We say that the decision maker s 3We denote random variables with capital letters and realizations of random variables with lower case letters. 4Our model allows both for causal and anticausal features [31]. confidence f H is monotone (with respect to the probability distribution P(Y = 1)) if, for all h, h H such that h h , it holds that P(Y = 1 | H = h) P(Y = 1 | H = h ). Further, we allow the classifier s confidence B to depend on the decision maker s confidence H because this will be necessary to achieve human-alignment via multicalibration in Section 5. However, our negative result in Section 3 also holds if the classifier s confidence f B(X, H) = f B(X) only depends on the features, as usual in most classifiers designed for AI-assisted decision making. In the remainder, we will use Z = (X, H) and denote the space of features and human confidence values as Z = X H. Figure 1 shows a visual representation of our SCM M. Under this characterization, we argue that, if a rational decision maker has decided t under confidence values b and h, then she would have decided t t had the confidence values been b b and h h while holding everything else fixed [20]. For example, in medical treatment, assume a doctor s and a classifier s confidence that a patient would benefit from treatment is b = h = 0.7 and the doctor decides to treat the patient, then we argue that, if the doctor is rational, she would have treated the patient had the doctor s and the classifier s confidence been b = h = 0.8 > 0.7. Further, we say that any AI-assisted decision policy π that satisfies this property is monotone, i.e., Definition 1 (Monotone AI-assisted decision policy). An AI-assisted decision policy π is monotone if and only if, for any b, b [0, 1] and h, h H such that b b and h h , it holds that π(h, b, w) π(h , b , w) for any w P M(w). Finally, note that, under any monotone AI-assisted decision policy, it trivially follows that E[T | H = h, B = b] E[T | H = h , B = b ], (4) where the expectation is over the uncertainty on the decision maker s individual characteristics and the data generating process. 3 Impossibility of AI-Assisted Decision Making Under Calibration In AI-assisted decision making, classifiers are usually demanded to provide calibrated confidence values [10 16]. A confidence function f B : Z [0, 1] is said to be perfectly calibrated if, for any b [0, 1], it holds that P(Y = 1 | f B(Z) = b) = b. Unfortunately, using finite amounts of (calibration) data, one can only hope to construct approximately calibrated confidence functions. There exist many different notions of approximate calibration, which have been proposed over the years. Here, for concreteness, we adopt the notion of α-calibration5 introduced by Hébert-Johnson et al. [11], however, our theoretical results can be easily adapted to other notions of approximate calibration6. Definition 2 (Calibration). A confidence function f B : Z [0, 1] satisfies α-calibration with respect to S Z if there exists some S S, with |S | (1 α)|S|, such that, for any b [0, 1], it holds that |P(Y = 1 | f B(Z) = b, Z S ) b| α, (5) If the decision maker s decision T only depends on the classifier s confidence B, i.e., π(H, B, W) = π(B) and f B satisfies α-calibration with respect to Z, then, it readily follows from previous work that, for any utility function that satisfies Eq. 1, a simple monotone AI-assisted decision policy π B that takes decisions by thresholding the confidence values is optimal [4 7], i.e., π B = argmaxπ Π(B) Eπ[u(T, Y )], where the expectation is with respect to the probability distribution P M and Π(B) denotes the class of AI-assisted decision policies using B. However, one of the main motivations to favor AI-assisted decision making over fully automated decision making is that the decision maker may have access to additional features V and may like to weigh the classifier s confidence B against her own confidence H. Hence, the decision maker may seek for the optimal decision policy π over the class Π(H, B) of AI-assisted decision policies using H and B, i.e., π = argmaxπ Π(H,B) Eπ[u(T, Y )], since it may offer greater expected utility than π B. Unfortunately, the following negative result shows that, in general, a rational decision maker may be unable to discover such an optimal decision policy π using (perfectly) calibrated confidence values and this is true even if f H is monotone: 5Note that, if α = 0 and S = Z, the confidence function f is perfectly calibrated. 6All proofs can be found in Appendix A. Theorem 3. There exist (infinitely many) AI-assisted decision making processes M satisfying Eqs. 2 and 3, with utility functions u(T, Y ) satisfying Eq. 1, such that f B is perfectly calibrated and f H is monotone but any AI-assisted decision policy π Π(H, B) that satisfies monotonicity is suboptimal, i.e., Eπ[u(T, Y )] < Eπ [u(T, Y )]. In the proof of the above result in Appendix A.2, we show that there always exist a perfectly calibrated f B(Z) = f B(X, H) that depends on both X and H for which any monotone AI-assisted decision policy is suboptimal. This is due to the fact that f B(Z) is calibrated on average over H, however, it may not be calibrated, nor even monotone, after conditioning on a specific value H = h. Further, we also show that, even if f B(Z) = P M(Y = 1 | X) matches the true distribution of the label Y given the features X, which has been typically the ultimate goal in the machine learning literature, there always exists a monotone f H for which any monotone AI-assisted decision policy is suboptimal. This is due to the fact that the decision maker s confidence H can differ across instances with the same value for features X because it also depends on the features V and noise Q. Hence, f H may not be monotone after conditioning on a specific value X = x . In both cases, when a rational decision maker compares pairs of confidence values h, b and h , b , the rate of positive outcomes Y = 1 for each pair may appear contradictory with the magnitude of confidence. In what follows, we will show that, if f B satisfies a natural alignment property with respect to f H, which we refer to as human-alignment, there always exists an optimal AI-assisted decision policy that is monotone. 4 AI-Assisted Decision Making Under Human-Aligned Calibration Intuitively, to avoid that pairs of confidence values B and H appear as contradictory to a rational decision maker, we need to make sure that, with high probability, both f B and f H are monotone after conditioning on specific values of H and B, respectively. Next, we formalize this intuition by means of the following property, which we refer to as α-alignment: Definition 4 (Human-alignment). A confidence function f B satisfies α-alignment with respect to a confidence function f H if, for any h H, there exists some Sh Sh, with Sh = {(x, H) Z | H = h} and | Sh| (1 α/2)|Sh|, such that, for any b , b [0, 1] and h , h H such that b b and h h , it holds that P(Y = 1 | f B(X, H) = b , (X, H) Sh ) P(Y = 1 | f B(X, H) = b , (X, H) Sh ) α (6) The above definition just means that, if f B is α-aligned with respect to f H then, for any h, h H, we can bound any violation of monotonicity by f B between at least a (1 α/2) fraction of the subspaces of features Sh and Sh . Moreover, note that, if f B is 0-aligned with respect to f H, then there are no violations of monotonicity, i.e., P(Y = 1 | f B(X, H) = b , (X, H) Sh ) P(Y = 1 | f B(X, H) = b , (X, H) Sh ), and we say that f B is perfectly aligned with respect to f H. Given the above definition, we are now ready to state our main result, which shows that humanalignment allows for AI-assisted decision policies that satisfy monotonicity and (near-)optimality: Theorem 5. Let M be any AI-assisted decision making process satisfying Eqs. 2 and 3, with an utility function u(T, Y ) satisfying Eq. 1 If f B satisfies α-alignment w.r.t. f H, then there always exists an AI-assisted decision policy π Π(H, B) that satisfies monotonicity and is near-optimal, i.e., Eπ [u(T, Y )] Eπ[u(T, Y )] + α u(1, 1) u(0, 1) + 3 2(u(0, 0) u(1, 0)) (7) where π = argmaxπ Π(H,B) Eπ[u(T, Y )] is the optimal policy. Corollary 1. If f B is perfectly aligned with respect to f H, then there always exists an AI-assisted decision policy π Π(H, B) that satisfies monotonicity and is optimal. Finally, in many high-stakes applications, we may like to make sure that the confidence values provided by f B are both useful and interpretable [34]. Hence, we may like to seek for confidence functions f B that satisfy human-aligned calibration, which we define as follows: Definition 6 (Human-aligned calibration). A confidence function f B satisfies α-aligned calibration with respect to a confidence function f H if and only if f B satisfies α-alignment with respect to f H and it satisfies α-calibration with respect to Z. In the next section, we will show how to achieve human-alignment and human-aligned calibration via multicalibration, a statistical notion introduced in the context of algorithmic fairness [11]. 5 Achieving Human-Aligned Calibration via Multicalibration Multicalibration was introduced by Hébert-Johnson et al. [11] as a notion to achieve fairness in supervised learning. It strengthens the notion of calibration by requiring that the confidence function is calibrated simultaneously across a large collection of subspaces of features C 2Z which may or may not be disjoint. More formally, it is defined as follows: Definition 7 (Multicalibration). A confidence function f B : Z B satisfies α-multicalibration with respect to C 2Z if f B satisfies α-calibration with respect to every S C. Then, we can show that, for an appropriate choice of C, if f B satisfies α-multicalibration with respect to C, then it satisfies α-aligned calibration with respect to f H. More specifically, we have the following result: Theorem 8. If f B satisfies (α/2)-multicalibration with respect to {Sh}h H, with Sh = {(x, H) Z | H = h}, then f B satisfies α-aligned calibration with respect to f H. The above theorem suggests that, given a classifier s confidence function f B, we can multicalibrate f B with respect to {Sh}h H to achieve α-aligned calibration with respect to f H. To achieve multicalibration guarantees using finite amounts of (calibration) data, multicalibration algorithms need to discretize the range of f B [9, 11, 12]. In what follows, we briefly revisit two algorithms, which carry out this discretization differently, and discuss their complexity and data requirements with respect to achieving α-aligned calibration. Multicalibration algorithm via λ-discretization. This algorithm, which was introduced by Hébert Johnson et al. [11], discretizes the range of f B, i.e., the interval [0, 1], into bins of fixed size λ > 0 with values Λ[0, 1] = { λ 2 , . . . , 1 λ Let λ(b) = [b λ/2, b + λ/2). The algorithm partitions each subspace Sh into 1/λ groups Sh,λ(b) = {(x, h) Sh | f B(x, h) λ(b)}, with b Λ[0, 1]. It iteratively updates the confidence values of function f B for these groups until f B satisfies a discretized notion of α -multicalibration over these groups. The algorithm then returns a discretized confidence function f B,λ(x, h) = E[f B(X, H) | f B(X, H) λ(b)], with b Λ[0, 1] such that f B(x, h) λ(b), which is guaranteed to satisfy (α + λ)-multicalibration. Refer to Algorithm 1 in Appendix B for a pseudocode of the algorithm. Then, as a direct consequence of Theorem 8, we can obtain a (discretized) confidence function f B,λ that satisfies α-aligned calibration by setting α = λ = α/4. However, the following proposition shows that, to satisfy just α-alignment, it is enough to set α = 3 8α > α/4 and λ = α/4: Proposition 1. The discretized confidence function f B,λ returned by Algorithm 1 satisfies (2α + λ)- alignment with respect to f H. Finally, it is worth noting that, to implement Algorithm 1, we need to compute empirical estimates of the expectations and probabilities above using a calibration set D. In this context, Theorem 2 in Hébert Johnson et al. [11] shows that, if we use a calibration set of size O(log(|H|/(αγξ))/α11/2γ3/2), with P((X, H) Sh) > γ for all h H, then f B,λ is guaranteed to satisfy α-multicalibration with probability at least 1 ξ in time O(|H| poly(1/α, 1/γ)). Multicalibration algorithm via uniform mass binning. Uniform mass binning (UMD) [9, 12] has been originally designed to calibrate f B with respect to Z using a calibration set D. However, since the subspaces {Sh}h H are disjoint, i.e., Sh Sh = for every h = h , we can multicalibrate f B with respect to {Sh}h H by just running |H| instances of UMD, each using the subset of samples D Sh. Here, we would like to emphasize that we can use UMD to achieve multicalibration because, in our setting, the subspaces {Sh}h H are disjoint. Each instance of UMD discretizes the range of f B, i.e., the interval [0, 1], into N = 1/λ bins with values Λh[0, 1] = { ˆP(Y = 1 | f B(X, h) [0, ˆq1]), . . . , ˆP(Y = 1 | f B(X, h) [ˆq N 1, ˆq N])}, where ˆqi denotes the (i/N)-th empirical quantile of the confidence values f B(x, h) of the samples (x, h) D Sh and ˆP denotes an empirical estimate of the probability using samples from D Sh, aswell. Here, note that, by construction, the bins have similar probability mass. Then, for each (x, h) Z, the corresponding instance of UMD provides the value of the discretized confidence function f B,λ(x, h) = b, where b Λh[0, 1] denotes the value of the bin whose corresponding defining interval includes f B(x, h). Finally, we have the following theorem, which guarantees that, as 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b P(Y = 1 (X, Y) h, (b)) Human Confidence, h 'low': [0.0, 0.25] 'mid': (0.25, 0.74] 'high': (0.74, 1.0] 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b P(Y = 1 (X, Y) h, (b)) Human Confidence, h 'low': [0.0, 0.37] 'mid': (0.37, 0.76] 'high': (0.76, 1.0] (b) Sarcasm 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b P(Y = 1 (X, Y) h, (b)) Human Confidence, h 'low': [0.0, 0.25] 'mid': (0.25, 0.69] 'high': (0.69, 1.0] 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b P(Y = 1 (X, Y) h, (b)) Human Confidence, h 'low': [0.0, 0.32] 'mid': (0.32, 0.64] 'high': (0.64, 1.0] Figure 2: Empirical estimate of the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)), where b Λ[0, 1] and h {low, mid, high} are the discretized confidence values for the classifiers and human participants, respectively. Error bars represent 90% confidence intervals and hatched bars mark alignment violations between confidence pairs (h, b) with |Sh,λ(b)| 30. long as the calibration set is large enough, the discretized confidence function f B,λ satisfies α-aligned calibration with respect to f H with high probability: Theorem 9. The discretized confidence function f B,λ returned by |H| instances of UMD, one per Sh, satisfies α-aligned calibration with respect to f H with probability at least 1 ξ as long as the size of the calibration set |D| = O |H| α2λγ log |H| λξ , with P((X, H) Sh) γ. 6 Experiments In this section, we validate our theoretical results using a dataset with real expert predictions in an AI-assisted decision making scenario comprising four different binary classification tasks7. Data description. We experiment with the publicly available Human-AI Interactions dataset [35]. The dataset comprises 34,783 unique predictions from 1,088 different human participants on four different binary prediction tasks ( Art , Sarcasm , Cities and Census ). Overall, there are approximately 32 different instances per task. In the Art task, participants need to determine the art period of a painting given two choices and, overall, there are paintings from four art periods. In the Sarcasm task, participants need to detect if sarcasm is present in text snippets from the Reddit sarcasm dataset [36]. In the Cities" task, participants need to determine which large US city is depicted in an image given two choices and, overall, there are images of four different US cities. Finally, in the Census task, participants need to determine if an individual earns more than 50k a year based on certain demographic information in tabular form. For Sarcasm , x is a representation of the text snippets and we set y = 1 if sarcasm is present, for Art and Cities , x is a representation of the images and we set y = 1 and y = 0 at random for each different instance and, for Census , x summarizes demographic information and we set y = 1 if an individual earns more than 50k a year. In each of the tasks, human participants provide confidence values about their predictions before (h) and after (h+AI) receiving AI advice from a classifier in form of the classifier s confidence values b.8 The original dataset contains predictions by participants from different, but overlapping, sets 7We release the code to reproduce our analysis at https://github.com/Networks-Learning/human-alignedcalibration. 8Refer to Appendix C for more details on the dataset. Table 1: Misalignment, miscalibration and AUC. Task Misalignment Miscalibration AUC EAE MAE ECE MCE πB πH πH+AI Art 4.5 10 4 0.058 0.084 0.186 86.7% 72.7% 82.0% Sarcasm 3.8 10 3 0.224 0.085 0.310 89.9% 82.5% 86.5% Cities 6.2 10 5 0.013 0.066 0.158 84.4% 79.0% 84.7% Census 9.0 10 3 0.298 0.109 0.270 80.0% 77.3% 79.9% of countries across tasks, who were told the AI advice had different values of accuracy.9 In our experiments, to control for these confounding factors, we focus on participants from the US who were told the AI advice was 80% accurate, resulting in 15,063 unique predictions from 471 different human participants. Experimental setup and evaluation metrics. For each of the tasks, we first measure (i) the degree of misalignment between the classifiers confidence values b and the participants confidence values h before receiving AI advice b and (ii) the difference (h+AI h) between the human participant s confidence values before and after receiving AI advice b. Then, we compare the utility achieved by a AI-assisted decision policy πH+AI that predicts the value of y by thresholding the humans confidence values h+AI after observing the classifier s confidence values against two baselines: (i) a decision policy πB that predicts the value of y by thresholding the classifier s confidence values b and (ii) a decision policy πH that predicts the value of y by thresholding the humans confidence values h before observing the classifier s confidence values. To measure the degree of misalignment, we discretize the confidence values b and h into bins. For the classifiers confidence b, we use 8 uniform sized bins per task with (centered) values Λ[0, 1], where λ = 1/8. For the human participants confidence h before receiving AI advice b, we use three bins per task ( low , mid and high ), where we set the bin boundaries so that each bin contains approximately the same probability mass and set the bin values to the average confidence value within each bin. In what follows, we refer to the pairs of discretized confidence values (h, b) as cells, where samples (x, y) Z whose confidence values lie in the cell (h, b) define the group Sh,λ(b), and note that we choose a rather low number of bins for both b and h so that most cells have sufficient data samples to reliable estimate several misalignment metrics, which we describe next. We use three different misalignment metrics: (i) the number of alignment violations between cell pairs, (ii) the expected alignment error (EAE) and (iii) the maximum alignment error (MAE). There is an alignment violation between cells pairs (h, b) and (h , b ), with h h and b b , if P(Y = 1|(X, Y ) Sh,λ(b)) > P(Y = 1|(X, Y ) Sh ,λ(b )). Moreover, we have that: P(Y = 1 | (X, Y ) Sh,λ(b)) P(Y = 1 | (X, Y ) Sh ,λ(b )) MAE = max h h ,b b P(Y = 1 | (X, Y ) Sh,λ(b)) P(Y = 1 | (X, Y ) Sh ,λ(b )), where N = |{h h , b b }|. Here, note that the number of alignment violations tells us how frequently is the left hand side of Eq. 6 positive across cell pairs given Sh = Sh and the EAE and MAE quantify the average and maximum value of the left hand side of Eq. 6 across cells violating alignment. To obtain reliable estimates of the above metrics, we only consider cells (h, b) with |Sh,λ(b)| 30 samples. Moreover, we also report the expected calibration error (ECE) and maximum calibration error (MCE) [12, 37], which are natural counterparts to EAE and MAE, respectively. As a measure of utility, we estimate the true positive rate (TPR) and false positive rate (FPR) of the decision policies πB, πH and πH+AI for all possible choices of threshold values, which we summarize using the area under the ROC curve (AUC) and, in Appendix C, we also report ROC curves. 9Participants were also either told that the advice is from a Human or from an AI based on a random assignment of participants to a treatment or control group. Since the actual advice received in both groups was identical for the same instance and the "perceived advice source" is randomized, we use data from both treatment and control groups in the experiments. 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b E[h+AI h (X, Y) h, (b)] Human Confidence, h 'low': [0.0, 0.25] 'mid': (0.25, 0.74] 'high': (0.74, 1.0] 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b E[h+AI h (X, Y) h, (b)] Human Confidence, h 'low': [0.0, 0.37] 'mid': (0.37, 0.76] 'high': (0.76, 1.0] (b) Sarcasm 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b E[h+AI h (X, Y) h, (b)] Human Confidence, h 'low': [0.0, 0.25] 'mid': (0.25, 0.69] 'high': (0.69, 1.0] 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b E[h+AI h (X, Y) h, (b)] Human Confidence, h 'low': [0.0, 0.32] 'mid': (0.32, 0.64] 'high': (0.64, 1.0] Figure 3: Empirical estimate of the average difference E[h+AI h | (X, Y ) Sh,λ(b)], where b Λ[0, 1] and h {low, mid, high} are the discretized confidence values for the classifier and human participants, respectively. Error bars represent 90% confidence intervals and hatched bars mark alignment violations between confidence pairs (h, b) with |Sh,λ(b)| 30. Results. We start by looking at the empirical estimates of the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)) and of our measures of misalignment (EAE, MAE) and miscalibration (ECE, MCE) in Figure 2 and Table 1 (left and middle columns). The results show that, for Cities , the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)) are (approximately) monotonically increasing with respect to the classifier s confidence values b. More specifically, as shown in Figure 2, there is only one alignment violation between cell pairs and, hence, our metrics of misalignment acquire also very low values. In contrast, for Art , Sarcasm and especially Census , there is an increasing number of alignment violations and our misalignment metrics acquire higher values, up to several orders of magnitude higher for Census . These results also show that misalignment and miscalibration go hand in hand, however, in terms of miscalibration, Census does not stand up so strongly. Next, we look at the difference h+AI h between the human participant s recorded confidence values before and after receiving AI advice b across samples in each of the subsets Sh,λ(b) induced by the discretized confidence values used above. Figure 3 summarizes the results, which reveal that the difference h+AI h increases monotonically with respect to the classifier s confidence b. This suggests that participants always expect b to reflect the probability of a positive outcome irrespectively of their confidence value h before receiving AI advice, providing support for our hypothesis that (rational) decision makers implement monotone AI-assisted decisions policies. Further, this finding also implies that, for Art , Sarcasm and Census , any policy πH+AI that predicts the value of the label y by thresholding the confidence value h+AI will be necessarily suboptimal because the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)) are not monotone increasing with b. Finally, we look at the AUC achieved by decision policies πB, πH and πH+AI. Table 1 (right columns) summarize the results, which shows that πH+AI outperforms πH consistently across all tasks but it only outperforms πB in a single task ( Cities ) out of four. These findings provide empirical support for Theorem 3, which predicts that, in the presence of human-alignment violations as those observed in Art , Sarcasm and Census , any monotone AI-assisted decision policy will be suboptimal, and they also provide support for Theorem 5, which predicts that, under human-alignment, there exist near-optimal AI-assisted decision policies satisfying monotonicity. 7 Discussion and Limitations In this section, we discuss the intended scope of our work and identify several limitations of our theoretical and experimental results, which may serve as starting points for future work. Decision making setting. We have focused on decision making settings where both decisions and outcomes are binary. However, we think that it may be feasible to extend our theoretical analysis to settings with multi-categorical (or real-valued) outcome variables and decisions. One of the main challenges would be to identify which natural conditions utility functions may satisfy in such settings. Further, we also think that it would be significantly more challenging to extend our theoretical analysis to sequential settings multicalibration in sequential settings is an open area of research but our ideas may still be a useful starting point. In addition, our theoretical analysis assumes that the decision makers aim to maximize the average utility of their decisions. However, whenever human decisions are consequential to individuals, the decision maker may have fairness desiderata. Confidence values. In our causal model of AI-assisted decision making, we allow the classifier s confidence values to depend on the decision maker s confidence values because this is necessary to achieve human-alignment via multicalibration as described in Section 5. However, we would like to clarify that both Theorems 3 and 5 still hold if the classifier s confidence values do not depend on the decision maker s confidence, as it is typically the status quo today. Looking into the future, our work questions this status quo by showing that, by allowing the classifier s confidence values to depend on the decision maker s confidence values, a decision maker may end up taking decisions with higher utility. Moreover, we would also like to clarify that, while the motivation behind our work is AI-assisted human decision making, our theoretical results do not depend on who be it a classifier or another human gives advice. As long as the advice comes in the form of confidence values, our results are valid. Finally, while we have shown that human-alignment can be achieved via multicalibration, we hypothesize that algorithms specifically designed to achieve human-alignment may have lower data and computational requirements than multicalibration algorithms. Experimental results. Our experimental results demonstrate that, across tasks, the average utility achieved by decision makers is relatively higher if the classifier they use satisfies human-alignment. However, they do not empirically demonstrate that, for a fixed task, there is an improvement in average utility achieved by decision makers if the classifier they use satisfies human-alignment. The reason why we could not demonstrate the latter is because, in our experiments, we used an observational dataset gathered by others [35]. Looking into the future, it would be very important to run a human subject study to empirically demonstrate the latter and, for now, treat our conclusions with caution. 8 Conclusions We have introduced a theoretical framework to investigate what properties confidence values should have to help decision makers take better decisions. We have shown that there exists data distribution for which a rational decision maker using calibrated confidence values will always take suboptimal decisions. However, we have further shown that, if the confidence values satisfy a natural alignment property, which can be achieved via multicalibration, then a rational decision maker using these confidence values can take optimal decisions. Finally, we have illustrated our theoretical results using real human predictions on four AI-assisted decision making tasks. Acknowledgements. We would like to thank Nastaran Okati for fruitful discussions at an early stage of the project. Gomez-Rodriguez acknowledges support from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 945719). [1] Wei Jiao, Gurnit Atwal, Paz Polak, Rosa Karlic, Edwin Cuppen, Alexandra Danyi, Jeroen de Ridder, Carla van Herpen, Martijn P Lolkema, Neeltje Steeghs, et al. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nature communications, 11(1):1 12, 2020. [2] Jacob Whitehill, Kiran Mohan, Daniel Seaton, Yigal Rosen, and Dustin Tingley. Mooc dropout prediction: How to measure accuracy? In Proceedings of the fourth (2017) acm conference on learning@ scale, pages 161 164, 2017. [3] Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science advances, 4(1):eaao5580, 2018. [4] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pages 797 806, 2017. [5] Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet, and Isabel Valera. Fair decisions despite imperfect predictions. In International Conference on Artificial Intelligence and Statistics, pages 277 287. PMLR, 2020. [6] Isabel Valera, Adish Singla, and Manuel Gomez-Rodriguez. Enhancing the accuracy and fairness of human decision making. In Advances in Neural Information Processing Systems, 2018. [7] Guy N Rothblum and Gal Yona. Decision-making under miscalibration. In Innovations in Theoretical Computer Science, 2023. [8] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, 2017. [9] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the 18th International Conference on Machine Learning, 2001. [10] Tilmann Gneiting, Fadoua Balabdaoui, and Adrian Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2007. [11] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, 2018. [12] Chirag Gupta and Aaditya K. Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In Proceedings of the 38th International Conference on Machine Learning, 2021. [13] Roshni Sahoo, Shengjia Zhao, Alyssa Chen, and Stefano Ermon. Reliable decisions with threshold calibration. Advances in Neural Information Processing Systems, 34:1831 1844, 2021. [14] Shengjia Zhao, Michael P Kim, Roshni Sahoo, Tengyu Ma, and Stefano Ermon. Calibrating predictions to decisions: A novel approach to multi-class calibration. In Advances in Neural Information Processing Systems, 2021. [15] Yingxiang Huang, Wentao Li, Fima Macheret, Rodney A Gabriel, and Lucila Ohno-Machado. A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association, 27(4):621 633, 2020. [16] Lequn Wang, Thorsten Joachims, and Manuel Gomez-Rodriguez. Improving screening processes via calibrated subset selection. In Proceedings of the 39th International Conference on Machine Learning, 2023. [17] Kailas Vodrahalli, Tobias Gerstenberg, and James Zou. Uncalibrated models can improve human-ai collaboration. In Advances in Neural Information Processing Systems, 2022. [18] Gal Yona, Amir Feder, and Itay Laish. Useful confidence measures: Beyond the max score. ar Xiv preprint ar Xiv:2210.14070, 2022. [19] Eleni Straitouri, Lequn Wang, Nastaran Okati, and Manuel Gomez-Rodriguez. Improving expert predictions with conformal prediction. In Proceedings of the 40th International Conference on Machine Learning, 2023. [20] Judea Pearl. Causality. Cambridge university press, 2009. [21] Vivian Lai, Chacha Chen, Q Vera Liao, Alison Smith-Renner, and Chenhao Tan. Towards a science of human-ai decision making: a survey of empirical studies. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2023. [22] Andrea Papenmeier, Gwenn Englebienne, and Christin Seifert. How model accuracy and explanation fidelity influence user trust. ar Xiv preprint ar Xiv:1907.12652, 2019. [23] Xinru Wang and Ming Yin. Are explanations helpful? a comparative study of the effects of explanations in ai-assisted decision-making. In 26th International Conference on Intelligent User Interfaces, pages 318 328, 2021. [24] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. Understanding the effect of accuracy on trust in machine learning models. In Proceedings of the 2019 chi conference on human factors in computing systems, pages 1 12, 2019. [25] Mahsan Nourani, Joanie T. King, and Eric D. Ragan. The role of domain expertise in user trust and the impact of first impressions with intelligent systems. Ar Xiv, abs/2008.09100, 2020. [26] Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 295 305, 2020. [27] Chacha Chen, Shi Feng, Amit Sharma, and Chenhao Tan. Machine explanations and human understanding. Transactions of Machine Learning Research, 2023. [28] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 2017. [29] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022. [30] Banghua Zhu, Jiantao Jiao, and Michael Jordan. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. ar Xiv preprint ar Xiv:2301.11270, 2023. [31] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012. [32] Matteo Lisi, Gianluigi Mongillo, Georgia Milne, Tessa Dekker, and Andrei Gorea. Discrete confidence levels revealed by sequential decisions. Nature Human Behaviour, 5(2):273 280, 2021. [33] Hang Zhang, Nathaniel D Daw, and Laurence T Maloney. Human representation of visuo-motor uncertainty as mixtures of orthogonal basis distributions. Nature neuroscience, 18(8):1152 1158, 2015. [34] Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, et al. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 401 413, 2021. [35] Kailas Vodrahalli, Roxana Daneshjou, Tobias Gerstenberg, and James Zou. Do humans trust advice more if it comes from ai? an analysis of human-ai interactions. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 763 777, 2022. [36] Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A large self-annotated corpus for sarcasm. ar Xiv preprint ar Xiv:1704.05579, 2017. [37] Telmo Silva Filho, Hao Song, Miquel Perello-Nieto, Raul Santos-Rodriguez, Meelis Kull, and Peter Flach. Classifier calibration: How to assess and improve predicted class probabilities: a survey. ar Xiv e-prints, pages ar Xiv 2112, 2021. A.1 Additional Lemmas Lemma 1 (Monotonicity). If a utility function u satisfies Eq. 1, then u is monotone with respect to the probability that Y = 1, i.e., for any P, P P({0, 1}) such that P(Y = 1) P (Y = 1), it holds that EY P [u(1, Y )] EY P [u(1, Y )]. Proof. We readily have that EY P [u(1, Y )] = P(Y = 1) u(1, 1) + (1 P(Y = 1)) u(1, 0) P (Y = 1) u(1, 1) + (1 P (Y = 1)) u(1, 0) = EY P [u(1, Y )], where, in the above inequality, we use that u(1, 1) > u(1, 0) and P(Y = 1) P (Y = 1). Lemma 2 (Trivial policies are not always optimal). If a utility function u satisfies Eq. 1, then there exist P, P P({0, 1}) such that the trivial policies π that either always decide T = 1 or always decide T = 0 are suboptimal. In particular, for any P, P P({0, 1}) such that P(Y = 1) < c and P (Y = 1) > c, where c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1) (0, 1), (8) it holds that EY P [u(1, Y )] < EY P [u(0, Y )] and EY P [u(1, Y )] > EY P [u(0, Y )]. (9) Proof. Let P be any distribution such that P(Y = 1) < c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1), where c (0, 1) because, by assumption, u satisfies Eq. 1. Now, by rearranging the above inequality, we have that P(Y = 1) u(1, 1) + (1 P(Y = 1)) u(1, 0) < P(Y = 1) u(0, 1) + (1 P(Y = 1)) u(0, 0), and, using the definition of the expectation, it immediately follows that EY P [u(1, Y )] < EY P [u(0, Y )]. The same argument can be used to show that, for any distribution P such that P (Y = 1) > c, it holds that EY P [u(1, Y )] > EY P [u(0, Y )]. Finally, note that, since c (0, 1), we know that such distributions P and P exist. A.2 Proof of Theorem 3 Before proving Theorem 3, we rewrite the expected utility with respect to the probability distribution P M in terms of confidence H and B by using the law of total expectation, Eπ[u(T, Y )] = EH,B P M(H,B) [Eπ[u(T, Y )|H, B]] . Here, to simplify notation, we will write EH,B [Eπ[u(T, Y ) | H, B]] , where note that, using the law of total expectation, we can write the inner expectation in the above expression in terms of the utilities of the trivial policies, i.e., Eπ[u(T, Y ) | H, B] = E[u(1, Y ) | H, B] Pπ(T = 1 | H, B) + E[u(0, Y ) | H, B] Pπ(T = 0 | H, B), (10) and we will use P to refer to probabilities induced by SCM M, e.g., P(H, B) to denote P M(H, B). Now, we restate and prove Theorem 3. Theorem 3. There exist (infinitely many) AI-assisted decision making processes M satisfying Eqs. 2 and 3, with utility functions u(T, Y ) satisfying Eq. 1, such that f B is perfectly calibrated and f H is monotone but any AI-assisted decision policy π Π(H, B) that satisfies monotonicity is suboptimal, i.e., Eπ[u(T, Y )] < Eπ [u(T, Y )]. Proof. To prove the above claim, we construct a monotone confidence function f H, perfectly calibrated confidence function f B and distribution P M for which any monotone AI-assisted decision policy π Π(H, B) achieves strictly lower utility than a carefully constructed non monotone AI-assisted decision policy π Π(H, B). We will present the proof in three parts. First, we will introduce the main building block and idea behind the proof by a small construction of f H, f B and P M with |H| = |B| = 3, where B [0, 1] denotes the (discrete) output space of the classifier s confidence function. We then construct examples of f H, f B and P M for arbitrary |H| = k and |B| = m with m, k N, m > k 2. Lastly, we construct examples where B is non-discrete and |H| = k with k > 2. Main building block and small example. We start by presenting the main idea of the proof using an example with a small set of confidence values H and B. Let the values of the decision maker s confidence H be in H = {h1, h2, h3} and the values of the classifier s confidence B be in B = {b1, b2, b3}, with order hi < (hi + 1) and bi < (bi + 1) respectively. Our main building block, consists of two distributions P , P + P({0, 1}) with P (Y = 1) < c and P +(Y = 1) > c, where c depends on utility u as described by Eq. 8 in Lemma 2. We use these distributions for our constructions of f H, f B and P M, so that for some realizations of H, B distribution P(Y = 1 | H, B) is either P or P +. Using Lemma 2 and from Eq. 10, we have that: (I) For any hi, bi such that P(Y | H = hi, B = bi) = P , it holds that E[u(1, Y ) | H = hi, B = bi] < E[u(0, Y ) | H = hi, B = bi]. Hence, decreasing Pπ(T = 1 | H, B) increases E[u(T, Y ) | H = hi, B = bi]. (II) For any hi, bi such that P(Y | H = hi, B = bi) = P +, it holds that E[u(1, Y ) | H = hi, B = bi] > E[u(0, Y ) | H = hi, B = bi]. Hence, increasing Pπ(T = 1 | H, B) increases E[u(T, Y ) | H = hi, B = bi]. Intuitively, suppose we now have that, for confidence values h2, b2, Y P + and, for confidence values h3, b2, Y P , i.e., P(Y | H = h2, B = b2) = P + and P(Y | H = h3, B = b2) = P . Then, any non-monotone AI-assisted decision policy π with P π(T = 1 | H = h2, B = b2) > P π(T = 1 | H = h3, B = b2) will have higher expected utility than any monotone AI-assisted decision policy given confidence values h2, b2 and h3, b2. Finally, under an appropriate choice of distribution P(H, B), such non-monotone AI-assisted decision policies π will offer higher overall utility in expectation. We formalize this intuition with the following lemma: Lemma 3. Let M be any AI-assisted decision making process satisfying Eqs. 2 and 3, with utility function u(T, Y ) satisfying Eq. 1. If f H, f B and P M are such that there exists confidence values b B, hi, hj H, with hi < hj, which satisfy P(H = hi, B = b) > 0, P(H = hj, B = b) > 0, P(Y | H = hi, B = b) = P + and P(Y | H = hj, B = b) = P , (11) for some distributions P , P + with P (Y = 1) < c and P +(Y = 1) > c, where c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1). (12) Then, for any monotone AI-assisted decision policy π Π(H, B), there exists an AI-assisted decision policy π Π(H, B) which is not monotone and achieves a stricly greater utility than π, i.e., Eπ[u(T, Y )] < E π[u(T, Y )]. Proof. Let π be a monotone AI-assisted decision policy, then it must hold that Pπ(T = 1 | H = hi, B = b) Pπ(T = 1 | H = hj, B = b) (see Eq. 4). Let π be an identical AI-assisted decision policy to π up to the decision for confidence values hi, b and hj, b. We distinguish between three cases. Case 1: Pπ(T = 1 | H = hi, B = b) < Pπ(T = 1 | H = hj, B = b). Let the probability of T = 1 under π for confidence values hi, b and hj, b be switched compared to π, i.e., P π(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b), P π(T = 1 | H = hj, B = b) = Pπ(T = 1 | H = hi, B = b). Then, π is not monotone, as Eq. 4 is not satisfied, and it holds that P π(T = 1 | H = hi, B = b) > Pπ(T = 1 | H = hi, B = b), P π(T = 1 | H = hj, B = b) < Pπ(T = 1 | H = hj, B = b). As we decreased P(T = 1 | H = hj, B = b) and increased P(T = 1 | H = hi, B = b), by properties (I) and (II), it must hold that the expected utility of π given confidence values hi, b and hj, b is higher than the one of π, i.e., E π[u(T, Y ) | H = hi, B = b] > Eπ[u(T, Y ) | H = hi, B = b] and (13) E π[u(T, Y ) | H = hj, B = b] > Eπ[u(T, Y ) | H = hj, B = b]. (14) Case 2: 0 < Pπ(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b) 1. Let the probability of T = 1 under π for confidence values hj, b be strictly lower compared to π and be the same as π for hi, b. Then, π is not monotone, since by case assumption P π(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b) > P π(T = 1 | H = hj, B = b) and the inequality in Eq. 14 holds by property (I). Case 3: Pπ(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b) = 0. Let the probability of T = 1 under π for confidence values hi, b be strictly higher compared to π and be the same as π for hj, b. Then, π is not monotone, since by case assumption P π(T = 1 | H = hj, B = b) = Pπ(T = 1 | H = hi, B = b) < P π(T = 1 | H = hi, B = b) and the inequality in Eq. 13 holds by property (II). As in all three cases at least one of the strict inequalities in Eqs. 13 or 14 holds and π is equivalent to π (i.e., it has the same expected conditional utility) given any other pair of confidence values h H, b B, we have that E π[u(T, Y )] = E[E π[u(T, Y )]|H, B] > E[Eπ[u(T, Y )|H, B] = Eπ[u(T, Y )]. Before proceeding further, we would like to note that we may also state Lemma 3 using h H, bi, bj B, with bi < bj, the proof would follow analogously. Now, we construct an AI-decision making process M, with H = {h1, h2, h3} and B = {b1, b2, b3}, such the decision maker s confidence f H is monotone, the classifier s confidence f B is perfectly calibrated, and the conditions of Lemma 3 are satisfied. First, let f H, f B and P M be such that P(f B(Z) = bj) = 3/6 if j = 1 2/6 if j = 2 1/6 if j = 3 0 otherwise P(H = hi | B = bj) := PX,V (H = hi | f B(Z) = bj) = ( 1 4 j if i j 0 otherwise. Then, it readily follows that P(H = hi, B = bj) = 1/6 for i j and P(H = hi, B = bj) = 0 otherwise. Moreover, for each pair of confidence values (hi, bj) with positive probability P(H = hi, B = bj), we set P(Y = 1 | H = hi, B = bj) = P + if i = j = 2 or (i = 3 and j {1, 3}) P if (j = 2 and i = 3) or (j = 1 and i {1, 2}), h1 h2 h3 h4 h5 h6 Figure 4: Nonzero values of P(Y = 1|H = hi, B = bj) and P(H = hi, B = bj) for every hi H and bj B used in the first (left) and second (right) part of the proof of Theorem 3. In each cell (hi, bj) in both panels, P + or P is the value of P(Y = 1|H = hi, B = bj) and lighter color means lower value of P(H = hi, B = bj), where white means P(Y = 1|h = hi, B = bj) = 0 and P(H, B) = 0. In both panels, the assignment of values is very stylized to facilitate the proof the classifier s confidence function f B partitions the feature space in a way such that a rational decision maker is unable to take decisions that maximize utility for almost all confidence values. However, less stylized examples also satisfy the conditions of Lemma 3. For example, as long as there is one triplet of confidence values b2, h2, h3 (or h3, b1, b2 in the left example) for which a rational decision maker is unable to take decisions that maximize utility, Lemma 3 can be applied. as shown in Figure 4 (left). Then, it readily follows that f H is monotone with respect to the probability that Y = 1, i.e., P(Y = 1 | H = hi) P(Y = 1 | H = hi+1)), and we have that the classifier s confidence values i:i j P(H = hi | B = bj) P(Y = 1 | H = hi, B = bj) 2/3 P + 1/3 P + if j = 1 1/2 P + 1/2 P + if j = 2 P + if j = 3 0 otherwise are perfectly calibrated and satisfy that bj < bj+1. Finally, using Lemma 3 with b = b2, hi = h2, hj = h3, we have that any monotone AI-assisted decision policy is suboptimal for any M with f H, f B and P M as defined above. Construction with arbitrary |H| = k and |B| = m, m > k 2. In this second part of the proof, we construct an AI-assisted decision making processes M, with |H| = k and |B| = m such that m > k 2, such that the decision maker s confidence f H is monotone, the classifier s confidence f B is perfectly calibrated and the conditions of Lemma 3 are satisfied. First, let the space of confidence values be H = {hi}i [k] and B = {bj}j [m], with order hi < hi+1 and bi < bi+1, respectively, and f H, f B and P M be such that P(f B(Z) = bj) = 1/m and P(H = hi | B = bj) := PX,V (H = hi | f B(Z) = bj) = m if j = i m j+1 m if i = 1, j > k j 1 m if j = i + 1, j k j 1 m if i = k, j > k 0 otherwise. Moreover, for each pair of confidence values (hi, bj) with positive probability P(H = hi, B = bj), we set P(Y = 1 | H = hi, B = bj) = P if j = i P if i = 1, j > k P + if j = i + 1, j k P + if i = k, j > k, as shown in Figure 4 (right). Further, we set the classifier s confidence values bj to bj := m j + 1 Then, it holds that bj < bj+1 and f B is perfectly calibrated as P(Y = 1 | B = bj) = P(H = hj | B = bj) P + P(H = hj 1 | B = bj) P + if j k P(H = h1 | B = bj) P + P(H = hk | B = bj) P + if j > k and thus, using the definitions of P(H | B) and P(Y | H, B), we have that P(Y | B = bj) = bj. To show that f H is monotone with respect to the probability that Y = 1, first note that P(H = hi, B = bi) decreases as i increases and P(H = hi, B = bi+1) increases as i increases. Moreover, further note that P(Y = 1 | H = hi, B = bi) = P < P(Y = 1 | H = hi, B = bi+1) = P +. Hence, for any i {2, . . . , k 1}, it readily follows that P(Y = 1 | H = hi) = P + P(B = bi+1|H = hi) + P P(B = bi|H = hi) P(Y = 1 | H = hi+1), and, for i = 1, it is evident that P(Y = 1 | H = h1) < P(Y = 1 | H = h2). Finally, using Lemma 3 with any choice of confidence values b = bj, hi = hj 1 and hj = hj with j {2, . . . , k}, we have that any monotone AI-assisted decision policy π is suboptimal for any M with |H| = k and |B| = m, m > k 2, and f H, f B and P M as defined above. Here, note that, as we do not fix the exact distributions P and P +, the above Lemma applies to infinitely many AI-assisted decision making processes M. Construction with B [0, 1] and |H| = k. In this last part of the proof, we construct an AI-assisted decision making process M, with |H| = k 2 and B [0, 1], such that the decision maker s confidence function f H is monotone, the classifier s confidence function f B is perfectly calibrated and the conditions of Lemma 3 are satisfied. First, let the space of confidence values be H = {hi}i [k], with order hi < hi+1, the feature space10 X = [0, 1], and f , f + be two strictly monotone increasing functions with f : [0, 1] [0, c) and f + : [0, 1] (c, 1], (17) c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1). (18) Further, let Qk+1 = {q0, q1, . . . qk, qk+1} be a set of quantiles such that P(X qj) = j/(k + 1) for all j {0, 1, . . . , k + 1} and thus, we have that, for all j [k + 1], for Ij := (qj 1, qj], it holds that P(X Ij) = 1 k + 1. Now, let f H and P M be such that PV (H = hi | X, X Ij) = 1/2 if i {j 1, j} 1 if i = j = 1 or (i = k and j = k + 1) 0 otherwise, (19) 10For a more general feature space X, we can use a mapping φ of X to [0, 1]. The proof works analogously by substituting X with φ(X). Figure 5: Nonzero values of P(Y = 1|X, H = hi, X Ij) for every hi H, with |H| = 3, and Ij = (qj 1, qj], with qj Q4 used in the last part of the proof of Theorem 3. Lighter color means lower value of f or f +. P(Y = 1 | X, H = hi, X Ij) = f (X) if j = i or (i = j = 1 ) f +(X) if j = i + 1 or (i = k and j = k + 1), (20) as shown in Figure 5. Next, we define f B(Z) = f B(X) := P(Y = 1 | X) = f (X) if X I1 f +(X) if X Ik+1 (f (X) + f +(X))/2 otherwise, which, by construction, is perfectly calibrated. To show that the decision maker s confidence function f H is monotone with respect to the probability that Y = 1, we first note that, using Eq. 19, we have that P(X Ij | H = hi) = 1/2 if 1 < i < k and j {i, i + 1} and 1 if i = j = 1 1 if i = k and j = k + 1 0 otherwise. Hence, using Eq. 21 and the law of total probability, for any i {2, . . . , k 2}, we have that P(Y = 1 | H = hi) = 1 2 [P(Y = 1 | H = hi, X Ii) + P(Y = 1 | H = hi, X Ii+1)] 2 f (qi) + f +(qi+1) 2 f (inf Ii+1) + f + (inf Ii+2) 2 [P(Y = 1 | H = hi+1, X Ii+1) + P(Y = 1 | H = hi+1, X Ii+2)] = P(Y = 1 | H = hi+1), where the inequalities follow from the fact that f and f + are strictly monotone increasing. Corner cases for i = 1 and i = k 1 can be shown analogously by further using that f (X) < c < f +(X) for all X. Finally, using Lemma 3 with any choice of confidence values hi = hj 1 hj = hj, j {2, , k 1} and b = f B(X) with X Ij, we have that any monotone AI-assisted decision policy π is suboptimal for any M with |B| [0, 1] and |H| = k, k 2 and f H, f B and P M as defined above. A.3 Proof of Theorem 5 We prove the statement by contraposition. Let M be an AI-assisted decision making process satisfying Eqs. 2 and 3, with a utility function u(T, Y ) satisfying Eq. 1 and let M be such that f B satisfies α-alignment with respect to f H and f B has output space B [0, 1]. Assume there exists no (near-)optimal monotone AI-assisted decision policy for utility u. Thus, there must exist an optimal AI-assisted decision policy π Π(H, B) which is not monotone and has strictly greater expected utility than any monotone policy. However, we show that we can modify π to a monotone AI-assisted decision policy ˆπ Π(H, B) with near-optimal expected utility. As π is not monotone, there must exist confidence values h1, h2 H, h1 h2, and b1, b2 B, b1 b2, such that π(h1, b1, w) > π(h2, b2, w) for some w W, (22) where W denotes the space of noise values. In what follows, let W(π,h2,b2) h1,b1 W denote the set containing any such w and let W(π,h2,b2) = S h,b H B W(π,h2,b2) h,b . For any confidence value h , b H [0, 1], we modify policy π to a policy ˆπ as follows. Let { Sh}h H denote the sets satisfying the α-alignment condition for f B with respect to f H and, given confidence h , let ˆbh denote the smallest confidence value of f B, such that there exist h h with P(Y = 1 | B = ˆbh , Z Sh) c, i.e., ˆbh := min{b B | P(Y = 1 | B = b, Z Sh) c for h h }. (23) Now, we define a new AI-assisted policy ˆπ from π as follows, ˆπ(h , b , w) := 1 if b ˆbh and w S h h ,b [ˆbh ,b ] W(π,h,b) 0 if b < ˆbh and w S h h ,b [b ,ˆbh ) W(π,h,b) π(h , b , w) otherwise. Next, we show that ˆπ is monotone and Eˆπ[u(T, Y )] Eπ[u(T, Y )] + α a for some constant a. Proof ˆπ is a monotone assisted policy. To prove that ˆπ Π(H, B) is a monotone AI-assisted decision policy, we show that, for all h , h H, b , b B, with h h , b b , it holds that W(ˆπ,h ,b ) h ,b = . We distinguish between three cases. Case 1: b ˆbh and b ˆbh . Since h h , b b and, by definition, ˆbh ˆbh since h h , we have that [ h h ,b [ˆbh ,b ] h h ,b [ˆbh ,b ] Hence, we can conclude that ˆπ(h , b , w) 1 = ˆπ(h , b , w) for all w [ h h ,b [ˆbh ,b ] W(π,h,b). (25) Further, for any other w W S h h ,b [ˆbh ,b ] W(π,h,b) W W(π,h ,b ) h ,b , we have that ˆπ(h , b , w) = π(h , b , w) and ˆπ(h , b , w) = π(h , b , w) and, by definition of W(π,h ,b ) h ,b , it follows that ˆπ(h , b , w) ˆπ(h , b , w) for all w W [ h h ,b [ˆbh ,b ] W(π,h,b). (26) From Eqs. 25 and 26, it follows that W(ˆπ,h ,b ) h ,b = . Case 2: b < ˆbh and b ˆbh . By definition of ˆπ, we have that ˆπ(h , b , w) 1 = ˆπ(h , b , w) for all w [ h h ,b [ˆbh ,b ] W(π,h,b) (27) and ˆπ(h , b , w) = 0 ˆπ(h , b , w) for all w [ h h ,b [b ,ˆbh ) W(π,h,b) (28) Analogously to case 1, since the values of w below are also in W W(π,h ,b ) h ,b and ˆπ is equivalent to π for these values, we have that ˆπ(h , b , w) ˆπ(h , b , w) for all w W [ h h ,b [ˆbh ,b ] h h ,b [b ,ˆbh ) (29) From Eqs. 27 28 and 29, it follows that W(ˆπ,h ,b ) h ,b = . Case 3: b < ˆbh and b < ˆbh . Since h h , b b and, by definition, ˆbh ˆbh since h h , we have that [ h h ,b [b ,ˆbh ) h h ,b [b ,ˆbh ) Hence, we can conclude that ˆπ(h , b , w) = 0 ˆπ(h , b , w) for all w [ h h ,b [b ,ˆbh ) W(π,h,b) (30) Again analogously to case 1, since the values of w below are also in W W(π,h ,b ) h ,b and ˆπ is equivalent to π for these values, we have that ˆπ(h , b , w) ˆπ(h , b , w) for all w W [ h h ,b [b ,ˆbh ) W(π,h,b) (31) From Eqs. 30 and 31, it follows that W(ˆπ,h ,b ) h ,b = . Note that, we cannot have a case where b ˆbh and b < ˆbh , as this would imply b < b . Since, in all three possible cases, we have shown that W(ˆπ,h ,b ) h ,b = , we can conclude that ˆπ Π(H, B) is monotone. Proof ˆπ is near optimal. First, we rewrite the inner expectation in Eq. 10 as Eπ[u(T, Y ) | H, B] = E[u(0, Y ) | H, B] + (E[u(1, Y ) | H, B] E[u(0, Y ) | H, B]) Pπ(T = 1 | H, B). Further, recall that | Sh| (1 α/2)|Sh| for all h H and, for all h , h H, h h and all b , b [0, 1], b b , we have that P(Y = 1 | f B(Z) = b , Z Sh ) P(Y = 1 | f B(Z) = b , Z Sh ) α (32) Now, for any h H, b B, we show an upper bound on Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]. We distinguish between three cases. Case 1: b ˆbh and P(Y = 1 | H = h , B = b ) c. Using Lemma 2, we have that (E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ]) 0 (33) Moreover, as b ˆbh , the distribution of positive decisions in ˆπ may also increases for h , b compared to π (see Eq. 24), i.e., Pπ(T = 1 | H = h , B = b ) Pˆπ(T = 1 | H = h , B = b ) 0 Hence, it follows that Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ] = (E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ]) (Pπ(T = 1 | H = h , B = b ) Pˆπ(T = 1 | H = h , B = b )) 0. Case 2: b ˆbh and P(Y = 1 | H = h , B = b ) < c. Since b ˆbh , there exists h, b H B, with h h , b b , such that P(Y = 1 | B = b, Z Sh) c. Moreover, using the definition of α-alignment, we have that P(Y = 1 | B = b, Z Sh) P(Y = 1 | B = b , Z Sh ) + α (35) Then, we can use this to lower bound the expected utility of T = 1 given B = b and Z Sh as follows: E[u(1, Y ) | B = b, Z Sh] E[u(1, Y ) | B = b , Z Sh ] = u(1, 1) (P(Y = 1 | B = b, Z Sh) P(Y = 1 | B = b , Z Sh ) + u(1, 0) (P(Y = 1 | B = b , Z Sh ) P(Y = 1 | B = b, Z Sh)) (u(1, 1) u(1, 0)) α, where the last inequality due to Eq. 35 and the assumption that u(1, 1) u(1, 0) > 0. Analogously, we can also upper bound the expected utility of T = 0 given H = h , B = b and Z Sh as follows: E[u(0, Y ) | B = b, Z Sh] E[u(0, Y ) | B = b , Z Sh ] = u(0, 1) (P(Y = 1 | B = b, Z Sh) P(Y = 1 | B = b , Z Sh ) + u(0, 0) (P(Y = 1 | B = b , Z Sh ) P(Y = 1 | B = b, Z Sh)) (u(0, 1) u(0, 0)) α, where the last inequality holds due to Eq. 35 and the assumption that u(0, 1) u(0, 0) < 0. Now, as P(Y = 1 | B = b, Z Sh) c, by Lemma 2, we have that E[u(1, Y ) | B = b, Z Sh] E[u(0, Y ) | B = b, Z Sh] (38) Combining Eqs. 36, 37 and 38, we obtain E[u(1, Y ) | B = b , Z Sh ] + α(u(1, 1) u(1, 0)) E[u(0, Y ) | B = b , Z Sh ] + α(u(0, 1) u(0, 0)) (39) In addition, note that we have following trivial bound for the expectation when H = h but Z / Sh u(1, 0) E[u(1, Y ) | H = h , B = b ] u(1, 1), (40) u(0, 1) E[u(0, Y ) | H = h , B = b ] u(0, 0) (41) Moreover, since b ˆbh , the distribution of positive decisions in ˆπ may also increase for h , b compared to π, i.e., Pπ(T = 1 | H = h , B = b ) Pˆπ(T = 1 | H = h , B = b ) 0 Hence, we have that Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ] ( 1) (E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ]), (42) where the inequality follows since E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ] 0 by Lemma 2 as P(Y = 1 | H = h , B = b ) < c. Finally, combining Eqs. 39, 40, 41 and 42 and using the law of total expectation, we obtain Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ] (1 β(h ,b ))(E[u(0, Y ) | B = b , Z Sh ] E[u(1, Y ) | B = b , Z Sh ]) + β(h ,b )(E[u(0, Y ) | H = h , B = b ] E[u(1, Y ) | H = h , B = b ]) (1 β(h ,b ))α(u(1, 1) u(1, 0) + u(0, 0) u(0, 1)) + β(h ,b )(u(0, 0) u(1, 0)), (43) where β(h ,b ) denotes the probability of Z / Sh given H = h , B = b , i.e., β(h ,b ) = P(Z / Sh |H = h , B = b ). Case 3: b < ˆbh . For all h, b, with h h , b b , we have that P(Y = 1 | B = b, Z Sh) < c. In particular, P(Y = 1 | B = b , Z Sh ) < c. Thus, by Lemma 2, E[u(1, Y ) | B = b , Z Sh ] < E[u(0, Y ) | B = b , Z Sh ] (44) In this case, since b < ˆbh , the distribution of positive decisions in ˆπ may decrease for h, b compared to π, i.e., 0 Pπ(T = 1 | H = h, B = b) Pˆπ(T = 1 | H = h, B = b) Combining Eqs.44, 40 and 41 and using the law of total expectation, we obtain Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ] (E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ]) 1 = (1 β(h ,b ))(E[u(1, Y ) | B = b , Z Sh ] E[u(0, Y ) | B = b , Z Sh ]) + β(h ,b )(E[u(1, Y ) | H = h , B = b ] EY [u(0, Y ) | H = h , B = b ]) β(h ,b )(u(1, 1) u(0, 1)), (45) where again β(h ,b ) = P(Z / Sh |H = h , B = b ). Now, for a fixed h H, since | Sh | (1 α/2)|Sh |, we know that 0 P b B β(h ,b) α/2. Hence, combining Eqs. 34, 43 and 45 from the three cases above, we have that EB[Eπ[u(T, Y ) | H = h , B = b ]] EB[Eˆπ[u(T, Y ) | H = h , B = b ]] = EB[Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]] max{α(u(1, 1) u(1, 0) + u(0, 0) u(0, 1)) + α 2 (u(0, 0) u(1, 0)), α 2 (u(1, 1) u(0, 1))} α (u(1, 1) u(0, 1) + 3 2 (u(0, 0) u(1, 0))). Finally, since by assumption π is optimal, i.e., Eπ[u(T, Y )] = Eπ [u(T, Y )] = maxπ Π(H,B) Eπ [u(T, Y )], we can conclude by the law of total expectation that Eπ [u(T, Y )] = EHEB[EY, T | π[u(T, Y ) | H, B]] Eˆπ[u(T, Y )] + α (u(1, 1) u(0, 1) + 3 2 (u(0, 0) u(1, 0))) . This concludes the proof. A.4 Proof of Theorem 8 If f B is α/2-multicalibrated with respect to {Sh}h H, then, by definition, for any h H, there exists Sh Sh with |S| (1 α/2)|Sh| such that, for any b [0, 1], it holds that |P(Y = 1 | f B(Z) = b, Z Sh) b| α/2. This directly implies that, for any h , h H and b , b [0, 1], we have that P(Y = 1 | f B(Z) = b , Z Sh ) b P(Y = 1 | f B(Z) = b , Z Sh ) b α (46) and, using linearity of expectation, we further have that P(Y = 1 | f B(Z) = b , Z Sh ) P(Y = 1 | f B(Z) = b , Z Sh ) α + b b , (47) showing that, whenever b b , the α-alignment condition is met. This proves that f B is α-aligned with respect to f H. Finally, if f B is α/2-multicalibrated with respect to {Sh}h H, then, it is α/2-calibrated with respect to any of the sets Sh. Since Z = h HSh, this implies that f B is α/2-calibrated with respect to Z. This concludes the proof. A.5 Proof of Proposition 1 Given a discretization parameter λ, Algorithm 1 works with a discretized notion of α-multicalibration, namely (α, λ)-multicalibration: Definition 10. Let C 2Z be a collection of subsets of Z. For any α, λ > 0, confidence function f B : Z [0, 1] is (α, λ)-multicalibrated with respect to C if, for all S C, b Λ[0, 1], and all Sh,λ(b)(g) such that |Sh,λ(b)| αλ|Sh|, it holds that |E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh,λ(b)]| α . (48) Here, we can analogously define a discretized notion of α-alignment, namely (α, λ)-alignment. Definition 11. For α, λ > 0, a confidence function f B : Z [0, 1] is (α, λ)-aligned with respect to f H if, for all h , h H, h h , and all b , b Λ[0, 1], b b , with |Sh ,λ(b )| > α/2 λ|Sh | and |Sh ,λ(b )| > α/2 λ|Sh |, we have P(Y = 1 | (X, H) Sh ,λ(b )) P(Y = 1 | (X, H) Sh ,λ(b )) α . (49) In what follows, we first show that (α, λ)-multicalibration with respect to {Sh}h H implies (2α + λ, λ)-alignment with respect to f H. Theorem 12. For α, λ > 0, if f B is (α, λ)-multicalibrated with respect to {Sh}h H, then f B is (2α + λ, λ)-aligned with respect to f H . Proof. If f B is (α, λ)-multicalibrated with respect to {Sh}h H, then, by definition, for all h H, b Λ[0, 1], and all Sh,λ(b) such that |Sh,λ(b)| α λ|Sh|, it holds that |E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh,λ(b)]| α. (50) This directly implies that, for all h , h H, b , b Λ[0, 1] with |Sh ,λ(b )| α λ|Sh | and |Sh ,λ(b )| α λ|Sh |, it holds that E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh ,λ(b )] E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh ,λ(b )] 2α (51) and, using the linearity of expectation, we have that P(Y = 1 | (X, H) Sh ,λ(b )) P(Y = 1 | (X, H) Sh ,λ(b )) 2α + E[f B(X, H) | (X, H) Sh ,λ(b )] E[f B(X, H) | (X, H) Sh ,λ(b )]. (52) Whenever b b , due to the λ-discretization, we have that E[f B(X, H) | (X, H) Sh ,λ(b )] E[f B(X, H) | (X, H) Sh ,λ(b )] λ (53) Hence, we have shown that if f B is α-multicalibrated, then for all h , h H, b , b Λ[0, 1] with |Sh ,λ(b )| α λ|Sh | and |Sh ,λ(b )| α λ|Sh |, we have P(Y = 1 | (X, H) Sh ,λ(b )) P(Y = 1 | (X, H) Sh ,λ(b )) 2α + λ . (54) Further, note that (2α + λ)/2 λ > α λ as λ > 0. This concludes the proof. Next, we show that, if f B is (α, λ)-aligned, then f B,λ is α-aligned with respect to f H. Theorem 13. For α, λ > 0, if f B is (α, λ)-aligned with respect to f H, then f B,λ is α-aligned with respect to f H. Proof. The proof is similar to the proof of Lemma 1 in Hébert-Johnson et al. [11]. Consider all Sh,λ(b) such that |Sh,λ(b)| < αλ|Sh|. By the λ-discretization, there are at most 1/λ such sets, thus, the cardinality of their union is at most 1/λαλ|Sh| = α|Sh|. Hence, for all h H, there exists a subset Sh Sh with | Sh| (1 α)|Sh| such that, for all h , h H, with h h , and all b , b Λ[0, 1], with b b , it holds that P(Y = 1 | (X, H) Sh ,λ(b ) Sh ) P(Y = 1 | (X, H) Sh ,λ(b ) Sh ) α . (55) The λ-discretization sets all values of (x, h) Sh ,λ(b ) to f B,λ(x, h) = E[f B(X, H) | f B(X, H) λ(b )]. Note that, for (x, h) Sh ,λ(b ), f B,λ(x, h) λ(b ) and for (x, h) Sh ,λ(b ), f B,λ(x, h) λ(b ), so it still holds that E[f B(X, H) | f B(X, H) λ(b )] E[f B(X, H) | f B(X, H) λ(b )]. Thus, using Eq. 55, we have that P(Y = 1 | f B(X, H) = E[f B(X, H) | (X, H) λ(b )], (X, H) Sh ) P(Y = 1 | f B(X, H) = E[f B(X, H) | (X, H) λ(b )], (X, H) Sh ) α (56) This concludes the proof. Finally, using Theorems 12 and 13, it readily follows that, given a parameter α , the discretized confidence function f B,λ returned by Algorithm 1 satisfies (2α + λ)-aligned calibration with respect to f H. A.6 Proof Theorem 9 We structure the proof in three parts. We first explain the calibration guarantee that UMD provides and how it relates to human-aligned calibration. Then, we derive a lower bound on the size of the subsets D Sh so that the discretized confidence function f B,λ satisfies α-aligned calibration with respect to f H with high probability. Finally, building on this result, we derive an upper bound on |D| so that f B,λ satisfies α-aligned calibration with high probability as long as there exists γ > 0 so that P((X, H) Sh) γ for all h H. Conditional Calibration implies Human-Aligned Calibration. Running UMD on a dataset D (Z Y)n, where each datapoint is sampled from P M, guarantees (α, ξ)-conditional calibration, a PAC-style calibration guarantee [12]. Given a dataset D, a confidence function f B satisfies (α, ξ)- conditional calibration if, with probability at least 1 ξ over the randomness in D, b [0, 1], |P(Y = 1|f B(X, H) = b) b| α . This stands in contrast to the definition of α-calibration, which requires only that the confidence f B(X, H) is at most α away from the true probability for 1 α fraction of Z. Similarly, using an union bound over all h H, (α/2, ξ/|H|)-conditional calibration of f B on each Sh, h H, implies that, with probability at least 1 ξ over the randomness in D, f B satisfies that h H, b [0, 1], |P(Y = 1|f B(X, H) = b, H = h) b| α/2 . (57) Hence, analogously to the proof of Theorem 8, this implies that, with probability at least 1 ξ over the randomness in D, f B also satisfies that h, h H,h h , b, b G, b b , P(Y = 1|f B(X, H) = b, H = h) P(Y = 1|f B(X, H) = b , H = h ) α . (58) In summary, from Eqs. 57 and 58, we can conclude that (α/2, ξ/|H|)-conditional calibration of f B on each Sh, h H, implies that, with probability at least 1 ξ, f B satisfies α-aligned calibration, where, for all h H, we have that Sh = Sh. Lower bound on |D Sh| to achieve conditional calibration with UMD. Running UMD on each partition D Sh of D induced by h H achieves (α/2, ξ/|H|)-conditional calibration as long as each subset D Sh of the data is large enough. More specifically, the following lower bound on the size of the subsets D Sh readily follows from Theorem 3 in Gupta et al. [12]. Lemma 4. The discretized confidence function f B,λ returned by |H| instances of UMD, one per Sh, is (α/2, ξ/|H|)-conditional calibrated on Sh for any ξ (0, 1) if |D Sh| nmin := Proof. Let B denote the number of bins in UMD. Theorem 3 in Gupta et al. [12] states that, if f B(X, H) is absolutely continuous with respect to the Lebesgue measure11 and |D Sh| 2B, then the discretized confidence function output by UMD is (ϵ, ξ )-conditionally calibrated for any ξ (0, 1) and log(2B/ξ ) 2( |D Sh|/B 1) . (60) Then, for a given α, setting ϵ = α/2, B = 1/λ and ξ = ξ/|H|, we can solve Eq. 60 for the lower bound on |D Sh| nmin with nmin as defined in Eq. 59. Upper bound on |D| to achieve conditional calibration with UMD. Suppose P((X, H) Sh) γ for all h H. When |H| 2, we give an upper bound on |D| so that with high probability |D Sh| nmin for all h H. In the process of sampling D (Z Y)n from P M, let R(h) i = 1 denote the event that the i-th datapoint (xi, hi, yi) has confidence value h, i.e., hi = h. Then, we can express |D Sh| in terms of random variable R(h), defined as i=1 R(h) i . (61) Since R(h) i is a Bernoulli-distributed variable with P(R(h) i ) = P((X, H) Sh), the expected value of R(h) is µ(h) := E[R(h)] = P((X, H) Sh) |D| γ |D|. Let |D| = 2 |H| log(2/ξ) 1/γ nmin, observe that in this case P(R(h) nmin) = P R(h) γ 2|H| log(2/ξ) |D| . For |H| 2 and ξ (0, 1), we have 1/(2|H| log(2/ξ)) (0, 1) and we can use a variation of the Chernoff bound to show P(R(h) nmin) P R(h) 1 2|H| log(2/ξ) µ(h) e µ(h)( 2|H| log(2/ξ) 1 2|H| log(2/ξ) ) 2 1 2 1 1 |H| log(2/ξ) + 1 (2|H| log(2/ξ))2 2 e |H| nmin 1 2 1 2|H| log(2/ξ) + 1 2(2|H| log(2/ξ))2 , where the first and last inequality results from using µ(h) > γ |D|. We can now use a union bound to obtain a lower bound on the probability that for any h H, |D Sh| nmin, i.e., P( h H : |D Sh| nmin) ξ 2 |H| e |H| nmin 1 2 1 2|H| log(2/ξ) + 1 2(2|H| log(2/ξ))2 (62) One can verify that for |H| 2 and nmin 1, we have P( h H : |D Sh| nmin) ξ 2. Hence, if |D| = 2 |H| log(2/ξ) 1/γ nmin, then, for all h H, |D Sh| nmin with probability 1 ξ/2. 11If f B is not continuous with respect to the Lebesgue measure (or equivalently put, f B does not have a probability density function), a randomization trick can be used to ensure that the results of the theorem hold. Combining this result and Lemma 4, we have that the discretized confidence function f B,λ returned by |H| instances of UMD, one per Sh, is (α/2, ξ/(2|H|))-conditional calibrated on each Sh with probability at least 1 ξ/2 for any ξ (0, 1) if |D| = 2 |H| log(2/ξ) Finally, using a union bound, we can conclude that f B,λ achieves α-aligned calibration with respect to f H with probability at least 1 ξ from |D| = O |H| log(|H|/ξλ) samples. This concludes the proof. B Multicalibration Algorithm In this section, we give a high-level description of the post-processing algorithm for multicalibration introduced by Hébert-Johnson et al. [11]. The algorithm works with a discretization of [0, 1] into uniform sized bins of size λ, for a λ > 0. Formally the λ-discretization of [0, 1], is defined as Definition 14 (λ-discretization [11]). Let λ > 0. The λ-discretization of [0, 1], denoted by Λ[0, 1] = { λ 2 , . . . , 1 λ 2 }, is the set of 1/λ evenly spaced real values over [0, 1]. For b Λ[0, 1], let λ(b) = [b λ/2, v + λ/2) (64) be the λ-interval centered around b (except for the final interval, which will be [1 λ, 1]). It starts by partitioning each subspace Sh into 1/λ groups Sh,λ(b) = {(x, h) Sh | f B(x, h) λ(b)}, with b Λ[0, 1]. Then, it repeatedly looks for a large enough group Sh,λ(b) such that the absolute difference between the average confidence value E[f B(X, H) | (X, H) Sh,λ(b)] and the probability P(Y = 1 | (X, H) Sh,λ(b)) is larger than α and, if it finds it, it updates the confidence value f B(x, h) of each (x, h) Sh,λ(b) by this difference. Once the algorithm cannot find any more such a group, it returns a discretized confidence function f B,λ(x, h) = E[f B(X, H) | f B(X, H) λ(b)], with b Λ[0, 1] such that f B(x, h) λ(b), which is guaranteed to satisfy (α + λ)-multicalibration. Algorithm 1 provides a pseudocode implementation of the overall algorithm. Within the implementation, it is worth noting that the expectations and probabilities can be estimated with fresh samples from the distribution or from a fixed dataset using tools from differential privacy and adaptive data analysis, as discussed in Hébert-Johnson et al. [11]. Algorithm 1 Post-processing algorithm for (α + λ)-multicalibration 1: Input: confidence function f B, parameters α, λ > 0 2: Output: confidence function f B,λ 3: repeat 4: updated false 5: for Sh C & b Λ[0, 1] do 6: Sh,λ(b) Sh {(x, h) Z | f B(x, h) λ(b)} 7: if P((X, H) Sh,λ(b)) < αλ P((X, H) Sh) then 8: continue 9: bh,λ(b) E[f B(X, H) | (X, H) Sh,λ(b)] 10: rh,λ(b) P(Y = 1 | (X, H) Sh,λ(b)) 11: if |rh,λ(b) bh,λ(b)| > α then 12: updated true 13: for (x, h) Sh,λ(b) do 14: f B(x, h) f B(x, h) + (rh,λ(b) bh,λ(b)) {project into [0, 1] if necessary} 15: until updated = false 16: for b Λ[0, 1] do 17: bλ(b) E[f B(X, H)|f B(X, H) λ(b)] 18: for (x, h) Z : f B(x, h) λ(b) do 19: f B,λ(x, h) bλ(b) 20: return f B,λ C Additional Details about the Experiments Transformation of confidence values. In the Human-AI Interactions dataset, the AI model is a simple statistical model where b is just a noisy average confidence h of an independent set of ca. 50 human labelers on each task instance. Moreover, the confidence values were originally recorded on a scale of [ 1, 1], where 1 means complete certainty on the correct true label and 1 means complete certainty on the incorrect label. To better match our theoretical framework, we transform all confidence values to a scale of [0, 1], where 1 means complete certainty that the true label y = 1 and 0 means complete certainty that the true label is y = 1. More formally, let ˆb, ˆh, ˆh+AI [ 1, 1] be the original confidence values in the dataset, then we obtain b [0, 1] via the following transformation: ( (ˆb + 1)/2 if y = 1 1 (ˆb + 1)/2 if y = 0, and analogously for h and h+AI. Comparing decision policies πB, πH and πH+AI. Figure 6 shows the ROC curves for the decision policies πB, πH and πH+AI in each of the four tasks in the Human-AI Interactions dataset. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate Decision Policies 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate Decision Policies (b) Sarcasm 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate Decision Policies 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate Decision Policies Figure 6: ROC curves for the decision policies πB, πH and πH+AI.