# humanaligned_calibration_for_aiassisted_decision_making__922e4e07.pdf

Human-Aligned Calibration for AI-Assisted Decision Making

Nina L. Corvelo Benz Max Planck Institute for Software Systems ETH Zürich ninacobe@mpi-sws.org

Manuel Gomez Rodriguez Max Planck Institute for Software Systems manuel@mpi-sws.org

Whenever a binary classiﬁer is used to provide decision support, it typically provides both a label prediction and a conﬁdence value. Then, the decision maker is supposed to use the conﬁdence value to calibrate how much to trust the prediction. In this context, it has been often argued that the conﬁdence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. However, multiple lines of empirical evidence suggest that decision makers have difﬁculties at developing a good sense on when to trust a prediction using these conﬁdence values. In this paper, our goal is ﬁrst to understand why and then investigate how to construct more useful conﬁdence values. We ﬁrst argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above conﬁdence values an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) conﬁdence values. However, we then show that, if the conﬁdence values satisfy a natural alignment property with respect to the decision maker s conﬁdence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the conﬁdence values, facilitating its discoverability. Further, we show that multicalibration with respect to the decision maker s conﬁdence on her own predictions is a sufﬁcient condition for alignment. Experiments on four different AI-assisted decision making tasks where a classiﬁer provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.

1 Introduction

In recent years, there has been an increasing excitement on the potential of machine learning models to improve decision making in a variety of high-stakes domains such as medicine, education or criminal justice [1 3]. One of the main focus has been binary classiﬁcation tasks, where a classiﬁer helps a decision maker by predicting a binary label of interest using a set of observable features [4 7]. For example, in medical treatment, the classiﬁer may help a doctor by predicting whether a patient may beneﬁt from a treatment. In college admissions, it may help an admissions committee by predicting whether a candidate may successfully complete an undergraduate program. In loan decisions, it may help a bank by predicting whether a prospective customer may default on a loan. In all these scenarios, the decision maker the doctor, the committee or the bank aim to use these predictions, together with their own predictions, to take good decisions that maximize a given utility function. In this context, since the predictions are unlikely to always match the truth, it has been widely agreed that the classiﬁer should also provide a conﬁdence value together with each prediction [8, 9].

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

While the conventional wisdom is that the conﬁdence value should be a well calibrated estimate of the probability that the predicted label matches the true label [10 16], multiple lines of empirical evidence have recently shown that decision makers have difﬁculties at developing a good sense on when to trust a prediction using these conﬁdence values [17 19]. Therein, Vodrahali et al. [17] have shown that, in certain scenarios, decision makers take better decisions using uncalibrated probability estimates rather than calibrated ones. However, a theoretical framework explaining this puzzling observation has been missing and it is yet unclear what properties we should be looking for to guarantee that conﬁdence values are useful for AI-assisted decision making. In our work, we aim to bridge this gap.

Our contributions. We start by formally characterizing AI-assisted decision making using a structural causal model (SCM) [20], as seen in Figure 1. Building upon this characterization, we ﬁrst argue that, if a decision maker is rational, the level of trust she places on predictions will be monotone on the conﬁdence values she will place more (less) trust on predictions with higher (lower) conﬁdence values. Then, we show that, for a broad class of utility functions, there are data distributions for which a rational decision maker can never take optimal decisions using calibrated estimates of the probability that the predicted label matches the true label as conﬁdence values. However, we further show that, if the conﬁdence values a decision maker uses satisfy a natural alignment property with respect to the conﬁdence she has on her own predictions, which we refer to as human-alignment, then the decision maker can both be rational and take optimal decisions. In addition, we demonstrate that human-alignment can be achieved via multicalibration [11], a statistical notion introduced in the context of algorithmic fairness. In particular, we show that multicalibration with respect to the decision maker s conﬁdence on her own predictions is a sufﬁcient condition for human-alignment. Finally, we validate our theoretical framework using real data from four different AI-assisted decision making tasks where a classiﬁer provides decision support to human decision makers in four different binary classiﬁcation tasks. Our results suggest that, comparing across tasks, classiﬁers providing human-aligned conﬁdence values facilitate better decisions than classiﬁers providing conﬁdence values that are not human-aligned. Moreover, our results also suggest that rational decision makers trust level increases monotonically with the classiﬁer s provided conﬁdence.

Further related work. Our work builds upon a rapidly increasing literature on AI-assisted decision making (refer to Lai et al. [21] for a recent review). More speciﬁcally, it is motivated by several empirical studies showing that decision makers have difﬁculties at modulating trust using conﬁdence values [17 19], as discussed previously. In this context, it is also worth noting that other empirical studies have analyzed how other factors such as model explanations and accuracy modulate trust [22 26]. However, except for a very recent notable exception [27], theoretical frameworks, which could be used to better understand the mixed ﬁndings found by these empirical studies, have been missing. More broadly, our work also relates to a ﬂurry of recent work on reinforcement learning with human feedback [28 30], which aims to better align the outputs of large language models (LLMs) with human preferences. However, our formulation is fundamentally different and our technical contributions are orthogonal to theirs.

2 A Causal Model of AI-Assisted Decision Making

We consider an AI-assisted decision making process where, for each realization of the process, a decision maker ﬁrst observes a set of features (x, v) X V, then takes a binary decision t {0, 1} informed by a classiﬁer s prediction ˆy = argmaxy fy(x), as well as conﬁdence fˆy(x) [0, 1], of a binary label of interest y {0, 1}, and ﬁnally receives a utility u(t, y) R. Such an AI-assisted decision making process ﬁts a variety of real-world applications. For example, in medical treatment, the features (x, v) may comprise multiple sources of information regarding a patient s health1, the label y may indicate whether a patient would beneﬁt from a speciﬁc treatment, the decision t may indicate whether the doctor applies the speciﬁc treatment to the patient, and the utility u(t, y) may quantify the trade-off between health beneﬁt to the patient and economic cost to the decision maker.

In what follows, rather than working with both ˆy and fˆy(x), we will work with just b = f1(x), which we will refer to as classiﬁer s conﬁdence, without loss of generality2. Moreover, we will assume that

1Our formulation allows for a subset of the features v to be available only to the decision maker but not to the classiﬁer. 2We can recover ˆy and fˆy(x) from b, i.e., if b > 0.5, we have that ˆy = 1 and fˆy(x) = b; if b < 0.5, ˆy = 0 and fˆy(x) = 1 b.

Individual characteristics Q W

DM s conﬁdence

Decision 70%

AI s conﬁdence

D Data generating

Figure 1: Our structural causal model M. Orange circles represent endogenous random variables and blue boxes represent exogenous random variables. The value of each endogenous variable is given by a function of the values of its ancestors in the structural causal model, as deﬁned by Eqs. 2 and 3. The value of each exogenous variable is sampled independently from a given distribution.

the utility u(t, y) is greater if the value of t and y coincide, i.e.,

u(1, 1) > u(1, 0), u(1, 1) > u(0, 1), u(0, 0) > u(1, 0), and u(0, 0) u(0, 1), (1)

a condition that we think it is natural under an appropriate choice of label and decision values. For example, in medical diagnosis, if t = 1 means the patient is tested early for a disease and y = 1 means the patient suffers the disease, the above condition implies that the utility of either testing a patient who suffers the disease or not testing a patient who does not suffer the disease are greater than the utility of either not testing a patient who suffers the disease or testing a patient who does not suffer the disease. In condition 1, we allow for a non-strict inequality u(0, 0) u(0, 1) because, in settings in which the label Y is only realized whenever the decision t = 1 (e.g., in our previous example on medical treatment, we can only observe if a treatment is eventually beneﬁcial or not if the patient is treated), it has been argued that, whenever t = 0, any choice of utility must be independent of the label value [4 6], i.e., u(0, 0) = u(0, 1) = u(0).

Next, we characterize the above AI-assisted decision making process using a structural causal model (SCM) [20], which we denote as M. The SCM M is deﬁned by a set of assignments, which entail a distribution P M and divide naturally into two subsets. One subset comprises the features and the label3, i.e., X = f X(D) V = f V (D) and Y = f Y (D), (2) where D is an independent exogenous random variable, often called exogenous noise, characterizing the data generating process and f X, f V and f Y are given functions4. The second subset comprises the decision maker and the classiﬁer, i.e.,

H = f H(X, V, Q), B = f B(X, H), T = π(H, B, W) and U = u(T, Y ), (3)

where f H and f B are given functions, which determine the decision maker s conﬁdence H and classiﬁer s conﬁdence B that the value of the label of interest is Y = 1, π is a given AI-assisted decision policy, which determines the decision maker s decision T, u is a given utility function, which determines the utility U, and Q and W are independent exogenous variables modeling the decision maker s individual characteristics inﬂuencing her own conﬁdence H and her decision T, respectively. By distinguishing both sources of noise, we allow for the presence of uncertainty on the decision T even after conditioning on ﬁxed conﬁdence values h and b. This accounts for the fact that, in reality, a decision maker may take different decisions T for instances with the same conﬁdence values h and b. For example, in medical treatment, for two different patients with the same conﬁdence h and b, a doctor s decision may differ due to limited resources.

In our SCM M, the decision maker s conﬁdence H refers to the conﬁdence the decision maker has that the label Y = 1 before observing the classiﬁer s conﬁdence B. Moreover, following previous behavioral studies showing that human s conﬁdence H is discretized in a few distinct levels [32, 33], we assume H takes values h from a totally ordered discrete set H. We say that the decision maker s

3We denote random variables with capital letters and realizations of random variables with lower case letters. 4Our model allows both for causal and anticausal features [31].

conﬁdence f H is monotone (with respect to the probability distribution P(Y = 1)) if, for all h, h H such that h h , it holds that P(Y = 1 | H = h) P(Y = 1 | H = h ). Further, we allow the classiﬁer s conﬁdence B to depend on the decision maker s conﬁdence H because this will be necessary to achieve human-alignment via multicalibration in Section 5. However, our negative result in Section 3 also holds if the classiﬁer s conﬁdence f B(X, H) = f B(X) only depends on the features, as usual in most classiﬁers designed for AI-assisted decision making. In the remainder, we will use Z = (X, H) and denote the space of features and human conﬁdence values as Z = X H. Figure 1 shows a visual representation of our SCM M.

Under this characterization, we argue that, if a rational decision maker has decided t under conﬁdence values b and h, then she would have decided t t had the conﬁdence values been b b and h h while holding everything else ﬁxed [20]. For example, in medical treatment, assume a doctor s and a classiﬁer s conﬁdence that a patient would beneﬁt from treatment is b = h = 0.7 and the doctor decides to treat the patient, then we argue that, if the doctor is rational, she would have treated the patient had the doctor s and the classiﬁer s conﬁdence been b = h = 0.8 > 0.7. Further, we say that any AI-assisted decision policy π that satisﬁes this property is monotone, i.e.,

Deﬁnition 1 (Monotone AI-assisted decision policy). An AI-assisted decision policy π is monotone if and only if, for any b, b [0, 1] and h, h H such that b b and h h , it holds that π(h, b, w) π(h , b , w) for any w P M(w).

Finally, note that, under any monotone AI-assisted decision policy, it trivially follows that

E[T | H = h, B = b] E[T | H = h , B = b ], (4)

where the expectation is over the uncertainty on the decision maker s individual characteristics and the data generating process.

3 Impossibility of AI-Assisted Decision Making Under Calibration

In AI-assisted decision making, classiﬁers are usually demanded to provide calibrated conﬁdence values [10 16]. A conﬁdence function f B : Z [0, 1] is said to be perfectly calibrated if, for any b [0, 1], it holds that P(Y = 1 | f B(Z) = b) = b. Unfortunately, using ﬁnite amounts of (calibration) data, one can only hope to construct approximately calibrated conﬁdence functions. There exist many different notions of approximate calibration, which have been proposed over the years. Here, for concreteness, we adopt the notion of α-calibration5 introduced by Hébert-Johnson et al. [11], however, our theoretical results can be easily adapted to other notions of approximate calibration6.

Deﬁnition 2 (Calibration). A conﬁdence function f B : Z [0, 1] satisﬁes α-calibration with respect to S Z if there exists some S S, with |S | (1 α)|S|, such that, for any b [0, 1], it holds that |P(Y = 1 | f B(Z) = b, Z S ) b| α, (5)

If the decision maker s decision T only depends on the classiﬁer s conﬁdence B, i.e., π(H, B, W) = π(B) and f B satisﬁes α-calibration with respect to Z, then, it readily follows from previous work that, for any utility function that satisﬁes Eq. 1, a simple monotone AI-assisted decision policy π B that takes decisions by thresholding the conﬁdence values is optimal [4 7], i.e., π B = argmaxπ Π(B) Eπ[u(T, Y )], where the expectation is with respect to the probability distribution P M and Π(B) denotes the class of AI-assisted decision policies using B. However, one of the main motivations to favor AI-assisted decision making over fully automated decision making is that the decision maker may have access to additional features V and may like to weigh the classiﬁer s conﬁdence B against her own conﬁdence H. Hence, the decision maker may seek for the optimal decision policy π over the class Π(H, B) of AI-assisted decision policies using H and B, i.e., π = argmaxπ Π(H,B) Eπ[u(T, Y )], since it may offer greater expected utility than π B.

Unfortunately, the following negative result shows that, in general, a rational decision maker may be unable to discover such an optimal decision policy π using (perfectly) calibrated conﬁdence values and this is true even if f H is monotone:

5Note that, if α = 0 and S = Z, the conﬁdence function f is perfectly calibrated. 6All proofs can be found in Appendix A.

Theorem 3. There exist (inﬁnitely many) AI-assisted decision making processes M satisfying Eqs. 2 and 3, with utility functions u(T, Y ) satisfying Eq. 1, such that f B is perfectly calibrated and f H is monotone but any AI-assisted decision policy π Π(H, B) that satisﬁes monotonicity is suboptimal, i.e., Eπ[u(T, Y )] < Eπ [u(T, Y )].

In the proof of the above result in Appendix A.2, we show that there always exist a perfectly calibrated f B(Z) = f B(X, H) that depends on both X and H for which any monotone AI-assisted decision policy is suboptimal. This is due to the fact that f B(Z) is calibrated on average over H, however, it may not be calibrated, nor even monotone, after conditioning on a speciﬁc value H = h. Further, we also show that, even if f B(Z) = P M(Y = 1 | X) matches the true distribution of the label Y given the features X, which has been typically the ultimate goal in the machine learning literature, there always exists a monotone f H for which any monotone AI-assisted decision policy is suboptimal. This is due to the fact that the decision maker s conﬁdence H can differ across instances with the same value for features X because it also depends on the features V and noise Q. Hence, f H may not be monotone after conditioning on a speciﬁc value X = x . In both cases, when a rational decision maker compares pairs of conﬁdence values h, b and h , b , the rate of positive outcomes Y = 1 for each pair may appear contradictory with the magnitude of conﬁdence. In what follows, we will show that, if f B satisﬁes a natural alignment property with respect to f H, which we refer to as human-alignment, there always exists an optimal AI-assisted decision policy that is monotone.

4 AI-Assisted Decision Making Under Human-Aligned Calibration

Intuitively, to avoid that pairs of conﬁdence values B and H appear as contradictory to a rational decision maker, we need to make sure that, with high probability, both f B and f H are monotone after conditioning on speciﬁc values of H and B, respectively. Next, we formalize this intuition by means of the following property, which we refer to as α-alignment: Deﬁnition 4 (Human-alignment). A conﬁdence function f B satisﬁes α-alignment with respect to a conﬁdence function f H if, for any h H, there exists some Sh Sh, with Sh = {(x, H) Z | H = h} and | Sh| (1 α/2)|Sh|, such that, for any b , b [0, 1] and h , h H such that b b and h h , it holds that P(Y = 1 | f B(X, H) = b , (X, H) Sh ) P(Y = 1 | f B(X, H) = b , (X, H) Sh ) α (6)

The above deﬁnition just means that, if f B is α-aligned with respect to f H then, for any h, h H, we can bound any violation of monotonicity by f B between at least a (1 α/2) fraction of the subspaces of features Sh and Sh . Moreover, note that, if f B is 0-aligned with respect to f H, then there are no violations of monotonicity, i.e., P(Y = 1 | f B(X, H) = b , (X, H) Sh ) P(Y = 1 | f B(X, H) = b , (X, H) Sh ), and we say that f B is perfectly aligned with respect to f H.

Given the above deﬁnition, we are now ready to state our main result, which shows that humanalignment allows for AI-assisted decision policies that satisfy monotonicity and (near-)optimality: Theorem 5. Let M be any AI-assisted decision making process satisfying Eqs. 2 and 3, with an utility function u(T, Y ) satisfying Eq. 1 If f B satisﬁes α-alignment w.r.t. f H, then there always exists an AI-assisted decision policy π Π(H, B) that satisﬁes monotonicity and is near-optimal, i.e.,

Eπ [u(T, Y )] Eπ[u(T, Y )] + α u(1, 1) u(0, 1) + 3

2(u(0, 0) u(1, 0)) (7)

where π = argmaxπ Π(H,B) Eπ[u(T, Y )] is the optimal policy. Corollary 1. If f B is perfectly aligned with respect to f H, then there always exists an AI-assisted decision policy π Π(H, B) that satisﬁes monotonicity and is optimal.

Finally, in many high-stakes applications, we may like to make sure that the conﬁdence values provided by f B are both useful and interpretable [34]. Hence, we may like to seek for conﬁdence functions f B that satisfy human-aligned calibration, which we deﬁne as follows: Deﬁnition 6 (Human-aligned calibration). A conﬁdence function f B satisﬁes α-aligned calibration with respect to a conﬁdence function f H if and only if f B satisﬁes α-alignment with respect to f H and it satisﬁes α-calibration with respect to Z.

In the next section, we will show how to achieve human-alignment and human-aligned calibration via multicalibration, a statistical notion introduced in the context of algorithmic fairness [11].

5 Achieving Human-Aligned Calibration via Multicalibration

Multicalibration was introduced by Hébert-Johnson et al. [11] as a notion to achieve fairness in supervised learning. It strengthens the notion of calibration by requiring that the conﬁdence function is calibrated simultaneously across a large collection of subspaces of features C 2Z which may or may not be disjoint. More formally, it is deﬁned as follows:

Deﬁnition 7 (Multicalibration). A conﬁdence function f B : Z B satisﬁes α-multicalibration with respect to C 2Z if f B satisﬁes α-calibration with respect to every S C.

Then, we can show that, for an appropriate choice of C, if f B satisﬁes α-multicalibration with respect to C, then it satisﬁes α-aligned calibration with respect to f H. More speciﬁcally, we have the following result:

Theorem 8. If f B satisﬁes (α/2)-multicalibration with respect to {Sh}h H, with Sh = {(x, H) Z | H = h}, then f B satisﬁes α-aligned calibration with respect to f H.

The above theorem suggests that, given a classiﬁer s conﬁdence function f B, we can multicalibrate f B with respect to {Sh}h H to achieve α-aligned calibration with respect to f H. To achieve multicalibration guarantees using ﬁnite amounts of (calibration) data, multicalibration algorithms need to discretize the range of f B [9, 11, 12]. In what follows, we brieﬂy revisit two algorithms, which carry out this discretization differently, and discuss their complexity and data requirements with respect to achieving α-aligned calibration.

Multicalibration algorithm via λ-discretization. This algorithm, which was introduced by Hébert Johnson et al. [11], discretizes the range of f B, i.e., the interval [0, 1], into bins of ﬁxed size λ > 0 with values Λ[0, 1] = { λ

2 , . . . , 1 λ

Let λ(b) = [b λ/2, b + λ/2). The algorithm partitions each subspace Sh into 1/λ groups Sh,λ(b) = {(x, h) Sh | f B(x, h) λ(b)}, with b Λ[0, 1]. It iteratively updates the conﬁdence values of function f B for these groups until f B satisﬁes a discretized notion of α -multicalibration over these groups. The algorithm then returns a discretized conﬁdence function f B,λ(x, h) = E[f B(X, H) | f B(X, H) λ(b)], with b Λ[0, 1] such that f B(x, h) λ(b), which is guaranteed to satisfy (α + λ)-multicalibration. Refer to Algorithm 1 in Appendix B for a pseudocode of the algorithm.

Then, as a direct consequence of Theorem 8, we can obtain a (discretized) conﬁdence function f B,λ that satisﬁes α-aligned calibration by setting α = λ = α/4. However, the following proposition shows that, to satisfy just α-alignment, it is enough to set α = 3

8α > α/4 and λ = α/4:

Proposition 1. The discretized conﬁdence function f B,λ returned by Algorithm 1 satisﬁes (2α + λ)- alignment with respect to f H.

Finally, it is worth noting that, to implement Algorithm 1, we need to compute empirical estimates of the expectations and probabilities above using a calibration set D. In this context, Theorem 2 in Hébert Johnson et al. [11] shows that, if we use a calibration set of size O(log(|H|/(αγξ))/α11/2γ3/2), with P((X, H) Sh) > γ for all h H, then f B,λ is guaranteed to satisfy α-multicalibration with probability at least 1 ξ in time O(|H| poly(1/α, 1/γ)).

Multicalibration algorithm via uniform mass binning. Uniform mass binning (UMD) [9, 12] has been originally designed to calibrate f B with respect to Z using a calibration set D. However, since the subspaces {Sh}h H are disjoint, i.e., Sh Sh = for every h = h , we can multicalibrate f B with respect to {Sh}h H by just running |H| instances of UMD, each using the subset of samples D Sh. Here, we would like to emphasize that we can use UMD to achieve multicalibration because, in our setting, the subspaces {Sh}h H are disjoint.

Each instance of UMD discretizes the range of f B, i.e., the interval [0, 1], into N = 1/λ bins with values Λh[0, 1] = { ˆP(Y = 1 | f B(X, h) [0, ˆq1]), . . . , ˆP(Y = 1 | f B(X, h) [ˆq N 1, ˆq N])}, where ˆqi denotes the (i/N)-th empirical quantile of the conﬁdence values f B(x, h) of the samples (x, h) D Sh and ˆP denotes an empirical estimate of the probability using samples from D Sh, aswell. Here, note that, by construction, the bins have similar probability mass. Then, for each (x, h) Z, the corresponding instance of UMD provides the value of the discretized conﬁdence function f B,λ(x, h) = b, where b Λh[0, 1] denotes the value of the bin whose corresponding deﬁning interval includes f B(x, h). Finally, we have the following theorem, which guarantees that, as

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

P(Y = 1 (X, Y) h, (b))

Human Confidence, h

'low': [0.0, 0.25] 'mid': (0.25, 0.74] 'high': (0.74, 1.0]

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

P(Y = 1 (X, Y) h, (b))

Human Confidence, h

'low': [0.0, 0.37] 'mid': (0.37, 0.76] 'high': (0.76, 1.0]

(b) Sarcasm

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

P(Y = 1 (X, Y) h, (b))

Human Confidence, h

'low': [0.0, 0.25] 'mid': (0.25, 0.69] 'high': (0.69, 1.0]

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

P(Y = 1 (X, Y) h, (b))

Human Confidence, h

'low': [0.0, 0.32] 'mid': (0.32, 0.64] 'high': (0.64, 1.0]

Figure 2: Empirical estimate of the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)), where b Λ[0, 1] and h {low, mid, high} are the discretized conﬁdence values for the classiﬁers and human participants, respectively. Error bars represent 90% conﬁdence intervals and hatched bars mark alignment violations between conﬁdence pairs (h, b) with |Sh,λ(b)| 30.

long as the calibration set is large enough, the discretized conﬁdence function f B,λ satisﬁes α-aligned calibration with respect to f H with high probability:

Theorem 9. The discretized conﬁdence function f B,λ returned by |H| instances of UMD, one per Sh, satisﬁes α-aligned calibration with respect to f H with probability at least 1 ξ as long as the size of the calibration set |D| = O |H| α2λγ log |H|

λξ , with P((X, H) Sh) γ.

6 Experiments

In this section, we validate our theoretical results using a dataset with real expert predictions in an AI-assisted decision making scenario comprising four different binary classiﬁcation tasks7.

Data description. We experiment with the publicly available Human-AI Interactions dataset [35]. The dataset comprises 34,783 unique predictions from 1,088 different human participants on four different binary prediction tasks ( Art , Sarcasm , Cities and Census ). Overall, there are approximately 32 different instances per task. In the Art task, participants need to determine the art period of a painting given two choices and, overall, there are paintings from four art periods. In the Sarcasm task, participants need to detect if sarcasm is present in text snippets from the Reddit sarcasm dataset [36]. In the Cities" task, participants need to determine which large US city is depicted in an image given two choices and, overall, there are images of four different US cities. Finally, in the Census task, participants need to determine if an individual earns more than 50k a year based on certain demographic information in tabular form. For Sarcasm , x is a representation of the text snippets and we set y = 1 if sarcasm is present, for Art and Cities , x is a representation of the images and we set y = 1 and y = 0 at random for each different instance and, for Census , x summarizes demographic information and we set y = 1 if an individual earns more than 50k a year. In each of the tasks, human participants provide conﬁdence values about their predictions before (h) and after (h+AI) receiving AI advice from a classiﬁer in form of the classiﬁer s conﬁdence values b.8 The original dataset contains predictions by participants from different, but overlapping, sets

7We release the code to reproduce our analysis at https://github.com/Networks-Learning/human-alignedcalibration. 8Refer to Appendix C for more details on the dataset.

Table 1: Misalignment, miscalibration and AUC.

Task Misalignment Miscalibration AUC EAE MAE ECE MCE πB πH πH+AI

Art 4.5 10 4 0.058 0.084 0.186 86.7% 72.7% 82.0% Sarcasm 3.8 10 3 0.224 0.085 0.310 89.9% 82.5% 86.5% Cities 6.2 10 5 0.013 0.066 0.158 84.4% 79.0% 84.7% Census 9.0 10 3 0.298 0.109 0.270 80.0% 77.3% 79.9%

of countries across tasks, who were told the AI advice had different values of accuracy.9 In our experiments, to control for these confounding factors, we focus on participants from the US who were told the AI advice was 80% accurate, resulting in 15,063 unique predictions from 471 different human participants.

Experimental setup and evaluation metrics. For each of the tasks, we ﬁrst measure (i) the degree of misalignment between the classiﬁers conﬁdence values b and the participants conﬁdence values h before receiving AI advice b and (ii) the difference (h+AI h) between the human participant s conﬁdence values before and after receiving AI advice b. Then, we compare the utility achieved by a AI-assisted decision policy πH+AI that predicts the value of y by thresholding the humans conﬁdence values h+AI after observing the classiﬁer s conﬁdence values against two baselines: (i) a decision policy πB that predicts the value of y by thresholding the classiﬁer s conﬁdence values b and (ii) a decision policy πH that predicts the value of y by thresholding the humans conﬁdence values h before observing the classiﬁer s conﬁdence values.

To measure the degree of misalignment, we discretize the conﬁdence values b and h into bins. For the classiﬁers conﬁdence b, we use 8 uniform sized bins per task with (centered) values Λ[0, 1], where λ = 1/8. For the human participants conﬁdence h before receiving AI advice b, we use three bins per task ( low , mid and high ), where we set the bin boundaries so that each bin contains approximately the same probability mass and set the bin values to the average conﬁdence value within each bin. In what follows, we refer to the pairs of discretized conﬁdence values (h, b) as cells, where samples (x, y) Z whose conﬁdence values lie in the cell (h, b) deﬁne the group Sh,λ(b), and note that we choose a rather low number of bins for both b and h so that most cells have sufﬁcient data samples to reliable estimate several misalignment metrics, which we describe next.

We use three different misalignment metrics: (i) the number of alignment violations between cell pairs, (ii) the expected alignment error (EAE) and (iii) the maximum alignment error (MAE). There is an alignment violation between cells pairs (h, b) and (h , b ), with h h and b b , if

P(Y = 1|(X, Y ) Sh,λ(b)) > P(Y = 1|(X, Y ) Sh ,λ(b )).

Moreover, we have that:

P(Y = 1 | (X, Y ) Sh,λ(b)) P(Y = 1 | (X, Y ) Sh ,λ(b ))

MAE = max h h ,b b P(Y = 1 | (X, Y ) Sh,λ(b)) P(Y = 1 | (X, Y ) Sh ,λ(b )),

where N = |{h h , b b }|. Here, note that the number of alignment violations tells us how frequently is the left hand side of Eq. 6 positive across cell pairs given Sh = Sh and the EAE and MAE quantify the average and maximum value of the left hand side of Eq. 6 across cells violating alignment. To obtain reliable estimates of the above metrics, we only consider cells (h, b) with |Sh,λ(b)| 30 samples. Moreover, we also report the expected calibration error (ECE) and maximum calibration error (MCE) [12, 37], which are natural counterparts to EAE and MAE, respectively.

As a measure of utility, we estimate the true positive rate (TPR) and false positive rate (FPR) of the decision policies πB, πH and πH+AI for all possible choices of threshold values, which we summarize using the area under the ROC curve (AUC) and, in Appendix C, we also report ROC curves.

9Participants were also either told that the advice is from a Human or from an AI based on a random assignment of participants to a treatment or control group. Since the actual advice received in both groups was identical for the same instance and the "perceived advice source" is randomized, we use data from both treatment and control groups in the experiments.

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

E[h+AI h (X, Y) h, (b)]

Human Confidence, h

'low': [0.0, 0.25] 'mid': (0.25, 0.74] 'high': (0.74, 1.0]

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

E[h+AI h (X, Y) h, (b)]

Human Confidence, h

'low': [0.0, 0.37] 'mid': (0.37, 0.76] 'high': (0.76, 1.0]

(b) Sarcasm

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

E[h+AI h (X, Y) h, (b)]

Human Confidence, h

'low': [0.0, 0.25] 'mid': (0.25, 0.69] 'high': (0.69, 1.0]

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1.0 Model Confidence, b

E[h+AI h (X, Y) h, (b)]

Human Confidence, h

'low': [0.0, 0.32] 'mid': (0.32, 0.64] 'high': (0.64, 1.0]

Figure 3: Empirical estimate of the average difference E[h+AI h | (X, Y ) Sh,λ(b)], where b Λ[0, 1] and h {low, mid, high} are the discretized conﬁdence values for the classiﬁer and human participants, respectively. Error bars represent 90% conﬁdence intervals and hatched bars mark alignment violations between conﬁdence pairs (h, b) with |Sh,λ(b)| 30.

Results. We start by looking at the empirical estimates of the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)) and of our measures of misalignment (EAE, MAE) and miscalibration (ECE, MCE) in Figure 2 and Table 1 (left and middle columns). The results show that, for Cities , the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)) are (approximately) monotonically increasing with respect to the classiﬁer s conﬁdence values b. More speciﬁcally, as shown in Figure 2, there is only one alignment violation between cell pairs and, hence, our metrics of misalignment acquire also very low values. In contrast, for Art , Sarcasm and especially Census , there is an increasing number of alignment violations and our misalignment metrics acquire higher values, up to several orders of magnitude higher for Census . These results also show that misalignment and miscalibration go hand in hand, however, in terms of miscalibration, Census does not stand up so strongly.

Next, we look at the difference h+AI h between the human participant s recorded conﬁdence values before and after receiving AI advice b across samples in each of the subsets Sh,λ(b) induced by the discretized conﬁdence values used above. Figure 3 summarizes the results, which reveal that the difference h+AI h increases monotonically with respect to the classiﬁer s conﬁdence b. This suggests that participants always expect b to reﬂect the probability of a positive outcome irrespectively of their conﬁdence value h before receiving AI advice, providing support for our hypothesis that (rational) decision makers implement monotone AI-assisted decisions policies. Further, this ﬁnding also implies that, for Art , Sarcasm and Census , any policy πH+AI that predicts the value of the label y by thresholding the conﬁdence value h+AI will be necessarily suboptimal because the probabilities P(Y = 1 | (X, Y ) Sh,λ(b)) are not monotone increasing with b.

Finally, we look at the AUC achieved by decision policies πB, πH and πH+AI. Table 1 (right columns) summarize the results, which shows that πH+AI outperforms πH consistently across all tasks but it only outperforms πB in a single task ( Cities ) out of four. These ﬁndings provide empirical support for Theorem 3, which predicts that, in the presence of human-alignment violations as those observed in Art , Sarcasm and Census , any monotone AI-assisted decision policy will be suboptimal, and they also provide support for Theorem 5, which predicts that, under human-alignment, there exist near-optimal AI-assisted decision policies satisfying monotonicity.

7 Discussion and Limitations

In this section, we discuss the intended scope of our work and identify several limitations of our theoretical and experimental results, which may serve as starting points for future work.

Decision making setting. We have focused on decision making settings where both decisions and outcomes are binary. However, we think that it may be feasible to extend our theoretical analysis to settings with multi-categorical (or real-valued) outcome variables and decisions. One of the main challenges would be to identify which natural conditions utility functions may satisfy in such settings. Further, we also think that it would be signiﬁcantly more challenging to extend our theoretical analysis to sequential settings multicalibration in sequential settings is an open area of research but our ideas may still be a useful starting point. In addition, our theoretical analysis assumes that the decision makers aim to maximize the average utility of their decisions. However, whenever human decisions are consequential to individuals, the decision maker may have fairness desiderata.

Conﬁdence values. In our causal model of AI-assisted decision making, we allow the classiﬁer s conﬁdence values to depend on the decision maker s conﬁdence values because this is necessary to achieve human-alignment via multicalibration as described in Section 5. However, we would like to clarify that both Theorems 3 and 5 still hold if the classiﬁer s conﬁdence values do not depend on the decision maker s conﬁdence, as it is typically the status quo today. Looking into the future, our work questions this status quo by showing that, by allowing the classiﬁer s conﬁdence values to depend on the decision maker s conﬁdence values, a decision maker may end up taking decisions with higher utility. Moreover, we would also like to clarify that, while the motivation behind our work is AI-assisted human decision making, our theoretical results do not depend on who be it a classiﬁer or another human gives advice. As long as the advice comes in the form of conﬁdence values, our results are valid. Finally, while we have shown that human-alignment can be achieved via multicalibration, we hypothesize that algorithms speciﬁcally designed to achieve human-alignment may have lower data and computational requirements than multicalibration algorithms.

Experimental results. Our experimental results demonstrate that, across tasks, the average utility achieved by decision makers is relatively higher if the classiﬁer they use satisﬁes human-alignment. However, they do not empirically demonstrate that, for a ﬁxed task, there is an improvement in average utility achieved by decision makers if the classiﬁer they use satisﬁes human-alignment. The reason why we could not demonstrate the latter is because, in our experiments, we used an observational dataset gathered by others [35]. Looking into the future, it would be very important to run a human subject study to empirically demonstrate the latter and, for now, treat our conclusions with caution.

8 Conclusions

We have introduced a theoretical framework to investigate what properties conﬁdence values should have to help decision makers take better decisions. We have shown that there exists data distribution for which a rational decision maker using calibrated conﬁdence values will always take suboptimal decisions. However, we have further shown that, if the conﬁdence values satisfy a natural alignment property, which can be achieved via multicalibration, then a rational decision maker using these conﬁdence values can take optimal decisions. Finally, we have illustrated our theoretical results using real human predictions on four AI-assisted decision making tasks.

Acknowledgements. We would like to thank Nastaran Okati for fruitful discussions at an early stage of the project. Gomez-Rodriguez acknowledges support from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 945719).

[1] Wei Jiao, Gurnit Atwal, Paz Polak, Rosa Karlic, Edwin Cuppen, Alexandra Danyi, Jeroen de Ridder, Carla van Herpen, Martijn P Lolkema, Neeltje Steeghs, et al. A deep learning system accurately classiﬁes primary and metastatic cancers using passenger mutation patterns. Nature communications, 11(1):1 12, 2020.

[2] Jacob Whitehill, Kiran Mohan, Daniel Seaton, Yigal Rosen, and Dustin Tingley. Mooc dropout prediction: How to measure accuracy? In Proceedings of the fourth (2017) acm conference on learning@ scale, pages 161 164, 2017.

[3] Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science advances, 4(1):eaao5580, 2018.

[4] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pages 797 806, 2017.

[5] Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet, and Isabel Valera. Fair decisions despite imperfect predictions. In International Conference on Artiﬁcial Intelligence and Statistics, pages 277 287. PMLR, 2020.

[6] Isabel Valera, Adish Singla, and Manuel Gomez-Rodriguez. Enhancing the accuracy and fairness of human decision making. In Advances in Neural Information Processing Systems, 2018.

[7] Guy N Rothblum and Gal Yona. Decision-making under miscalibration. In Innovations in Theoretical Computer Science, 2023.

[8] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, 2017.

[9] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classiﬁers. In Proceedings of the 18th International Conference on Machine Learning, 2001.

[10] Tilmann Gneiting, Fadoua Balabdaoui, and Adrian Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2007.

[11] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identiﬁable) masses. In Proceedings of the 35th International Conference on Machine Learning, 2018.

[12] Chirag Gupta and Aaditya K. Ramdas. Distribution-free calibration guarantees for histogram binning without sample splitting. In Proceedings of the 38th International Conference on Machine Learning, 2021.

[13] Roshni Sahoo, Shengjia Zhao, Alyssa Chen, and Stefano Ermon. Reliable decisions with threshold calibration. Advances in Neural Information Processing Systems, 34:1831 1844, 2021.

[14] Shengjia Zhao, Michael P Kim, Roshni Sahoo, Tengyu Ma, and Stefano Ermon. Calibrating predictions to decisions: A novel approach to multi-class calibration. In Advances in Neural Information Processing Systems, 2021.

[15] Yingxiang Huang, Wentao Li, Fima Macheret, Rodney A Gabriel, and Lucila Ohno-Machado. A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association, 27(4):621 633, 2020.

[16] Lequn Wang, Thorsten Joachims, and Manuel Gomez-Rodriguez. Improving screening processes via calibrated subset selection. In Proceedings of the 39th International Conference on Machine Learning, 2023.

[17] Kailas Vodrahalli, Tobias Gerstenberg, and James Zou. Uncalibrated models can improve human-ai collaboration. In Advances in Neural Information Processing Systems, 2022.

[18] Gal Yona, Amir Feder, and Itay Laish. Useful conﬁdence measures: Beyond the max score. ar Xiv preprint ar Xiv:2210.14070, 2022.

[19] Eleni Straitouri, Lequn Wang, Nastaran Okati, and Manuel Gomez-Rodriguez. Improving expert predictions with conformal prediction. In Proceedings of the 40th International Conference on Machine Learning, 2023.

[20] Judea Pearl. Causality. Cambridge university press, 2009.

[21] Vivian Lai, Chacha Chen, Q Vera Liao, Alison Smith-Renner, and Chenhao Tan. Towards a science of human-ai decision making: a survey of empirical studies. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, 2023.

[22] Andrea Papenmeier, Gwenn Englebienne, and Christin Seifert. How model accuracy and explanation ﬁdelity inﬂuence user trust. ar Xiv preprint ar Xiv:1907.12652, 2019.

[23] Xinru Wang and Ming Yin. Are explanations helpful? a comparative study of the effects of explanations in ai-assisted decision-making. In 26th International Conference on Intelligent User Interfaces, pages 318 328, 2021.

[24] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. Understanding the effect of accuracy on trust in machine learning models. In Proceedings of the 2019 chi conference on human factors in computing systems, pages 1 12, 2019.

[25] Mahsan Nourani, Joanie T. King, and Eric D. Ragan. The role of domain expertise in user trust and the impact of ﬁrst impressions with intelligent systems. Ar Xiv, abs/2008.09100, 2020.

[26] Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. Effect of conﬁdence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 295 305, 2020.

[27] Chacha Chen, Shi Feng, Amit Sharma, and Chenhao Tan. Machine explanations and human understanding. Transactions of Machine Learning Research, 2023.

[28] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 2017.

[29] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022.

[30] Banghua Zhu, Jiantao Jiao, and Michael Jordan. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. ar Xiv preprint ar Xiv:2301.11270, 2023.

[31] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012.

[32] Matteo Lisi, Gianluigi Mongillo, Georgia Milne, Tessa Dekker, and Andrei Gorea. Discrete conﬁdence levels revealed by sequential decisions. Nature Human Behaviour, 5(2):273 280, 2021.

[33] Hang Zhang, Nathaniel D Daw, and Laurence T Maloney. Human representation of visuo-motor uncertainty as mixtures of orthogonal basis distributions. Nature neuroscience, 18(8):1152 1158, 2015.

[34] Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, et al. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 401 413, 2021.

[35] Kailas Vodrahalli, Roxana Daneshjou, Tobias Gerstenberg, and James Zou. Do humans trust advice more if it comes from ai? an analysis of human-ai interactions. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 763 777, 2022.

[36] Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A large self-annotated corpus for sarcasm. ar Xiv preprint ar Xiv:1704.05579, 2017.

[37] Telmo Silva Filho, Hao Song, Miquel Perello-Nieto, Raul Santos-Rodriguez, Meelis Kull, and Peter Flach. Classiﬁer calibration: How to assess and improve predicted class probabilities: a survey. ar Xiv e-prints, pages ar Xiv 2112, 2021.

A.1 Additional Lemmas

Lemma 1 (Monotonicity). If a utility function u satisﬁes Eq. 1, then u is monotone with respect to the probability that Y = 1, i.e., for any P, P P({0, 1}) such that P(Y = 1) P (Y = 1), it holds that EY P [u(1, Y )] EY P [u(1, Y )].

Proof. We readily have that EY P [u(1, Y )] = P(Y = 1) u(1, 1) + (1 P(Y = 1)) u(1, 0)

P (Y = 1) u(1, 1) + (1 P (Y = 1)) u(1, 0) = EY P [u(1, Y )], where, in the above inequality, we use that u(1, 1) > u(1, 0) and P(Y = 1) P (Y = 1).

Lemma 2 (Trivial policies are not always optimal). If a utility function u satisﬁes Eq. 1, then there exist P, P P({0, 1}) such that the trivial policies π that either always decide T = 1 or always decide T = 0 are suboptimal. In particular, for any P, P P({0, 1}) such that P(Y = 1) < c and P (Y = 1) > c, where

c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1) (0, 1), (8)

it holds that EY P [u(1, Y )] < EY P [u(0, Y )] and EY P [u(1, Y )] > EY P [u(0, Y )]. (9)

Proof. Let P be any distribution such that

P(Y = 1) < c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1),

where c (0, 1) because, by assumption, u satisﬁes Eq. 1. Now, by rearranging the above inequality, we have that P(Y = 1) u(1, 1) + (1 P(Y = 1)) u(1, 0) < P(Y = 1) u(0, 1) + (1 P(Y = 1)) u(0, 0), and, using the deﬁnition of the expectation, it immediately follows that EY P [u(1, Y )] < EY P [u(0, Y )]. The same argument can be used to show that, for any distribution P such that P (Y = 1) > c, it holds that EY P [u(1, Y )] > EY P [u(0, Y )]. Finally, note that, since c (0, 1), we know that such distributions P and P exist.

A.2 Proof of Theorem 3

Before proving Theorem 3, we rewrite the expected utility with respect to the probability distribution P M in terms of conﬁdence H and B by using the law of total expectation, Eπ[u(T, Y )] = EH,B P M(H,B) [Eπ[u(T, Y )|H, B]] . Here, to simplify notation, we will write EH,B [Eπ[u(T, Y ) | H, B]] , where note that, using the law of total expectation, we can write the inner expectation in the above expression in terms of the utilities of the trivial policies, i.e.,

Eπ[u(T, Y ) | H, B] = E[u(1, Y ) | H, B] Pπ(T = 1 | H, B) + E[u(0, Y ) | H, B] Pπ(T = 0 | H, B), (10)

and we will use P to refer to probabilities induced by SCM M, e.g., P(H, B) to denote P M(H, B). Now, we restate and prove Theorem 3.

Theorem 3. There exist (inﬁnitely many) AI-assisted decision making processes M satisfying Eqs. 2 and 3, with utility functions u(T, Y ) satisfying Eq. 1, such that f B is perfectly calibrated and f H is monotone but any AI-assisted decision policy π Π(H, B) that satisﬁes monotonicity is suboptimal, i.e., Eπ[u(T, Y )] < Eπ [u(T, Y )].

Proof. To prove the above claim, we construct a monotone conﬁdence function f H, perfectly calibrated conﬁdence function f B and distribution P M for which any monotone AI-assisted decision policy π Π(H, B) achieves strictly lower utility than a carefully constructed non monotone AI-assisted decision policy π Π(H, B).

We will present the proof in three parts. First, we will introduce the main building block and idea behind the proof by a small construction of f H, f B and P M with |H| = |B| = 3, where B [0, 1] denotes the (discrete) output space of the classiﬁer s conﬁdence function. We then construct examples of f H, f B and P M for arbitrary |H| = k and |B| = m with m, k N, m > k 2. Lastly, we construct examples where B is non-discrete and |H| = k with k > 2.

Main building block and small example.

We start by presenting the main idea of the proof using an example with a small set of conﬁdence values H and B. Let the values of the decision maker s conﬁdence H be in H = {h1, h2, h3} and the values of the classiﬁer s conﬁdence B be in B = {b1, b2, b3}, with order hi < (hi + 1) and bi < (bi + 1) respectively.

Our main building block, consists of two distributions P , P + P({0, 1}) with P (Y = 1) < c and P +(Y = 1) > c, where c depends on utility u as described by Eq. 8 in Lemma 2. We use these distributions for our constructions of f H, f B and P M, so that for some realizations of H, B distribution P(Y = 1 | H, B) is either P or P +. Using Lemma 2 and from Eq. 10, we have that:

(I) For any hi, bi such that P(Y | H = hi, B = bi) = P , it holds that

E[u(1, Y ) | H = hi, B = bi] < E[u(0, Y ) | H = hi, B = bi].

Hence, decreasing Pπ(T = 1 | H, B) increases E[u(T, Y ) | H = hi, B = bi].

(II) For any hi, bi such that P(Y | H = hi, B = bi) = P +, it holds that

E[u(1, Y ) | H = hi, B = bi] > E[u(0, Y ) | H = hi, B = bi].

Hence, increasing Pπ(T = 1 | H, B) increases E[u(T, Y ) | H = hi, B = bi].

Intuitively, suppose we now have that, for conﬁdence values h2, b2, Y P + and, for conﬁdence values h3, b2, Y P , i.e., P(Y | H = h2, B = b2) = P + and P(Y | H = h3, B = b2) = P . Then, any non-monotone AI-assisted decision policy π with P π(T = 1 | H = h2, B = b2) > P π(T = 1 | H = h3, B = b2) will have higher expected utility than any monotone AI-assisted decision policy given conﬁdence values h2, b2 and h3, b2. Finally, under an appropriate choice of distribution P(H, B), such non-monotone AI-assisted decision policies π will offer higher overall utility in expectation.

We formalize this intuition with the following lemma:

Lemma 3. Let M be any AI-assisted decision making process satisfying Eqs. 2 and 3, with utility function u(T, Y ) satisfying Eq. 1. If f H, f B and P M are such that there exists conﬁdence values b B, hi, hj H, with hi < hj, which satisfy

P(H = hi, B = b) > 0, P(H = hj, B = b) > 0,

P(Y | H = hi, B = b) = P + and P(Y | H = hj, B = b) = P , (11)

for some distributions P , P + with P (Y = 1) < c and P +(Y = 1) > c, where

c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1). (12)

Then, for any monotone AI-assisted decision policy π Π(H, B), there exists an AI-assisted decision policy π Π(H, B) which is not monotone and achieves a stricly greater utility than π, i.e., Eπ[u(T, Y )] < E π[u(T, Y )].

Proof. Let π be a monotone AI-assisted decision policy, then it must hold that Pπ(T = 1 | H = hi, B = b) Pπ(T = 1 | H = hj, B = b) (see Eq. 4). Let π be an identical AI-assisted decision

policy to π up to the decision for conﬁdence values hi, b and hj, b. We distinguish between three cases.

Case 1: Pπ(T = 1 | H = hi, B = b) < Pπ(T = 1 | H = hj, B = b).

Let the probability of T = 1 under π for conﬁdence values hi, b and hj, b be switched compared to π, i.e., P π(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b), P π(T = 1 | H = hj, B = b) = Pπ(T = 1 | H = hi, B = b). Then, π is not monotone, as Eq. 4 is not satisﬁed, and it holds that P π(T = 1 | H = hi, B = b) > Pπ(T = 1 | H = hi, B = b), P π(T = 1 | H = hj, B = b) < Pπ(T = 1 | H = hj, B = b).

As we decreased P(T = 1 | H = hj, B = b) and increased P(T = 1 | H = hi, B = b), by properties (I) and (II), it must hold that the expected utility of π given conﬁdence values hi, b and hj, b is higher than the one of π, i.e., E π[u(T, Y ) | H = hi, B = b] > Eπ[u(T, Y ) | H = hi, B = b] and (13) E π[u(T, Y ) | H = hj, B = b] > Eπ[u(T, Y ) | H = hj, B = b]. (14)

Case 2: 0 < Pπ(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b) 1.

Let the probability of T = 1 under π for conﬁdence values hj, b be strictly lower compared to π and be the same as π for hi, b. Then, π is not monotone, since by case assumption P π(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b) > P π(T = 1 | H = hj, B = b) and the inequality in Eq. 14 holds by property (I).

Case 3: Pπ(T = 1 | H = hi, B = b) = Pπ(T = 1 | H = hj, B = b) = 0.

Let the probability of T = 1 under π for conﬁdence values hi, b be strictly higher compared to π and be the same as π for hj, b. Then, π is not monotone, since by case assumption P π(T = 1 | H = hj, B = b) = Pπ(T = 1 | H = hi, B = b) < P π(T = 1 | H = hi, B = b) and the inequality in Eq. 13 holds by property (II).

As in all three cases at least one of the strict inequalities in Eqs. 13 or 14 holds and π is equivalent to π (i.e., it has the same expected conditional utility) given any other pair of conﬁdence values h H, b B, we have that E π[u(T, Y )] = E[E π[u(T, Y )]|H, B] > E[Eπ[u(T, Y )|H, B] = Eπ[u(T, Y )].

Before proceeding further, we would like to note that we may also state Lemma 3 using h H, bi, bj B, with bi < bj, the proof would follow analogously.

Now, we construct an AI-decision making process M, with H = {h1, h2, h3} and B = {b1, b2, b3}, such the decision maker s conﬁdence f H is monotone, the classiﬁer s conﬁdence f B is perfectly calibrated, and the conditions of Lemma 3 are satisﬁed. First, let f H, f B and P M be such that

P(f B(Z) = bj) =

3/6 if j = 1 2/6 if j = 2 1/6 if j = 3 0 otherwise

P(H = hi | B = bj) := PX,V (H = hi | f B(Z) = bj) =

( 1 4 j if i j 0 otherwise.

Then, it readily follows that P(H = hi, B = bj) = 1/6 for i j and P(H = hi, B = bj) = 0 otherwise. Moreover, for each pair of conﬁdence values (hi, bj) with positive probability P(H = hi, B = bj), we set

P(Y = 1 | H = hi, B = bj) = P + if i = j = 2 or (i = 3 and j {1, 3}) P if (j = 2 and i = 3) or (j = 1 and i {1, 2}),

h1 h2 h3 h4 h5 h6

Figure 4: Nonzero values of P(Y = 1|H = hi, B = bj) and P(H = hi, B = bj) for every hi H and bj B used in the ﬁrst (left) and second (right) part of the proof of Theorem 3. In each cell (hi, bj) in both panels, P + or P is the value of P(Y = 1|H = hi, B = bj) and lighter color means lower value of P(H = hi, B = bj), where white means P(Y = 1|h = hi, B = bj) = 0 and P(H, B) = 0. In both panels, the assignment of values is very stylized to facilitate the proof the classiﬁer s conﬁdence function f B partitions the feature space in a way such that a rational decision maker is unable to take decisions that maximize utility for almost all conﬁdence values. However, less stylized examples also satisfy the conditions of Lemma 3. For example, as long as there is one triplet of conﬁdence values b2, h2, h3 (or h3, b1, b2 in the left example) for which a rational decision maker is unable to take decisions that maximize utility, Lemma 3 can be applied.

as shown in Figure 4 (left). Then, it readily follows that f H is monotone with respect to the probability that Y = 1, i.e., P(Y = 1 | H = hi) P(Y = 1 | H = hi+1)), and we have that the classiﬁer s conﬁdence values

i:i j P(H = hi | B = bj) P(Y = 1 | H = hi, B = bj)

2/3 P + 1/3 P + if j = 1 1/2 P + 1/2 P + if j = 2 P + if j = 3 0 otherwise

are perfectly calibrated and satisfy that bj < bj+1.

Finally, using Lemma 3 with b = b2, hi = h2, hj = h3, we have that any monotone AI-assisted decision policy is suboptimal for any M with f H, f B and P M as deﬁned above.

Construction with arbitrary |H| = k and |B| = m, m > k 2.

In this second part of the proof, we construct an AI-assisted decision making processes M, with |H| = k and |B| = m such that m > k 2, such that the decision maker s conﬁdence f H is monotone, the classiﬁer s conﬁdence f B is perfectly calibrated and the conditions of Lemma 3 are satisﬁed.

First, let the space of conﬁdence values be H = {hi}i [k] and B = {bj}j [m], with order hi < hi+1 and bi < bi+1, respectively, and f H, f B and P M be such that P(f B(Z) = bj) = 1/m and

P(H = hi | B = bj) := PX,V (H = hi | f B(Z) = bj) =

m if j = i m j+1

m if i = 1, j > k j 1

m if j = i + 1, j k j 1

m if i = k, j > k 0 otherwise.

Moreover, for each pair of conﬁdence values (hi, bj) with positive probability P(H = hi, B = bj), we set

P(Y = 1 | H = hi, B = bj) =

P if j = i P if i = 1, j > k P + if j = i + 1, j k P + if i = k, j > k,

as shown in Figure 4 (right). Further, we set the classiﬁer s conﬁdence values bj to

bj := m j + 1

Then, it holds that bj < bj+1 and f B is perfectly calibrated as

P(Y = 1 | B = bj) = P(H = hj | B = bj) P + P(H = hj 1 | B = bj) P + if j k P(H = h1 | B = bj) P + P(H = hk | B = bj) P + if j > k

and thus, using the deﬁnitions of P(H | B) and P(Y | H, B), we have that P(Y | B = bj) = bj.

To show that f H is monotone with respect to the probability that Y = 1, ﬁrst note that P(H = hi, B = bi) decreases as i increases and P(H = hi, B = bi+1) increases as i increases. Moreover, further note that P(Y = 1 | H = hi, B = bi) = P < P(Y = 1 | H = hi, B = bi+1) = P +. Hence, for any i {2, . . . , k 1}, it readily follows that

P(Y = 1 | H = hi) = P + P(B = bi+1|H = hi) + P P(B = bi|H = hi) P(Y = 1 | H = hi+1),

and, for i = 1, it is evident that P(Y = 1 | H = h1) < P(Y = 1 | H = h2).

Finally, using Lemma 3 with any choice of conﬁdence values b = bj, hi = hj 1 and hj = hj with j {2, . . . , k}, we have that any monotone AI-assisted decision policy π is suboptimal for any M with |H| = k and |B| = m, m > k 2, and f H, f B and P M as deﬁned above. Here, note that, as we do not ﬁx the exact distributions P and P +, the above Lemma applies to inﬁnitely many AI-assisted decision making processes M.

Construction with B [0, 1] and |H| = k.

In this last part of the proof, we construct an AI-assisted decision making process M, with |H| = k 2 and B [0, 1], such that the decision maker s conﬁdence function f H is monotone, the classiﬁer s conﬁdence function f B is perfectly calibrated and the conditions of Lemma 3 are satisﬁed.

First, let the space of conﬁdence values be H = {hi}i [k], with order hi < hi+1, the feature space10

X = [0, 1], and f , f + be two strictly monotone increasing functions with

f : [0, 1] [0, c) and f + : [0, 1] (c, 1], (17)

c = u(0, 0) u(1, 0) u(1, 1) u(1, 0) + u(0, 0) u(0, 1). (18)

Further, let Qk+1 = {q0, q1, . . . qk, qk+1} be a set of quantiles such that P(X qj) = j/(k + 1) for all j {0, 1, . . . , k + 1} and thus, we have that, for all j [k + 1],

for Ij := (qj 1, qj], it holds that P(X Ij) = 1 k + 1.

Now, let f H and P M be such that

PV (H = hi | X, X Ij) =

1/2 if i {j 1, j} 1 if i = j = 1 or (i = k and j = k + 1) 0 otherwise, (19)

10For a more general feature space X, we can use a mapping φ of X to [0, 1]. The proof works analogously by substituting X with φ(X).

Figure 5: Nonzero values of P(Y = 1|X, H = hi, X Ij) for every hi H, with |H| = 3, and Ij = (qj 1, qj], with qj Q4 used in the last part of the proof of Theorem 3. Lighter color means lower value of f or f +.

P(Y = 1 | X, H = hi, X Ij) = f (X) if j = i or (i = j = 1 ) f +(X) if j = i + 1 or (i = k and j = k + 1), (20)

as shown in Figure 5. Next, we deﬁne

f B(Z) = f B(X) := P(Y = 1 | X) =

f (X) if X I1 f +(X) if X Ik+1 (f (X) + f +(X))/2 otherwise,

which, by construction, is perfectly calibrated.

To show that the decision maker s conﬁdence function f H is monotone with respect to the probability that Y = 1, we ﬁrst note that, using Eq. 19, we have that

P(X Ij | H = hi) =

1/2 if 1 < i < k and j {i, i + 1} and 1 if i = j = 1 1 if i = k and j = k + 1 0 otherwise.

Hence, using Eq. 21 and the law of total probability, for any i {2, . . . , k 2}, we have that

P(Y = 1 | H = hi) = 1

2 [P(Y = 1 | H = hi, X Ii) + P(Y = 1 | H = hi, X Ii+1)]

2 f (qi) + f +(qi+1)

2 f (inf Ii+1) + f + (inf Ii+2)

2 [P(Y = 1 | H = hi+1, X Ii+1) + P(Y = 1 | H = hi+1, X Ii+2)]

= P(Y = 1 | H = hi+1),

where the inequalities follow from the fact that f and f + are strictly monotone increasing. Corner cases for i = 1 and i = k 1 can be shown analogously by further using that f (X) < c < f +(X) for all X.

Finally, using Lemma 3 with any choice of conﬁdence values hi = hj 1 hj = hj, j {2, , k 1} and b = f B(X) with X Ij, we have that any monotone AI-assisted decision policy π is suboptimal for any M with |B| [0, 1] and |H| = k, k 2 and f H, f B and P M as deﬁned above.

A.3 Proof of Theorem 5

We prove the statement by contraposition. Let M be an AI-assisted decision making process satisfying Eqs. 2 and 3, with a utility function u(T, Y ) satisfying Eq. 1 and let M be such that f B satisﬁes α-alignment with respect to f H and f B has output space B [0, 1]. Assume there exists no (near-)optimal monotone AI-assisted decision policy for utility u. Thus, there must exist an optimal AI-assisted decision policy π Π(H, B) which is not monotone and has strictly greater expected utility than any monotone policy. However, we show that we can modify π to a monotone AI-assisted decision policy ˆπ Π(H, B) with near-optimal expected utility.

As π is not monotone, there must exist conﬁdence values h1, h2 H, h1 h2, and b1, b2 B, b1 b2, such that π(h1, b1, w) > π(h2, b2, w) for some w W, (22)

where W denotes the space of noise values. In what follows, let W(π,h2,b2) h1,b1 W denote the set

containing any such w and let W(π,h2,b2) = S

h,b H B W(π,h2,b2) h,b .

For any conﬁdence value h , b H [0, 1], we modify policy π to a policy ˆπ as follows. Let { Sh}h H denote the sets satisfying the α-alignment condition for f B with respect to f H and, given conﬁdence h , let ˆbh denote the smallest conﬁdence value of f B, such that there exist h h with P(Y = 1 | B = ˆbh , Z Sh) c, i.e.,

ˆbh := min{b B | P(Y = 1 | B = b, Z Sh) c for h h }. (23)

Now, we deﬁne a new AI-assisted policy ˆπ from π as follows,

ˆπ(h , b , w) :=

1 if b ˆbh and w S

h h ,b [ˆbh ,b ] W(π,h,b)

0 if b < ˆbh and w S

h h ,b [b ,ˆbh ) W(π,h,b)

π(h , b , w) otherwise.

Next, we show that ˆπ is monotone and Eˆπ[u(T, Y )] Eπ[u(T, Y )] + α a for some constant a.

Proof ˆπ is a monotone assisted policy.

To prove that ˆπ Π(H, B) is a monotone AI-assisted decision policy, we show that, for all h , h H, b , b B, with h h , b b , it holds that W(ˆπ,h ,b ) h ,b = . We distinguish between three cases.

Case 1: b ˆbh and b ˆbh .

Since h h , b b and, by deﬁnition, ˆbh ˆbh since h h , we have that [

h h ,b [ˆbh ,b ]

h h ,b [ˆbh ,b ]

Hence, we can conclude that

ˆπ(h , b , w) 1 = ˆπ(h , b , w) for all w [

h h ,b [ˆbh ,b ]

W(π,h,b). (25)

Further, for any other w W S

h h ,b [ˆbh ,b ] W(π,h,b) W W(π,h ,b ) h ,b , we have that

ˆπ(h , b , w) = π(h , b , w) and ˆπ(h , b , w) = π(h , b , w) and, by deﬁnition of W(π,h ,b ) h ,b , it follows that

ˆπ(h , b , w) ˆπ(h , b , w) for all w W [

h h ,b [ˆbh ,b ]

W(π,h,b). (26)

From Eqs. 25 and 26, it follows that W(ˆπ,h ,b ) h ,b = .

Case 2: b < ˆbh and b ˆbh .

By deﬁnition of ˆπ, we have that

ˆπ(h , b , w) 1 = ˆπ(h , b , w) for all w [

h h ,b [ˆbh ,b ]

W(π,h,b) (27)

and ˆπ(h , b , w) = 0 ˆπ(h , b , w) for all w [

h h ,b [b ,ˆbh )

W(π,h,b) (28)

Analogously to case 1, since the values of w below are also in W W(π,h ,b ) h ,b and ˆπ is equivalent to π for these values, we have that

ˆπ(h , b , w) ˆπ(h , b , w) for all w W [

h h ,b [ˆbh ,b ]

h h ,b [b ,ˆbh )

(29) From Eqs. 27 28 and 29, it follows that W(ˆπ,h ,b ) h ,b = .

Case 3: b < ˆbh and b < ˆbh .

Since h h , b b and, by deﬁnition, ˆbh ˆbh since h h , we have that [

h h ,b [b ,ˆbh )

h h ,b [b ,ˆbh )

Hence, we can conclude that

ˆπ(h , b , w) = 0 ˆπ(h , b , w) for all w [

h h ,b [b ,ˆbh )

W(π,h,b) (30)

Again analogously to case 1, since the values of w below are also in W W(π,h ,b ) h ,b and ˆπ is equivalent to π for these values, we have that

ˆπ(h , b , w) ˆπ(h , b , w) for all w W [

h h ,b [b ,ˆbh )

W(π,h,b) (31)

From Eqs. 30 and 31, it follows that W(ˆπ,h ,b ) h ,b = .

Note that, we cannot have a case where b ˆbh and b < ˆbh , as this would imply b < b . Since, in all three possible cases, we have shown that W(ˆπ,h ,b ) h ,b = , we can conclude that ˆπ Π(H, B) is monotone.

Proof ˆπ is near optimal.

First, we rewrite the inner expectation in Eq. 10 as

Eπ[u(T, Y ) | H, B] = E[u(0, Y ) | H, B] + (E[u(1, Y ) | H, B] E[u(0, Y ) | H, B]) Pπ(T = 1 | H, B).

Further, recall that | Sh| (1 α/2)|Sh| for all h H and, for all h , h H, h h and all b , b [0, 1], b b , we have that

P(Y = 1 | f B(Z) = b , Z Sh ) P(Y = 1 | f B(Z) = b , Z Sh ) α (32)

Now, for any h H, b B, we show an upper bound on Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]. We distinguish between three cases.

Case 1: b ˆbh and P(Y = 1 | H = h , B = b ) c.

Using Lemma 2, we have that

(E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ]) 0 (33)

Moreover, as b ˆbh , the distribution of positive decisions in ˆπ may also increases for h , b compared to π (see Eq. 24), i.e.,

Pπ(T = 1 | H = h , B = b ) Pˆπ(T = 1 | H = h , B = b ) 0

Hence, it follows that

Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]

= (E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ])

(Pπ(T = 1 | H = h , B = b ) Pˆπ(T = 1 | H = h , B = b )) 0.

Case 2: b ˆbh and P(Y = 1 | H = h , B = b ) < c.

Since b ˆbh , there exists h, b H B, with h h , b b , such that P(Y = 1 | B = b, Z Sh) c. Moreover, using the deﬁnition of α-alignment, we have that

P(Y = 1 | B = b, Z Sh) P(Y = 1 | B = b , Z Sh ) + α (35)

Then, we can use this to lower bound the expected utility of T = 1 given B = b and Z Sh as follows:

E[u(1, Y ) | B = b, Z Sh] E[u(1, Y ) | B = b , Z Sh ]

= u(1, 1) (P(Y = 1 | B = b, Z Sh) P(Y = 1 | B = b , Z Sh )

+ u(1, 0) (P(Y = 1 | B = b , Z Sh ) P(Y = 1 | B = b, Z Sh)) (u(1, 1) u(1, 0)) α,

where the last inequality due to Eq. 35 and the assumption that u(1, 1) u(1, 0) > 0. Analogously, we can also upper bound the expected utility of T = 0 given H = h , B = b and Z Sh as follows:

E[u(0, Y ) | B = b, Z Sh] E[u(0, Y ) | B = b , Z Sh ]

= u(0, 1) (P(Y = 1 | B = b, Z Sh) P(Y = 1 | B = b , Z Sh )

+ u(0, 0) (P(Y = 1 | B = b , Z Sh ) P(Y = 1 | B = b, Z Sh)) (u(0, 1) u(0, 0)) α,

where the last inequality holds due to Eq. 35 and the assumption that u(0, 1) u(0, 0) < 0.

Now, as P(Y = 1 | B = b, Z Sh) c, by Lemma 2, we have that

E[u(1, Y ) | B = b, Z Sh] E[u(0, Y ) | B = b, Z Sh] (38)

Combining Eqs. 36, 37 and 38, we obtain

E[u(1, Y ) | B = b , Z Sh ] + α(u(1, 1) u(1, 0))

E[u(0, Y ) | B = b , Z Sh ] + α(u(0, 1) u(0, 0)) (39)

In addition, note that we have following trivial bound for the expectation when H = h but Z / Sh

u(1, 0) E[u(1, Y ) | H = h , B = b ] u(1, 1), (40)

u(0, 1) E[u(0, Y ) | H = h , B = b ] u(0, 0) (41)

Moreover, since b ˆbh , the distribution of positive decisions in ˆπ may also increase for h , b compared to π, i.e.,

Pπ(T = 1 | H = h , B = b ) Pˆπ(T = 1 | H = h , B = b ) 0

Hence, we have that

Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]

( 1) (E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ]), (42)

where the inequality follows since E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ] 0 by Lemma 2 as P(Y = 1 | H = h , B = b ) < c.

Finally, combining Eqs. 39, 40, 41 and 42 and using the law of total expectation, we obtain

Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]

(1 β(h ,b ))(E[u(0, Y ) | B = b , Z Sh ] E[u(1, Y ) | B = b , Z Sh ])

+ β(h ,b )(E[u(0, Y ) | H = h , B = b ] E[u(1, Y ) | H = h , B = b ])

(1 β(h ,b ))α(u(1, 1) u(1, 0) + u(0, 0) u(0, 1)) + β(h ,b )(u(0, 0) u(1, 0)), (43)

where β(h ,b ) denotes the probability of Z / Sh given H = h , B = b , i.e., β(h ,b ) = P(Z / Sh |H = h , B = b ).

Case 3: b < ˆbh .

For all h, b, with h h , b b , we have that P(Y = 1 | B = b, Z Sh) < c. In particular, P(Y = 1 | B = b , Z Sh ) < c. Thus, by Lemma 2,

E[u(1, Y ) | B = b , Z Sh ] < E[u(0, Y ) | B = b , Z Sh ] (44)

In this case, since b < ˆbh , the distribution of positive decisions in ˆπ may decrease for h, b compared to π, i.e., 0 Pπ(T = 1 | H = h, B = b) Pˆπ(T = 1 | H = h, B = b) Combining Eqs.44, 40 and 41 and using the law of total expectation, we obtain

Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]

(E[u(1, Y ) | H = h , B = b ] E[u(0, Y ) | H = h , B = b ]) 1

= (1 β(h ,b ))(E[u(1, Y ) | B = b , Z Sh ] E[u(0, Y ) | B = b , Z Sh ])

+ β(h ,b )(E[u(1, Y ) | H = h , B = b ] EY [u(0, Y ) | H = h , B = b ])

β(h ,b )(u(1, 1) u(0, 1)), (45)

where again β(h ,b ) = P(Z / Sh |H = h , B = b ).

Now, for a ﬁxed h H, since | Sh | (1 α/2)|Sh |, we know that 0 P

b B β(h ,b) α/2. Hence, combining Eqs. 34, 43 and 45 from the three cases above, we have that

EB[Eπ[u(T, Y ) | H = h , B = b ]] EB[Eˆπ[u(T, Y ) | H = h , B = b ]]

= EB[Eπ[u(T, Y ) | H = h , B = b ] Eˆπ[u(T, Y ) | H = h , B = b ]]

max{α(u(1, 1) u(1, 0) + u(0, 0) u(0, 1)) + α

2 (u(0, 0) u(1, 0)), α

2 (u(1, 1) u(0, 1))}

α (u(1, 1) u(0, 1) + 3

2 (u(0, 0) u(1, 0))).

Finally, since by assumption π is optimal, i.e., Eπ[u(T, Y )] = Eπ [u(T, Y )] = maxπ Π(H,B) Eπ [u(T, Y )], we can conclude by the law of total expectation that

Eπ [u(T, Y )] = EHEB[EY, T | π[u(T, Y ) | H, B]]

Eˆπ[u(T, Y )] + α (u(1, 1) u(0, 1) + 3

2 (u(0, 0) u(1, 0))) .

This concludes the proof.

A.4 Proof of Theorem 8

If f B is α/2-multicalibrated with respect to {Sh}h H, then, by deﬁnition, for any h H, there exists Sh Sh with |S| (1 α/2)|Sh| such that, for any b [0, 1], it holds that

|P(Y = 1 | f B(Z) = b, Z Sh) b| α/2.

This directly implies that, for any h , h H and b , b [0, 1], we have that

P(Y = 1 | f B(Z) = b , Z Sh ) b P(Y = 1 | f B(Z) = b , Z Sh ) b α (46)

and, using linearity of expectation, we further have that

P(Y = 1 | f B(Z) = b , Z Sh ) P(Y = 1 | f B(Z) = b , Z Sh ) α + b b , (47)

showing that, whenever b b , the α-alignment condition is met. This proves that f B is α-aligned with respect to f H.

Finally, if f B is α/2-multicalibrated with respect to {Sh}h H, then, it is α/2-calibrated with respect to any of the sets Sh. Since Z = h HSh, this implies that f B is α/2-calibrated with respect to Z. This concludes the proof.

A.5 Proof of Proposition 1

Given a discretization parameter λ, Algorithm 1 works with a discretized notion of α-multicalibration, namely (α, λ)-multicalibration: Deﬁnition 10. Let C 2Z be a collection of subsets of Z. For any α, λ > 0, conﬁdence function f B : Z [0, 1] is (α, λ)-multicalibrated with respect to C if, for all S C, b Λ[0, 1], and all Sh,λ(b)(g) such that |Sh,λ(b)| αλ|Sh|, it holds that

|E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh,λ(b)]| α . (48)

Here, we can analogously deﬁne a discretized notion of α-alignment, namely (α, λ)-alignment. Deﬁnition 11. For α, λ > 0, a conﬁdence function f B : Z [0, 1] is (α, λ)-aligned with respect to f H if, for all h , h H, h h , and all b , b Λ[0, 1], b b , with |Sh ,λ(b )| > α/2 λ|Sh | and |Sh ,λ(b )| > α/2 λ|Sh |, we have

P(Y = 1 | (X, H) Sh ,λ(b )) P(Y = 1 | (X, H) Sh ,λ(b )) α . (49)

In what follows, we ﬁrst show that (α, λ)-multicalibration with respect to {Sh}h H implies (2α + λ, λ)-alignment with respect to f H. Theorem 12. For α, λ > 0, if f B is (α, λ)-multicalibrated with respect to {Sh}h H, then f B is (2α + λ, λ)-aligned with respect to f H .

Proof. If f B is (α, λ)-multicalibrated with respect to {Sh}h H, then, by deﬁnition, for all h H, b Λ[0, 1], and all Sh,λ(b) such that |Sh,λ(b)| α λ|Sh|, it holds that

|E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh,λ(b)]| α. (50)

This directly implies that, for all h , h H, b , b Λ[0, 1] with |Sh ,λ(b )| α λ|Sh | and |Sh ,λ(b )| α λ|Sh |, it holds that

E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh ,λ(b )]

E[f B(X, H) P(Y = 1 | X, H) | (X, H) Sh ,λ(b )] 2α (51)

and, using the linearity of expectation, we have that

P(Y = 1 | (X, H) Sh ,λ(b )) P(Y = 1 | (X, H) Sh ,λ(b ))

2α + E[f B(X, H) | (X, H) Sh ,λ(b )] E[f B(X, H) | (X, H) Sh ,λ(b )]. (52)

Whenever b b , due to the λ-discretization, we have that

E[f B(X, H) | (X, H) Sh ,λ(b )] E[f B(X, H) | (X, H) Sh ,λ(b )] λ (53)

Hence, we have shown that if f B is α-multicalibrated, then for all h , h H, b , b Λ[0, 1] with |Sh ,λ(b )| α λ|Sh | and |Sh ,λ(b )| α λ|Sh |, we have

P(Y = 1 | (X, H) Sh ,λ(b )) P(Y = 1 | (X, H) Sh ,λ(b )) 2α + λ . (54)

Further, note that (2α + λ)/2 λ > α λ as λ > 0. This concludes the proof.

Next, we show that, if f B is (α, λ)-aligned, then f B,λ is α-aligned with respect to f H. Theorem 13. For α, λ > 0, if f B is (α, λ)-aligned with respect to f H, then f B,λ is α-aligned with respect to f H.

Proof. The proof is similar to the proof of Lemma 1 in Hébert-Johnson et al. [11]. Consider all Sh,λ(b) such that |Sh,λ(b)| < αλ|Sh|. By the λ-discretization, there are at most 1/λ such sets, thus, the cardinality of their union is at most 1/λαλ|Sh| = α|Sh|. Hence, for all h H, there exists a subset Sh Sh with | Sh| (1 α)|Sh| such that, for all h , h H, with h h , and all b , b Λ[0, 1], with b b , it holds that

P(Y = 1 | (X, H) Sh ,λ(b ) Sh ) P(Y = 1 | (X, H) Sh ,λ(b ) Sh ) α . (55)

The λ-discretization sets all values of (x, h) Sh ,λ(b ) to f B,λ(x, h) = E[f B(X, H) | f B(X, H) λ(b )]. Note that, for (x, h) Sh ,λ(b ), f B,λ(x, h) λ(b ) and for (x, h) Sh ,λ(b ), f B,λ(x, h) λ(b ), so it still holds that E[f B(X, H) | f B(X, H) λ(b )] E[f B(X, H) | f B(X, H) λ(b )]. Thus, using Eq. 55, we have that

P(Y = 1 | f B(X, H) = E[f B(X, H) | (X, H) λ(b )], (X, H) Sh )

P(Y = 1 | f B(X, H) = E[f B(X, H) | (X, H) λ(b )], (X, H) Sh ) α (56)

This concludes the proof.

Finally, using Theorems 12 and 13, it readily follows that, given a parameter α , the discretized conﬁdence function f B,λ returned by Algorithm 1 satisﬁes (2α + λ)-aligned calibration with respect to f H.

A.6 Proof Theorem 9

We structure the proof in three parts. We ﬁrst explain the calibration guarantee that UMD provides and how it relates to human-aligned calibration. Then, we derive a lower bound on the size of the subsets D Sh so that the discretized conﬁdence function f B,λ satisﬁes α-aligned calibration with respect to f H with high probability. Finally, building on this result, we derive an upper bound on |D| so that f B,λ satisﬁes α-aligned calibration with high probability as long as there exists γ > 0 so that P((X, H) Sh) γ for all h H.

Conditional Calibration implies Human-Aligned Calibration. Running UMD on a dataset D (Z Y)n, where each datapoint is sampled from P M, guarantees (α, ξ)-conditional calibration, a PAC-style calibration guarantee [12]. Given a dataset D, a conﬁdence function f B satisﬁes (α, ξ)- conditional calibration if, with probability at least 1 ξ over the randomness in D,

b [0, 1], |P(Y = 1|f B(X, H) = b) b| α .

This stands in contrast to the deﬁnition of α-calibration, which requires only that the conﬁdence f B(X, H) is at most α away from the true probability for 1 α fraction of Z.

Similarly, using an union bound over all h H, (α/2, ξ/|H|)-conditional calibration of f B on each Sh, h H, implies that, with probability at least 1 ξ over the randomness in D, f B satisﬁes that

h H, b [0, 1], |P(Y = 1|f B(X, H) = b, H = h) b| α/2 . (57)

Hence, analogously to the proof of Theorem 8, this implies that, with probability at least 1 ξ over the randomness in D, f B also satisﬁes that

h, h H,h h , b, b G, b b ,

P(Y = 1|f B(X, H) = b, H = h) P(Y = 1|f B(X, H) = b , H = h ) α . (58)

In summary, from Eqs. 57 and 58, we can conclude that (α/2, ξ/|H|)-conditional calibration of f B on each Sh, h H, implies that, with probability at least 1 ξ, f B satisﬁes α-aligned calibration, where, for all h H, we have that Sh = Sh.

Lower bound on |D Sh| to achieve conditional calibration with UMD. Running UMD on each partition D Sh of D induced by h H achieves (α/2, ξ/|H|)-conditional calibration as long as each subset D Sh of the data is large enough. More speciﬁcally, the following lower bound on the size of the subsets D Sh readily follows from Theorem 3 in Gupta et al. [12].

Lemma 4. The discretized conﬁdence function f B,λ returned by |H| instances of UMD, one per Sh, is (α/2, ξ/|H|)-conditional calibrated on Sh for any ξ (0, 1) if

|D Sh| nmin :=

Proof. Let B denote the number of bins in UMD. Theorem 3 in Gupta et al. [12] states that, if f B(X, H) is absolutely continuous with respect to the Lebesgue measure11 and |D Sh| 2B, then the discretized conﬁdence function output by UMD is (ϵ, ξ )-conditionally calibrated for any ξ (0, 1) and

log(2B/ξ ) 2( |D Sh|/B 1) . (60)

Then, for a given α, setting ϵ = α/2, B = 1/λ and ξ = ξ/|H|, we can solve Eq. 60 for the lower bound on |D Sh| nmin with nmin as deﬁned in Eq. 59.

Upper bound on |D| to achieve conditional calibration with UMD. Suppose P((X, H) Sh) γ for all h H. When |H| 2, we give an upper bound on |D| so that with high probability |D Sh| nmin for all h H.

In the process of sampling D (Z Y)n from P M, let R(h) i = 1 denote the event that the i-th datapoint (xi, hi, yi) has conﬁdence value h, i.e., hi = h. Then, we can express |D Sh| in terms of random variable R(h), deﬁned as

i=1 R(h) i . (61)

Since R(h) i is a Bernoulli-distributed variable with P(R(h) i ) = P((X, H) Sh), the expected value of R(h) is µ(h) := E[R(h)] = P((X, H) Sh) |D| γ |D|.

Let |D| = 2 |H| log(2/ξ) 1/γ nmin, observe that in this case

P(R(h) nmin) = P R(h) γ 2|H| log(2/ξ) |D| .

For |H| 2 and ξ (0, 1), we have 1/(2|H| log(2/ξ)) (0, 1) and we can use a variation of the Chernoff bound to show

P(R(h) nmin) P R(h) 1 2|H| log(2/ξ) µ(h)

e µ(h)( 2|H| log(2/ξ) 1

2|H| log(2/ξ) ) 2 1

2 1 1 |H| log(2/ξ) + 1 (2|H| log(2/ξ))2

2 e |H| nmin 1 2 1 2|H| log(2/ξ) + 1 2(2|H| log(2/ξ))2 ,

where the ﬁrst and last inequality results from using µ(h) > γ |D|. We can now use a union bound to obtain a lower bound on the probability that for any h H, |D Sh| nmin, i.e.,

P( h H : |D Sh| nmin) ξ

2 |H| e |H| nmin 1 2 1 2|H| log(2/ξ) + 1 2(2|H| log(2/ξ))2 (62)

One can verify that for |H| 2 and nmin 1, we have P( h H : |D Sh| nmin) ξ

2. Hence, if |D| = 2 |H| log(2/ξ) 1/γ nmin, then, for all h H, |D Sh| nmin with probability 1 ξ/2.

11If f B is not continuous with respect to the Lebesgue measure (or equivalently put, f B does not have a probability density function), a randomization trick can be used to ensure that the results of the theorem hold.

Combining this result and Lemma 4, we have that the discretized conﬁdence function f B,λ returned by |H| instances of UMD, one per Sh, is (α/2, ξ/(2|H|))-conditional calibrated on each Sh with probability at least 1 ξ/2 for any ξ (0, 1) if

|D| = 2 |H| log(2/ξ)

Finally, using a union bound, we can conclude that f B,λ achieves α-aligned calibration with respect to f H with probability at least 1 ξ from

|D| = O |H| log(|H|/ξλ)

samples. This concludes the proof.

B Multicalibration Algorithm

In this section, we give a high-level description of the post-processing algorithm for multicalibration introduced by Hébert-Johnson et al. [11]. The algorithm works with a discretization of [0, 1] into uniform sized bins of size λ, for a λ > 0. Formally the λ-discretization of [0, 1], is deﬁned as

Deﬁnition 14 (λ-discretization [11]). Let λ > 0. The λ-discretization of [0, 1], denoted by Λ[0, 1] = { λ

2 , . . . , 1 λ

2 }, is the set of 1/λ evenly spaced real values over [0, 1]. For b Λ[0, 1], let

λ(b) = [b λ/2, v + λ/2) (64)

be the λ-interval centered around b (except for the ﬁnal interval, which will be [1 λ, 1]).

It starts by partitioning each subspace Sh into 1/λ groups Sh,λ(b) = {(x, h) Sh | f B(x, h) λ(b)}, with b Λ[0, 1]. Then, it repeatedly looks for a large enough group Sh,λ(b) such that the absolute difference between the average conﬁdence value E[f B(X, H) | (X, H) Sh,λ(b)] and the probability P(Y = 1 | (X, H) Sh,λ(b)) is larger than α and, if it ﬁnds it, it updates the conﬁdence value f B(x, h) of each (x, h) Sh,λ(b) by this difference. Once the algorithm cannot ﬁnd any more such a group, it returns a discretized conﬁdence function f B,λ(x, h) = E[f B(X, H) | f B(X, H) λ(b)], with b Λ[0, 1] such that f B(x, h) λ(b), which is guaranteed to satisfy (α + λ)-multicalibration.

Algorithm 1 provides a pseudocode implementation of the overall algorithm. Within the implementation, it is worth noting that the expectations and probabilities can be estimated with fresh samples from the distribution or from a ﬁxed dataset using tools from differential privacy and adaptive data analysis, as discussed in Hébert-Johnson et al. [11].

Algorithm 1 Post-processing algorithm for (α + λ)-multicalibration

1: Input: conﬁdence function f B, parameters α, λ > 0 2: Output: conﬁdence function f B,λ 3: repeat 4: updated false 5: for Sh C & b Λ[0, 1] do 6: Sh,λ(b) Sh {(x, h) Z | f B(x, h) λ(b)} 7: if P((X, H) Sh,λ(b)) < αλ P((X, H) Sh) then 8: continue 9: bh,λ(b) E[f B(X, H) | (X, H) Sh,λ(b)] 10: rh,λ(b) P(Y = 1 | (X, H) Sh,λ(b)) 11: if |rh,λ(b) bh,λ(b)| > α then 12: updated true 13: for (x, h) Sh,λ(b) do 14: f B(x, h) f B(x, h) + (rh,λ(b) bh,λ(b)) {project into [0, 1] if necessary} 15: until updated = false

16: for b Λ[0, 1] do 17: bλ(b) E[f B(X, H)|f B(X, H) λ(b)] 18: for (x, h) Z : f B(x, h) λ(b) do 19: f B,λ(x, h) bλ(b) 20: return f B,λ

C Additional Details about the Experiments

Transformation of conﬁdence values. In the Human-AI Interactions dataset, the AI model is a simple statistical model where b is just a noisy average conﬁdence h of an independent set of ca. 50 human labelers on each task instance. Moreover, the conﬁdence values were originally recorded on a scale of [ 1, 1], where 1 means complete certainty on the correct true label and 1 means complete certainty on the incorrect label. To better match our theoretical framework, we transform all conﬁdence values to a scale of [0, 1], where 1 means complete certainty that the true label y = 1 and 0 means complete certainty that the true label is y = 1. More formally, let ˆb, ˆh, ˆh+AI [ 1, 1] be the original conﬁdence values in the dataset, then we obtain b [0, 1] via the following transformation:

( (ˆb + 1)/2 if y = 1 1 (ˆb + 1)/2 if y = 0,

and analogously for h and h+AI.

Comparing decision policies πB, πH and πH+AI. Figure 6 shows the ROC curves for the decision policies πB, πH and πH+AI in each of the four tasks in the Human-AI Interactions dataset.

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

True Positive Rate

Decision Policies

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

True Positive Rate

Decision Policies

(b) Sarcasm

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

True Positive Rate

Decision Policies

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

True Positive Rate

Decision Policies

Figure 6: ROC curves for the decision policies πB, πH and πH+AI.