# uncalibrated_models_can_improve_humanai_collaboration__db41ae4a.pdf

Uncalibrated Models Can Improve Human-AI

Collaboration

Kailas Vodrahalli Stanford University kailasv@stanford.edu

Tobias Gerstenberg

Stanford University gerstenberg@stanford.edu

James Zou Stanford University jamesz@stanford.edu

In many practical applications of AI, an AI model is used as a decision aid for human users. The AI provides advice that a human (sometimes) incorporates into their decision-making process. The AI advice is often presented with some measure of conﬁdence that the human can use to calibrate how much they depend on or trust the advice. In this paper, we present an initial exploration that suggests showing AI models as more conﬁdent than they actually are, even when the original AI is well-calibrated, can improve human-AI performance (measured as the accuracy and conﬁdence of the human s ﬁnal prediction after seeing the AI advice). We ﬁrst train a model to predict human incorporation of AI advice using data from thousands of human-AI interactions. This enables us to explicitly estimate how to transform the AI s prediction conﬁdence, making the AI uncalibrated, in order to improve the ﬁnal human prediction. We empirically validate our results across four different tasks dealing with images, text and tabular data involving hundreds of human participants. We further support our ﬁndings with simulation analysis. Our ﬁndings suggest the importance of jointly optimizing the human-AI system as opposed to the standard paradigm of optimizing the AI model alone.

1 Introduction

In safety-critical settings like medicine, AI is often integrated as interactive feedback, allowing a human to decide when and to what extent the AI s advice is utilized [29]. For example, in medical diagnosis tasks, this places the AI in a similar category as lab tests or other exams a doctor may order to aid in diagnosis. This mitigates the often black box nature of AI algorithms that limit a user s trust and usage of AI [6, 24, 34, 18].

Typically, these algorithms are designed and optimized to minimize loss using standard training objectives independent of the human end user. In this paper, we question this premise and ask whether a joint optimization of the entire human-AI system is possible. In particular, we revisit the conventional wisdom that models with calibrated conﬁdence enable more accurate transfer of prediction uncertainty between models and/or humans [8]. We propose calibrating a model for human use, with the intuition that humans are often unable to accurately integrate calibration information correctly and so may beneﬁt from uncalibrated information. This approach is complementary to existing ideas on how to help humans utilize AI advice (e.g., through some form of explanation accompanying the advice or through feedback on model/human performance) [36, 14].

We draw on prior work modeling human predictions [30] and assume black box access to an AI algorithm which provides advice. From data collected on thousands of human interactions, we ﬁt a model for how humans incorporate (or ignore) AI advice. Using this learned human behavior model, we optimize a monotonic transformation that modiﬁes the AI s reported conﬁdence to maximize human-AI system performance (measured as the human s ﬁnal accuracy). We demonstrate ﬁrst in simulations and subsequently with real-world data using crowd-sourced human study participants

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

that our modiﬁed AI advice results in higher overall performance. These results suggest there is merit to jointly optimizing human-AI systems and that an optimal AI algorithm may not be optimal for human use.

Our contributions We propose and empirically validate a framework for optimizing an AI algorithm with respect to the human end user, resulting in human-calibrated AI that calibrates AI models for human use. We only require black box access to the AI. Our framework relies on a model for human behavior, which we create using a large collection of empirical data on human interaction with AI advice. To the best of our knowledge, these are some of the ﬁrst results using an empirically validated model for human behavior to optimize AI performance in a collaborative human-AI system. We would like to emphasize, however, that this these methods are not currently suitable for practical use.

Ethical considerations One of the main conclusions of our study is that human users do not consider conﬁdence in the optimal way (i.e., how we model conﬁdence in statistics). In an attempt to correct for this, we present both empirical and simulation-based evidence that modifying the presented AI advice improves overall performance. One potential question about our approach is: does modifying the AI s conﬁdence constitute misleading the user? We agree in our experiments, the AI can mislead the user. However, our goal here is not to propose a method that should be used in practical applications, and rather to highlight the importance of modeling human interaction when designing AI for human use.

Related works The experimental setup we use comes from the advice utilization community. Here, studies often use the JAS framework for measuring and comparing the effect of different forms of advice [28]. In many of these studies, algorithmic advice is manipulated to elicit various responses from humans. We take this a step further in our paper by manipulating the advice to improve model calibration with respect to the user and improve performance. Previous work has studied the effect of peer vs. expert advice [15], automated vs. human advice [22, 5, 15, 21], task difﬁculty [7], and advice conﬁdence [25, 26]. Some work has also sought to characterize the perceptions people have of an AI when making decisions with AI advice and how these perceptions affect trust and utilization of the AI [13]. Of particular relevance to our work is [25], which provides evidence that uncalibrated, conﬁdent advice is more utilized by humans in certain settings. We also leverage a proposed model for human behavior from [30].

The conventional approach to human-AI collaboration is to ensure model and human calibration [8, 3] and to make the AI s predictions explainable [31, 27]. These explanations can affect human understanding of AI conﬁdence [12, 35, 36, 14]. It is also standard practice to optimize the AI in isolation (i.e., maximize the AI performance). However, recent work suggests that the objectives used to optimize the AI do not adequately consider the human-AI team. Alternative approaches include learning to defer, where a classiﬁer determines when expert input is required [16, 19]; learning to complement, where the AI is optimized to perform well on only the tasks that humans struggle with [33]; and methods that optimize objectives with costs assigned to requesting AI advice and/or human expert input utilization [1]. Our work is most related to [1], which also emphasizes the importance of conﬁdent AI predictions for human decisions. Our approach is complementary to these prior works and contains several novel aspects. We model human behavior using empirical data, demonstrate that this human model allows us to optimize the AI advice, and demonstrate both in simulation and empirically that our modiﬁed advice improves overall system performance. Furthermore, we only require black-box access to the AI advice. In relation to [1], we further differentiate our work as the model for human interaction is different we give advice after the user makes an initial assessment.

2 Experimental setup

Our goal is to augment the performance of a human-AI system by optimizing the AI for use by a human. To make this goal tractable, we assume (1) that we have a model for human interaction with AI advice, and (2) that we have black box access to the AI algorithm producing the advice.

Human-AI interaction

The human-AI interaction involves two stages. First, a human is shown a task and completes it. AI advice is then presented, and the human is allowed to modify their response. As we measure performance before and after receiving advice, this model is conducive to studying the effect of advice and so sees common usage in the literature [22, 30, 17].

Data collection and tasks

We collected data from crowd-sourced participants on a diverse set of tasks that all have the same basic design. We focused on binary prediction tasks where humans respond on a continuous scale. The use of the continuous scale allows for a more rich analysis of the user s prediction than just the simple binary prediction label would permit.

Participants were shown a sliding scale with each end indicating a different binary response. They were instructed to move a pointer on the scale to the location that indicated their conﬁdence and their response, with the midpoint on the scale indicating an uncertain response. Participants submitted an initial response; subsequently, the advice was indicated with a visual marker on the scale and participants were allowed to adjust their initial response. Participants were incentivized so that higher conﬁdence in correct and incorrect answers increased and decreased their monetary reward respectively.

We collected data on a set of four tasks from diverse data modalities. These tasks were originally proposed in prior work [30]. The tasks were designed to be accessible to the general public to allow use of crowd-sourced study participants for obtaining empirical data. See below for a brief description of the tasks we used; examples of task instances from each of the four tasks are shown in Appendix B.2. The data collected is summarized in Table 1. Each task consists of 32 binary questions. For each task, the 32 questions were carefully selected to have a range of difﬁculty. More details on the distribution of task difﬁculty is presented in Appendix B.3.

All participants received the same set of questions, though the order was randomized for each participant. All participants also received the same advice, although a small amount of noise was randomly added for each participant. The advice had an accuracy of 80% for each task; participants were informed of this accuracy. However, no feedback on AI or human accuracy on individual questions was provided during the experiment.

Since the datasets we are using are relatively small and curated primarily for human use rather than AI training, we opted to use aggregated human responses as a proxy for an AI model. In particular, the advice was constructed by averaging the responses of a prior group of human participants who responded to the questions without receiving any advice. This allowed us to ensure the advice given was reasonable in the sense that the advice is generally less conﬁdent on difﬁcult tasks and more conﬁdent on easy tasks. We checked the calibration of this advice by computing the expected calibration error (ECE) [20]. The ECE for AI advice is 0.074, indicating the advice is roughly calibrated (0 indicates perfect calibration).

We recruited 49-79 participants for each task, resulting in several thousand datapoints (each question is one datapoint). Participants are US residents, roughly 50% are female, and the average age is approximately 30 years old. Despite limitations in the age distribution and geographic location of participants, a survey question we asked indicated a varied range of AI familiarity in our participants.

We followed standard practice in collecting data. Our study protocol was approved by our institution s IRB board. The data collection process was low risk, and we ensured samples contained diverse participants across age, sex, and ethnicity. We compensated participants at $10.00 per hour (depending on bonus pay). The total cost was around $600 (Section 2) + $750 (Section 5.2).

We have four types of data with corresponding user tasks. Art dataset (Image data) Contains images of paintings from 4 art periods: Renaissance, Baroque, Romanticism, and Modern Art. Participants were asked to determine the art period a painting is from given a binary choice. Cities dataset (Image data) This dataset contains images from 4 major US cities: San Francisco, Los Angeles, Chicago, and New York City. The task is to identify which city an image is from given a binary choice. Sarcasm dataset (Text data) This dataset is a subset of the Reddit sarcasm dataset [10], which includes text snippets from the discussion forum website, Reddit. Participants were asked to detect whether sarcasm was present in a given text snippet. Census dataset (Tabular data) This dataset comes from US census data [32]. The task is to identify an individual s income level given some of their demographic information e.g. state of residence, age, education level.

Figure 1: Activation-integration model for human behavior.

Table 1: Summary of data used for ﬁtting our human behavior model. This data is publicly available at https://github.com/kailas-v/human-ai-interactions.

Task # Partici-

Accuracy (before advice)

Accuracy (after advice)

Art 68 2176 0.523 63.7% 75.5%

Cities 79 2528 0.557 70.7% 77.1%

Sarcasm 49 1567 0.347 72.7% 76.1%

Census 50 1600 0.475 70.1% 73.6%

3 Human behavior model

Table 1 provides a summary of the data collected for developing our human behavior model. We ﬁnd that humans actually changed their response after receiving advice roughly 50% of the time, with the exact number shown in the Activation Rate column of Table 1. When a person does change their response, we call this person activated. Speciﬁcally, a person is activated if their ﬁnal response changes by at least a small amount.

This observation motivates a simple, two-stage model for human behavior: a human ﬁrst decides whether to modify their response ( activation stage ), and subsequently, if they were activated, they modify their response ( integration stage ). A diagram of this model is shown in Figure 1. This two-stage model is consistent with previously proposed models from the psychology and HCI literature [9, 30].

We ﬁt functions for each of the two stages: an activation model that predicts how likely it is that a person is activated, and an integration model that predicts how the person modiﬁes their advice, assuming they are activated. Taken together, we can model human behavior in the two-stage human AI system we use to run our experiments. We deﬁne this model s behavior more rigorously in Section 4. Our ﬁtted models will only be used to optimize our advice-modifying function and will not be used a test-time. As such, we can utilize demographic information about the participants to ﬁt the model, though such information may not be available during a real-world deployment. We opt to ﬁt a small neural network for each of these stages, as we require a differentiable model (for easy optimization) with moderately more complexity than a linear model.

3.1 Activation model

Our activation model takes 12 input features extracted from a single person-task instance interaction. It outputs a probability for whether the person will be activated on the given task instance. The input features are based on the person s initial response, the AI advice, and several demographic features including age and sex. We do not encode the initial response or AI advice directly as the label (the

(a) Activation model PDP for AI advice (b) Integration model Heatmap

Figure 2: Our trained model for human behavior: (a) Partial dependence plot measuring the effect of the advice on activation. (b) Heatmap of the average behaviour of our integration model for human behavior when we vary unmodiﬁed AI advice and the human s initial response.

left or right slider location of the correct answer) is randomized for each participant. Instead, we use the response and advice conﬁdence, as well as a binary feature that encodes whether the response and advice agreed on the label. This implies a symmetry in the model s predictions. Please see Appendix D.1 for a description of all the features used.

We use a 3 layer neural network with Re LU non-linearities for the activation model. We also attempted to use other simple algorithms for modeling the activation behavior of humans, like linear models, but found the neural networks to perform better. We trained the network using binary cross-entropy loss with an Adam optimizer [11] and stopped training when validation loss increased, as is standard. The activation model achieves a ROC-AUC > 0.81 on held-out test participants, suggesting reasonable performance despite the high variability in our dataset.

In Figure 2a, we show a partial dependence plot for the learned activation model. This plot shows the average behavior (across our entire test dataset) of the activation model when we ﬁx the AI advice to be a certain value (the x-value) for all data points. The AI advice can be positive (correct advice) or negative (incorrect advice), while the magnitude corresponds to the advice conﬁdence. We note that (1) there is a base level of activation around 50% and (2) our activation model predicts people will be most activated when the advice is conﬁdently incorrect. This occurs because (a) most people in our dataset get answers correct, on average, and seeing conﬁdent advice of the opposite label tends to make people change their initial response (e.g. by decreasing conﬁdence even if the predicted label does not change). And because (b) conﬁdent advice that agrees with a person s initial response (whether correct or incorrect) tends to increase their conﬁdence. We also observe that (3) the activation rate increase when advice is conﬁdent and correct for the same reason previously noted in (2b). The activation rate is lower for correct advice than for incorrect advice as most people in our dataset are, on average correct, which tends to invoke a smaller response.

3.2 Integration model

Our integration model is similar to our activation model, but optimized for a different function. In particular, we use the same input features, model architecture, and training procedure. The integration model outputs a prediction for sign(r1)(r2 r1), where r1, r2 2 [ 1, 1] are the initial and ﬁnal responses respectively. Note that r1, r2 are probabilities normalized to the [ 1, 1] scale. The training minimizes the RMSE between the predicted change and the actual change. The optimization is performed on only the subset of the dataset that was actually activated (roughly 50% of the data). Our integration model has a reasonable performance, with an RMSE of 0.25 (the maximum absolute error possible is 2) and R2 of 0.73 on test data.

In Figure 2b, we plot a heatmap of the ﬁtted integration model s behavior. We plot the average output of our integration model across our test data split when the person s initial response (x-axis) and AI advice (y-axis) are ﬁxed to a certain value for all datapoints. The sign and magnitude of the initial response and AI advice correspond to correctness (negative is incorrect) and conﬁdence. The

integration model prediction can be positive (increasing conﬁdence in the initial predicted label) or negative (decreasing conﬁdence in the initial predicted label / change in the predicted label).

We observe three features in the heatmap: (1) The heat map is symmetric across quadrants 1, 3 and 2, 4. This is by design. As the model does not use the initial response and AI advice values directly it uses their magnitudes (conﬁdence) as well as a third, binary term indicating whether they agree on label the integration model can only predict a delta change relative to the sign of the initial response. (2) The output has largest magnitude when the advice is conﬁdent and opposite the person s initial response. (3) If the advice and the person s initial response share the same label, the integration model output is largest when the person s initial response has low conﬁdence and the model has high conﬁdence. These observations conﬁrm our integration model has reasonable operation behavior.

4 Modifying the AI advice

(a) Fitted human behavior model (g ,β ) (b) Section 4.2 example

Figure 3: Simulation results. (a) Fitted modiﬁed advice vs. original unmodiﬁed advice. We show results across synthetically varied average advice accuracy in the training data. The curve labeled Advice given is what we actually show people in our experiments. (b) Optimal human-calibrated

AI output for the setting described in Section 4.2. Heatmap shows delta change in advice compared to calibrated advice: red and blue values indicate where the AI should be underand overconﬁdent respectively in order to achieve the best human-AI system output.

Here we discuss how we modify the raw AI advice output. First let s establish some notation. The notation described below is in reference to a single datapoint; later, we will add the subscript i to index across datapoints.

Let A 2 R denote the inverse sigmoid of the advice presented to participants (i.e., we present the AI advice as σ(A)). We will work with A rather than σ(A) here, as we will aim to replace the σ transform with an optimized function that improves human-AI performance.

Let g : R ! [0, 1] be a function (which we will optimize) that maps the AI advice to a probability that the user is shown. Note that for the baseline, unmodiﬁed advice, we set g to be the sigmoid function, σ(x) = 1 1+e x .

Let r1, r2 be the initial and ﬁnal responses from a participant (r2 is a function of r1, A, and other features), and let y 2 {0, 1} denote the true label. Let u represent demographic features of a person, and let x = (r1, A, u) be the input feature vector to factivation and fintegration, our activation and integration models for human behavior models described in Section 3. To simplify notation, we let fintegration denote the function that outputs the predicted r2 for an activated person (rather than the delta from r1 that we optimize for). Finally, let f HB represent the entire model for human behavior. We assume all feature preprocessing is hidden inside factivation and fintegration for convenience of notation.

4.1 Optimizing for human-AI performance

We consider g of the form

g(A) = g ,β(A) = 1 1 + e sign(A)( |A|+β) , (1)

where , β 2 R 0. Note that setting ( , β) = (1, 0) results in g1,0 = σ, the sigmoid function used to produce the unmodiﬁed advice. Note that we do not exactly perform a linear transformation on A: the |A| and sign(A) terms are included to ensure that g does not change the label recommended as it makes the function symmetric around A = 0. modulates the rate at which the presented advice increases with AI conﬁdence, while β adjusts the minimum conﬁdence level of the presented advice. For example, if β = 1, the presented conﬁdence is > e 1+e > 0.73 (relative to the chosen label). In general, we can use other differentiable functions for g; we chose the form in Equation 1 because it is easy to optimize, simple to understand, and is still ﬂexible enough to ﬁt our data.

As a shorthand for when we modify the features input to factivate and fintegrate, we denote g(x) = (g(A), r1, u). We will optimize for , β to maximize the expected ﬁnal accuracy using our human

behavior model, f HB. As discussed in Section 3, f HB has the following representation:

f HB(g(x)) = f HB(g ,β(x)) =

r1 w.p. 1 factivate(g ,β(x)) fintegrate(g ,β(x)) w.p. factivate(g ,β(x))

We then solve the following optimization problem to obtain , β:

( , β ) = arg min

,β E [L(f HB(g(x)), y)] , (2)

where L is the binary cross entropy loss function. The expectation is taken with respect to the randomness in f HB and the data distribution. Given training dataset D, we optimize the following expression:

E[L(f HB(g(xi)), yi)] =

L(r1i, yi) (1 factivate(g(xi))) + L(fintegrate(g(xi)), yi) factivate(g(xi))

In practice, we carry out this optimization using stochastic gradient descent, which is possible as we choose factivate and fintegrate to be differentiable functions (see Appendix D.2). For data, we used the empirical data described in Section 2. The full training procedure is visualized in Figure 4.

initial response

Integration

model Activation

Initial response

New response

!!,#(#) AI advice

Logistic loss

Human behavior model

Backpropagate loss through human

behavior model to %$,%(&)

Figure 4: Overview of optimization procedure for g ,β. AI advice is input through g ,β. The transformed advice is then input into the human behavior model. Finally, loss is computed and the gradient with respect to the loss is backpropogated to update and β. Note here that the human behavior model was trained previously and is now ﬁxed.

Once we have optimized for , β , we can apply it to new data without needing the user demographics u (i.e., we do not personalize the conﬁdence transformation to individual users). Note that the choice of loss function is arbitrary here (we used logistic loss as our task is binary classiﬁcation). In particular, for high risk settings where a mistake is very costly, the , β could be selected accordingly using an appropriate loss function.

Table 2: Simulated performance difference of modiﬁed advice ( Advice given curve in Figure 3a) to unmodiﬁed advice. Positive difference indicates higher value for modiﬁed advice. Results shown across three average advice accuracy levels. The bolded row is the advice used in our human experiments.

Advice Accuracy Final Accuracy Correct Conﬁdence Activation Rate

75% +0.7% +0.097 +0.083

79% +1.9% +0.114 +0.091

85% +3.1% +0.131 +0.111

4.2 When does under/overconﬁdent advice help?

Here we describe simple theoretical situations where uncalibrated advice improves performance. We consider a binary classiﬁcation task and assume we want to minimize the logistic loss of the human-AI classiﬁer. We also assume (1) a constrained version of the activation-integration model perfectly models human behavior; in particular, humans either use their initial response or follow the given advice exactly. And (2) we have an oracle for human behavior and know exactly the calibration of the human and AI advice. We will now construct a function f(a, r1) that outputs a , the optimal advice to minimize logistic loss of the human-AI system.

To select a , we need to consider two quantities:

Lr1 = [pr1 log(r1) + (1 pr1) log(1 r1)] (3) La = [pa log(a ) + (1 pa) log(1 a )] (4)

Here, pa and pr1 are the probabilities the advice and initial response are correct respectively. Then Lr1 relates the log-loss for non-activated humans and La relates the log-loss of following the modiﬁed advice, a .

Now we simply select a to minimize the expected loss incurred by the human-AI system

a = arg min

a0 (1 pactivation)Lr1 + pactivation La0 (5)

for pactivation = factivation(a , r1, u) (which we assumed we know exactly). For example, consider a setting where

pactivation =

[ r1 + < a] pr1 = f(r1)

where x = max{x, 1 x}, 2 [ 0.5, 0.5], and for a generic function f : [0, 1] ! [0, 1]. Here a human under/overvalues their response relative to the advice (depending on ), and the human is not calibrated (depending on f). In this setting, it is indeed optimal for the advice, a , to be under/overconﬁdent to correct for human misutilization of advice (depending on the advice and initial response). An example where = 0.1 and f(r1) = r2

1 is shown in Figure 3b. Further discussion is included in Appendix C.

5.1 Simulation results

Optimizing Equation 2 results in an optimal , β . Using f HB, we can then analyze the expected beneﬁt of g ,β over the baseline, g1,0. In Figure 3a, we show g ,β . We plot three separate curves. The curve labeled Advice given is optimized using the previously described dataset, and this is the function we use in our empirical results. We also synthetically modify the dataset by biasing the advice towards the correct label and adding noise to the advice to respectively increase and decrease the average advice accuracy in the dataset to 75% and 85%. We then ﬁt a new ( , β) to the modiﬁed dataset and plot these curves. Note that f HB is ﬁt to the unmodiﬁed dataset and is ﬁxed for optimizing all three of these curves.

Table 3: Empirical performance on UK-based participants. Showing the difference in metric between modiﬁed advice and baseline (value is > 0 when modiﬁed advice resulted in a larger value). We recruited 50 participants for each task.

Task Final Accuracy Correct Conﬁdence Activation Rate

Art (90%) +8.5% +0.150 +0.101

Cities (87%) +1.7% +0.140 +0.169

All (89%) +5.2% +0.149 +0.143

For all three curves, we notice there is a large discontinuity at A = 0.5. This is a result of a large β , and suggests that advice is not useful to humans unless it is at a certain level of conﬁdence (e.g., to make the human activated).

That all three curves follow relatively similar paths suggests the performance of g ,β may be robust to small changes in advice accuracy. This is conﬁrmed (in simulation) in Table 2, where we show the simulated performance difference between modiﬁed and unmodiﬁed advice. We see here that the ﬁnal accuracy, conﬁdence in correct result, and activation rate all increase with the modiﬁed advice.

5.2 Human experiments

(a) Accuracy (b) Conﬁdence

Figure 5: Our empirical results: we plot the average change between the ﬁrst (before advice) and second (after advice) responses in (a) accuracy and (b) conﬁdence of the correct label, averaged by participant. These results are consistent with our simulation ﬁndings of Table 2. The error bar represents 1 standard deviation.

We verify our simulation results across the four previously described tasks (with the exception that we do not add noise to the advice, so all participants see the exact same advice now). Adult, US-based participants were recruited through the Proliﬁc crowdworker platform [23]. We recruited 50 participants for each task, and randomly assigned either the unmodiﬁed (baseline) or modiﬁed advice to each person. In Figure 5, we show the average change in (a) accuracy and (b) conﬁdence of the correct label (the average conﬁdence, with correct and incorrect labels receiving positive and negative weight respectively), partitioning by the advice received by participants. Accuracy and conﬁdence increase across all tasks when using advice (p = 3.84 10 6). Moreover, we can largely conﬁrm our simulation results. Accuracy and conﬁdence increased due to the modiﬁed advice across nearly all tasks. Additionally, activation rate (not plotted) increased with the modiﬁed advice, with an average increase of 3.3% across tasks.

Another interesting ﬁnding is that tasks with higher advice accuracy (Art and Cities) exhibit larger increases in accuracy and conﬁdence. The AI advice is more accurate in Art and Cities (accuracy: 90% and 87%) compared to Sarcasm and Census (accuracy: 78% for both). This intuitively makes sense we are making the advice more conﬁdent in its predictions, and so while this beneﬁts all tasks, it disproportionately beneﬁts tasks where the advice is more accurate. This effect was predicted by our simulation results.

Our primary human experiments involved US-based participants. To show our ﬁndings generalize, we took the advice modiﬁcation learned on the US data and applied it to new participants based in the UK for the Art and Cities tasks. The ﬁndings are highly consistent: modiﬁed advice increases activation and improves ﬁnal human accuracy and conﬁdence (Table 3). Combining US and UK data, performance improvement is signiﬁcant for both accuracy (p = 8.86 10 3) and conﬁdence (p = 4.43 10 11).

6 Discussion, Limitations, and Future Work

In this paper, we proposed optimizing a simple function to better calibrate AI advice for human use in a collaborative human-AI system. To optimize the reported conﬁdence, we trained a human behavior model from experimental data and used it simulate the full human-AI system. The modiﬁed AI advice are explicitly uncalibrated, but in both simulated and empirical results, it resulted in substantially higher ﬁnal human performance. We further validated our model using new participants from a different country. These results highlight the importance of recognizing that AI, when deployed in collaborative systems, should not be optimized in isolation. Our results also suggest that we can view modifying AI advice as a nudge for the human user. If the AI says its conﬁdence is 0.7, the user may not fully absorb the import of 0.7 since humans are not great at intuitive probability. Modifying the reported conﬁdence could be a useful nudge to help users to better absorb the information. This view is supported by prior work suggesting humans prefer clean decision boundaries [2].

Our work has several limitations that would be interesting to address in future studies. In particular, it is unclear how reliable such a method would be in practice. We argue that our method for creating human-calibrated AI can be deployed in practice, but requires an accurate model for human behavior. In particular, if we have access to an oracle for an individual s utilization of AI advice, we can exactly optimize the output of a calibrated AI model to correct for the individual s biases. Note that this may lead to underor overcalibrated advice depending on the user and the advice conﬁdence. However, as we cannot perfectly model any given individual, the question to ask is whether we can create a robust model for human behavior (e.g., with conﬁdence intervals etc.) and then optimize with respect to that model while accounting for error. This extension was out-of-scope for this paper, and is an important direction of future research. Other limitations include: we considered tasks where advice accuracy generally exceeded individual human performance and did not evaluate expert tasks (e.g., involving clinicians making clinical decisions). We also did not consider how human interaction with an AI evolves over time that is, we assumed each individual to be static over the course of the study, while in general, humans may naturally adapt to the AI and modifying the AI may affect this process. These are also important directions for future research. Finally, we recognize that the human behavior model we used in this work does not capture all real-world settings. It may be important to consider different models for different applications, and our results will need to be validated in these new settings.

[1] Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S Weld. Is the most

accurate ai the best teammate? optimizing ai for teamwork. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 11405 11414, 2021.

[2] Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz.

Beyond accuracy: The role of mental models in human-ai team performance. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 2 11, 2019.

[3] Erin K Chiou and John D Lee. Trusting automation: Designing for responsivity and resilience.

Human Factors, page 00187208211009995, 2021.

[4] Joshua R De Leeuw. jspsych: A javascript library for creating behavioral experiments in a web

browser. Behavior research methods, 47(1):1 12, 2015.

[5] Mary T Dzindolet, Linda G Pierce, Hall P Beck, and Lloyd A Dawe. The perceived utility of

human and automated aids in a visual detection task. Human Factors, 44(1):79 94, 2002.

[6] Robin C Feldman, Ehrik Aldana, and Kara Stein. Artiﬁcial intelligence in the health care space:

how we can trust what we cannot know. Stan. L. & Pol y Rev., 30:399, 2019.

[7] Francesca Gino and Don A Moore. Effects of task difﬁculty on use of advice. Journal of

Behavioral Decision Making, 20(1):21 35, 2007.

[8] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural

networks. In International Conference on Machine Learning, pages 1321 1330. PMLR, 2017.

[9] Nigel Harvey and Ilan Fischer. Taking advice: Accepting help, improving judgment, and sharing

responsibility. Organizational behavior and human decision processes, 70(2):117 133, 1997.

[10] Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A large self-annotated corpus for

sarcasm. ar Xiv preprint ar Xiv:1704.05579, 2017.

[11] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014.

[12] Vivian Lai and Chenhao Tan. On human predictions with explanations and predictions of

machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency, pages 29 38, 2019.

[13] Min Kyung Lee. Understanding perception of algorithmic decisions: Fairness, trust, and

emotion in response to algorithmic management. Big Data & Society, 5(1):2053951718756684, 2018.

[14] Zhuoran Lu and Ming Yin. Human reliance on machine learning models when performance

feedback is limited: Heuristics and risks. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1 16, 2021.

[15] Poornima Madhavan and Douglas A Wiegmann. Effects of information source, pedigree, and

reliability on operator interaction with decision support systems. Human Factors, 49(5):773 785, 2007.

[16] David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: improving fairness

and accuracy by learning to defer. ar Xiv preprint ar Xiv:1711.06664, 2017.

[17] Neda Mesbah, Christoph Tauchert, and Peter Buxmann. Whose advice counts more man or

machine? an experimental investigation of ai-based advice utilization. In Proceedings of the 54th Hawaii International Conference on System Sciences, page 4083. Scholar Space, 2021.

[18] Tim Miller. Explanation in artiﬁcial intelligence: Insights from the social sciences. Artiﬁcial

intelligence, 267:1 38, 2019.

[19] Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert.

In International Conference on Machine Learning, pages 7076 7087. PMLR, 2020.

[20] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated

probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

[21] Dilek Önkal, Paul Goodwin, Mary Thomson, Sinan Gönül, and Andrew Pollock. The relative

inﬂuence of advice from human experts and statistical methods on forecast adjustments. Journal of Behavioral Decision Making, 22(4):390 409, 2009.

[22] Andrew Prahl and Lyn Van Swol. Understanding algorithm aversion: When is advice from

automation discounted? Journal of Forecasting, 36(6):691 702, 2017.

[23] Proliﬁc academic. https://www.prolific.co, 2022.

[24] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?" explaining

the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135 1144. ACM, 2016.

[25] Sunita Sah, Don A Moore, and Robert J Mac Coun. Cheap talk and credibility: The consequences

of conﬁdence and accuracy on advisor credibility and persuasiveness. Organizational Behavior and Human Decision Processes, 121(2):246 255, 2013.

[26] Thomas Schultze, Anne-Fernandine Rakotoarisoa, and Stefan Schulz-Hardt. Effects of distance

between initial estimates and advice on advice utilization. Judgment & Decision Making, 10(2), 2015.

[27] Donghee Shin. The effects of explainability and causability on perception, trust, and acceptance:

Implications for explainable ai. International Journal of Human-Computer Studies, 146:102551, 2021.

[28] Lyn M Van Swol, Jihyun Esther Paik, and Andrew Prahl. Advice recipients: The psychology of

advice utilization. In The Oxford handbook of advice, page 21 41. Oxford University Press, 2018.

[29] Kailas Vodrahalli, Roxana Daneshjou, Roberto A Novoa, Albert Chiou, Justin M Ko, and James

Zou. Trueimage: A machine learning algorithm to improve the quality of telehealth photos. In BIOCOMPUTING 2021: Proceedings of the Paciﬁc Symposium, pages 220 231. World Scientiﬁc, 2020.

[30] Kailas Vodrahalli, Tobias Gerstenberg, and James Zou. Do humans trust advice more if it comes

from ai? an analysis of human-ai interactions. ar Xiv preprint ar Xiv:2107.07015, 2021.

[31] Katharina Weitz, Dominik Schiller, Ruben Schlagowski, Tobias Huber, and Elisabeth André. "

do you trust me?" increasing user-trust by integrating virtual agents in explainable ai interaction design. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pages 7 9, 2019.

[32] Alex West and Anusha Praturu. Enhancing the census income prediction dataset. https://

people.ischool.berkeley.edu/~alexwest/w210_census_income_html/, 2019. Accessed: 2021-05-15.

[33] Bryan Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. ar Xiv preprint

ar Xiv:2005.00582, 2020.

[34] Yao Xie, Melody Chen, David Kao, Ge Gao, and Xiang Anthony Chen. Chexplain: Enabling

physicians to explore and understand data-driven, ai-enabled medical imaging analysis. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1 13. ACM, 2020.

[35] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. Understanding the effect of accuracy

on trust in machine learning models. In Proceedings of the 2019 chi conference on human factors in computing systems, pages 1 12, 2019.

[36] Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. Effect of conﬁdence and explanation

on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 295 305, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [Yes] We include instructions and materials to reproduce experimental results in the Supplement. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] In addition to details included in the paper, we include additional details in the Supplement. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] Compute resources were minimal (i.e., could be run on a single core laptop in under a minute) 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [No] The assets are all for public use and

we cite the references. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [Yes] Sarcasm dataset was ﬁltered to omit potentially offensive content. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [Yes] In appendix (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [Yes] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [Yes]