# differentiable_learning_under_triage__2d12079a.pdf Differentiable Learning Under Triage Nastaran Okati MPI for Software Systems nastaran@mpi-sws.org Abir De IIT Bombay abir@cse.iitb.ac.in Manuel Gomez-Rodriguez MPI for Software Systems manuelgr@mpi-sws.org Multiple lines of evidence suggest that predictive models may benefit from algorithmic triage. Under algorithmic triage, a predictive model does not predict all instances but instead defers some of them to human experts. However, the interplay between the prediction accuracy of the model and the human experts under algorithmic triage is not well understood. In this work, we start by formally characterizing under which circumstances a predictive model may benefit from algorithmic triage. In doing so, we also demonstrate that models trained for full automation may be suboptimal under triage. Then, given any model and desired level of triage, we show that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level. Building upon these results, we introduce a practical gradient-based algorithm that is guaranteed to find a sequence of predictive models and triage policies of increasing performance. Experiments on a wide variety of supervised learning tasks using synthetic and real data from two important applications content moderation and scientific discovery illustrate our theoretical results and show that the models and triage policies provided by our algorithm outperform those provided by several competitive baselines. 1 Introduction In recent years, there has been a raising interest on a new learning paradigm which seeks the development of predictive models that operate under different automation levels models that take decisions for a given fraction of instances and leave the remaining ones to human experts. This new paradigm has been so far referred to as learning under algorithmic triage [1], learning under human assistance [2, 3], learning to complement humans [4, 5], and learning to defer to an expert [6]. Here, one does not only has to find a predictive model but also a triage policy which determines who predicts each instance. The motivation that underpins learning under algorithmic triage is the observation that, while there are high-stake tasks where predictive models have matched, or even surpassed, the average performance of human experts [8, 9], they are still less accurate than human experts on some instances, where they make far more errors than average [1]. The main promise is that, by working together, human experts and predictive models are likely to achieve a considerably better performance than each of them would achieve on their own. While the above mentioned work has shown some success at fulfilling this promise, the interplay between the predictive accuracy of a predictive model and its human counterpart under algorithmic triage is not well understood. One of the main challenges in learning under algorithmic triage is that, for each potential triage policy, there is an optimal predictive model, however, the triage policy is also something one seeks to optimize, as first noted by De et al. [2]. In this context, previous work on learning under algorithmic triage can be naturally differentiated into two lines of work. The first line of work has developed rather general heuristic algorithms that do not enjoy theoretical guarantees [1, 4, 5]. The second has developed algorithms with theoretical guarantees [2, 3, 6], however, they have focused on more 35th Conference on Neural Information Processing Systems (Neur IPS 2021). restrictive settings. More specifically, De et al. [2, 3] focus on ridge regression and support vector machines and reduce the problem to the maximization of approximately submodular functions and Mozannar and Sontag [6] view the problem from the perspective of cost sensitive learning and introduce a convex and consistent surrogate loss of the objective function they consider. Our contributions. Our starting point is a theoretical investigation of the interplay between the prediction accuracy of supervised learning models and human experts under algorithmic triage. By doing so, we hope to better inform the design of general purpose techniques for training differentiable models under algorithmic triage. Our investigation yields the following insights: I. To find the optimal triage policy and predictive model, we need to take into account the amount of human expert disagreement, or expert uncertainty, on a per-instance level. II. We identify under which circumstances a predictive model that is optimal under full automation may be suboptimal under a desired level of triage. III. Given any predictive model and desired level of triage, the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level. Building on the above insights, we introduce a practical gradient-based algorithm that finds a sequence of predictive models and triage policies of increasing performance subject to a constraint on the maximum level of triage. We apply our gradient-based algorithm in a wide variety of supervised learning tasks using both synthetic and real-world data from two important applications content moderation and scientific discovery. Our experiments illustrate our theoretical results and show that the models and triage policies provided by our algorithm outperform those provided by several competitive baselines1. Further related work. Our work is also related to the areas of learning to defer and active learning. In learning to defer, the goal is to design machine learning models that are able to defer predictions [10 18]. Most previous work focuses on supervised learning and design classifiers that learn to defer either by considering the defer action as an additional label value or by training an independent classifier to decide about deferred decisions. However, in this line of work, there are no human experts who make predictions whenever the classifiers defer them, in contrast with the literature on learning under algorithmic triage [1 6] they just pay a constant cost every time they defer predictions. Moreover, the classifiers are trained to predict the labels of all samples in the training set, as in full automation. In active learning, the goal is to find which subset of samples one should label so that a model trained on these samples predicts accurately any sample at test time [19 26]. In contrast, in our work, the trained model only needs to predict accurately a fraction of samples picked by the triage policy at test time and rely on human experts to predict the remaining samples. 2 Supervised Learning under Triage Let X Rm be the feature domain, Y be the label domain, and assume features and labels are sampled from a ground truth distribution P(x, y) = P(x)P(y | x). Moreover, let ˆy = h(x) be the label predictions provided by a human expert and assume they are sampled from a distribution P(h | x), which models the disagreements amongst experts [27]. Then, in supervised learning under triage, one needs to find: (i) a triage policy π(x) : X {0, 1}, which determines who predicts each feature vector a supervised learning model (π(x) = 0) or a human expert (π(x) = 1); (ii) a predictive model m(x) : X Y, which needs to provide label predictions ˆy = m(x) for those feature vectors x for which π(x) = 0. Here, similarly as in standard supervised learning, we look for the triage policy and the predictive model that result into the most accurate label predictions by minimizing a loss function ℓ(ˆy, y). More formally, let Π be the set of all triage policies, then, given a hypothesis class of predictive models M, 1Our code and data are available at https://github.com/Networks-Learning/differentiable-learning-under-triage our goal is to solve the following minimization problem2: minimize π Π,m M L(π, m) subject to Ex[π(x)] b (1) where b is a given parameter that limits the level of triage, i.e., the percentage of samples human experts need to provide predictions for, and L(π, m) = Ex,y,h [(1 π(x)) ℓ(m(x), y) + π(x) ℓ(h, y)] . (2) Here, one might think of replacing h with its point estimate µh(x) = argminµh Eh|x[ℓ (h, µh)], where ℓ ( ) is a general loss function. However, the resulting objective would have a bias term, as formalized by the following proposition3 for the quadratic loss ℓ (h, µh) = (h µh)2: Proposition 1 Let ℓ (ˆy, y) = (ˆy y)2 and assume there exist x X for which the distribution of human predictions P(h | x) is not a point mass. Then, the function L(π, m) = Ex,y [(1 π(x)) ℓ (m(x), y) + π(x) ℓ (µh(x), y)] is a biased estimate of the true average loss defined in Eq. 2. The above result implies that, to find the optimal triage policy and predictive model, we need to take into account the amount of expert disagreement, or expert uncertainty, on each feature vector x rather than just an average expert prediction. 3 On the Interplay Between Prediction Accuracy and Triage Let m 0 be the optimal predictive model under full automation, i.e., m 0 = argminm M L(π0, m), where π0(x) = 0 for all x X. Then, the following proposition tells us that, if the predictions made by m 0 are less accurate than those by human experts on some instances, the model will always benefit from algorithmic triage: Proposition 2 If there is a subset V X of positive measure under P such that Z x V Ey|x [ℓ(m 0(x), y)] d P > Z x V Ey,h|x [ℓ(h, y)] d P, then there exists a nontrivial triage policy π = π0 such that L(π, m 0) < L(π0, m 0). Moreover, if we rewrite the average loss as L(π, m) = Ex (1 π(x)) Ey|x[ℓ(m(x), y)] + π(x) Ey,h|x [ℓ(h, y)] , it becomes apparent that, for any model m M, the optimal triage policy π m = argminπ Π L(π, m) is a deterministic threshold rule in which triage decisions are derived by thresholding the difference between the model and human loss on a per-instance level. Formally, we have the following Theorem: Theorem 3 Let m M be any fixed predictive model. Then, the optimal triage policy that minimize the loss L(π, m) subject to a constraint Ex[π(x)] b on the maximum level of triage is given by: π m,b(x) = 1 if Ey|x ℓ(m(x), y) Eh|x [ℓ(h, y)] > t P,b,m 0 otherwise, (3) where t P,b,m = argmin τ 0 Ex τ b + max Ey|x ℓ(m(x), y) Eh|x[ℓ(h, y)] τ, 0 . Then, if we plug in Eq. 3 into Eq. 1, we can rewrite our minimization problem as: minimize m M L(π m,b, m) (4) L(π m,b, m) = Ex Ey|x[ℓ(m(x), y)] THRESt P,b,m Ey|x ℓ(m(x), y) Eh|x[ℓ(h, y)] , 0 (5) 2One might think that minimizing over the set of all possible triage policies is not a well-posed problem, i.e., the optimal triage policy is an extremely complex function. However, our theoretical analysis will reveal that the optimal triage policy does have a simple form and its complexity depends on the considered hypothesis class of predictive models (refer to Theorem 3). 3All proofs can be found in Appendix A. with THRESt(x, val) = x if x > t val otherwise. Here, note that, in the unconstrained case, t P,b,m = 0 and THRES0(x, 0) = max(x, 0). Next, building on the above expression, we prove that the optimal predictive model under full automation mθ 0 is suboptimal under algorithmic triage as long as the average gradient across the subset of samples which are assigned to the human under the corresponding optimal triage policy π mθ 0 ,b is not zero. Formally, our main result is the following Proposition: Proposition 4 Let mθ 0 be the optimal predictive model under full automation within a hypothesis class of parameterized models M(Θ), π mθ 0 ,b the optimal triage policy for mθ 0 defined in Eq. 3 for a given maximum level of triage b, and V = {x | π mθ 0 ,b(x) = 1}. If x V Ey|x h θℓ(mθ(x), y)|θ=θ 0 i d P = 0. (6) then it holds that L(π mθ 0 ,b, mθ 0) > minθ Θ L(π mθ,b, mθ). Finally, we can also identify the circumstances under which any predictive model mθ within a hypothesis class of parameterized predictive models M(Θ) is suboptimal under algorithmic triage: Proposition 5 Let mθ be a predictive model within a hypothesis class of parameterized models M(Θ), π mθ ,b the optimal triage policy for mθ defined in Eq. 3 for a given maximum level of triage b, and V = {x | π mθ ,b(x) = 0}. If Z x V Ey|x [ θℓ(mθ(x), y)|θ=θ ] d P = 0. (7) then it holds that L(π mθ ,b, mθ ) > minθ Θ L(π mθ,b, mθ). The above results will lay the foundations for our practical gradient-based algorithm for differentiable learning under triage in the next section. 4 How to Learn Under Triage In this section, our goal is to find the policy mθ within a hypothesis class of parameterized predictive models M(Θ) that minimizes the loss L(π mθ,b, mθ) defined in Eq. 4. To this end, we now introduce a general purpose gradient-based algorithm that first approximates mθ given a desirable maximum level of triage b and then approximates the corresponding optimal triage policy π mθ ,b 4. To approximate mθ , the main obstacle we face is that the threshold value t P,b,mθ in the average loss L(π mθ,b, mθ) given by Eq. 5 depends on the predictive model mθ which we are trying to learn. To overcome this challenge, we proceed sequentially, starting from the triage policy π0, with π0(x) = 0 for all x X, and build a sequence of triage policies and predictive models {(π mθt,b, mθt)}T t=0. More specifically, in each step t, we find the parameters of the predictive model mθt via stochastic gradient descent (SGD) [28], i.e., θ(j) t = θ(j 1) t α(j 1) θ L(π mθt 1,b, mθ) θ=θ(j 1) t = θ(j 1) t α(j 1) θEx π mθt 1,b(x) Ey,h|x [ℓ(h, y)] +(1 π mθt 1,b(x)) Ey|x[ℓ(mθ(x), y)] θ=θ(j 1) t = θ(j 1) t α(j 1)Ex (1 π mθt 1,b(x)) Ey|x[ θ ℓ(mθ(x), y)|θ=θ(j 1) t ] , (8) where α(j) is the learning rate at iteration j. Moreover, the following proposition shows that, under mild conditions, the performance of the triage policies and predictive models improves in each step: 4At test time, given a predictive model mθ and an unseen sample x, we cannot directly evaluate (or, more precisely, estimate using Monte-Carlo) the value of the optimal triage policy π mθ ,b(x), given by Eq. 3, since it depends on Ey | x[ ] and Eh | x[ ]. Proposition 6 Assume that 2 θℓ(mθ(x), y) ΛI for all x X and y Y and α(j) < 1/Λ for all j > 0 for some constant Λ > 0. If, in each step t, we find the parameters of the predictive model mθt using Eq. 8, with θ(0) t = θt 1, then, it holds that L(π mθt,b, mθt) < L(π mθt 1,b, mθt 1). In addition, the following theorem shows that, whenever the loss function ℓ( ) is convex with respect to θ, our algorithm enjoys global convergence guarantee: Theorem 7 Let ℓ( ) be convex with respect to θ and the output of the SGD algorithm θt = argminθ L(π mθt 1,b, mθ). Moreover, assume that Λmin I 2 θ ℓ(mθ(x), y) Λmax I, with Λmin > 0, and ℓ( ) be H-Lipschitz, i.e., ℓ(mθ(x), y) ℓ(mθ (x), y) H θ θ . Then, we have that lim t L(π mθt,b, mθt) L(π mθ ,b, mθ ) 4H2Λmax Λ2 min(1 b)2 . (9) In practice, given a set of samples D = {(xi, yi, hi)}, we can use the following finite sample Monte-Carlo estimator for the gradient θL(π mθt 1,b, mθ): θL(π mθt 1,b, mθ) = θ i=1 ℓ(mθ(xi), yi) THRESt P,b,mθt 1 (ℓ(mθ(xi), yi) ℓ(hi, yi), 0) max( (1 b) |D| ,p) X i=1 θℓ(mθ(x[i]), y[i]) where [i] denotes the i-th sample in increasing value of the difference between the model and the human loss5 ℓ(mθt 1(x[i]), y[i]) ℓ(h[i], y[i]) and p is the number of samples with ℓ(mθt 1(x[i]), y[i]) ℓ(h[i], y[i]) < 0. In the above, we do not have to explicitly compute the threshold t P,b,mθt 1 nor the triage policy π mθt 1,b(xi) for every sample xi in the set D, we just need to pick the max( (1 b) |D| , p) samples with the lowest value of the model loss minus the human loss ℓ(mθt 1(x[i]), y[i]) ℓ(h[i], y[i]) using the predictive model mθt 1 fitted in step t 1. To understand why, note that, as long as t P,b,mθt 1 > 0, by definition, t P,b,mθt 1 needs to satisfy that i D [τb + max(ℓ(mθt 1(xi), yi) ℓ(hi, yi) τ, 0)] τ=t P,b,mθt 1 = 0 and this can only happen if ℓ(mθt 1(xi), yi) ℓ(hi, yi) t P,b,mθt 1 > 0 for b |D| out of |D| samples. Here, we are implicitly estimating the optimal triage policies using the observed training labels and human predictions we are not approximating them using a parameterized model and, due to Proposition 6, the implementation of the above procedure with Monte-Carlo estimates is guaranteed to converge to a local minimum of the empirical loss. Moreover, note that we can think of the procedure as a particular instance of disciplined parameterized programming [29, 30], where the differentiable convex optimization layer is given by the minimization with respect to the triage policy. While training each of the predictive models mθt, we can implicitly compute the optimal triage policy π θt,b as described above, however, at test time, we cannot do the same since we would need to observe the label and human prediction of each unseen sample x. To overcome this, after training the last machine model mθT , we also fit a model ˆπγ(x) to approximate π mθT ,b(x) using SGD, i.e., γ(j) = γ(j 1) α(j 1) γ i=1 ℓ (ˆπγ(xi), π mθT ,b(xi)) γ=γ(j 1) , (10) where α(j) is the learning rate at iteration j and the choice of loss ℓ ( ) depends on the model class chosen for ˆπγ. In our experiments, we have found that, using the above procedure, we can 5Note that, if the set of samples contains several predictions h[i] by different human experts for each sample x[i], we would use all of them to estimate the (average) human loss. Algorithm 1 DIFFERENTIABLE TRIAGE: it returns the weights of a predictive model mθ and the weights of a triage policy ˆπγ. Require: Set of training samples D, maximum level of triage b, number of time steps T, number of epochs N, mini batches M, batch size B, learning rate α. 1: function TRAINMACHINEUNDERTRIAGE(T,D, M, B, b, α) 2: θ(0) INITIALIZETHETA() 3: for t = 1, . . . , T do 4: θt TRAINMODEL(θt 1, D, M, B, b, α) 5: γ APPROXIMATETRIAGEPOLICY(θT , D, N, M, B, b, α) 6: return θT , γ 7: function TRIAGE(D, b, θ) 8: p number of instances in D with ℓ(mθ(x), y) ℓ(h, y) < 0 9: D 10: for i = 1, . . . , max((1 b)|D|, p) do 11: D D {i-th sample from D in increasing value of ℓ(mθ(x), y) ℓ(h, y)} 12: return D 13: function TRAINMODEL(θ , D, M, B, b, α) 14: θ(0) θ 15: for i = 0, . . . , M 1 do 16: D(i) the i th mini batch of D 17: D(i) TRIAGE(D(i), b, θ(i)) 18: 0 19: for (x, y, h) D(i) do 20: + θ ℓ(mθ(x), y)|θ=θ(i) 21: θ(i+1) θ(i) α B 22: return θ(M) 23: function APPROXIMATETRIAGEPOLICY(θ, D, N, M, B, b, α) 24: γ(M) INITIALIZEGAMMA () 25: for j = 1, . . . , N do 26: γ(0) γ(M) 27: for i = 0, . . . , M 1 do 28: D(i) the i th mini batch of D 29: 0 30: for (x, y, h) D(i) do 31: + γℓ (ˆπγ(x), π mθ,b(x)) γ=γ(i) 32: γ(i+1) γ(i) α B 33: return γ(M) approximate well the optimal triage policy πmθT ,b. However, we would like to note that this problem can also be viewed as finding an estimator for the α-superlevel set Cα(f) = {x X : f(x) α} of the function f(x) = Ey | x[ℓ(mθT (x), y) Eh | x[ℓ(h(x), y)]] with α = t P,b,mθT from a set of noisy observations. Under this view, it might be possible to derive estimators with error performance bounds building upon recent work on level set estimation [31, 32]. which is left as future work. Refer to Algorithm 1 for a pseudocode implementation of the overall gradient-based algorithm, which returns θT and γ. Appendix B provides a detailed scalability analysis, which suggests that our algorithm does not significantly increase the computational complexity of vanilla SGD. 5 Experiments on Synthetic Data In this section, our goal is to shed light on the theoretical results from Section 3. To this end, we use our gradient-based algorithm in a simple regression task in which the optimal predictive model under full automation is suboptimal under algorithmic triage.6 6All algorithms were implemented in Python 3.7 and ran on a V100 Nvidia Tesla GPU with 32GB of memory. 2 0 2 Feature x 2 0 2 Feature x 0.00 0.01 0.02 0.03 ℓ(h, y) ℓ(mθ(x), y) 0.00 0.01 0.02 0.03 ℓ(h, y) ℓ(mθ(x), y) Triage policy π mθ0,b (a) Predictive model mθ0 trained under full automation 2 0 2 Feature x 2 0 2 Feature x 0.00 0.01 0.02 0.03 ℓ(h, y) ℓ(mθ(x), y) Triage policy π mθ0 ,b 0.00 0.01 0.02 0.03 ℓ(h, y) ℓ(mθ(x), y) Triage policy π mθ,b (b) Predictive model mθ trained under triage Figure 1: Interplay between the per-instance accuracy of predictive models and experts under different triage policies. In both panels, the first row shows the training samples (x, y) along with the predictions made by the models m(x) and the triage policy values π(x) and the second row shows the predictive model loss ℓ(m(x), y) against the human expert loss ℓ(m(x), y) on a per-instance level. Columns correspond to the settings (1 4) from left to right. The triage policy π mθ0,b is optimal for the predictive model mθ0 and the triage policy π mθ,b is optimal for the predictive model mθ. Each point corresponds to one instance and, for each instance, the color indicates the amount of noise in the predictions by experts, as given by Eq. 11, and the tone indicates the triage policy value. In all panels, we used ℓ(ˆy, y) = (ˆy y)2 and the class of predictive models parameterized by sigmoid functions, i.e., mθ(x) = Sθ(x). Experimental setup. We generate |D| = 72 samples, where we first draw the features x R uniformly at random, i.e., x U[ 3, 3], and then obtain the response variables y using two different sigmoid functions Sθ(x) = 1 1+exp( θx). More specifically, we set y = S1(x) if x [ 3, 1.5) [0, 1.5) and y = S5(x) if x [ 1.5, 0) [1.5, 3]. Moreover, we assume human experts provide noisy predictions of the response variables, i.e., h(x) = y + ϵ(x), where ϵ(x) N(0, σ2 ϵ (x)) with 8 10 3 if x [ 3, 1.5) 1 10 3 if x [ 1.5, 0) 4 10 3 if x [0, 1.5) 2 10 3 if x [1.5, 3] In the above, we are using heteroscedastic noise motivated by multiple lines of evidence that suggest that human experts performance on a per instance level spans a wide range [1, 2, 27]. Then, we consider the hypothesis class of predictive models M(Θ) parameterized by sigmoid functions, i.e., mθ(x) = Sθ(x), and utilize the sum of squared errors on the predictions as loss function, i.e., ℓ(ˆy, y) = (ˆy y)2, to train the models and triage policies in the following four settings: 1. Predictive model trained under full automation mθ0 without algorithmic triage, i.e., πmθ0,b(x) = π0(x) = 0 for all x X. 2. Predictive model trained under full automation mθ0 with optimal algorithmic triage π mθ0,b. 3. Predictive model trained under algorithmic triage mθ, with b = 1, with suboptimal algorithmic triage π mθ0,b. Here, we use the triage policy that is optimal for the predictive model trained under full automation. 4. Predictive model trained under algorithmic triage mθ, with b = 1, with optimal algorithmic triage π mθ,b. In all the cases, we train the predictive models mθ0 and mθ using our method with b = 0 and b = 1, respectively. Finally, we investigate the interplay between the accuracy of the above predictive models and the human experts and the structure of the triage policies at a per-instance level. 0 1 2 3 4 Time Step t E[ℓ(mθ(x), y)] b = 0.0 b = 0.2 b = 0.4 b = 0.6 0 1 2 3 4 Time Step t E[ℓ(mθ(x), y)] b = 0.0 b = 0.2 b = 0.4 b = 0.6 Galaxy zoo (a) Training losses by the predictive models mθt 0 1 2 3 4 5 6 7 8 9 Epoch E[ℓ (ˆπ(x), π (x))] b = 0.2 b = 0.4 b = 0.6 0 1 2 3 4 5 6 7 8 9 Epoch E[ℓ (ˆπ(x), π (x))] b = 0.2 b = 0.4 b = 0.6 Galaxy zoo (b) Training losses by the triage policies ˆπγ(i) Figure 2: Average training losses achieved by the predictive models mθt and the triage policies ˆπγ(i) on the Hatespeech and Galaxy zoo datasets during training. In Panel (a), each predictive model mθt is the output of TRAINMODEL( ) at step t and in Panel (b), each triage policy ˆπγ(i) is the output of TRAINTRIAGE( ) at epoch i. Both functions are defined in Algorithm 1. Error bars correspond to plus and minus one standard error. Results. Figure 1 shows the training samples (x, y) along with the predictions made by the predictive models mθ0 and mθ and the values of the triage policies π0(x), π mθ0,b and π mθ,b, as well as the losses achieved by the models and triage policies (1-4) on a per-instance level. The results provide several interesting insights. Since the predictive model trained under full automation mθ0 seeks to generalized well across the entire feature space, the loss it achieves on a per-instance level is never too high, but neither too low, as shown in the left column of Panel (a). As a consequence, this model under no triage achieves the highest average loss among all alternatives, L(π0, mθ0) = 0.0053 (setting 1). This may not come as a surprise since the mapping between feature and response variables does not lie within the hypothesis class of predictive models used during training. However, since the predictions by human experts are more accurate than those provided by the above model in some regions of the feature space, we can deploy the model with the optimal triage policy π θ0,b given by Theorem 3 and lower the average loss to L(π θ0,b, mθ0) = 0.0020 (setting 2), as shown in the right column of Panel (a) and suggested by Proposition 2. In contrast with the predictive model trained under full automation mθ0, the predictive model trained under triage mθ learns to predict very accurately the instances that lie in the regions of the feature space colored in green and yellow but it gives up on the regions colored in red and blue, where its predictions incur a very high loss, as shown in Panel (b). However, these latter instances where the loss would have been the highest if the predictive model had to predict their response variables y are those that the optimal triage policy π mθ,b hand in to human experts to make predictions. As a result, this predictive model under the optimal triage policy does achieve the lowest average loss among all alternatives, L(π θ,b, mθ) = 0.0009 (setting 4), as suggested by Propositions 4 and 5. Finally, our results also show that deploying the predictive model mθ under a suboptimal triage policy may actually lead to a higher loss L(π θ0,b, mθ) = 0.0031 (setting 3) than the loss achieved by the predictive model trained under full automation mθ0 with its optimal triage policy π θ0,b. This happens because the predictive model mθ is trained to work well only on the instances x such that π mθ,b(x) = 0 and not necessarily on those with π m0,b(x) = 0. 6 Experiments on Real Data In this section, we use our gradient-based algorithm in two binary and multi-class classification tasks in content moderation and scientific discovery,. We first investigate the interplay between the accuracy of the predictive models and human experts and the structure of the optimal triage policies at different steps of the training process. Then, we compare the performance of our algorithm with several competitive baselines. Experimental setup. We use two publicly available datasets [33, 34], one from an application in content moderation and the other for scientific discovery7: 7We chose these two particular applications because the corresponding datasets are among the only publicly available datasets that we found containing multiple human predictions per instance, necessary to estimate the human loss at an instance level, and a relatively large number of instances. The Hatespeech dataset is publicly available under MIT license and the Galaxy zoo dataset is publicly available under π(x) = 0 π(x) = 1 ℓ(mθ(x), y) ℓ(h, y) = t P,b,m 2 5 2 2 20 ℓ(h, y) ℓ(mθ(x), y) 2 5 2 2 20 ℓ(h, y) ℓ(mθ(x), y) 2 5 2 2 20 ℓ(h, y) ℓ(mθ(x), y) 2 1 20 21 ℓ(h, y) ℓ(mθ(x), y) (a) Step t = 1 2 1 20 21 ℓ(h, y) ℓ(mθ(x), y) (b) Step t = 2 2 1 20 21 ℓ(h, y) ℓ(mθ(x), y) (c) Step t = 3 Figure 3: Predictive model and expert losses at a per-instance level on a randomly selected subset of 500 samples of the Hatespeech (top row) and Galaxy zoo (bottom row) datasets throughout the execution of our method during training. The maximum level of triage is set to b = 0.4 for Hatespeech and b = 1.0 for Galaxy zoo dataset. Each point corresponds to an individual instance and, for each instance, the color pattern indicates the triage policy value. Hatespeech: It consists of |D| = 24,783 tweets containing lexicons used in hate speech. Each tweet is labeled by three to five human experts from Crowdflower as hate-speech , offensive , or neither . Galaxy zoo: It consists of |D| = 10,000 galaxy images8. Each image is labeled by 30+ human experts as early type or spiral . For each tweet in the Hatespeech dataset, we first generate a 100 dimensional feature vector using fasttext [35] as x, similarly as in De et al. [2]. For each image in the Galaxy zoo dataset, we use its corresponding pixel map9 as x. Given an instance with feature value x, we estimate P(h | x) = nx(h) P h Y nx(h ), where nx(h) denotes the number of human experts who predicted label h and we set its true label to y = argmaxh Y P(h | x). Moreover, at test time, for each instance that the triage policy assigns to humans, we sample h P(h | x). In all our experiments, we consider the hypothesis class of probabilistic predictive models M(Θ) parameterized by softmax distributions, i.e., mθ(x) pθ;x = Multinomial [exp (φy,θ(x))]y Y , where, for the Hatespeech dataset, φ ,θ is the convolutional neural network (CNN) by Kim [36] and, for the Galaxy zoo dataset, it is the deep residual network by He et al. [37]. During training, we use a cross entropy loss on the observed labels, i.e., ℓ(ˆy, y) = log P(ˆy = y | x). Here, if an instance is assigned to the predictive model, we have that P(ˆy = y | x) = exp (φy,θ(x)) P y exp (φy ,θ(x)), and, if an instance is assigned to a human expert, we have that P(ˆy = y | x) = P(h = y | x). For the function ˆπγ(x), we use the class of logistic functions, i.e., ˆπγ(x) = 1 1+exp( φγ(x)), where φγ is the same CNN and deep residual network as in the predictive model, respectively. Here, we also use the cross entropy loss, i.e., ℓ (ˆπγ(x), π mθT ,b(x)) = π mθT ,b(x) log ˆπγ(x) (1 π mθT ,b(x)) log(1 ˆπγ(x)). In each experiment, we used 60% samples for training, 20% for validation and 20% for testing. Refer to Appendix C for additional details on the experimental setup. Results. First, we look at the average loss achieved by the predictive models mθt and triage policies ˆπγ(i) throughout the execution of our method during training. Figure 2 summarizes the results, which Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 license. Since our algorithm is general, it may be useful in other applications in which predictive models trained under full automation performs worse than humans in some instances (see Proposition 2). 8The original Galaxy zoo dataset consists of 61,577 images, however, we report results on a randomly chosen subset of 10,000 images due to scalability reasons. We found similar results in other random subsets. 9The pixel maps for each image are available at https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge Differentiable Triage (ours) Full Automation Triage Confidence-based Triage Surrogate-based Triage Score-based Triage 0.0 0.2 0.4 0.6 0.8 1.0 b (a) Hatespeech 0.0 0.2 0.4 0.6 0.8 1.0 b (b) Galaxy zoo Figure 4: Misclassification test error P(ˆy = y) against the triage level b on the Hatespeech and Galaxy zoo datasets for our algorithm, confidence-based triage [5], score-based triage [1], surrogate-based triage [6] and full automation triage. Appendix C contains more details on the baselines. reveal several insights. For small values of the triage level, the models mθt aim to generalize well across a large portion of the feature space. As a result, they incur a large training loss, as shown in Panel (a). In contrast, for b 0.4, the models mθt are trained to generalize across a smaller region of the feature space, which leads to a considerably smaller training loss. However, for such a high triage level, the overall performance of our method is also contingent on how well ˆπγ approximates the optimal triage policy. Fortunately, Panel (b) shows that, as epochs increase, the average training loss of ˆπγ(j) decreases. Appendix D validates further the trained approximate triage policies ˆπγ. Next, we compare the predictive model and the human expert losses per training instance throughout the execution of the our method during training. Figure 3 summarizes the results. At each step t, we find that the optimal triage policies π mθt,b hands in to human experts those instances (in orange) where the loss would have been the highest if the predictive model had to predict their response variables y. Moreover, at the beginning of the training process (i.e., low step values t), since the predictive model mθt seeks to generalize across a large portion of the feature space, the model loss remains similar across the feature space. However, later into the training process (i.e., high step values t), the predictive models mθt focuses on predicting more accurately the samples that the triage policy hands in to the model, achieving a lower loss on those samples, and gives up on the remaining samples, where it achieves a high loss. Finally, we compare the performance of our method against four baselines in terms of test misclassification error. Refer to Appendix C for more details on the baselines, which we refer to as confidence-based triage [5], score-based triage [1], surrogate-based triage [6] and full automation triage10 . Figure 4 summarizes the results. We find that, in each dataset, our algorithm is able to find the predictive model and the triage policy with the lowest misclassification across the entire span of b values. One could view these values of maximum triage levels as the optimal automation levels (within the level values we experimented with). 7 Conclusions In this paper, we have contributed towards a better understanding of supervised learning under algorithmic triage. We have first identified under which circumstances predictive models may benefit from algorithmic triage, including those trained for full automation. Then, given a predictive model and desired level of triage, we have shown that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically from the model and human per-instance errors. Finally, we have introduced a practical algorithm to train supervised learning models under triage and have shown that it outperforms several competitive baselines. Our work also opens many interesting venues for future work. For example, we have assumed that each instance is predicted by either a predictive model or a human expert. However, there may be many situations in which human experts predict all instances but their predictions are informed by a predictive model [38]. We have shown that our algorithm is guaranteed to converge to a local minimum of the empirical risk, however, it would be interesting to analyze the convergence rate and the generalization error. Finally, it would be valuable to assess the performance of supervised learning models under algorithmic triage using interventional experiments on a real-world application. 10Among all the baselines we are aware of [1 6], we did not compare with De et al. [2, 3] because they only allow linear models and SVMs and we did not compare with Wilder et al. [4] because they did not provide enough details to implement their method. Acknowledgment. Okati and Gomez-Rodriguez acknowledge support from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 945719). De has been partially supported by a DST Inspire Faculty Award. [1] Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort. ar Xiv preprint ar Xiv:1903.12220, 2019. [2] Abir De, Paramita Koley, Niloy Ganguly, and Manuel Gomez-Rodriguez. Regression under human assistance. In AAAI, 2020. [3] Abir De, Nastaran Okati, Ali Zarezadeh, and Manuel Gomez-Rodriguez. Classification under human assistance. In AAAI, 2021. [4] Bryan Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. In IJCAI, 2020. [5] Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S. Weld. Optimizing AI for Teamwork. In AAAI, 2021. [6] Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In ICML, 2020. [7] Justin Cheng, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. Antisocial behavior in online discussion communities. In ICWSM, 2015. [8] Michael Pradel and Koushik Sen. Deepbugs: A learning approach to name-based bug detection. In OOPSLA, 2018. [9] Eric Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine, 25(1):44, 2019. [10] David Madras, Toniann Pitassi, and Richard S. Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer. In Neur IPS, 2018. [11] Peter Bartlett and Marten Wegkamp. Classification with a reject option using a hinge loss. JMLR, 9(Aug):1823 1840, 2008. [12] Corinna Cortes, Giulia De Salvo, and Mehryar Mohri. Learning with rejection. In ALT, 2016. [13] Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers. 2019. [14] Harish Ramaswamy, Ambuj Tewari, and Shivani Agarwal. Consistent algorithms for multiclass classification with an abstain option. Electronic J. of Statistics, 12(1):530 554, 2018. [15] Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. In ICML, 2019. [16] Ziyin Liu, Zhikang Wang, Paul Pu Liang, Russ R Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Deep gamblers: Learning to abstain with portfolio theory. In Neur IPS, 2019. [17] Sunil Thulasidasan, Tanmoy Bhattacharya, Jeff Bilmes, Gopinath Chennupati, and Jamal Mohd-Yusof. Combating label noise in deep learning using abstention. ar Xiv preprint ar Xiv:1905.10964, 2019. [18] Liu Ziyin, Blair Chen, Ru Wang, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency, and Masahito Ueda. Learning not to learn in the presence of noisy labels. ar Xiv preprint ar Xiv:2002.06541, 2020. [19] David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models. In Neur IPS, 1995. [20] Steven Hoi, Rong Jin, Jianke Zhu, and Michael R Lyu. Batch mode active learning and its application to medical image classification. In ICML, 2006. [21] Masashi Sugiyama. Active learning in approximately linear regression based on conditional expectation of generalization error. JMLR, 7(Jan):141 166, 2006. [22] Rebecca Willett, Robert Nowak, and Rui M Castro. Faster rates in regression via active learning. In Neur IPS, 2006. [23] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In Neur IPS, 2008. [24] Sivan Sabato and Remi Munos. Active regression by stratification. In Neur IPS, 2014. [25] Xue Chen and Eric Price. Active regression via linear-sample sparsification. In COLT, 2017. [26] Abolfazl Hashemi, Mahsa Ghasemi, Haris Vikalo, and Ufuk Topcu. Submodular observation selection and information gathering for quadratic models. In ICML, 2019. [27] Maithra Raghu, Katy Blumer, Rory Sayres, Ziad Obermeyer, Bobby Kleinberg, Sendhil Mullainathan, and Jon Kleinberg. Direct uncertainty prediction for medical second opinions. In ICML, 2019. [28] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951. [29] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In ICML, 2017. [30] Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and Zico Kolter. Differentiable convex optimization layers. In Neur IPS, 2019. [31] Rebecca M Willett and Robert D Nowak. Minimax optimal level-set estimation. IEEE Transactions on Image Processing, 16(12):2965 2979, 2007. [32] Aarti Singh. Nonparametric Set Estimation Problems in Statistical Inference and Learning. Ph D thesis, University of Wisconsin Madison, 2008. [33] Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. Automated hate speech detection and the problem of offensive language. ICWSM, pages 512 515. [34] Steven P Bamford, Robert C Nichol, Ivan K Baldry, Kate Land, Chris J Lintott, Kevin Schawinski, Anže Slosar, Alexander S Szalay, Daniel Thomas, Mehri Torki, et al. Galaxy zoo: the dependence of morphology and colour on environment. Monthly Notices of the Royal Astronomical Society, 393(4):1324 1352, 2009. [35] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Fasttext. zip: Compressing text classification models. ar Xiv preprint ar Xiv:1612.03651, 2016. [36] Yoon Kim. Convolutional neural networks for sentence classification, 2014. [37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. [38] Brian Lubars and Chenhao Tan. Ask not what ai can do, but what ai should do: Towards a framework of task delegability. 2020. [39] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [40] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.