# classification_under_human_assistance__ec4aade7.pdf

Classiﬁcation Under Human Assistance

Abir De1*, Nastaran Okati2*, Ali Zarezade2, Manuel Gomez Rodriguez2

1IIT Bombay 2Max Planck Institute for Software Systems

Most supervised learning models are trained for full automation. However, their predictions are sometimes worse than those by human experts on some speciﬁc instances. Motivated by this empirical observation, our goal is to design classiﬁers that are optimized to operate under different automation levels. More speciﬁcally, we focus on convex margin-based classiﬁers and ﬁrst show that the problem is NP-hard. Then, we further show that, for support vector machines, the corresponding objective function can be expressed as the difference of two functions f = g c, where g is monotone, non-negative and γ-weakly submodular, and c is non-negative and modular. This representation allows us to utilize a recently introduced deterministic greedy algorithm, as well as a more efﬁcient randomized variant of the algorithm, which enjoy approximation guarantees at solving the problem. Experiments on synthetic and real-world data from several applications in medical diagnosis illustrate our theoretical ﬁndings and demonstrate that, under human assistance, supervised learning models trained to operate under different automation levels can outperform those trained for full automation as well as humans operating alone.

1 Introduction In recent years, machine learning models have matched, or even surpassed, the average performance of human experts at tasks for which intelligence is required (Graves, Abdel Rahman, and Hinton 2013; Krizhevsky, Sutskever, and Hinton 2012; Silver et al. 2016; Sutskever, Vinyals, and Le 2014). As a consequence, there is a widespread discussion on the possibility of letting machine learning models take high-stake decisions the promise is that the timeliness and quality of the decisions would greatly improve. For example, in medical diagnosis, patients would not need to wait for months to be diagnosed by a specialist. In content moderation, online publishers could moderate toxic comments before they trigger incivility in their platforms. In software development, developers would easily ﬁnd bugs in large software projects and would not need to spend long hours in code reviews. Unfortunately, the decisions taken by machine learning models are still worse than those by human experts on

*Equal contributions. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

some instances, where they make far more errors than average (Raghu et al. 2019). Motivated by this observation, there has been a paucity of work on developing machine learning models that are optimized to operate under different automation levels (De et al. 2020a; Bordt and von Luxburg 2020; Mozannar and Sontag 2020; Raghu et al. 2019; Wilder, Horvitz, and Kamar 2020) models that are optimized to take decisions for a given fraction of the instances and leave the remaining ones to humans. However, most of this work has developed heuristic algorithms that do not enjoy theoretical guarantees. One of the only exceptions is the work by De et al. (2020a), which has reduced the problem of ridge regression under different automation levels to the maximization of an α-submodular function (Gatmiry and Gomez-Rodriguez 2019). In our work, rather than (ridge) regression, we focus on classiﬁcation under human assistance and show that, for support vector machines, the problem can be solved using algorithms with theoretical guarantees. More speciﬁcally, we ﬁrst show that, for convex margin-based classiﬁers, the problem of classiﬁcation under human assistance is NP-hard. This is due to its combinatorial nature for each potential meta-decision about which instances the classiﬁer will decide upon, there is an optimal set of parameters for the classiﬁer, however, the metadecision is also something we seek to optimize. Then, for support vector machines, we derive an alternative representation of the objective function as a difference of two functions f = g c, where g is monotone, non-negative, and γ-weakly submodular (Bian et al. 2017; Das and Kempe 2018) and c is non-negative and modular. Moreover, we further show that, in our problem, the submodularity ratio γ, which characterizes how close is the function g to being submodular, can be lower bounded. These properties allow a recently introduced deterministic greedy algorithm (Algorithm 1) as well as a more efﬁcient randomized variant of the algorithm (Harshaw et al. 2019) to enjoy nontrivial approximation guarantees. Finally, we experiment with synthetic and real-world data from several applications in medical diagnosis. Our experiments on synthetic data reveal that, by outsourcing samples to humans during training, the resulting support vector machine is able to reduce the number of training samples inside or on the wrong side of the margin, among those samples it

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

needs to decide upon. Our experiments on real data demonstrate that, under human assistance, support vector machines trained to operate under different automation levels outperform those trained for full automation as well as humans operating alone1. Further related work. There is a rapidly increasing line of work devoted to designing classiﬁers that are able to defer decisions (Bartlett and Wegkamp 2008; Cortes, De Salvo, and Mohri 2016; El-Yaniv et al. 2010; Geifman and El-Yaniv 2019; Geifman, Uziel, and El-Yaniv 2018; Hsu et al. 2020; Liu et al. 2019; Ramaswamy, Tewari, and Agarwal 2018; Thulasidasan et al. 2019; Wiener and El-Yaniv 2011; Ziyin et al. 2020). However, this line of work does not consider there is a human decision maker, with a human error model, who takes a decision whenever the classiﬁers defer it and the classiﬁers are trained to predict the labels of all samples in the training set, as in full automation. Our work also relates to the area of active learning (Chen and Price 2017; Cohn, Ghahramani, and Jordan 1995; Guo and Schuurmans 2008; Hashemi et al. 2019; Hoi et al. 2006; Sabato and Munos 2014; Sugiyama 2006; Willett, Nowak, and Castro 2006), where the goal is to determine which subset of training samples one should label so that a supervised machine learning model, trained on these samples, generalizes well across the entire feature space during test. However, there is a fundamental difference between our work and active learning. In our work, the trained model only needs to accurately predict samples which are close to the samples assigned to the machine during training time. In contrast, in active learning, the trained model needs to predict well any sample during test time. Finally, our work advances the state of the art on humanmachine collaboration (Ghosh et al. 2019; Grover et al. 2018; Hadﬁeld-Menell et al. 2016; Haug, Tschiatschek, and Singla 2018; Kamalaruban et al. 2019; Macindoe, Kaelbling, and Lozano-P erez 2012; Meresht et al. 2020; Nikolaidis et al. 2017, 2015; Radanovic et al. 2019; Tschiatschek et al. 2019; Wilson and Daugherty 2018). However, rather than considering a setting in which the machine and the human interact with each other as most previous work, we develop algorithms that learn to distribute decisions between humans and machines.

2 Problem Formulation In this section, we formally introduce the problem of designing convex margin-based classiﬁers that are optimized to operate under different automation levels. Then, we show that, for convex margin-based classiﬁers, the problem is NPhard. For simplicity, we will consider binary classiﬁcation, however, our ideas can be extended to m-ary classiﬁcation. In binary classiﬁcation, one needs to ﬁnd a mapping function f(x) between feature vectors x Rm, with x p(x), and class labels y { 1, 1}, with y p(y | x). To this end, one utilizes a training set D = {(xi, yi)}i V to construct a mapping that works well on an unseen test set. For marginbased classiﬁers, ﬁnding this mapping usually reduces to

1Our code and data are available in https://github.com/Networks Learning/classiﬁcation-under-assistance

building a decision boundary deﬁned by a set of parameters θ that separates feature vectors in the training set according to their class labels. One typically looks for the decision boundary that achieves the greatest classiﬁcation accuracy in a test set by minimizing a convex loss function ℓ(hθ(x), y) over a training set, i.e., θ = argminθ P i V ℓ(hθ(xi), yi), where hθ(xi) denotes the signed distance from the feature vector x to the decision boundary. Then, given an unseen feature vector x from the test set, the classiﬁer predicts f(x) = 1 if hθ (xi) 0 and f(x) = 1 otherwise. In binary classiﬁcation under human assistance, for every feature vector x Rm, the mapping function f(x) can resort to either a classiﬁer or a human expert. For marginbased classiﬁers, ﬁnding the mapping then reduces to picking the subset of training samples S V that are outsourced to human experts, with |S| n, and building a decision boundary that separates feature vectors in the subset of training samples Sc = V\S according to their class labels. Using the same convex loss function ℓ(hθ(x), y) as in the standard binary classiﬁcation, our goal is then to solve the following minimization problem:

minimize S,θ

i V\S ℓ(hθ(xi), yi) + X

i S c(xi, yi)

subject to |S| n, (1)

where c(x, y) denotes the human error per sample, which we will deﬁne more precisely in the next section. Here, we assume that human annotations are independent, which allows us to cast the total human error as the sum of human error per sample over all instances. Moreover, we use a linear constraint on the number of examples outsourced to humans because, in most practical scenarios, humans get paid every time they make a prediction if they make n predictions, they get paid n times. Given the optimal set S , we can ﬁnd the optimal parameter θ = θ (V\S ) in polynomial time since, by assumption, the loss ℓ(hθ(xi), yi) is convex. Unfortunately, the following Theorem tells us that, in general, we cannot expect to ﬁnd both S and θ in polynomial time (proven in the long version of our paper (De et al. 2020b)): Theorem 1 The problem of designing margin-based classiﬁers under human assistance deﬁned in Eq. 1 is NP-Hard. Moreover, given the solution to the above minimization problem, we would still need to decide whether to outsource an unseen feature vector x from the test set to a human expert even if x = xi for all i V. To this end, we could train an additional model π(d | x) to decide which samples to outsource to a human using the labeled set {(xi, di)}i V, where xi are the feature vectors in the training set and di = +1 if i S and di = 1 otherwise. Then, as long as this model does not make mistakes on the training set, one can readily conclude that the samples assigned to the classiﬁer during training are as if they were sampled from the feature distribution p(x)π(d = 1 | x) induced by π. As a direct consequence, if the model π(d | x) is smooth with respect to x, one can further conclude that the trained margin-based classiﬁer will work well on the unseen samples it needs to decide upon at test time, i.e., samples from p(x)π(d = 1 | x).

3 Algorithms with Approximation Guarantees for Support Vector Machines In this section, we show that, for support vector machines (SVMs), the optimization problem deﬁned in Eq. 1 can be rewritten as a maximization of the difference of two functions g c, where g is monotone, non-negative, and γweakly submodular and c is non-negative modular. Moreover, we further show that the submodularity ratio γ can be lower bounded and, as a consequence, a recently introduced deterministic greedy algorithm (Harshaw et al. 2019) as well as a more efﬁcient randomized variant of the algorithm enjoy approximation guarantees at solving the problem. Monotonicity and weak submodularity. For (soft margin) SVMs, we can ﬁrst rewrite the minimization problem deﬁned in Eq. 1 as follows2:

minimize S,w,b

λ w 2 + [1 yi(w Φ(xi) + b)]+

| {z } ℓ(hw,b(xi),yi)

i S [1 yih(xi)]+ | {z } c(xi,yi)

subject to |S| n,

where Φ( ) denotes a given feature transformation, h( ) [ H, H] is the (normalized) score provided by the human experts, which is only known for the training samples, and H > 0 is a given constant. In the above, we measure the human error c(x, y) using a hinge loss [1 y h(x)]+ because the SVM formulation also uses a hinge loss [1 y (w T x + b)]+ to measure the machine error. This is necessary in order to compare human and machine performance meaningfully. However, our solution is agnostic to this speciﬁc choice it is applicable to any human error model. Now, for any given set S, let w (V\S) and b (V\S) be the parameters that minimize the objective function above, i.e., w (V\S), b (V\S) = argminw,b P

i V\S[λ w 2 + (1 yi(w Φ(xi) + b))+]. Here, note that these parameters can be found in polynomial time since the ﬁrst two parameters in the objective function are convex. Then, we can rewrite the above minimization problem as a set function maximization problem:

maximize S g(S) c(S), subject to |S| n, (3)

g(S) = λ|V| w (V) 2+ X

i V [1 yi(w (V) Φ(xi) + b (V))]+

λ|V\S| w (V\S) 2

i V\S [1 yi(w (V\S) Φ(xi) + b (V\S))]+,

and c(S) = X

i S [1 yih(xi)]+. (5)

2In the long version of our paper (De et al. 2020b), we also consider hard margin linear SVMs, which are relevant whenever the data is linearly separable.

C+ C+ C+ C C C

C+ 1/s C+ 1/s C+ 1/s

C 1/s C 1/s C 1/s

(a) C and C 1/s

(b) Sequence of C 1/s for s N+

Figure 1: Convex hulls and reduced convex hulls. Panels (a) and (b) show the convex hulls C and reduced convex hulls C 1/s of a training set whose feature vectors are non separable. In both panels, cyan and orange dots represent feature vectors xi with yi = 1 and yi = 1, respectively.

In the above, the ﬁrst term λ|V| w (V) 2 + P

i V[1 yi(w (V) Φ(xi)+b (V))]+ ensures that the function g(S) is non-negative and the function c(S) is clearly non-negative and modular3. Moreover, we have the following proposition, which shows that g(S) is a monotone function (proven in the long version of our paper (De et al. 2020b)):

Proposition 2 The set function g(S), deﬁned in Eq. 4, is monotone, i.e., g(S {j}) g(S) 0 for all S V and j V.

If the feature vectors in the training set are not separable according to their class labels, there exist instances of the problem in which the function g(S) has submodularity ratio γ = 04. However, we will now identify a general class of feature distributions for which the function g(S) has a nonzero submodularity ratio and its value can be lower bounded. This lower bound will allow a recently introduced deterministic greedy algorithm as well as its randomized variant to enjoy approximation guarantees at solving the problem. In the remainder, we ﬁrst consider linear SVMs, i.e., Φ(x) = x, and then nonlinear SVMs. Let V+ and V be the set of training samples with positive and negative labels, respectively, C+ and C be their corresponding convex hulls, i.e.,

i V µi = 1, µi 0

and be the minimum distance between them, i.e., = minx+ C+, x C x+ x 2. Then, note that, whenever the feature vectors in the training set are not separable according to their class labels, the above convex hulls overlap and thus = 0. However, under mild technical conditions, there will always exist subsets of feature vectors within these convex hulls that do not overlap, as shown in Figure 1. To characterize these subsets, we introduce the notion of re-

3A set function f(S) is modular iff it satisﬁes that f(S {j}) f(S) = f(L {j}) f(L) for all S L V and j V. 4A set function f(S) is γ-weakly submodular iff we have P

j L\S[f(S {j}) f(S)] γ[f(S L) f(S)] S, L V and j V. The largest γ 1 such that the inequality is true is called submodularity ratio.

duced convex hulls (Bennett and Bredensteiner 2000):

i V µi = 1, 0 µi 1

where s N+, with s min{|V+|, |V |} and, similarly as before, we denote the minimum distance between them as 1/s = minx+ C+ 1/s, x C 1/s x+ x . Now, consider the sequence of reduced convex hulls {(C+ 1/s, C 1/s)}Vmin s=1 , where Vmin = min{|V+|, |V |}, illustrated in Figure 1(b), and note that the corresponding minimum distances, by deﬁnition, satisfy that 1/s 1/s for all s > s. Then, we measure to what extent feature vectors with positive and negative labels overlap using the distance = min s {1,...,Vmin} { 1/s | 1/s > 0}, (8)

where the higher the value of s = argmins { 1/s | 1/s > 0}, the higher the overlap between feature vectors with positive and negative labels. Moreover, if there are no elements in the sequence with positive distance, we set = 0 and s = 0. Given the above, we are now ready to present and prove one of our key results, which characterizes the submodularity ratio of the function g(S) in terms of the amount of overlap between feature vectors with positive and negative labels, as measured by the distance (proven in the long version of our paper (De et al. 2020b)): Theorem 3 Let Φ(x) = x, ρ = Vmin/|V|, σ = s /|V|, η = 2

λ + maxi V xi /

λ. Then, the submodularity ratio γ of the function g(S) (deﬁned in Eq. 4) satisﬁes that

4λ , 1 (η 2)2

1 4 + 4|V|(η 1)

+ (η 1) |V|

as long as the number of samples outsourced to humans n (ρ σ )|V|. For nonlinear SVMs, rather than characterizing the submodularity ratio in terms of and s , we resort to the spectral properties of the kernel matrix K = [K(xi, xj)]i,j V. In particular, our key result is the following Theorem (proven in the long version of our paper (De et al. 2020b)):

Theorem 4 Let η = 2

λ + maxi V p

K(xi, xi) /

λ, Y = diag({y}i V), ρ = Vmin/|V|, and

ζ = minimize µ 0,P

i V+ µi=1,P

i V µi=1 µ Y KY µ,

If the kernel matrix is full rank, i.e., rank(K) = |V|, then the submodularity ratio γ of the function g(S) satisﬁes that

γ γ = min ζσ 2

4λ , 1 (η 2)2

1 4 + 4|V|(η 1)

+ (η 1) |V|

as long as n (ρ σ )|V| for some σ (0, ρ ].

Algorithm 1: Distorted greedy algorithm

Input: Ground set V, functions g and c, parameters n and γ. Initialize: S for i = 0, . . . , n 1 do ωi 1 γ

k argmax k V\S {ωi [g(S {k}) g(S)] c({k})}

if ωi [g(S {k }) g(S)] c({k }) > 0 then S S {k } end end Return S

Finally, for the particular case of (soft margin) SVMs without offset, i.e., b = 0 in Eq. 2, we can derive a stronger lower bound, which does not depend on and s , by exploiting their greater stability properties (proven in the long version of our paper (De et al. 2020b)):

Theorem 5 If η = 2

λ + maxi V p

K(xi, xi) /

λ, then for SVMs without offset, (b = 0 in Eq. 2), the submodularity ratio of the function g(S) is given by

γ γ = min 1 (η 2)2 , 1

Proof sketch of our key technical results. The proofs of Theorems 3, 4 and 5 consist of two steps. In the ﬁrst step, they show that g(S {j}) g(S) λ w (V\S) 2 and use the dual formulation of SVM as well as the properties of the corresponding SVM model to derive a lower bound of w (V\S) . In the second step, they use the stability property of SVM (Bousquet and Elisseeff 2002) to derive the upper bound on g(S L) g(S). These two steps together lead to the bound on γ . While the bounds in all the above theorems are tight in terms of the size of the dataset, there are notable differences between the different model classes. For SVMs with offsets, Theorems 3 and 4 suggest that the submodularity ratio bound decreases proportionally to 1/|V| and this is due to their poor stability properties (Hush, Scovel, and Steinwart 2007). For SVMs without offsets, Theorem 5 tells us that the submodularity ratio bound is independent of |V| and this is due to their greater stability properties (Bousquet and Elisseeff 2002). More speciﬁcally, for any L V\S, the marginal gain g(S L) g(S) can be upper bounded by a smaller quantity than in the case of SVMs with offsets and this results into a stronger lower bound. Distorted greedy algorithm. The distorted greedy algorithm (Harshaw et al. 2019) proceeds iteratively and, at each iteration, it assigns to the humans the sample (xk, yk) that provides the highest marginal distorted gain among the remaining training samples V \ S. Algorithm 1 summarizes the algorithm, which requires the value of the submodularity ratio γ as an input. Since we only have data dependant bounds on γ, in our experiments, we use the meta algorithm proposed in Harshaw et al. (2019) to guess the value of γ. Since we have shown that the objective function in Eq. 3 can be expressed as the difference of two functions g c,

S, y = 1 S, y = 1 V\S, y = 1 V\S, y = 1

n/|V| = 0.12

n/|V| = 0.3 (a) Linear SVM

n/|V| = 0.12

n/|V| = 0.3 (b) Nonlinear SVM

Figure 2: Linear and non linear support vector machines with offset under human assistance trained using the stochastic greedy algorithm (Alg. 1). In each panel, circles represent the feature vectors in the training set, ﬁlled and empty circles are assigned to humans and machines, respectively, the solid line indicates w (V\S )φ(x) + b(V\S ) = 0, and the cyan and orange dashed lines indicate w (V\S )φ(x) + b(V\S ) = 1 and w (V\S )φ(x) + b(V\S ) = 1, respectively. For both linear and nonlinear SVMs, we used λ = 1, H = 0.2 and, for nonlinear SVMs, we used a quadratic kernel K(xi, xj) = ( 1

2 xi, xj )2.

where g is monotone, non-negative and γ-weakly submodular and c is non-negative modular, it readily follows from Theorem 3 in Harshaw et al. (2019) that the distorted greedy algorithm enjoys approximation guarantees. More speciﬁcally, the distorted greedy algorithm is guaranteed to return a set S such that g(S) c(S) (1 e γ)g(OPT) c(OPT), (9) where OPT is the optimal set and γ γ with γ deﬁned in Theorem 3 (linear SVMs with offset), Theorem 4 (nonlinear SVMs with offset), or Theorem 5 (SVMs without offset). In addition to the distorted greedy algorithm, which needs to make O(n|V|) evaluations of the function g( ), Harshaw et al. (2019) has also proposed a randomized variant of the algorithm, which enjoys an asymptotically faster run time due to the use of sampling techniques and is also applicable to our problem. Instead of optimizing over the entire ground set V at each iteration, it optimizes over a random sample Bi V of size O( |V|

ϵ ). Hence, it only needs to make O(|V| log( 1

ϵ )) evaluations of g( ) and it returns a set S such that E[g(S) c(S)] (1 e γ ϵ)g(OPT) c(OPT). (10) In the next sections, we will demonstrate that, in addition to enjoying the above approximation guarantees, the distorted greedy algorithm as well as its randomized variant perform better in practice than several competitive baselines.

4 Experiments on Synthetic Data In this section, we ﬁrst look into the solutions provided by the distorted greedy algorithm (Alg. 1) in a variety of synthetic examples. Then, we compare the performance of the distorted greedy algorithm and its randomized variant with several competitive baselines. Finally, we investigate the inﬂuence that the amount of human error has on the number of samples outsourced to humans by the distorted greedy algorithm. Experimental setup. In all our experiments, we generate |V| = 400 samples using a generative process under which the corresponding SVM under full automation

is unable to perform well. For linear SVMs with offset, we draw the class labels uniformly at random, i.e., y Bernoulli(0.5), and then draw the features from two different distributions, p(x|y = 1) = N([0, 0], [6, 1; 1, 6]) and p(x|y = 1) = βN([5, 5], [6, 1; 1, 6]) + (1 β)N([ 5, 5], [6, 1; 1, 6]), where β Bernoulli(0.5). For nonlinear SVMs, we draw the features from a single distribution p(x) N([0, 0], [12, 1; 1, 14]) and, for each sampled feature x, we set y = +1 if x [1, 1] 2 or x + [1, 1] 5 and y = 1, otherwise. Here, we use a quadratic kernel K(xi, xj) = ( 1

2 xi, xj )2. Moreover, we generate the scores provided by human experts h(x) per label by drawing samples from two uniform distributions, i.e., h(x) Unif[ H, H + 1] if y = 1 and h(x) Unif[ 1 + H, H] otherwise. The exact choice of H varies across experiments and is mentioned therein. Finally, in each experiment, we use 60% of samples for training and 40% of samples for testing, set λ = 1 and we follow the procedure described in Section 2 to train a multilayer perceptron π( |x) that decides which samples to outsource to humans at test time. Baseline methods. We compare the performance of the distorted greedy algorithm and its randomized variant with four competitive baselines: Triage based on algorithmic uncertainty (Raghu et al. 2019): it ﬁrst trains a support vector machine for full automation, i.e., S = . Then, at test time, it outsources to humans the top n testing samples sorted in decreasing order of the classiﬁcation uncertainty of this support vector machine, deﬁned as 1/|w (V)φ(x) + b (V)|. Triage based on predicted error (Raghu et al. 2019): it trains a support vector machine for full automation, i.e., S = , and two additional supervised models that predict the human error and the support vector machine error. Then, at test time, it outsources to humans the top n testing samples sorted in decreasing order of the difference between the predicted machine error and the predicted human error. Full automation: it trains a support vector machine for full automation and, at test time, the trained support vector

Distorted greedy Stochastic distorted greedy

Algorithmic uncertainty triage Predicted errors triage

Full automation

0.0 0.06 0.12 0.18 0.24 0.3 n/|V|

(a) Misclassiﬁcation error

0.0 0.06 0.12 0.18 0.24 0.3 n/|V|

(b) F1 score

Figure 3: Performance of all methods on synthetic data with linear SVM with offset. For all methods, we set H = 0.2. If humans predict all testing samples (No automation), we have that P(ˆy = y) = 0.19 and F1-score = 0.81.

machine predicts all testing samples. No automation: humans predict all testing samples. We evaluate the performance of all the methods in terms of the misclassiﬁcation test error P(y = ˆy) and F1 score on positive test samples. Here, our F1 score is a useful metric in medical diagnosis, since it measures the ability of a model to detect the speciﬁc disease. Results. First, we look into the solutions (S , w (V\S ), b (V\S )) provided by the distorted greedy algorithm. Figure 2 summarizes the results, which shows that: (i) the distorted greedy algorithm outsources to humans those samples that are more prone to be misclassiﬁed by the machine, i.e., {(x, y) | y[w (V\S )φ(x) + b(V\S )] < 0}; and, (ii) the higher the samples n outsourced to humans, the lower the number of training samples inside the margin, i.e., {(x, y) | |w (V\S )φ(x) + b(V\S )| 1}, among those samples the SVM needs to decide upon. Next, we compare the performance of the distorted greedy algorithm and its randomized variant with all the baselines algorithmic uncertainty triage, predicted error triage, full automation and no automation in terms of the misclassiﬁcation test error P(y = ˆy) and F1 score on positive test samples. Figure 3 summarizes the results, which shows that both algorithms consistently outperform all the baselines by large margins. Finally, we investigate the inﬂuence that the amount of human error has on the number of samples |S| n outsourced to humans by the distorted greedy algorithm and its randomized variant. Figure 4 summarizes the results for the linear SVM with offset. We ﬁnd that, as long as the amount of human error is small, both algorithms outsource |S| = n samples to humans, however, for higher levels of human error, the algorithm outsources fewer samples |S| < n since it is more beneﬁcial for the minimization of the overall error.

5 Experiments on Real Data

We experiment with three real-world medical datasets and ﬁrst show that, under human assistance, support vector machines trained using the distorted greedy algorithm as well as its randomized variant outperform those trained for full automation as well as humans operating alone. Then, we in-

E[c(x, y)] = 0.2

E[c(x, y)] = 0.4

E[c(x, y)] = 0.6

E[c(x, y)] = 0.8

E[c(x, y)] = 1.0

0.0 0.06 0.12 0.18 0.24 0.3 n/|V|

(a) Distorted greedy

0.0 0.06 0.12 0.18 0.24 0.3 n/|V|

(b) Stochastic distorted greedy

Figure 4: Number of samples |S| outsourced to humans by the distorted greedy algorithm and its randomized variant against the maximum number of outsourced samples n for different amount of human errors.

vestigate the inﬂuence that the amount of human error has on the performance of the distorted greedy algorithm. Experimental setup. We experiment with three publicly available datasets (Decenci ere et al. 2014; Hoover, Kouznetsova, and Goldbaum 2000), each of them from a different application in medical diagnosis: (i) Messidor: It consists of |V| = 400 eye images. Each image is assigned a score by one single medical expert, on a four-point scale, which measures the severity of a retinopathy. (ii) Stare: It consists of |V| = 373 retinal images. Each image is assigned a score by one single medical expert, on a ﬁve-point scale, which measures the severity of a retinal hemorrhage. (iii) Aptos: It consists of |V| = 705 retinal images. Each image is given a score by one single clinician, on a ﬁve-point scale, which measures the severity of diabetic retinopathy. Following previous work, we ﬁrst generate a 1000 dimensional feature vector using Resnet (He et al. 2016) for each sample in the Stare dataset and a 4096-dimensional feature vector using VGG16 (Simonyan and Zisserman 2014) for each sample in the Messidor and the Aptos datasets. Then, we use the top 50 features, as identiﬁed by PCA, as x in our experiments. Moreover, for each sample, we set y = 1 if its severity score q corresponds to one of the two lowest grades of the associated disease and y = +1 otherwise. For each sample, we only have access to the score q given by a single human expert. Therefore, we sample the scores given by humans from a categorical distribution h(x) = ˆq Cat(px,q), where px,q = Dirichlet(χx,q) are the probabilities of each potential score ˆq for a sample (x, y) and χx,q is a vector parameter that controls the human accuracy, and then scale these scores so that h( ) [ 1, 1] and compute the human error as c(x, y) = E [(1 yh(x))+]. For each sample (x, y), the element of χx,q corresponding to the score ˆq = q given by the human expert has the highest value. Finally, for each dataset, we use 60% of samples for training and 40% of samples for testing, set the value of λ using cross validation under full automation, and followed the procedure described in Section 2 to train a logistic regression model π(d | x) that decides which samples to outsource to

Distorted greedy Stochastic distorted greedy

Algorithmic uncertainty triage Predicted errors triage

Full automation

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

(a) Messidor

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

Figure 5: Performance of all methods on three medical datasets. If humans predict all testing samples (No automation), we have that P(ˆy = y) = 0.15, 0.14, 0.20 and F1-score = 0.87, 0.78, 0.8 for Messidor, Stare and Aptos datasets respectively.

humans at test time. Here, the logistic regression model uses just |w (V\S)φ(x) + b (V\S)| as the only single feature. The long version of our paper (De et al. 2020b) contains additional details about the generation of human scores, the additional model π and the baseline algorithms. Result. First, we compare the performance of all methods in Figure 5. The results show that the distorted greedy algorithm as well as its randomized variant: (i) outperform the triage baselines in most of the automation levels and (ii) beneﬁt from human assistance both in cases when humans on their own (No automation) are better than machines on their own (Full automation) and vice versa (Panels (a) and (b) vs Panel (c)); and, (iii) perform comparably, sometimes one beating the other and vice versa (refer to Section 3.4. in Harshaw et al. (2019) to better understand the reasons). Here, note that we intentionally experimented with different values of χ , across different datasets so that the performance of the no automation baseline varies. Moreover, since in our datasets, the parameter ρ = min{|V+|, |V |}/|V| (0.33, 0.45), then to satisfy the conditions of Theorem 3, we only experiment with n/|V| 0.2 < ρ . Next, we assess the inﬂuence that the human error c(x, y) has on the performance of the greedy algorithm. Figure 6 summarizes the results for the Aptos dataset, which show that, as long as the human error is small, the performance of the distorted greedy algorithm improves with respect to the maximum number of samples n. However, for higher levels of human error, outsourcing samples does not help.

6 Conclusions In this paper, we have shown that, for support vector machines, we can solve the problem of classiﬁcation under human assistance using algorithms with approximation guarantees. Moreover, we have further shown that, under human assistance, support vector machines trained to operate under different automation levels can provide superior empirical

E[c(x, y)] = 0.01

E[c(x, y)] = 0.03

E[c(x, y)] = 0.05

E[c(x, y)] = 0.12

E[c(x, y)] = 0.23

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

(a) Missclassiﬁcation error

0.0 0.04 0.08 0.12 0.16 0.2 n/|V|

(b) F1 score

Figure 6: Performance of the stochastic distorted greedy algorithm against the maximum number of outsourced samples n for different levels of human error c(x, y) on the Aptos dataset.

performance than those trained for full automation. Our work also opens many interesting venues for future work. For example, we have assumed that the human error is known, however, in practice, the spectrum of human abilities spans a broad range. It would be interesting to develop algorithms that, over time, adapt to the particular human(s) they are dealing with. Moreover, we have assumed that the human annotations are independent, however, there exist scenarios in which annotations are correlated, e.g., when a single user sequentially reviews a set of items in a session. It would be interesting to lift the independence assumption and design algorithms that account for the correlation between annotations. It would also be valuable to ﬁnd tighter lower bounds on the parameter γ, which better characterize the good empirical performance. Finally, it would be worth to extend our analysis to other convex margin-based classiﬁers as well as design machine learning algorithms operating under different automation levels in sequential decision-making tasks, e.g., semi-autonomous driving (Meresht et al. 2020).

Broader Impact There are several technical, societal, and legal challenges that prevents the use of fully autonomous machine learning systems for high-stake decision making in safety-critical applications such as medical diagnosis. In our work, we argue that it is possible to address such challenges by letting humans and machines to assist and complement each other. Moreover, our theoretical and experimental results suggest that humans and machines working together may achieve better performance than what they would achieve on their own.

Acknowledgements This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No. 945719).

References Bartlett, P.; and Wegkamp, M. 2008. Classiﬁcation with a reject option using a hinge loss. JMLR 9(Aug): 1823 1840. Bennett, K. P.; and Bredensteiner, E. J. 2000. Duality and geometry in SVM classiﬁers. In ICML, volume 2000, 57 64. Bian, A. A.; Buhmann, J. M.; Krause, A.; and Tschiatschek, S. 2017. Guarantees for greedy maximization of nonsubmodular functions with applications. In Proceedings of the 34th International Conference on Machine Learning Volume 70, 498 507. Bordt, S.; and von Luxburg, U. 2020. When Humans and Machines Make Joint Decisions: A Non-Symmetric Bandit Model. ar Xiv preprint ar Xiv:2007.04800 . Bousquet, O.; and Elisseeff, A. 2002. Stability and generalization. Journal of machine learning research 2(Mar): 499 526. Chen, X.; and Price, E. 2017. Active regression via linearsample sparsiﬁcation. ar Xiv preprint ar Xiv:1711.10051 . Cohn, D.; Ghahramani, Z.; and Jordan, M. 1995. Active learning with statistical models. In Neur IPS, 705 712. Cortes, C.; De Salvo, G.; and Mohri, M. 2016. Learning with rejection. In ALT, 67 82. Springer. Das, A.; and Kempe, D. 2018. Approximate submodularity and its applications: subset selection, sparse approximation and dictionary selection. The Journal of Machine Learning Research 19(1): 74 107. De, A.; Koley, P.; Ganguly, N.; and Gomez-Rodriguez, M. 2020a. Regression under human assistance. In AAAI. De, A.; Okati, N.; Zarezade, A.; and Gomez-Rodriguez, M. 2020b. Classiﬁcation Under Human Assistance. ar Xiv preprint ar Xiv:2006.11845 . Decenci ere, E.; Zhang, X.; Cazuguel, G.; Lay, B.; Cochener, B.; Trone, C.; Gain, P.; Ordonez, R.; Massin, P.; Erginay, A.; Charton, B.; and Klein, J.-C. 2014. Feedback on a publicly distributed database: the Messidor database. Image Analysis & Stereology 33(3): 231 234.

El-Yaniv, R.; et al. 2010. On the Foundations of Noise-free Selective Classiﬁcation. Journal of Machine Learning Research 11(5).

Gatmiry, K.; and Gomez-Rodriguez, M. 2019. Nonsubmodular Function Maximization subject to a Matroid Constraint, with Applications. ar Xiv preprint ar Xiv:1811.07863 .

Geifman, Y.; and El-Yaniv, R. 2019. Selective Net: A Deep Neural Network with an Integrated Reject Option. ar Xiv preprint ar Xiv:1901.09192 .

Geifman, Y.; Uziel, G.; and El-Yaniv, R. 2018. Bias Reduced Uncertainty Estimation for Deep Neural Classiﬁers .

Ghosh, A.; Tschiatschek, S.; Mahdavi, H.; and Singla, A. 2019. Towards deployment of robust AI agents for humanmachine partnerships. In AAMAS.

Graves, A.; Abdel-Rahman, M.; and Hinton, G. E. 2013. Speech recognition with deep recurrent neural networks. In ICASSP.

Grover, A.; Al-Shedivat, M.; Gupta, J. K.; Burda, Y.; and Edwards, H. 2018. Learning policy representations in multiagent systems. In ICML.

Guo, Y.; and Schuurmans, D. 2008. Discriminative batch mode active learning. In Neur IPS, 593 600.

Hadﬁeld-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan, A. 2016. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, 3909 3917.

Harshaw, C.; Feldman, M.; Ward, J.; and Karbasi, A. 2019. Submodular Maximization Beyond Non-negativity: Guarantees, Fast Algorithms, and Applications. In ICML.

Hashemi, A.; Ghasemi, M.; Vikalo, H.; and Topcu, U. 2019. Submodular Observation Selection and Information Gathering for Quadratic Models. ar Xiv preprint ar Xiv:1905.09919 .

Haug, L.; Tschiatschek, S.; and Singla, A. 2018. Teaching inverse reinforcement learners via features and demonstrations. In Advances in Neural Information Processing Systems, 8464 8473.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.

Hoi, S.; Jin, R.; Zhu, J.; and Lyu, M. R. 2006. Batch mode active learning and its application to medical image classiﬁcation. In ICML.

Hoover, A.; Kouznetsova, V.; and Goldbaum, M. 2000. Locating blood vessels in retinal images by piecewise threshold probing of a matched ﬁlter response. IEEE Transactions on Medical imaging 19(3): 203 210.

Hsu, Y.-C.; Shen, Y.; Jin, H.; and Kira, Z. 2020. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10951 10960.

Hush, D.; Scovel, C.; and Steinwart, I. 2007. Stability of unstable learning algorithms. Machine learning 67(3): 197 206. Kamalaruban, P.; Devidze, R.; Cevher, V.; and Singla, A. 2019. Interactive teaching algorithms for inverse reinforcement learning. In IJCAI. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS. Liu, Z.; Wang, Z.; Liang, P. P.; Salakhutdinov, R. R.; Morency, L.-P.; and Ueda, M. 2019. Deep Gamblers: Learning to Abstain with Portfolio Theory. In Neur IPS. Macindoe, O.; Kaelbling, L. P.; and Lozano-P erez, T. 2012. Pomcop: Belief space planning for sidekicks in cooperative games. In Eighth Artiﬁcial Intelligence and Interactive Digital Entertainment Conference. Meresht, V. B.; De, A.; Singla, A.; and Gomez-Rodriguez, M. 2020. Learning to Switch Between Machines and Humans. ar Xiv preprint ar Xiv:2002.04258 . Mozannar, H.; and Sontag, D. 2020. Consistent Estimators for Learning to Defer to an Expert. ar Xiv preprint ar Xiv:2006.01862 . Nikolaidis, S.; Forlizzi, J.; Hsu, D.; Shah, J.; and Srinivasa, S. 2017. Mathematical models of adaptation in human-robot collaboration. ar Xiv preprint ar Xiv:1707.02586 . Nikolaidis, S.; Ramakrishnan, R.; Gu, K.; and Shah, J. 2015. Efﬁcient model learning from joint-action demonstrations for human-robot collaborative tasks. In 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 189 196. IEEE. Radanovic, G.; Devidze, R.; Parkes, D.; and Singla, A. 2019. Learning to collaborate in markov decision processes. In ICML. Raghu, M.; Blumer, K.; Corrado, G.; Kleinberg, J.; Obermeyer, Z.; and Mullainathan, S. 2019. The algorithmic automation problem: Prediction, triage, and human effort. ar Xiv preprint ar Xiv:1903.12220 . Ramaswamy, H.; Tewari, A.; and Agarwal, S. 2018. Consistent algorithms for multiclass classiﬁcation with an abstain option. Electronic J. of Statistics 12(1): 530 554. Sabato, S.; and Munos, R. 2014. Active regression by stratiﬁcation. In Neur IPS, 469 477. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature . Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556 . Sugiyama, M. 2006. Active learning in approximately linear regression based on conditional expectation of generalization error. JMLR 7(Jan): 141 166. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS.

Thulasidasan, S.; Bhattacharya, T.; Bilmes, J.; Chennupati, G.; and Mohd-Yusof, J. 2019. Combating Label Noise in Deep Learning Using Abstention. ar Xiv preprint ar Xiv:1905.10964 . Tschiatschek, S.; Ghosh, A.; Haug, L.; Devidze, R.; and Singla, A. 2019. Learner-aware teaching: Inverse reinforcement learning with preferences and constraints. In Advances in Neural Information Processing Systems, 4147 4157. Wiener, Y.; and El-Yaniv, R. 2011. Agnostic selective classiﬁcation. In Advances in neural information processing systems, 1665 1673. Wilder, B.; Horvitz, E.; and Kamar, E. 2020. Learning to Complement Humans. ar Xiv preprint ar Xiv:2005.00582 . Willett, R.; Nowak, R.; and Castro, R. M. 2006. Faster rates in regression via active learning. In Neur IPS. Wilson, H. J.; and Daugherty, P. R. 2018. Collaborative intelligence: humans and AI are joining forces. Harvard Business Review 96(4): 114 123. Ziyin, L.; Chen, B.; Wang, R.; Liang, P. P.; Salakhutdinov, R.; Morency, L.-P.; and Ueda, M. 2020. Learning Not to Learn in the Presence of Noisy Labels. ar Xiv preprint ar Xiv:2002.06541 .