# regression_under_human_assistance__a92fd8f7.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Regression Under Human Assistance

Abir De MPI-SWS ade@mpi-sws.org

Paramita Koley IIT Kharagpur paramita.koley@iitkgp.ac.in

Niloy Ganguly IIT Kharagpur niloy@cse.iitkgp.ac.in

Manuel Gomez-Rodriguez MPI-SWS manuelgr@mpi-sws.org

Decisions are increasingly taken by both humans and machine learning models. However, machine learning models are currently trained for full automation they are not aware that some of the decisions may still be taken by humans. In this paper, we take a ﬁrst step towards the development of machine learning models that are optimized to operate under different automation levels. More speciﬁcally, we ﬁrst introduce the problem of ridge regression under human assistance and show that it is NP-hard. Then, we derive an alternative representation of the corresponding objective function as a difference of nondecreasing submodular functions. Building on this representation, we further show that the objective is nondecreasing and satisﬁes α-submodularity, a recently introduced notion of approximate submodularity. These properties allow a simple and efﬁcient greedy algorithm to enjoy approximation guarantees at solving the problem. Experiments on synthetic and real-world data from two important applications medical diagnosis and content moderation demonstrate that the greedy algorithm beats several competitive baselines.

1 Introduction

In a wide range of critical applications, societies rely on the judgement of human experts to take consequential decisions decisions which have signiﬁcant consequences. Unfortunately, the timeliness and quality of the decisions are often compromised due to the large number of decisions to be taken and a shortage of human experts. For example, in certain medical specialties, patients in most countries need to wait for months to be diagnosed by a specialist. In content moderation, online publishers often stop hosting comments sections because their staff is unable to moderate the myriad of comments they receive. In software development, bugs may be sometimes overlooked by software developers who spend long hours on code reviews for large software projects. In this context, there is a widespread discussion on the possibility of letting machine learning models take decisions in these high-stake tasks, where they have matched,

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

or even surpassed, the average performance of human experts (Topol 2019; Cheng, Danescu-Niculescu-Mizil, and Leskovec 2015; Pradel and Sen 2018). Currently, these models are mostly trained for full automation they assume they will take all the decisions. However, their decisions are still worse than those by human experts on some instances, where they make far more errors than average (Raghu et al. 2019a). Motivated by this observation, our goal is to develop machine learning models that are optimized to operate under different automation levels. In other words, these models are optimized to take decisions for a given fraction of the instances and leave the remaining ones to humans. More speciﬁcally, we focus on ridge regression and introduce a novel problem formulation that allows for different automation levels. Based on this formulation, we make the following contributions:

I. We show that the problem is NP-hard. This is due to its combinatorial nature for each potential meta-decision about which instances the machine will decide upon, there is an optimal set of parameters for the regression model, however, the meta-decision is also something we seek to optimize.

II. We derive an alternative representation of the objective function as a difference of nondecreasing submodular functions. This representation enables us to use a recent iterative algorithm (Iyer and Bilmes 2012) to solve the problem, however, this algorithm does not enjoy approximation guarantees.

III. Building on the above representation, we further show that the objective function is nondecreasing and satisﬁes α-submodularity, a notion of approximate submodularity (Gatmiry and Gomez-Rodriguez 2019). These properties allow a simple and efﬁcient greedy algorithm (refer to Algorithm 1) to enjoy approximation guarantees.

Here, we would like to acknowledge that our contributions are just a ﬁrst step towards designing machine learning models that are optimized to operate under different automation levels. It would be very interesting to extend our work to more sophisticated machine learning models and other machine learning tasks (e.g., classiﬁcation). Finally, we experiment with synthetic and real-world data

from two important applications medical diagnosis and content moderation. Our results show that the greedy algorithm beats several competitive algorithms, including the iterative algorithm for maximization of a difference of submodular functions mentioned above, and is able to identify and outsource to humans those samples where their expertise is required. To facilitate research in this area, we are releasing an open source implementation of our method1. Related work. The work most closely related to ours is by Raghu et al. (2019a), in which a classiﬁer can outsource samples to humans. However, in contrast to our work, their classiﬁer is trained to predict the labels of all samples in the training set, as in full automation, and the proposed algorithm does not enjoy theoretical guarantees. As a result, a natural extension of their algorithm to ridge regression achieves a signiﬁcantly lower performance than ours, as shown in Figure 4. There is a rapidly increasing line of work devoted to designing classiﬁers that are able to defer decisions (Bartlett and Wegkamp 2008; Cortes, De Salvo, and Mohri 2016; Geifman, Uziel, and El-Yaniv 2018; Geifman and El-Yaniv 2019; Raghu et al. 2019b; Ramaswamy et al. 2018; Thulasidasan et al. 2019; Liu et al. 2019). Here, the classiﬁers learn to defer either by considering the defer action as an additional label value or by training an independent classiﬁer to decide about deferred decisions. However, there are two fundamental differences between this work and ours. First, they do not consider there is a human decision maker, with a human error model, who takes a decision whenever the classiﬁers defer it. Second, the classiﬁers are trained to predict the labels of all samples in the training set, as in full automation. Our work is also related to active learning (Cohn, Ghahramani, and Jordan 1995; Hoi et al. 2006; Willett, Nowak, and Castro 2006; Sugiyama 2006; Guo and Schuurmans 2008; Sabato and Munos 2014; Chen and Price 2017; Hashemi et al. 2019), robust linear regression (Bhatia et al. 2017; Suggala et al. 2019; Tsakonas et al. 2014; Wright and Ma 2010) and robust logistic regression (Feng et al. 2014). In active learning, the goal is to to determine which subset of training samples one should label so that a supervised machine learning model, trained on these samples, generalizes well across the entire feature space during test. In contrast, our trained model only needs to accurately predict samples which are close to the samples assigned to the machine during training time and rely on humans to predict the remaining samples. In robust linear regression and robust logistic regression, the (implicit) assumption is that a constant fraction of the output variables are corrupted by an unbounded noise. Then, the goal is to ﬁnd a consistent estimator of the model parameters which ignores the samples whose output variables are noisy. In contrast, in our work, we do not assume any noise model for the output variables but rather a human error per sample and ﬁnd a estimator of the model parameters that outsources some of the samples to humans.

1https://github.com/Networks-Learning/regression-underassistance

2 Problem Formulation In this section, we formally state the problem of ridge regression under human assistance, where some of the predictions can be outsourced to humans. Given a set of training samples {(xi, yi)}i V and a human error per sample c(xi, yi), we can outsource a subset S V of the training samples to humans, with |S| n. Then, ridge regression under human assistance seeks to minimize the overall training error, including the outsourced samples, i.e., minimize w, S ℓ(w, S) subject to |S| n, (1)

i S c(xi, yi) +

(yj x j w)2 + λ||w||2 2 ,

where the ﬁrst term accounts for the human error, the second term accounts the machine error, and λ is a given regularization parameter for the machine. Moreover, if we deﬁne y = [y1, y2, , y N] and X = [x1, x2, , x N], we can rewrite the above objective function as

i S c(xi, yi)

+ (y Sc X Scw) (y Sc X Scw) + λ||w||2 2 |Sc|,

where y Sc is the subvector of y indexed by Sc and XSc is the submatrix formed by columns of X that are indexed by Sc. Then, whenever S V, it readily follows that the optimal parameter w = w (S) is given by

w (S) = λ|Sc|I + XSc X Sc 1 XScy Sc.

If we plug in the above equation into Eq. 1, we can rewrite the ridge regression problem under human assistance as a set function maximization problem, i.e.,

maximize S log ℓ(w (S), S) subject to |S| n, (2)

where ℓ(w (S), S) is given by

i S c(xi, yi) + y Scy Sc

y Sc X Sc λ|Sc|I + XSc X Sc 1 XScy Sc if S V, i S c(xi, yi) if S = V.

Unfortunately, due to its combinatorial nature, the above problem formulation is difﬁcult to solve, as formalized by the following Theorem:

Theorem 1 The problem of ridge regression under human assistance deﬁned in Eq. 1 is NP-hard.

Proof Consider a particular instance of the problem with c(xi, yi) = 0 for all i V and λ = 0. Moreover, assume the response variables y are generated as follows:

y = X w + b , (3)

where b is a n-sparse vector which takes non-zero values on at most n corrupted sampless, and a zero elsewhere. Then,

the problem can be just viewed as a robust least square regression (RLSR) problem (Studer et al. 2011), i.e.,

minimize w, S

i S (yi x i w)2 subject to |S| = |V| n,

which has been shown to be NP-hard (Bhatia et al. 2017). This concludes the proof. However, in the next section, we will show that, perhaps surprisingly, a simple greedy algorithm enjoys approximation guarantees. In the remainder of the paper, to ease the notation, we will use ℓ(S) = ℓ(w (S), S). Remarks. Once the model is trained, given a new unlabeled sample x, we outsource the sample to a human if the nearest neighbor in the set V belongs to S , i.e., argmini V ||xi x|| S , where S is the solution to the above maximization problem, and pass it on to the machine, i.e., ˆyi = x i w (S ), otherwise. Here, note that, as long as the feature distribution does not change during test, this procedure guarantees that the fraction of samples outsourced to humans during training and test time will be similar. The following proposition formalizes this result: Proposition 2 Let {xi}i V be a set of training samples, {x j}j V a set of test samples, and n and n the number of training and test samples outsourced to humans, respectively. If xi, xj P(x) for all i V, j V , then, it holds that E[n ]/|V | = n/|V|.

3 An Algorithm With Approximation Guarantees In this section, we ﬁrst show that the objective function in Eq. 2 can be represented as a difference of nondecreasing submodular functions. Then, we build on this representation to show that the objective function is nondecreasing and satisﬁes α-submodularity (Gatmiry and Gomez-Rodriguez 2019), a recently introduced notion of approximate submodularity. Finally, we present an efﬁcient greedy algorithm that, due to the α-submodularity of the objective function, enjoys approximation guarantees. Difference of submodular functions. We ﬁrst start by rewriting the objective function log ℓ(S) using the following Lemma, which states a well-known property of the Schur complement of a block matrix:

Lemma 3 Let Z = A B C D

. If D is invertible, then

det(Z) = det(D) det(A BD 1C). More speciﬁcally, consider A =

i S c(xi, yi) + y Scy Sc, B = C = y Sc X Sc and D = λ|Sc|I + XSc X Sc in the above lemma. Then, for S V, it readily follows that:

log ℓ(S) = f(S) g(S) (4)

where f(S) is given by

log det i S c(xi, yi) + y Scy Sc y Sc X Sc XScy Sc λ|Sc|I + XSc X Sc

and g(S) is given by,

g(S) = log det[λ|Sc|I + XSc X Sc].

In the above, note that, for S = V, the functions f and g are not deﬁned. As it will become clearer later, for S = V, it will be useful to deﬁne their values as follows:

f(S) = min k1,k2

f(V\k1) + f(V\k2) f(V\{k1, k2}),

g(V\k1) + g(V\k2) g(V\{k1, k2})

i V c(xi, yi) ,

g(S) = f(S) log

i V c(xi, yi),

where note that these values also satisfy Eq. 4. Next, we show that, under mild technical conditions, the above functions are nonincreasing and satisfy a natural diminishing property called submodularity2.

Theorem 4 Assume c(xk, yk) γy2 k and λ γ 1 γ max i V ||xi||2 2 with 0 γ < 1, then, f and g are non-

increasing and submodular.

Proof We start by showing that f is submodular, i.e., f(S k) f(S) f(T k) f(T ) for all S T V and k V. First, deﬁne

M(S) = y Scy Sc +

i S c(xi, yi) y Sc X Sc XScy Sc λ|Sc|I + XSc X Sc

and observe that

M(S k) = M(S) y2 k c(xk, yk) ykx k xkyk λI + xkx k

Then, it follows from Proposition 10 (refer to Appendix B) that M(S) M(S k) 0 Hence, we have a Cholesky decomposition M(S) M(S k) = Qk Q k Similarly, we have that M(T k) = M(T ) Qk Q k , and hence

M(T ) = M(S)

i T \S Qi Q i (5)

Now, for T k V, a few steps of calculation shows that: f(S k) f(S) f(T k) + f(T ) equals to

log det(I Q k M 1(S)Qk) det(I Q k M 1(T )Qk)

Moreover, Eq. 5 indicates that M(S) M(T ) 0. Therefore M 1(T ) M 1(S) and hence I Q k M 1(S)Qk I Q k M 1(T )Qk. In addition, we also note that M(T ) Qk Q k 0. This, together with Lemma 3, we have that I Q k M 1(T )Qk 0. Hence, due to Proposition 11 (refer to Appendix B), we have det(I Q k M 1(S)Qk) det(I Q k M 1(T )Qk. Finally, for T k = V, we have that

f(S k1) f(S) f(V\k2) f(V\{k1, k2}) (6) f(V) f(V\{k1}), (7)

2A set function f( ) is submodular iff it satisﬁes that f(S {k}) f(S) f(T {k}) f(T ) for all S T V and k V, where V is the ground set.

where the ﬁrst inequality follows from the proof of submodularity for T k V and the second inequality comes from the deﬁnition of f(S) for S = V. This concludes the proof of submodularity of f. Next, we show that f is nonincreasing. First, recall that, for |S| < |V| 1, we have that

f(S k) f(S) = log det(M(S) Qk Q k ) det(M(S)) (8)

Then, note that M(S) Qk Q k M(S) and M(S) Qk Q k 0. Hence, using Proposition 11 (refer to Appendix B), it follows that det(M(S) Qk Q k ) det(M(S)), which proves f is nonincreasing for |S| < |V| 1. Finally, for |S| = |V| 1, it readily follows from Eq. 7 that f(V) f(V\{k1}) f(V\k2) f(V\{k1, k2}) (9) Now f(V\k2) f(V\{k1, k2}) 0 since we have proved that f(S) is nonincreasing for |S| < |V| 1. This concludes the proof of monotonicity of f. Proceeding similarly, it can be proven that g is also nondecreasing and submodular. We would like to highlight that, in the above, the technical conditions have a natural interpretation the ﬁrst condition is satisﬁed if the human error is not greater than a fraction γ of the true response variable and the second condition is satisﬁed if the regularization parameter is not too small. In our experiments, the above result will enable us to use a series of recent heuristic iterative algorithms for maximizing the difference of submodular functions (Iyer and Bilmes 2012) as baselines. However, these algorithms do not enjoy approximation guarantees they only guarantee to monotonically reduce the objective function at every step. Monotonicity. We ﬁrst start by analyzing the monotonicity of log ℓ(S) whenever S = V\k, for any k V in the following Lemma (proven in Appendix A): Lemma 5 Assume c(xk, yk) < γy2 k and λ > γ 1 γ maxi V ||xi||2 2 with 0 γ < 1. Then, it holds that log ℓ(V) log ℓ(V\k) < 0 for all k V. Then, building on the above lemma, we have the following Theorem, which shows that log ℓ(S) is a strictly nonincreasing function (proven in Appendix A): Theorem 6 Assume c(xk, yk) < γy2 k and λ > γ 1 γ maxi V ||xi||2 2 with 0 γ < 1, then, the function log ℓ(S) is strictly nonincreasing, i.e., log ℓ(S k) log ℓ(S) < 0 for all S V and k V. Finally, note that the above result does not imply that the human error c(xk, yk) is always smaller than the machine error (yk x k w (k))2, where w (k) is optimal parameter for S = {k}, as formalized by the following Proposition (proven in Appendix A): Proposition 7 Assume ρ2y2 k < c(xk, yk) < γy2 k and γ 1 γ maxi V ||xi||2 2 < λ < ρ 1 ρ maxi V ||xi||2 2 with γ < ρ < γ and 0 γ < 1, then, it holds that

c(xk, yk) > (yk x k w (k))2.

Algorithm 1 Greedy algorithm

Input: Ground set V, set of training samples {(xi, yi)}i V, parameters n and λ. Output: Set of items S 1: S 2: while |S| < n do 3: % Find best sample 4: k argmaxk V\S log ℓ(S k) + log ℓ(S) 5: % Sample is outsourced to humans 6: S S {k } 7: end while 8: return S

α-submodularity. Given the above results, we are now ready to present and prove our main result, which characterizes the objective function of the optimization problem deﬁned in Eq. 2:

Theorem 8 Assume c(xk, yk) < γ y2 k, λ > γ 1 γ max i V ||xi||2 2 with 0 γ < 1, and

i V c(xi, yi) 13. Then, the function log ℓ(S) is a nondecreasing α-submodular function4 and the parameter α satisﬁes that

α α = 1 min (1 κℓ) log ℓ(V) maxk1,k2 f(V\{k1, k2}) f(V\{k1}),

(1 κℓ) log ℓ(V) maxk log ℓ(V\k) log ℓ(V)

with, κℓ= log [ℓ( ) mink(ℓ(V\k) ℓ(V))]

Proof Using that i V c(xi, yi) > 1 and the function ℓ is nonincreasing, we can conclude that 1 < ℓ(V) < ℓ(S). Then, it readily follows from the proof of Theorem 6 that

1 < ℓ(S k) <ℓ(S) (ℓ(V\k) ℓ(V)) (11)

Hence we have,

log ℓ(S) log ℓ(S) (ℓ(V\k) ℓ(V))

(a) = log ℓ( ) (ℓ(V\k) ℓ(V))

log ℓ( ) κℓ (12)

where equality (a) follows from Theorem 6, which implies

3Note that we can always rescale the data to satisfy this last condition. 4A function f( ) is α-submodular (Gatmiry and Gomez Rodriguez 2019) iff it satisﬁes that f(S {k}) f(S) (1 α) [f(T {k}) f(T )] for all S T V and k V, where V is the ground set and α is the generalized curvature (Lehmann, Lehmann, and Nisan 2006; Bogunovic, Zhao, and Cevher 2018; Hashemi et al. 2019).

ˆy = w (S )x

Features x Unif[ 8, 8]

Response (y)

-8 -6 -4 -2 2 4 6 8

(a) n = 80, Logistic

ˆy = w (S )x

Features x Unif[ 8, 8]

Response (y)

-8 -6 -4 -2 2 4 6 8

(b) n = 160, Logistic

S ˆy = w (S )x

Features x Unif[ 8, 8]

Response (y)

-8 -6 -4 -2 2 4 6 8

(c) n = 240, Logistic

ˆy = w (S )x

Features x Unif[ 8, 8]

Response (y)

-8 -6 -4 -2 2 4 6 8

(d) n = 320, Logistic

V\S S ˆy = w (S )x

Features x Unif[ 1, 1]

Response (y)

0 -1 1 0.5 -0.5

(e) n = 80, Gaussian

ˆy = w (S )x

Features x Unif[ 1, 1]

Response (y)

0 -1 1 0.5 -0.5

(f) n = 160, Gaussian

S ˆy = w (S )x

Features x Unif[ 1, 1]

Response (y)

0 -1 1 0.5 -0.5

(g) n = 240, Gaussian

ˆy = w (S )x

Features x Unif[ 1, 1]

Response (y)

0 -1 1 0.5 -0.5

(h) n = 320, Gaussian

Figure 1: Solution (w (S ), S ) provided by our greedy algorithm for a gaussian and logistic response variable distribution and different number of outsourced samples n. In all cases, we used d = 1 and σ2 = 0.01. For the logistic distribution, as n increases, the greedy algorithm let the machine to focus on the samples where the relationship between features and the response variables is more linear and outsource the remaining points to humans. For the gaussian distribution, as n increases, the greedy algorithm outsources samples on the tails of the distribution to humans.

that ℓmax = ℓ( ). Then, we have that

1 α = min k, S T V log ℓ(S) log ℓ(S k) log ℓ(T ) log ℓ(T k)

min min k,S T :|T | |V| 2 log ℓ(S) log ℓ(S k) log ℓ(T ) log ℓ(T k),

min S,k log ℓ(S) log ℓ(S k)

log ℓ(V\k) log ℓ(V)

Next, we bound the ﬁrst term as follows:

min k,S T :|T | |V| 2 log ℓ(S) log ℓ(S k) log ℓ(T ) log ℓ(T k)

(a) min k,S T :|T | |V| 2 (1 κℓ) log ℓ(S) log ℓ(T ) log ℓ(T k)

(b) min k,|T | |V| 2 (1 κℓ) log ℓ(V) log ℓ(T ) log ℓ(T k)

= min k,|T | |V| 2 (1 κℓ) log ℓ(V) f(T ) f(T k) (g(T ) g(T k))

(c) min k,|T | |V| 2 (1 κℓ) log ℓ(V) f(T ) f(T k)

(d) (1 κl) log ℓ(V ) maxk1,k2(f(V\{k1, k2}) f(V\k1)),

where inequality (a) follows from Eq. 12, inequality (b) follows from the monotonicity of log ℓ(S), and inequalities (c) and (d) follows from Theorem 4. Finally, we use the monotonicity of log ℓ(S) and Eq. 12 to bound the second term in Eq. 13 is always greater than (1 κℓ) log ℓ(V) maxk log ℓ(V\k) log ℓ(V), which concludes the proof. A greedy algorithm. The greedy algorithm proceeds iteratively and, at each step, it assigns to the humans the sample

(xk, yk) that provides the highest marginal gain among the set of samples which are currently assigned to the machine. Algorithm 1 summarizes the greedy algorithm. Since the objective function in Eq. 2 is α-submodular, it readily follows from Theorem 9 in Khashayar and Gomez Rodriguez (Gatmiry and Gomez-Rodriguez 2019) that the above greedy algorithm enjoys an approximation guarantee. More speciﬁcally, we have the following Theorem:

Theorem 9 The greedy algorithm returns a set S such that log ℓ(S) (1 + 1/(1 α)) 1OPT, where OPT is the optimal value and α α with α deﬁned in Eq. 10.

In the above, note that, due to Theorem 6, the actual (regularized) loss function is strictly nonincreasing and thus the greedy algorithm always goes until |S| = n, however, the overall accuracy may be higher for some values of |S| < n as shown in Figure 6. In the next section, we will demonstrate that, in addition to enjoying the above approximation guarantees, the above greedy algorithm performs better in practice than several competitive baselines.

4 Experiments on Synthetic Data

In this section, we experiment with a variety of synthetic examples. First, we look into the solution (w (S ), S ) provided by the greedy algorithm. Then, we compare the performance of the greedy algorithm with several competitive baselines. Finally, we investigate how the performance of the greedy algorithm varies with respect to the amount of human error. Experimental setup. For each sample (x, y), we ﬁrst generate each dimension of the feature vector x Rd uniformly at random, i.e., xi U( 1, 1) and then sample the response variable y from either (i) a Gaussian distribution N(1 x/d, σ2 1) or (ii) a logistic distribution 1/(1 +

Greedy DS Distorted greedy Triage

40 80 120 160 200 240 280 320 360 0

(a) Gaussian

Greedy DS Distorted greedy Triage

40 80 120 160 200 240 280 320 360 0

(b) Logistic

Figure 2: Mean squared error (MSE) against number of outsourced samples n for the proposed greedy algorithm, DS (Iyer and Bilmes 2012), distorted greedy (Harshaw et al. 2019) and Triage (Raghu et al. 2019a) on synthetic data. In all cases, we used d = 5 and σ2 = 10 3. In panel (a), we set λ = 5 10 3 and, in panel (b), we set λ = 10 3. The greedy algorithm consistently outperforms the baselines across the entire range of automation levels.

exp( 1 x/d)). Moreover, we sample the associated human error from a Gaussian distribution, i.e., c(x, y) N(0, σ2 2). In each experiment, we use |V| = 500 training samples and we compare the performance of the greedy algorithm with three competitive baselines: An iterative heuristic algorithm (DS) for maximizing the difference of submodular functions by Iyer and Bilmes (2012). A greedy algorithm (Distorted greedy) for maximizing γ-weakly submodular functions by Harshaw et al. (2019)5. A natural extension of the algorithm (Triage) by Raghu et al. (2019a), originally developed for classiﬁcation under human assistance, where we ﬁrst solve the standard ridge regression problem for the entire training set, then we map each test sample to the nearest neighbor training sample and ﬁnally outsource to humans the top n samples sorted in decreasing order of the difference between machine and human error of the assigned training sample. Results. We ﬁrst look into the solution (w (S ), S ) provided by the greedy algorithm both for the Gaussian and logistic distributions and a different number of outsourced samples n. Figure 1 summarizes the results, which reveal several interesting insights. For the logistic distribution, as n increases, the greedy algorithm let the machine to focus on the samples where the relationship between features and the response variables is more linear and outsource the remaining points to humans. For the Gaussian distribution, as n increases, the greedy algorithm outsources samples on the tails of the distribution to humans. Then, we compare the performance of the greedy algorithm in terms of mean squared error (MSE) on a held-out set against the three competitive baselines. Figure 2 summarizes the results, which show that the greedy algorithm consistently outperforms the baselines across the entire range of automation levels. Finally, we investigate how the performance of our greedy algorithm varies with respect to the amount of human error. Figure 3 summarizes the results, which shows that, for low levels of human error, the overall mean squared error decreases monotonically with respect to

5Note that any α-submodular function is γ-weakly submodular (Gatmiry and Gomez-Rodriguez 2019).

σ2 = 0.01 σ2 = 0.02 σ2 = 0.03 σ2 = 0.04 σ2 = 0.05

40 80 120 160 200 240 280 320 360 0

(a) Gaussian

σ2 = 0.01 σ2 = 0.02 σ2 = 0.03 σ2 = 0.04 σ2 = 0.05

40 80 120 160 200 240 280 320 360 0

(b) Logistic

Figure 3: Mean squared error (MSE) achieved by the proposed greedy algorithm against the number of outsourced samples n for different levels of human error (σ2) on synthetic data. In all cases, we used d = 5. In panel (a), we set λ = 5 10 3 and, in panel (b), we set λ = 10 3. For low levels of human error, the overall mean squared error decreases monotonically with respect to the number of outsourced samples. In contrast, for high levels of human error, it is not beneﬁcial to outsource samples to humans.

the number of outsourced samples to humans. In contrast, for high levels of human error, it is not beneﬁcial to outsource samples.

5 Experiments on Real Data In this section, we experiment with four real-world datasets from two important applications, medical diagnosis and content moderation, and show that the greedy algorithm beats several competitive baselines. Moreover, we also look at the samples that the greedy algorithm outsources to humans and show that, for different distributions of human error, the outsourced samples are those that humans are able to predict more accurately. Experimental setup. We experiment with one dataset for content moderation and three datasets for medical diagnosis, which are publicly available (Davidson et al. ; Decenci ere et al. 2014; Hoover, Kouznetsova, and Goldbaum 2000). More speciﬁcally:

(i) Hatespeech: It consists of 1000 tweets6 containing words, phrases and lexicons used in hate speech. Each tweet is given several scores by three to ﬁve annotators from Crowdﬂower, which measure the severity of hate speech.

(ii) Stare-H: It consists of 400 retinal images. Each image is given a score by one single expert, on a ﬁve point scale, which measures the severity of a retinal hemorrhage.

(iii) Stare-D: It contains the same set of images from Stare H. However, in this dataset, each image is given a score by a single expert, on a six point scale, which measures the severity of the Drusen disease.

(iv) Messidor: It contains 500 eye images. Each image is given score by one single expert, on a three point scale, which measures the severity of an edema.

6The original Hatespeech dataset consists of 25000 tweets, however, we report results on a randomly chosen subset of 1000 tweets because the distorted greedy and DS algorithms did not scale in the original dataset. We found similar results in other random subsets.

Distorted greedy

10 20 30 40 50 60 70 80 90

(a) Hatespeech

Distorted greedy

10 20 30 40 50 60 70 80 90 0

(b) Stare-H

Distorted greedy

10 20 30 40 50 60 70 80 90 0

(c) Stare-D

Distorted greedy

10 20 30 40 50 60 70 80 90 0

(d) Messidor

Figure 4: Mean squared error (MSE) against number of outsourced samples n for the proposed greedy algorithm, DS (Iyer and Bilmes 2012), distorted greedy (Harshaw et al. 2019) and triage (Raghu et al. 2019a) on four real-world datasets. In Panels (b-d), we set the parameter p = 0.90. The greedy algorithm outperforms the baselines across a majority of automation levels. The only exceptions are high automation levels, where the distorted greedy and the triage algorithms sometimes achieve slightly better performance.

(a) Easy sample

(b) Difﬁcult sample

Figure 5: An easy and a difﬁcult sample image from the Stare-D dataset. Both images are given a score of severity zero for the Drusen disease, which is characterized by pathological yellow spots. The easy sample does not contain yellow spots and thus it is easy to predict its score. In contrast, the difﬁcult sample contains yellow spots, which are manifested not from Drusen, but diabetic retinopathy, and thus it is challenging to accurately predict its score. As a result, the greedy algorithm decides to outsource the difﬁcult sample to humans, whereas it lets the machine decide about the easy one.

We ﬁrst generate a 100 dimensional feature vector using fasttext (Joulin et al. 2016) for each sample in the Hatespeech dataset and a 1000 dimensional feature vector using Resnet (He et al. 2016) for each sample in the Stare-H, Stare D, and Messidor datasets. Then, we use the top 50 features, as identiﬁed by PCA, as x in our experiments. For the image datasets, the response variable y is just the available score by a single expert and the human predictions are sampled from a categorical distribution s Cat(px), where px are the probabilities of each potential score value s for a sample with features x. More speciﬁcally, if the response variable takes values on a t point scale, we consider:

p if k = y 1 p

2 if k {y 1, y + 1} and 1 < y < t 1 p if k = y 1 and y = t 1 p if k = y + 1 and y = 1,

where p is a parameter that controls the human accuracy. For the Hatespeech dataset, the response variable y is the mean of the scores provided by the annotators and the human pre-

ρc = 0.2 ρc = 0.4 ρc = 0.6 ρc = 0.8

10 20 30 40 50 60 70 80 90 0

(a) Stare-H

ρc = 0.2 ρc = 0.4 ρc = 0.6 ρc = 0.8

10 20 30 40 50 60 70 80 90 0

(b) Stare-D

Figure 6: Mean squared error (MSE) achieved by the proposed greedy algorithm against the number of outsourced samples n under different distributions of human errors on two real-world datasets. Under each distribution of human error, human error is low (c(x, y) = 10 4) for a fractions ρc of the samples and high (c(x, y) = 0.5) for the remaining fraction 1 ρc. As long as there are samples that humans can predict with low error, the greedy algorithm does outsource them to humans and thus the overall performance improves. However, whenever the fraction of outsourced samples is higher than the fraction of samples with low human error, the performance degrades. This results in a characteristic Ushaped curve.

dictions are picked uniformly at random from the available individual scores given by each annotator. In each dataset, we compute the human error as c(x, y) = E(y s)2 for each sample (x, y) and set the same value of λ across all competitive methods. Finally, in each experiment, we use 80% samples for training and 20% samples for testing. Results. We ﬁrst compare the performance of the greedy algorithm in terms of mean squared error (MSE) on a held-out set against the same competitive baselines used in the experiments on synthetic data, i.e., DS (Iyer and Bilmes 2012), distorted greedy (Harshaw et al. 2019), and triage (Raghu et al. 2019a). Figure 4 summarizes the results, which show that the greedy algorithm outperforms the baselines across a majority of automation levels. The only exceptions are high automation levels, where the distorted greedy and the triage algorithms sometimes achieve slightly better performance. Next, we look at the samples that the greedy algorithm outsources to humans and those that leaves to machines. Intuitively, human assistance should be required for those sam-

ples which are difﬁcult (easy) for a machine (a human) to decide about. Figure 5 provides an illustrative example of an easy and a difﬁcult sample image. While both sample images are given a score of severity zero for the Drusen disease, one of them contains yellow spots, which are often a sign of Drusen disease7, and is therefore difﬁcult to predict. In this particular case, the greedy algorithm outsourced the difﬁcult sample to humans and let the machine decide about the easy one. Does this intuitive assignment happen consistently? To answer this question, we run our greedy algorithm on the Stare-H and Stare-D datasets under different distributions of human error and assess to which extent the greedy algorithm outsources to humans those samples they can predict more accurately. More speciﬁcally, we sample the human predictions from a non-uniform categorical distribution under which human error is low (c(x, y) = 10 4) for a fraction ρc of the samples and high (c(x, y) = 0.5) for the remaining fraction 1 ρc. Figure 6 shows the performance of the greedy algorithm for different ρc values. We observe that, as long as there are samples that humans can predict with low error, the greedy algorithm outsources them to humans and thus the overall performance improves. However, whenever the fraction of outsourced samples is higher than the fraction of samples with low human error, the performance degrades. This results in a characteristic U-shaped curve.

6 Conclusions In this paper, we have initiated the development of machine learning models that are optimized to operate under different automation levels. We have focused on ridge regression under human assistance and shown that a simple greedy algorithm is able to ﬁnd a solution with nontrivial approximation guarantees. Moreover, using both synthetic and real-world data, we have shown that this greedy algorithm beats several competitive baselines and is able to identify and outsource to humans those samples they can predict more accurately. Our work also opens many interesting venues for future work. For example, it would be very interesting to advance the development of other more sophisticated machine learning models, both for regression and classiﬁcation, under different automation levels. It would be valuable to ﬁnd tighter lower bounds on the parameter α, which better characterize the good empirical performance. It would be very interesting to study sequential decision making scenarios under human assistance, e.g., autonomous driving under different automation levels. Finally, we have assumed that we can measure the human error for every training sample. It would be interesting to tackle the problem under uncertainty.

A Proofs Proof of Proposition 2. Let the feature space be F. Moreover we denote that X = {xi}i V and X = {x j}j V . Then we denote that

Hxi = k V{x F|||xi x|| ||xk x||}. (14)

7In this particular case, the patient suffered diabetic retinopathy, which is also characterized by yellow spots.

Hence, the set of test samples, which are nearest to xi, is denoted as X Hxi. Since the features in X and X are i.i.d random variables, |X Hxi| are also i.i.d random variables for different realizations of X and X . Let us deﬁne ϑ = E[|X Hxi|]. Hence we have, E[n ] =

i S E[|X Hxi|] = nϑ and E[|V | n ] = (|V| n)ϑ, which leads to the required result.

Proof of Lemma 5. By deﬁnition, we have that

ℓ(w (V), V) =

i S c(xi, yi) + c(xk, yk)

ℓ(w (V\k), V\k) = y2 k y2 kx k (λI + xkx k ) 1xk

i S c(xi, yi)

Moreover, note that it is enough to prove that ℓ(w (V), V) ℓ(w (V\k), V\k) < 0, without the logarithms, to prove the result. Then, we have that

ℓ(w (V), V) ℓ(w (V\k), V\k)

= c(xk, yk) y2 k + y2 kx k (λI + xkx k ) 1xk

(a) = c(xk, yk) y2 k + y2 kx k

λ2 xkx k 1 + x k xk

= c(xk, yk) y2 k + y2 k x k xk

1 x k xk λ + x k xk

x k xk λ + x k xk (1 γ)

x k xk γx k xk 1 γ + x k xk (1 γ)

where equality (a) follows from Lemma 12 and inequality (b) follows from the lower bound on λ.

Proof of Theorem 6. Deﬁne Λ0 = λ|Sc|I + XSc X Sc, Λ1 = λ|Sc|I+XSc X Sc λI xkx k and Θ = λI+xkx k . Moreover, note that

Λ1 = Λ0 Θ and Λ 1 1 (a) = Λ 1 0 + (Λ0Θ 1Λ0 Λ0) Deﬁne as Ω

where equality (a) follows from Proposition 13. Then, it fol-

lows that ℓ(S k) =

i S c(xi, yi) + c(xk, yk) + y Scy Sc y2 k

(y Sc X Sc ykx k )Λ 1 1 (XScy Sc ykxk)

i S c(xi, yi) + c(xk, yk) + y Scy Sc y2 k

y Sc X ScΛ 1 0 XScy Sc y Sc X ScΩ 1XScy Sc

+ 2yky Sc X ScΛ 1 1 x k y2 kx k Λ 1 1 xk = ℓ(S) + c(xk, yk) y2 k y Sc X Sc ykx k

Ω 1 Λ 1 1 Λ 1 1 Λ 1 1 ΩΛ 1 1

XScy Sc ykxk

y2 kx k (Λ 1 1 Λ 1 1 ΩΛ 1 1 )xk

(b) ℓ(S) + c(xk, yk) y2 k y2 kx k (Λ 1 1 Λ 1 1 ΩΛ 1 1 )xk

(c) = ℓ(S) + c(xk, yk) y2 k + y2 kx k (λI + xkx k ) 1xk

(d) = ℓ(S) + ℓ(V) ℓ(V\k), where equality (a) follows from Proposition 13, inequality

(b) uses that Ω 1 Λ 1 1 Λ 1 1 Λ 1 1 ΩΛ 1 1

0, equality (c) fol-

lows from the following observation: (Λ 1 1 Λ 1 1 ΩΛ 1 1 )

= (Λ 1 0 + Ω 1) (Λ 1 0 + Ω 1)Ω(Λ 1 0 + Ω 1)

= Λ 1 0 ΩΛ 1 0 Λ 1 0 = Λ 1 0 (Λ0Θ 1Λ0 Λ0)Λ 1 0 Λ 1 0 = Θ 1, and inequality (d) follows from Lemma 5. Proof of Proposition 7.

(yk x k w (k))2

= (yk x k w (k))2 + λ||w (k)||2 λ||w (k)||2

= y2 k y2 kx k (λI + xkx k ) 1xk λy2 kx k (λI + xkx k ) 2xk

= y2 k y2 k x k xk λ + x k xk λy2 kx k (λI + xkx k ) 2xk

(a) = λy2 k λ + x k xk λy2 kx k

= λy2 k λ + x k xk y2 k λ x k

I xkx k λ + x k xk

λ λ + x k xk

2 (b) ρ2y2 k,

where equality (a) follows from Lemma 12 and inequality (b) follows from the assumption λ ρ 1 ρ maxi ||x2 i ||2 2. Finally, since c(xk, yk) > ρ2y2 k, we can conclude that c(xk, yk) > (yk x k w (k))2.

B Auxiliary lemmas and propositions Proposition 10 Assume c(xk, yk) γy2 k and λ

γ 1 γ ||xk||2 2. Then, y2 k c(xk, yk) ykx k xkyk λI + xkx k

Proof We use Schur complement property for positivedeﬁniteness (Boyd et al. 1994, Page 8)

on the matrix y2 k c(xk, yk) ykx k xkyk λI + xkx k

λI + xkx k xkx k (y2 k/(y2 k c(xk, yk))) (15)

λI γ 1 γ xkx k . (16)

Given that xkx k is a rank one matrix, it has only one nonzero eigenvalue. Hence it is same as tr(xkx k ) = ||xk||2 2, which along with the assumed bound on λ proves that λI + xkx k xkx k (y2 k/(y2 k c(xk, yk))) 0. Then from Schur complement method, we have y2 k c(xk, yk) ykx k xkyk λI + xkx k

Proposition 11 If A B and B and A are both positive deﬁnite matrices, then det(A) det(B).

Proof If A = LL is the Cholesky factorization and since A is strictly positive deﬁnite, L has an inverse.

A B = I L 1BL 0

= 1 > eigi(L 1BL ) > 0 eigenvalues eigi

i eigi(L 1BL )

= 1 > det(L 1BL ) = (1/ det(A))(det(B))

which immediately gives the required result.

Lemma 12 ((Sherman and Morrison 1950)) Assume A is an invertible matrix. Then, the following equality holds:

(A + uv ) 1 = A 1 A 1uv A 1

1 + v A 1u (17)

Proposition 13 Assume A and B are invertible matrices. Then, the following equality holds:

(A B)1 = A 1 + (AB 1A A) 1 (18)

Proof We observe that, (AB 1A A) = (AB 1A A)A 1(A B) + (A B). Pre-multiply by (AB 1A A) 1 and post-multiply by (A B) 1 on both sides to get the result.

Proposition 14 The function t(x) = log(x a)

log x is increasing for x > a + 1.

dt(x)/dx = x log x (x a) log(x a)

x(x a)(log x)2 > 0 (19)

References Bartlett, P. L., and Wegkamp, M. H. 2008. Classiﬁcation with a reject option using a hinge loss. JMLR 9(Aug):1823 1840. Bhatia, K.; Jain, P.; Kamalaruban, P.; and Kar, P. 2017. Consistent robust regression. In Neur IPS, 2110 2119. Bogunovic, I.; Zhao, J.; and Cevher, V. 2018. Robust maximization of non-submodular objectives. ar Xiv preprint ar Xiv:1802.07073. Boyd, S.; El Ghaoui, L.; Feron, E.; and Balakrishnan, V. 1994. Linear matrix inequalities in system and control theory, volume 15. Siam. Chen, X., and Price, E. 2017. Active regression via linear-sample sparsiﬁcation. ar Xiv preprint ar Xiv:1711.10051. Cheng, J.; Danescu-Niculescu-Mizil, C.; and Leskovec, J. 2015. Antisocial behavior in online discussion communities. In ICWSM. Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1995. Active learning with statistical models. In Neur IPS, 705 712. Cortes, C.; De Salvo, G.; and Mohri, M. 2016. Learning with rejection. In ALT, 67 82. Springer. Davidson, T.; Warmsley, D.; Macy, M.; and Weber, I. Automated hate speech detection and the problem of offensive language. ICWSM, 512 515. Decenci ere, E.; Zhang, X.; Cazuguel, G.; Lay, B.; Cochener, B.; Trone, C.; Gain, P.; Ordonez, R.; Massin, P.; Erginay, A.; Charton, B.; and Klein, J.-C. 2014. Feedback on a publicly distributed database: the messidor database. Image Analysis & Stereology 33(3):231 234. Feng, J.; Xu, H.; Mannor, S.; and Yan, S. 2014. Robust logistic regression and classiﬁcation. In Neur IPS, 253 261. Gatmiry, K., and Gomez-Rodriguez, M. 2019. Non-submodular function maximization subject to a matroid constraint, with applications. ar Xiv preprint ar Xiv:1811.07863. Geifman, Y., and El-Yaniv, R. 2019. Selectivenet: A deep neural network with an integrated reject option. ar Xiv preprint ar Xiv:1901.09192. Geifman, Y.; Uziel, G.; and El-Yaniv, R. 2018. Bias-reduced uncertainty estimation for deep neural classiﬁers. Guo, Y., and Schuurmans, D. 2008. Discriminative batch mode active learning. In Neur IPS, 593 600. Harshaw, C.; Feldman, M.; Ward, J.; and Karbasi, A. 2019. Submodular maximization beyond non-negativity: Guarantees, fast algorithms, and applications. ar Xiv preprint ar Xiv:1904.09354. Hashemi, A.; Ghasemi, M.; Vikalo, H.; and Topcu, U. 2019. Submodular observation selection and information gathering for quadratic models. ar Xiv preprint ar Xiv:1905.09919. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Hoi, S. C.; Jin, R.; Zhu, J.; and Lyu, M. R. 2006. Batch mode active learning and its application to medical image classiﬁcation. In ICML. Hoover, A.; Kouznetsova, V.; and Goldbaum, M. 2000. Locating blood vessels in retinal images by piecewise threshold probing of a matched ﬁlter response. IEEE Transactions on Medical imaging 19(3):203 210. Iyer, R., and Bilmes, J. 2012. Algorithms for approximate minimization of the difference between submodular functions, with applications. ar Xiv preprint ar Xiv:1207.0560. Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; J egou, H.; and Mikolov, T. 2016. Fasttext. zip: Compressing text classiﬁcation models. ar Xiv preprint ar Xiv:1612.03651.

Lehmann, B.; Lehmann, D.; and Nisan, N. 2006. Combinatorial auctions with decreasing marginal utilities. Games and Economic Behavior 55(2):270 296. Liu, Z.; Wang, Z.; Liang, P. P.; Salakhutdinov, R. R.; Morency, L.- P.; and Ueda, M. 2019. Deep gamblers: Learning to abstain with portfolio theory. In Neur IPS. Pradel, M., and Sen, K. 2018. Deepbugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages 2(OOPSLA):147. Raghu, M.; Blumer, K.; Corrado, G.; Kleinberg, J.; Obermeyer, Z.; and Mullainathan, S. 2019a. The algorithmic automation problem: Prediction, triage, and human effort. ar Xiv preprint ar Xiv:1903.12220. Raghu, M.; Blumer, K.; Sayres, R.; Obermeyer, Z.; Kleinberg, B.; Mullainathan, S.; and Kleinberg, J. 2019b. Direct uncertainty prediction for medical second opinions. In ICML. Ramaswamy, H. G.; Tewari, A.; Agarwal, S.; et al. 2018. Consistent algorithms for multiclass classiﬁcation with an abstain option. Electronic J. of Statistics 12(1):530 554. Sabato, S., and Munos, R. 2014. Active regression by stratiﬁcation. In Neur IPS, 469 477. Sherman, J., and Morrison, W. J. 1950. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics 21(1):124 127. Studer, C.; Kuppinger, P.; Pope, G.; and Bolcskei, H. 2011. Recovery of sparsely corrupted signals. IEEE Transactions on Information Theory 58(5):3115 3130. Suggala, A. S.; Bhatia, K.; Ravikumar, P.; and Jain, P. 2019. Adaptive hard thresholding for near-optimal consistent robust regression. ar Xiv preprint ar Xiv:1903.08192. Sugiyama, M. 2006. Active learning in approximately linear regression based on conditional expectation of generalization error. JMLR 7(Jan):141 166. Thulasidasan, S.; Bhattacharya, T.; Bilmes, J.; Chennupati, G.; and Mohd-Yusof, J. 2019. Combating label noise in deep learning using abstention. ar Xiv preprint ar Xiv:1905.10964. Topol, E. J. 2019. High-performance medicine: the convergence of human and artiﬁcial intelligence. Nature medicine 25(1):44. Tsakonas, E.; Jald en, J.; Sidiropoulos, N. D.; and Ottersten, B. 2014. Convergence of the huber regression m-estimate in the presence of dense outliers. IEEE Signal Processing Letters 21(10):1211 1214. Willett, R.; Nowak, R.; and Castro, R. M. 2006. Faster rates in regression via active learning. In Neur IPS. Wright, J., and Ma, Y. 2010. Dense error correction via ℓ1-minimization. IEEE Transactions on Information Theory 56(7):3540 3560.