# classification_with_strategically_withheld_data__8d258ec2.pdf

Classiﬁcation with Strategically Withheld Data

Anilesh K. Krishnaswamy,1 Haoming Li,2 David Rein,1 Hanrui Zhang,1 Vincent Conitzer1

1 Duke University, 2 University of Southern California anilesh@cs.duke.edu, haoming.li@usc.edu, irving.rein@duke.edu, hrzhang@cs.duke.edu, conitzer@cs.duke.edu

Machine learning techniques can be useful in applications such as credit approval and college admission. However, to be classiﬁed more favorably in such contexts, an agent may decide to strategically withhold some of her features, such as bad test scores. This is a missing data problem with a twist: which data is missing depends on the chosen classiﬁer, because the speciﬁc classiﬁer is what may create the incentive to withhold certain feature values. We address the problem of training classiﬁers that are robust to this behavior. We design three classiﬁcation methods: MINCUT, HILLCLIMBING (HC) and Incentive-Compatible Logistic Regression (IC-LR). We show that MINCUT is optimal when the true distribution of data is fully known. However, it can produce complex decision boundaries, and hence be prone to overﬁtting in some cases. Based on a characterization of truthful classiﬁers (i.e., those that give no incentive to strategically hide features), we devise a simpler alternative called HC which consists of a hierarchical ensemble of out-of-thebox classiﬁers, trained using a specialized hill-climbing procedure which we show to be convergent. For several reasons, MINCUT and HC are not effective in utilizing a large number of complementarily informative features. To this end, we present IC-LR, a modiﬁcation of Logistic Regression that removes the incentive to strategically drop features. We also show that our algorithms perform well in experiments on realworld data sets, and present insights into their relative performance in different settings.

1 Introduction

Applicants to most colleges in the US are required to submit their scores for at least one of the SAT and the ACT. Both tests are more or less equally popular, with close to two million taking each in 2016 (Adams 2017). Applicants usually take one of these two tests whichever works to their advantage.1 However, given the growing competitiveness of college admissions, many applicants now take both tests and then strategically decide whether to drop one of the scores (if

A version of the paper including the Supplement is available at https://arxiv.org/abs/2012.10203 Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1https://www.princetonreview.com/college/sat-act

they think it will hurt their application) or report both.2 The key issue here is that it is impossible to distinguish between an applicant who takes both tests but reports only one, and an applicant that takes only one test for example because the applicant simply took the one required by her school, the dates for the other test did not work with her schedule, or for other reasons that are not strategic in nature.3

Say a college wants to take a principled machine learning approach to making admission decisions based on the scores from these two tests. For simplicity, assume no other information is available. Assume that the college has enough historical examples that contain the scores of individuals (on whichever tests are taken, truthfully reported) along with the corresponding ideal (binary) admission decisions.4 Based on this data, the college has to choose a decision function that determines which future applicants are accepted. If this function is known to the applicants, they are bound to strategize and use their knowledge of the decision function to decide the scores they report.4 How can the classiﬁer be trained to handle strategic reporting of scores at prediction time? To see the intricacies of this problem, let us consider a simple example.

Example 1. Say the scores for each of the two tests (SAT and ACT) take one of two values: h (for high) or l (for low). Let denote a missing value. Then there are eight possible inputs (excluding ( , ) since at least one score is required): (h, h), (h, l), (l, h), (l, l), (h, ), ( , h), (l, ) and ( , l). Assume the natural distribution (without any withholding) over these inputs is known, and so are the conditional probabilities of the label Y {0, 1}, as shown below:

X (h, h) (h, l) (l, h) (l, l) (h, ) ( , h) (l, ) ( , l) Pr(X) 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 Pr(Y = 1 | X) 0.9 0.7 0.3 0.1 0.6 0.6 0.2 0.2 Pr(Y = 0 | X) 0.1 0.3 0.7 0.9 0.4 0.4 0.8 0.8

Table 1: True distribution of inputs and targets:

2https://blog.collegevine.com/should-you-submit-your-satact-scores/ 3https://blog.prepscholar.com/do-you-need-to-take-both-theact-and-sat 4We make these assumptions more generally throughout the paper.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Assume Y = 1 is the more desirable accept decision. Then, ideally, we would like to predict b Y = 1 whenever X {(h, h), (h, l), (h, ), ( , h)}. However, the strategic reporting of scores at prediction time effectively means, for example, that an input ( , h) cannot be assigned the accept decision of b Y = 1 unless the same is done for (l, h) as well; otherwise, someone with (l, h) would simply not report the ﬁrst test, thereby misreporting ( , h) and being accepted. Taking this into account, the classiﬁer with minimum error is given by b Y = 1 whenever X {(h, h), (h, l), (h, )}.

There are many other settings where a similar problem arises. Many law schools now allow applicants to choose between the GRE and the traditional LSAT.5 Recently, as a result of the COVID-19 pandemic, universities have implemented optional pass/fail policies, where students can choose to take some or all of their courses for pass/fail credit, as opposed to a standard letter grade that inﬂuences their GPA. They are often able to decide the status after already knowing their performance in the course. For credit scoring, some individuals might not report some of their information, especially if it is not mandatory by law (Florez-Lopez 2010). The ability of strategic agents to withhold some of their features at prediction time poses a challenge only when the data used to train the classiﬁer has some naturally missing components to begin with. For if not, the principal e.g., the entity deciding on admissions can reject all agents that withhold any of their features, thereby forcing them to reveal all features. We focus on how a principal can best train classiﬁers that are robust even when there is strategic withholding of data by agents. Our methods produce classiﬁers that eliminate the incentive for agents to withhold data.

Our contributions We now describe the key questions we are facing, and how we answer them. Our model is described formally in Section 2. All proofs are in the Supplement. If the true input distribution is known, can we compute the optimal classiﬁer? (Section 3) We answer this question in the afﬁrmative by showing that the problem of computing the optimal classiﬁer (Theorem 1) in this setting reduces to the classical Min-cut problem (Cormen et al. 2009). This analysis gives us the MINCUT classiﬁer, which can be computed on the empirical distribution, estimated using whatever data is available. However, since it can potentially give complex decision boundaries, it might not generalize well. Are there simpler classiﬁers that are robust to strategic withholding of features? (Section 4) We ﬁrst characterize the structure of classiﬁers that are truthful , i.e., give no incentive to strategically hide features at prediction time (Theorem 2). Using this characterization, we devise a hillclimbing procedure (HC) to train a hierarchical ensemble of out-of-the-box classiﬁers and show that the procedure converges (Theorem 4) as long as we have black-box access to an agnostic learning oracle. We also analytically bound the generalization error of HC (Theorem 3). The ensemble of HC can be populated with any of the commonly used classiﬁers such as logistic regression, ANNs, etc.

5https://www.ets.org/gre/revised general/about/law/

Another truthful classiﬁer we present is a modiﬁcation of Logistic Regression. This method, called IC-LR (Incentive Compatible Logistic Regression), works by encoding all features with positive values, and using positive regression coefﬁcients whereby it is in every agent s best interest to report all features truthfully. IC-LR uses Projected Gradient Descent for its training. The advantage of this method is that it can be directly to a large number of features. How do our methods perform on real data sets? (Section 6) We conduct experiments on several real-world data sets to test the performance of our methods, comparing them to each other, as well as to other methods that handle missing data but ignore the strategic aspect of the problem. We see that our methods perform well overall, and uncover some interesting insights on their relative performance:

1. When the number of features is small, HC is the most reliable across the board.

2. When the number of features is small, and many of them are discrete/categorical (or suitably discretized), MINCUT and IC-LR perform better.

3. If a large number of features must be used, IC-LR gives the best performance, although HC performs reasonably well with some simple feature selection techniques.

Related work Our work falls broadly in the area of strategic machine learning, wherein a common assumption is that strategic agents can modify their features (i.e., misreport) in certain ways (normally at some cost), either to improve outcomes based on the classiﬁer chosen by the principal (Hardt et al. 2016) or to inﬂuence which classiﬁer is chosen in the ﬁrst place (Dekel, Fischer, and Procaccia 2010). The main challenge in strategic machine learning, as in this paper, is the potential misalignment between the interests of the agents and the principal. Existing results in this line of work (Chen et al. 2018; Kleinberg and Raghavan 2019; Haghtalab et al. 2020), often mainly theoretical, consider classiﬁers of a speciﬁc form, say linear, and ways of misreporting or modifying features in that context. Our results are different in that we focus on a speciﬁc type of strategic misreporting, i.e., withholding parts of the data, and devise general methods that are robust to this behavior that, in addition to having theoretical guarantees, can be tested practically. Some experimental results (Hardt et al. 2016) do exist but our work is quite different; for instance, we do not need to invent a cost function (as in Hardt et al. (2016)). Another major difference is that we consider generalization in the presence of strategic behavior, while most previous work does not (except for a concurrent paper (Zhang and Conitzer 2021)), which studies the sample complexity of PAC learning in the presence of strategic behavior). Our problem can also be viewed as an instance of automated mechanism design with partial veriﬁcation (Green and Laffont 1986; Yu 2011; Kephart and Conitzer 2015, 2016) where it is typically assumed that the feature space (usually called type space in mechanism design) is discrete and has reasonably small cardinality, and a prior distribution is known over the feature space. In contrast, the feature

spaces considered in this paper consist of all possible combinations of potentially continuous feature values. Moreover, the population distribution can only be accessed by observing examples. Thus, common methodologies in automated mechanism design do not sufﬁce for our setting. A set of closely related (in particular, to Theorem 1) theoretical results are those of Zhang, Cheng, and Conitzer (2019b,a, 2021b) on the problem of distinguishing good agents from bad (where each produces a different distribution over a sample space, and the agent can misreport the set of n samples that she has drawn). However, our work is different in that we consider the standard classiﬁcation problem, we focus more on practical aspects, and we do not rely on the full knowledge of the input distribution. Our work also ﬁnds a happy intersection between strategic machine learning and the literature on classiﬁcation with missing data (Marlin 2008). The problem we study is also connected to adversarial classiﬁcation (Dalvi et al. 2004; Dekel, Shamir, and Xiao 2010). We discuss these connections in more detail in the Supplement.

2 Preliminaries Model with strategically withheld features: We have an input space X, a label space Y = {0, 1}, and a distribution D over X Y which models the population. A classiﬁer f : X Y maps a combination of features to a label. Let F = [k] = {1, . . . , k} be the set of features, each of which a data point may or may not have. For x X, let xi denote the value of its i-th feature (xi = if x does not have feature i [k]). For any S [k], deﬁne x|S to be the projection of x onto S (i.e., retain features in S and drop those not in S):

(x|S)i = xi, if i S , otherwise.

We assume that data can be strategically manipulated at prediction (test) time in the following way: an agent whose true data point is x can report any other data point x such that x|S = x for some S [k]. We use to denote the relation between any such pair x, x (x x S [k] : x|S = x ). Note that is transitive, i.e., for any x1, x2, x3 X, x1 x2 and x2 x3 = x1 x3. We assume agents prefer label 1 to 0: in response to a classiﬁer f, an agent with data point x will always withhold 6 features to receive label 1 if possible, i.e., the agent will report x argmaxx :x x f(x ). Incorporating such strategic behavior into the loss of a classiﬁer f, we get

ℓD(f) = Pr (x,y) D

h y = max x :x x f(x ) i .

Truthful classiﬁers We will also be interested in truthful classiﬁers, which provably eliminate incentives for such strategic manipulation. A classiﬁer f is truthful if for any

6In practice, f might not be perfectly known, and agents might not be able to best respond. This problem does not arise for our methods, since they are truthul. For other classiﬁers, their accuracy may go up or down if agents fail to best-respond; but the assumption that agents best-respond is common in many such contexts.

x, x X where x x , f(x) f(x ). In other words, not withholding any features is always an optimal way to respond to a truthful classiﬁer. As a result, the loss of any truthful classiﬁer f in the presence of strategically withheld features has the standard form: ℓD(f) = Pr(x,y) D[f(x) = y]. Note that the so-called Revelation Principle which states that in the presence of strategic behavior, any classiﬁer f is equivalent to a truthful classiﬁer f holds in this case because the reporting structure is transitive.7 In other words, we are guaranteed that, for any classiﬁer f, there exists a truthful classiﬁer f , such that for any x X, maxx :x x f(x ) = f (x). Therefore, we focus on truthful classiﬁers in our model, without loss of generality.

3 The MINCUT Classiﬁer

We ﬁrst present a method for computing an optimal classiﬁer when the input distribution is fully known.8 Assuming X is ﬁnite, our goal is to characterize a classiﬁer f which minimizes the loss ℓD(.), for a known input distribution D. As shorthand, deﬁne, for all x X,

Deﬁnition 1. D+(x) := Pr(x ,y ) D[x = x y = 1], D (x) := Pr(x ,y ) D[x = x y = 0].

The basic idea here is simple: to partition X into two sides, one labeled 1 and the other 0, where the error accrued for each x X is given by D (x) or D+(x), according as x is labeled 1 or 0. Such a partition should crucially respect the constraints imposed by the strategic behavior of agents : if x x , then either x is labeled 1 or x is labeled 0.

Deﬁnition 2. Given X and D, let G(D, X) be a directed capacitated graph with vertices V = X {s, t}, where the edges E and edge capacities u are deﬁned as follows:

For each x X, there are edges (s, x) and (x, t) in E, with capacities u(s, x) = D (x) and u(x, t) = D+(x). For all pairs x, x X such that x x , there is an edge (x, x ) E with capacity u(x, x ) = .

In terms of the graph deﬁned above, computing the optimal classiﬁer f we seek is equivalent to ﬁnding a minimum s-t cut on G(D, X). The intuition is that the edges from s and to t reﬂect the value gained from labeling an example 0 or 1, respectively; one of the edges must be cut, reﬂecting the loss of not assigning it to the corresponding side. Moreover, if x x , then the corresponding edge with inﬁnite capacity prevents the assigning of 0 to x and 1 to x .

Theorem 1. If (S, S) is a minimum s-t cut of G(D, X) (where S is on the same side as s), then for the classiﬁer f (x) := 1(x S), we have ℓD(f ) = minf ℓD(f).

7More details, including a formal proof, are in the Supplement. 8A theoretical companion paper (Zhang, Cheng, and Conitzer 2021a) contains a more general version of the mincut-based algorithm. There, the goal is to compute an optimal classiﬁer with possibly more than 2 outcomes given perfect knowledge of the entire population distribution. In this paper, we investigate the special case with only 2 outcomes (i.e., accept and reject), but do not assume prior knowledge about the population distribution.

We note that, consequently, the optimal classiﬁer can be computed in poly(|X|) time. In practice, it is natural to expect that we do not know D exactly, but have a ﬁnite number of samples from it. A more practical option is to apply Theorem 1 to the empirical distribution induced by the samples observed, and hope for the classiﬁer computed from that to generalize to the true population distribution D.

Implementing MINCUT Given a set b X of m i.i.d. samples from D, let b D be the corresponding empirical distribution over b X, and X := b X {x : x x, x b X}. The MINCUT classiﬁer is then obtained by applying Theorem 1 to G( b D, b X), and extending it to X as and when required. Here, note that MINCUT runs in time poly(m) (and not poly(|X|)), since G( b D, b X) has m nodes, and checking if a test point is in X takes poly(m) time. In light of traditional wisdom, the smaller m is relative to X, the larger the generalization error will be. While not attempting a theoretical analysis in this regard, we note that when X is large, the generalization error can be extremely large (see Example 2 in the Supplement). The reason for this is two-fold: 1. MINCUT can give complicated decision boundaries. 2. MINCUT is indecisive on samples not in X.9

Therefore, a suitable discretization of features is sometimes useful (see Section 6). Note that MINCUT is truthful, by virtue of the inﬁnite capacity edges in Deﬁnition 2.

4 Truthful Classiﬁers and HILL-CLIMBING The other drawback of MINCUT, related to the issue of generalization just discussed, is that it can be hard to interpret meaningfully in a practical setting. In this section, we devise a simpler alternative called HILL-CLIMBING. To help introduce this algorithm, we ﬁrst present a characterization of truthful classiﬁers in our setting, since we can limit our focus to them without loss of generality (as discussed in Section 2). For shorthand, we use the following deﬁnition: Deﬁnition 3 (F -classiﬁer). For a subset of features F F, a classiﬁer f is said to be an F -classiﬁer if for all x X, we have f(x) = f(x|F ), and if there exists i F such that xi = , then f(x) = 0. In other words, an F -classiﬁer depends only on the features in F , rejecting all x where any of these is empty. We collect many such classiﬁers into an ensemble as follows: Deﬁnition 4 (MAX Ensemble). For a collection of classiﬁers C = {fj}, its MAX Ensemble classiﬁer is given by MAXC(.) := maxj fj(.). This is equivalent to getting each agent to pick the most favorable classiﬁer from among those in {fj}. Now we have the following characterization of truthful classiﬁers: Theorem 2. A classiﬁer f is truthful iff f(.) = MAXC(.) for a collection of classiﬁers C = {fj} such that, for some {Fj} 2F , each fj is an Fj-classiﬁer .

9This is more likely to happen when using a large number of features.

For any truthful classiﬁer f, we seek to bound the gap between its population loss ℓD(f) and its empirical loss on samples in b X deﬁned by ℓb X (f) := 1 m P i [m] |f(xi) yi|. Before stating a theorem to this end, we deﬁne the following entities: Let H be a base hypothesis space over X, and n {1, . . . , 2k} be a parameter. Deﬁne d := d VC(H), the VC dimension of H, and H is the set of all classiﬁers that can be written as the MAX Ensemble of n classiﬁers in H. Theorem 3. Let b X = {(xi, yi)}i [m] be m i.i.d. samples from D. For any f H, for any δ > 0, with probability at least 1 δ, we have ℓD(f) ℓb X (f) +

dn log dn log m+log(1/δ)

It is easy to see that for any of the commonly used hypothesis spaces say H consists of linear hypotheses if a truthful classiﬁer f is in H, then so are the components of the MAX Ensemble version of f as in Theorem 2. We have, however, stated Theorem 3 in slightly more general terms.

The HILL-CLIMBING classiﬁer We now present a hillclimbing approach with provable convergence and generalization guarantees. The HILL-CLIMBING classiﬁer (henceforth HC) is of the same form as given by the characterization of truthful classiﬁers in Theorem 2.10 Intuitively, the approach works by considering a hierarchy of classiﬁers, organized by the features involved. For example, consider a setting with k = 3 features. We make a choice as to what classiﬁers we use say f1 for input of the form (x1, , ), f2 for input of the form (x1, x2, ), and f3 for input of the form (x1, x2, x3). Any agent with features 1 and 2 (but not 3), for example, should be able to report both features so as to be classiﬁed by f2, or feature 2 to be classiﬁed by f1 instead. So in effect, assuming full knowledge of the classiﬁers, each agent can check all of the classiﬁers and choose the most favorable one. Without loss of generality, we assume that when a data point does not have all the features required by a classiﬁer, it is automatically rejected.

Algorithm 1 HILL-CLIMBING (HC) Classiﬁer

Input: data set b X = {(xi, yi)}i [m], n subsets F1, F2, . . . , Fn of F. Initialize: t 0, {f 0 1 , . . . , f 0 n}. while > 0 do for i = 1, 2, . . . , n do Si {(x, y) b X : f t j(x|Fj) = 0, j = i}. f t+1 i = argminf H P (x,y) Si |f(x|Fi) y|. end for f MAX{f t+1 1 ,...,f t+1 n }; ℓt = ℓb X (f ) ℓt ℓt 1; t t + 1 end while Return: f .

In short, HC (deﬁned formally in Algorithm 1) works as follows: ﬁrst choose a hypothesis space H, in order for The-

10And, therefore, is truthful, and inherits Theorem 3.

orem 3 to apply. Then select n subsets of F (where n is a parameter), say F1, F2, . . . , Fn. For each Fj, we learn a Fjclassiﬁer, say fj, from among those in H. Start by initializing these classiﬁers to any suitable {f 0 1 , . . . , f 0 n}. In each iterative step, each of the subclassiﬁers is updated to minimize the empirical loss on the samples that are rejected by all other classiﬁers. We next show that such an update procedure always converges. To do so, as far as our theoretical analysis goes, we assume we have black-box access to an agnostic learning oracle (Line 6 in Algorithm 1). After convergence, the HC classiﬁer is obtained as the MAX Ensemble of these classiﬁers. The generalization guarantee of Theorem 3 applies directly to the HC classiﬁer.

Theorem 4. Algorithm 1 converges.

Connection with MINCUT The HC formulation given above can be thought of as a less complicated version of MINCUT: some of the edge constraints are ignored with respect to learning the individual classiﬁers, and are instead factored in via the MAX function. Say F1 F2. For some x, it is possible that f1(x|F1) = 1 and f2(x|F1) = 0. In other words, the individual classiﬁers could violate the MINCUT constraints, in order to learn classiﬁcation functions that generalize well individually, and also collectively thanks to the combined HC training procedure.

Implementation The set of classiﬁers {f1, f2, . . . , fn} in HC can be populated with any standard out-of-the-box methods such as logistic regression or neural networks, the choice of which can inﬂuence the performance of f. In Section 6, we test HC with a few such options. The assumption of having access to an agnostic learning oracle does not play a crucial role in practice, with standard training methods performing well enough to ensure convergence. Also, HC will converge in at most m (number of training examples) iterations, because in each iteration the number of correctly classiﬁed examples increases by at least one. (An iteration may need to train n individual classiﬁers.) This also means there is no difference between checking whether > 0 or 1/m. In our experiments, we run HC using 10 4, and convergence is achieved pretty quickly (see the Supplement for exact details).

Choosing subsets Note that we are free to choose any F1, F2, . . . , Fn to deﬁne HC. Its generalization (via Theorem 3), will depend on the choice of n. As more and more subsets of features are included (and further binning them based on their values), HC starts behaving more and more like MINCUT. In addition, using a large number of subsets increases the computational complexity of HC. In practice, therefore, the number of subsets must be limited somehow we ﬁnd that some simple strategies like the following work reasonably well: (a) selecting a few valuable features and taking all subsets of those features, (b) taking all subsets of size smaller than a ﬁxed number k, say k = 2. In many practical situations, a few features (possibly putting their values in just a few bins) are often enough to get close to optimal

accuracy, also improving interpretability (e.g., see Wang and Rudin (2015) or Jung et al. (2017)) The question of devising a more nuanced algorithm for selecting these subsets merits a separate investigation, and we leave this to future work.

5 Incentive-Compatible Logistic Regression As we just mentioned, it is challenging to directly apply HC and MINCUT to a large number of features. As we will see, we can address this challenge in various ways to still get very strong performance with HC. Moreover, HC enjoys remarkable generality, generalization and convergence guarantees. Nevertheless, we would like to have an algorithm that tries to make use of all the available features, while still being truthful. In this section, we present such an approach, which, as we show later in Section 6, indeed performs comparably to and in some cases better than MINCUT and HC. Below we present a simple and truthful learning algorithm, Incentive-Compatible Logistic Regression (IC-LR), which is a truthful variant of classical gradient-based algorithms for logistic regression. Recall that in logistic regression, the goal is to learn a set of coefﬁcients {βi}, one for each feature i F, as well as an intercept β0, such that for each data point (x, y), the predicted label ˆy given by

i F xi βi) 0.5

ﬁts y as well as possible, where σ(t) = 1/(1 + e t) is the logistic function. Roughly speaking, IC-LR. (formally deﬁned in Algorithm 2) works by restricting the coefﬁcients {βi} in such a way that dropping a feature (i.e., setting xi to 0) can never make the predicted label larger. If, without loss of generality, all feature values xi are nonnegative (or suitably translated), then this is equivalent to: for each feature i F, the coefﬁcient βi 0. IC-LR. enforces this nonnegativity constraint throughout the training procedure, by requiring a projection step after each gradient step, which projects the coefﬁcients to the feasible nonnegative region by setting any negative coefﬁcient to 0 (equivalently, an ℓ1 projection). One potential issue with IC-LR is the following: if a certain feature xi 0 is negatively correlated with the positive classiﬁcation label, then IC-LR is forced to ignore it (since it is constrained to use positive coefﬁcients). To make good use of this feature, we can include an inverted copy x i = λ xi (where λ is chosen such that x i 0). We could also choose an apt discretization of such features (using cross-validation) and translate the discretized bins into separate binary variables. Such a discretization can account for more complex forms of correlation, e.g., a certain feature s being too high or too low me makes the positive label likelier. In practice, we ﬁnd that the latter method does better. If such transformations are undesirable, perhaps for reasons of complexity or interpretability, HC methods are a safer bet.

6 Evaluation In this section, we show that, when strategic withholding is at play, MINCUT, HC and IC-LR perform well and provide

Algorithm 2 Incentive-Compatible Logistic Regression

Input: data set b X = {(x, y)}, learning rate {ηt}, δ 0. Initialize: t 0, {β0, β1, . . . , βk}. while > δ do gi 0 for all i {0, 1, . . . , k} for (x, y) b X do g0 g0 + σ β0 + P i F xi βi y for i F do gi gi + (σ β0 + P i F xi βi y) xi end for end for i {0, 1, . . . , k}, βi max{βi ηt gi, 0} f (x) := 1 σ β0 + P i F βi xi 0.5

ℓt = ℓb X (f ); ℓt ℓt 1; t t + 1 end while Return: f .

Data set Size Total # of features Size after balancing Features after restriction

Australia 690 15 614 2 num., 2 cat. Germany 1000 20 600 1 num., 3 cat. Poland 5910 64 820 4 num. Taiwan 30,000 23 13,272 4 ordinal

Table 2: Data set summary statistics (num. = numerical, cat. = categorical)

a signiﬁcant advantage over several out-of-the-box counterparts (that do not account for strategic behavior).

Datasets Four credit approval datasets are obtained from the UCI repository (Dua and Graff 2017), one each from Australia, Germany, Poland and Taiwan. As is common for credit approval datasets, they are imbalanced to various degrees. In order to demonstrate the performance of classiﬁers in a standard, controlled setting, we balance them by random undersampling. There is a dedicated community (Chawla, Japkowicz, and Kotcz 2004) that looks at the issue of imbalanced learning. We do not delve into these issues in our paper, and evaluate our methods on both balanced and imbalanced datasets (see the Supplement for the latter). In addition, to demonstrate the challenge of high-dimensional data imposed on some of the classiﬁcation methods, the experiments are run on the datasets (a) restricted to 4 features,11 and (b) with all available features. The basic characteristics of the datasets are summarized in Table 2 note that there is enough variation in terms of the types of features present. We then randomly remove a fraction ϵ = 0, 0.1, . . . , 0.5 of all feature values in each dataset to simulate data that is missing naturally i.e., not due to strategic withholding.

Testing We test all methods under two ways of reporting: truthful , i.e., all features are reported as is, and strate-

11According to ANOVA F-value evaluated before dropping any feature values.

gic , i.e., some features might be withheld if it leads to a better outcome. We measure the test accuracy of each classiﬁer, averaged over N=100 runs, with randomness over the undersampling and the data that is randomly chosen to be missing, to simulate data missing for non-strategic reasons. Other metrics, and details about implementing and training the classiﬁers, are discussed in the Supplement. It is important to note that for testing any method, we have to, in effect, compute the best response of each data point toward the classiﬁer. Since the methods we propose are truthful, this is a trivial task. But for other methods, this might not be easy, thereby limiting what baselines can be used.

Classiﬁers We evaluate our proposed methods, MINCUT, HC with logistic regression (HC (LR)) and neural networks (HC (ANN)) as subclassiﬁers, and incentive-compatible logistic regression (IC-LR), against several baseline methods. First, they will be compared against three out-of-the-box baseline classiﬁers: logistic regression (LR), neural networks (ANN) and random forest (RF). We select LR for its popularity in credit approval applications; we select ANN for it being the best-performing individual classiﬁer on some credit approval datasets (Lessmann et al. 2015); we select RF for it being the best-performing homogeneous ensemble on some credit approval datasets (Lessmann et al. 2015), as HC can be viewed as a homogeneous ensemble method. For the sake of exposition, we present numbers just for baselines based on LR, as they perform relatively better. Second, for the purposes of comparison, we include MAJ predict the majority label if examples with the exact same feature values appeared in the training set, and reject if not which can be thought of as a non-strategic counterpart of MINCUT. We also include k-nearest neighbors (KNN) as a baseline, since it is closely related to MAJ. These out-of-the-box classiﬁers need help dealing with missing data, whether they are missing naturally at training and test time or strategically at test time, and to this end, we employ (a) IMP: mean/mode imputation (Lessmann et al. 2015), and (b) R-F: reduced-feature modeling (Saar-Tsechansky and Provost 2007), for each of them. When the dataset has a large number of features, MINCUT and IC-LR can be directly applied. For HC, we assist it in two ways: (a) by selecting 4 features based on the training data, denoted by FS (feature selection),12 and (b) by choosing a limited number of small subsets (30 with 1 feature and 30 with 2 features), denoted by APP (approximation). Note that since our proposed methods are truthful, we can assume that features are reported as is. However, for all out-of-thebox classiﬁers, except IMP(LR), it is infeasible to simulate strategic withholding of feature values, due to the enormous number of combinations of features. Last but not least, we test all methods with the discretization of continuous features (into categorical ones) (Fayyad and Irani 1993), for reasons given in earlier sections.

12Such a technique can be applied to other methods too the results (see the Supplement) are not very different from those in Tables 4.

Classiﬁer Australia Germany Poland Taiwan

Tru. Str. Tru. Str. Tru. Str. Tru. Str.

HC(LR) .792 .792 .639 .639 .659 .659 .648 .648 MINCUT .770 .770 .580 .580 .501 .501 .652 .652 IC-LR .788 .788 .654 .654 .639 .639 .499 .499 IMP(LR) .796 .791 .663 .580 .714 .660 .670 .618 R-F(LR) .808 .545 .631 .508 .670 .511 .665 .590

Table 3: Our methods vs. the rest: mean classiﬁer accuracy for ϵ = 0.2, balanced datasets, 4 features

Classiﬁer Australia Germany Poland Taiwan

Tru. Str. Tru. Str. Tru. Str. Tru. Str.

HC(LR) w/ disc. .794 .794 .641 .641 .692 .692 .650 .650 MINCUT w/ disc. .789 .789 .629 .629 .692 .692 .649 .649 IC-LR w/ disc. .800 .800 .651 .651 .698 .698 .646 .646 IMP(LR) w/ disc. .799 .762 .652 .577 .719 .631 .686 .541 R-F(LR) w/ disc. .796 .542 .633 .516 .708 .522 .684 .587

Table 4: Our methods vs. the rest: mean classiﬁer accuracy for ϵ = 0.2, balanced datasets, 4 features ( w/ disc. stands for with discretization of features )

Results For want of space, we report results only for ϵ = 0.2. We also limit our exposition of HC, IMP and R-F methods to those based on logistic regression, as these perform better than their ANN/RF/KNN counterparts. For a comprehensive compilation of all results, along with standard deviation numbers, please refer to the Supplement. With a small number of features (Table 3): As expected, the out-of-the-box baselines perform well under truthful reporting, but not with strategic reporting. Our methods are robust to strategic withholding, and in line with the earlier discussion on the potential issues faced by MINCUT and IC-LR (in Sections 3 and 5), we see that (a) HC(LR) performs most consistently, and (b) in some cases, MINCUT (e.g., Poland) and IC-LR (e.g., Taiwan) do not do well. With discretization (Table 4): As expected, discretization of numerical features into binary categories improves the performance of MINCUT and IC-LR, for reasons explained in Sections 3 and 5 respectively. We also see some beneﬁt from discretization for HC(LR) when the features are mostly continuous (e.g., Poland), and less so when they are already discrete (e.g., Taiwan). With a large number of features (Table 5): We see broadly similar trends here, except that in the case with discretization, IC-LR performs much better than before (e.g., Poland). The reason for this is that IC-LR is able to use all the available features once they are discretized into binary categories. However, without discretization, HC methods are more reliable (e.g., Poland and Taiwan). On the out-of-the-box baselines: Imputation-based methods are sensitive visa-vis the mean/mode values used. There is incentive to drop a certain feature if the imputed value is a positive signal. If there are many such features, then these methods perform poorly, as seen in Table 5 (cf. Table 3, Australia). If the imputed values do not give a clear signal (e.g., when the distribution of each feature value is not

Classiﬁer Australia Germany Poland Taiwan

Tru. Str. Tru. Str. Tru. Str. Tru. Str.

HCFS(LR) .795 .795 .625 .625 .678 .678 .648 .648 HCAPP(LR) .777 .777 .617 .617 .658 .658 .638 .638 MINCUT .496 .496 .499 .499 .499 .499 .499 .499 IC-LR .798 .798 .654 .654 .607 .607 .588 .588 HCFS(LR) w/ disc. .794 .794 .632 .632 .694 .694 .649 .649 HCAPP(LR) w/ disc. .782 .782 .620 .620 .724 .724 .644 .644 MINCUT w/ disc. .534 .534 .503 .503 .499 .499 .550 .550 IC-LR w/ disc. .805 .805 .653 .653 .773 .773 .667 .667 IMP(LR) .802 .701 .663 .523 .729 .507 .657 .501 IMP(LR) w/ disc. .809 .723 .659 .554 .783 .503 .697 .501

Table 5: Our methods vs. the rest: mean classiﬁer accuracy for ϵ = 0.2, balanced datasets, all features

skewed), there is a high variance in the performance of these methods (see the Supplement). In some cases, the benchmarks perform as well as, or slightly better than, our incentive-compatible classiﬁers. For example, in Table 3, for the Australia and Poland data sets, the accuracy of IMP(LR) and that of HC(LR) are within 0.001 of each other. This happens because the imputed values are, in these cases (but not in most of our other cases), negative indicators of the positive label, and therefore there is generally no incentive to strategically drop features. Reduced-Feature modeling, despite performing well under truthful reporting, allows too many examples to be accepted under strategic reporting, which hurts its performance. This is true especially for smaller ϵ, as each subclassiﬁer has fewer examples to train on, giving several viable options for strategic withholding.

We note here that the variance (in the accuracy achieved) produced by our methods, since they are robust to strategic withholding, is much smaller than that of the baseline methods (exact numbers are deferred to the Supplement).

7 Conclusion

In this paper, we studied the problem of classiﬁcation when each agent at prediction time can strategically withhold some of its features to obtain a more favorable outcome. We devised classiﬁcation methods (MINCUT, HC and IC-LR) that are robust to this behavior, and in addition, characterized the space of all possible truthful classiﬁers in our setting. We tested our methods on real-world data sets, showing that they outperform out-of-the-box methods that do not account for the aforementioned strategic behavior.

An immediate question that follows is relaxing the assumption of having access to truthful training data for example, one could ask what the best incentive-compatible classiﬁer is given that the training data consists of best responses to a known classiﬁer f; or, one could consider an online learning model where the goal is to bound the overall loss over time. A much broader question for future work is to develop a more general theory of robustness to missing data that naturally includes the case of strategic withholding.

Acknowledgements

We are thankful for support from NSF under award IIS1814056.

Ethics Statement

The methods presented in this paper are geared towards preventing the strategic withholding of data when machine learning methods are used in real-world applications. This will increase the robustness of ML techniques in these contexts: without taking this issue into account, deployment of these techniques will generally result in a rapid change in the distribution of submitted data due to the new incentives faced, causing techniques to work much more poorly than expected at training time. Thus, there is an AI safety (Amodei et al. 2016) beneﬁt to our work. The lack of strategic withholding also enables the collection of (truthful) quality data. Of course, there can be a downside to this as well if the data is not used responsibly, which could be the case especially if the features that (without our techniques) would have been withheld are sensitive or private in nature. The other issues to consider in our context are those of transparency and fairness. We assume that the classiﬁer is public knowledge, and therefore, agents can appropriately best-respond. In practice, this might not be the case; however, agents may learn how to best-respond over time if similar decisions are made repeatedly (e.g., in the case of college admissions or loan applications). While US college admission is often a black box, it need not be; many countries have transparent public criteria for university admissions (e.g., the Indian IIT admission system), and the same is true in many other contexts (e.g., Canadian immigration). Of course, transparency goes hand in hand with interpretability, i.e., the classiﬁer must be easily explainable as well, and there could be a trade-off, in principle, between how easy the classiﬁer is to interpret and the accuracy it can achieve. It is also possible that our methods hurt the chances of those with more missing data (similarly to how immigrants without credit history may not be able to get a credit card). This is to some extent inevitable, because if one can get in without any feature, everyone could get in by dropping all features. Therefore, the issue of fairness might arise in the case where some groups systematically tend to have more missing data.

Adams, C. J. 2017. In Race for Test-Takers, ACT Outscores SAT for Now. URL https://www.edweek.org/ew/articles/ 2017/05/24/in-race-for-test-takers-act-outscores-sat-- for.html.

Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P. F.; Schulman, J.; and Man e, D. 2016. Concrete Problems in AI Safety. Ar Xiv abs/1606.06565.

Chawla, N. V.; Japkowicz, N.; and Kotcz, A. 2004. Special issue on learning from imbalanced data sets. ACM SIGKDD explorations newsletter 6(1): 1 6.

Chen, Y.; Podimata, C.; Procaccia, A. D.; and Shah, N. 2018. Strategyproof linear regression in high dimensions. In Pro-

ceedings of the 2018 ACM Conference on Economics and Computation, 9 26. Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; and Stein, C. 2009. Introduction to algorithms. MIT press. Dalvi, N.; Domingos, P.; Sanghai, S.; and Verma, D. 2004. Adversarial classiﬁcation. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 99 108. Dekel, O.; Fischer, F.; and Procaccia, A. D. 2010. Incentive compatible regression learning. Journal of Computer and System Sciences 76(8): 759 777. Dekel, O.; Shamir, O.; and Xiao, L. 2010. Learning to classify with missing and corrupted features. Machine learning 81(2): 149 178. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml (last accessed 07/09/2020). Fayyad, U. M.; and Irani, K. B. 1993. Multi-Interval Discretization of Continuous-Valued Attributes for Classiﬁcation Learning. In Bajcsy, R., ed., Proceedings of the 13th International Joint Conference on Artiﬁcial Intelligence. Chamb ery, France, August 28 - September 3, 1993, 1022 1029. Morgan Kaufmann. URL http://ijcai.org/Proceedings/ 93-2/Papers/022.pdf. Florez-Lopez, R. 2010. Effects of Missing Data in Credit Risk Scoring. A Comparative Analysis of Methods to Achieve Robustness in the Absence of Sufﬁcient Data. The Journal of the Operational Research Society 61(3): 486 501. Green, J. R.; and Laffont, J.-J. 1986. Partially veriﬁable information and mechanism design. The Review of Economic Studies 53(3): 447 456. Haghtalab, N.; Immorlica, N.; Lucier, B.; and Wang, J. 2020. Maximizing Welfare with Incentive-Aware Evaluation Mechanisms. In 29th International Joint Conference on Artiﬁcial Intelligence. Hardt, M.; Megiddo, N.; Papadimitriou, C.; and Wootters, M. 2016. Strategic classiﬁcation. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, 111 122. Jung, J.; Concannon, C.; Shroff, R.; Goel, S.; and Goldstein, D. G. 2017. Simple rules for complex decisions. Available at SSRN 2919024 . Kephart, A.; and Conitzer, V. 2015. Complexity of mechanism design with signaling costs. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 357 365. Kephart, A.; and Conitzer, V. 2016. The revelation principle for mechanism design with reporting costs. In Proceedings of the 2016 ACM Conference on Economics and Computation, 85 102. Kleinberg, J.; and Raghavan, M. 2019. How Do Classiﬁers Induce Agents to Invest Effort Strategically? In Proceedings of the 2019 ACM Conference on Economics and Computation, 825 844.

Lessmann, S.; Baesens, B.; Seow, H.-V.; and Thomas, L. C. 2015. Benchmarking state-of-the-art classiﬁcation algorithms for credit scoring: An update of research. European Journal of Operational Research 247(1): 124 136. Marlin, B. 2008. Missing data problems in machine learning. Ph.D. thesis, Citeseer. Saar-Tsechansky, M.; and Provost, F. 2007. Handling missing values when applying classiﬁcation models. Journal of machine learning research 8(Jul): 1623 1657. Wang, F.; and Rudin, C. 2015. Falling rule lists. In Artiﬁcial Intelligence and Statistics, 1013 1022. Yu, L. 2011. Mechanism design with partial veriﬁcation and revelation principle. Autonomous Agents and Multi-Agent Systems 22(1): 217 223. Zhang, H.; Cheng, Y.; and Conitzer, V. 2019a. Distinguishing Distributions When Samples Are Strategically Transformed. In Advances in Neural Information Processing Systems, 3187 3195. Zhang, H.; Cheng, Y.; and Conitzer, V. 2019b. When samples are strategically selected. In International Conference on Machine Learning, 7345 7353. Zhang, H.; Cheng, Y.; and Conitzer, V. 2021a. Automated Mechanism Design for Classiﬁcation with Partial Veriﬁcation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence. Zhang, H.; Cheng, Y.; and Conitzer, V. 2021b. Classiﬁcation with Few Tests through Self-Selection. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence. Zhang, H.; and Conitzer, V. 2021. Incentive-Aware PAC Learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.