# maximizing_welfare_with_incentiveaware_evaluation_mechanisms__b6d9d98d.pdf Maximizing Welfare with Incentive-Aware Evaluation Mechanisms Nika Haghtalab1 , Nicole Immorlica2 , Brendan Lucier2 and Jack Z. Wang1 1Cornell University 2Microsoft Research {nika,jackzwang}@cs.cornell.edu, {nicimm,brlucier}@microsoft.com Motivated by applications such as college admission and insurance rate determination, we propose an evaluation problem where the inputs are controlled by strategic individuals who can modify their features at a cost. A learner can only partially observe the features, and aims to classify individuals with respect to a quality score. The goal is to design an evaluation mechanism that maximizes the overall quality score, i.e., welfare, in the population, taking any strategic updating into account. We further study the algorithmic aspect of finding the welfare maximizing evaluation mechanism under two specific settings in our model. When scores are linear and mechanisms use linear scoring rules on the observable features, we show that the optimal evaluation mechanism is an appropriate projection of the quality score. When mechanisms must use linear thresholds, we design a polynomial time algorithm with a (1/4)-approximation guarantee when the underlying feature distribution is sufficiently smooth and admits an oracle for finding dense regions. We extend our results to settings where the prior distribution is unknown and must be learned from samples. 1 Introduction Automatic decision making is increasingly used to identify qualified individuals in areas such as education, hiring, and public health. This has inspired a line of work aimed at improving the performance and interpretability of classifiers for identifying qualification and excellence within a society given access to limited visible attributes of individuals. As these classifiers become widely deployed at a societal level, they can take on the additional role of defining excellence and qualification. That is, classifiers encourage people who seek to be identified as qualified to acquire attributes that are deemed to be qualified by the classifier. For example, a college admission policy that heavily relies on SAT scores will naturally encourage students to increase their SAT scores, which they might do by getting a better understanding of the core material, taking SAT prep lessons, cheating, etc. Progress in machine learning has not fully leveraged classifiers role in incentivizing people to change their feature attributes, and at times has even considered it an inconvenience to the designer who must now take steps to ensure that their classifier cannot be gamed [Meir et al., 2012; Chen et al., 2018; Hardt et al., 2016; Dekel et al., 2010; Cai et al., 2015]. One of the motivations for such strategyproof classification is Goodhart s law, which states when a measure becomes a target, it ceases to be a good measure. Taking Goodhart s law to heart, one might view an individual s attributes to be immutable, and any strategic response to a classifier (changes in one s attributes) only serves to mask an agent s true qualification and thereby degrade the usefulness of the classifier. What this narrative does not address is that in many applications of classification, one s qualifications can truly be improved by changing one s attributes. For example, students who improve their understanding of core material truly become more qualified for college education. These changes have the potential to raise the overall level of qualification and excellence in a society and should be encouraged by a social planner. In this work, we focus on this powerful and under-utilized role of classification in machine learning and ask how to design an evaluation mechanism on visible features that incentivizes individuals to improve a desired quality. For instance, in college admissions, the planner might wish to maximize the quality of candidates. Quality is a function of many features, such as persistence, creativity, GPA, past achievements, only a few of which may be directly observable by the planner. Nevertheless the planner designs an admission test on visible features to identify qualified individuals. To pass the test, ideally candidates improve their features and truly increase their quality as a result. In another motivating example, consider a designer who wants to increase average driver safety, which can depend on many features detailing every aspect of a driver s habits. The designer may only have access to a set of visible features such as age, driver training, or even driving statistics like acceleration/deceleration speed (as recorded by cars GPS devices). Then a scoring rule (that can affect a driver s insurance premium) attempts to estimate and abstract a notion of safety from these features. Drivers naturally adjust their behavior to maximize their score. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) both cases, the mechanism does not just pick out high quality candidates or safe drivers in the underlying population, but it actually causes their distributions to change. Modeling and Contributions. Motivated by these observations, we introduce the following general problem. Agents are described by feature vectors in a high-dimensional feature space, and can change their innate features at a cost. There is a true function mapping feature vectors to a true quality score (binary or real-valued). The planner observes a low-dimensional projection of agents features and chooses an evaluation mechanism, from a fixed class of mechanisms which map this low-dimensional space to observed scores (binary or real-valued). Agents get value from having a high observed score (e.g., getting admitted to university or having a low car insurance premium), whereas the planner wishes to maximize welfare, i.e., the average true score of the population. To further study this model, we analyze the algorithmic aspects of maximizing welfare in two specific instantiations. In Section 3, we show that when the true quality function is linear and the mechanism class is the set of all linear functions, the optimal is a projection of the true quality function on the visible subspace. Our most interesting algorithmic results (Section 4), arise from the case when the true function is linear and mechanism class is the set of all linear thresholds. In this case, a simple projection does not work: we need to consider the distribution of agents (projected on the visible feature space) when choosing the mechanism. For this case, we provide polynomial time approximation algorithms for finding the optimal linear threshold. In Section 5, we also provide sample complexity guarantees for learning the optimal mechanism from samples only. Prior Work. Our work builds upon the strategic machine learning literature introduce by Hardt et al. [2016]. As in our work, agents are represented by feature vectors which can be manipulated at a cost. Hardt et al. [2016] design optimal learning algorithms in the presence of these costly strategic manipulations. Hu et al. [2019] and Milli et al. [2019] extend Hardt et al.[2016] by assuming different groups of agents have different costs of manipulation and study the disparate impact on their outcomes. Dong et al. [2018] consider a setting in which the learner does not know the distribution of agents features or costs but must learn them through revealed preference. Importantly, in all these works, the manipulations do not change the underlying features of the agent and hence purely disadvantage the learning algorithm. Kleinberg and Raghavan [2019] introduce a different model in which manipulations do change the underlying features. Some changes are advantageous, and the designer chooses a rule that incentivizes these while discouraging disadvantageous changes. Their main result is that simple linear mechanisms suffice for a single known agent (i.e., known initial feature vector). In contrast, we study a population of agents with a known distribution of feature vectors and optimize over the class of linear, or linear threshold, mechanisms. Alon et al. [2020] also study extensions of Kleinberg and Raghavan [2019] to multiple agents. In that work, agents differ in how costly it is for them to manipulate their features but they all have the same starting feature representation, but in our work, agents differ in their starting features while facing the same manipulation cost. As noted by Kleinberg and Raghavan [2019], their model is closely related to the field of contract design in economics. The canonical principal-agent model (see, e.g., [Grossman and Hart, 1983; Ross, 1973]) involves a single principle and a single agent. There is a single-dimensional output, say the crop yield of a farm, and the principal wishes to incentivize the agent to expend costly effort to maximize output. However, the principle can only observe output, and the mapping of effort to output is noisy. Under a variety of conditions, the optimal contract in such settings pays the agent an affine function of output [Carroll, 2015; D utting et al., 2019; Holmstrom and Milgrom, 1987], although the optimal contract in general settings can be quite complex [Mcafee and Mc Millan, 1986]. Our model differs from this canonical literature in that both effort and output are multi-dimensional. In this regard, the work of Holstrom and Milgrom [1991] is closest to ours. They also study a setting with a multidimensional feature space in which the principal observes only a low-dimensional representation. Important differences include that they only study one type of agent whereas we allow agents to have different starting feature vectors, and they assume transferable utility whereas in our setting payments are implicit and do not reduce the utility of the principal. 2 Preliminaries The Model. We denote the true features of an individual by x Rn, where the feature space Rn encodes all relevant features of a candidate, e.g., health history, biometrics, vaccination record, exercise and dietary habits. We denote by f : Rn [0, 1] the mapping from one s true features to their true quality. For example, a real-valued f(x) [0, 1] can express the overall quality of candidates and a binary-valued f(x) can determine whether x meets a certain qualification level. These true features of an individual may not be visible to the designer. Instead there is an n n projection matrix P of rank k, i.e., P 2 = P, such that for any x, Px represents the visible representation of the individual, such as vaccination record and health history, but not exercise and dietary habits. We define by Img(P) = {z Rn | z = Pz} the set of all feature representations that are visible to the designer. We denote by g : Rn R a mechanism whose outcome for any individual x depends only on Px, i.e., the visible features of x.1 E.g., g(x) R can express the payoff an individual receives from the mechanism, g(x) [0, 1] can express the probability that x is accepted by a randomized mechanism g, or a binary-valued g(x) {0, 1} can express whether x is accepted or rejected deterministically. Let cost(x, x ) represent the cost that an individual incurs when changing their features from x to x . We consider cost(x, x ) = c x x 2 for some c > 0. Given mechanism g, the total payoff x receives from changing its feature representation to x is Ug(x, x ) = g(x ) cost(x, x ). Let δg : Rn Rn denote the best response of an individual to 1 Equiv. g(x) = g||(Px) for unrestricted g|| :Rn R. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) g, i.e., δg(x) = argmax x Ug(x, x ). We consider a distribution D over feature vector in Rn, representing individuals. Our goal is to design a mechanism g such that, when individuals from D best respond to it, it yields the highest quality individuals on average. That is to find a mechanism g G that maximizes Val(g) = E x D f(δg(x)) . (1) We often consider the gain in the quality of individuals compared to the average quality before deploying any mechanism, i.e., the baseline E[f(x)], defined by Gain(g) = Val(g) E x D[f(x)]. (2) We denote the mechanism with highest gain by gopt G. For example, f can indicate the overall health of an individual and G the set of governmental policies on how to set insurance premiums based on the observable features of individuals, e.g., setting lower premiums for those who received preventative care in the previous year. Then, gopt corresponds to the policy that leads to the healthiest society on average. In Sections 3 and 4, we work with different design choices for f and G, and show how to find (approximately) optimal mechanisms. Types of Mechanisms. More specifically, we consider a known true quality function f that is a linear function wf x bf for some vector wf Rn and bf R. Without loss of generality we assume that the domain and cost multiplier c are scaled so that wf is a unit vector. In Section 3, we consider the class of linear mechanisms Glin, i.e., any g Glin is represented by g(x) = wg x bg, for a scalar bg R and vector wg Img(P) of length wg 2 R for some R R+. In Section 4, we consider the class of linear threshold (halfspace) mechanisms G0-1, where a g G0-1 is represented by g(x) = sign(wg x bg) for some unit vector wg Img(P) and scalar bg R. Other Notation. is the L2 norm of a vector unless otherwise stated. For ℓ 0, the ℓ-margin density of a halfspace is Denℓ D(w, b) = Prx D h w x b [ ℓ, 0] i . This is the total density D assigns to points that are at distance at most ℓ from being included in the positive side of the halfspace. The soft ℓ-margin density of this halfspace is defined as S-Denℓ D(w, b) = E x D h (b w x)1 w x b [ ℓ, 0] i . We suppress ℓand D when the context is clear. We assume that individuals have features that are within a ball of radius r of the origin, i.e., D is only supported on X = {x | x r}. In Section 4, we work with distribution D that is additionally σ-smooth. D is σ-smoothed distribution if there is a corresponding distribution P over X such that to sample x D one first samples x P and then x = x + N(0, σ2I). Smoothing is a common assumption in theory of computer science where N(0, σ2I) models uncertainties in real-life measurements. To ensure that the noise in measurements is small compared to radius of the domain r, we assume that σ O(r). 3 Linear Mechanisms In this section, we show how the optimal linear mechanism in Glin is characterized by gopt(x) = wg x for wg that is (the largest vector) in the direction of Pwf. This leads to an algorithm with O(n) runtime for finding the optimal mechanism. At a high level, this result shows that when the true quality function f is linear, the optimal linear mechanism is in the direction of the closest vector to wf in the visible feature space. Indeed, this characterization extends to any true quality function that is a monotonic transformation of a linear function. Theorem 1 (Linear Mechanisms). Let f(x) = h(wf x bf) for some monotonic function h : R R. Let g(x) = (P wf )R P wf 2 x. Then, g has the optimal gain. It is interesting to note that designing an optimal linear mechanism (Theorem 1) does not use any information about the distribution of instances in D, rather directly projects wf on the subspace of visible features. We see in Section 4 that this is not the case for linear threshold mechanisms, where the distribution of feature attributes plays a central role in choosing the optimal mechanism. 4 Linear Threshold Mechanisms In this section, we consider the class of linear threshold mechanisms G0-1 and explore the computational aspects of finding gopt G0-1. Note that any g(x) = sign(wg x bg) corresponds to the transformation of a linear mechanism from Glin that only rewards those whose quality passes a threshold. As in the case of linear mechanisms, individuals only move in the direction of wg and every unit of their movement improves the payoff of the mechanism by wg wf. However, an individual s payoff from the mechanism improves only if its movement pushes the feature representation from the 0-side of a linear threshold to +1-side. Therefore only those individuals whose feature representations are close to the decision boundary move. Lemma 1. For g(x) = sign(wg x bg) G0-1, δg(x)=x 1 wg x bg 1 c , 0 (wg x bg)wg. This leads to very different dynamics and challenges compared to the linear mechanism case. For example, wg Pwf no longer leads to a good mechanism as it may only incentivize a small fraction of individuals as shown in the following example. Example 1. As in Figure 1, consider wf = 1 and P that projects a vector on its second and third coordinates. Consider a distribution D that consists of x = (0, 0, x3) for x3 Unif[ r, r]. Any g only incentivizes individuals who are at distance at most ℓ= 1/c from the the decision boundary, highlighted by the shaded regions. Mechanism g(x) = sign(x2 ℓ) incentivizes everyone to move ℓunit in the direction of x2 and leads to total utility of E[f(δg(x))] = ℓ/ 3. But, g (x) = sign(w g x b g) for unit vector w g Pwf only incentivizes 2ℓ/2r fraction of the individuals, on average each moves only ℓ/2 p 2/3 units in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 𝒟= 0,0, 𝑥& '( *+,-[/0,0] 𝐰34 𝑃𝐰7 Figure 1: Mechanism g (x) = sign(w g x bg) for any w g Pwf is far from optimal. the direction of wf. Therefore, Gain(g ) ℓ2 3 Gain(g) when unit cost c 1 r. Using the characterization of an individual s best-response to g from Lemma 1, for any true quality function f(x) = wf x bf and g(x) = sign(wg x bg), we have Gain(g) = (wg wf) S-Den1/c(wg, bg). (3) While mechanisms with wg Pwf can be far from the optimal, one can still achieve a 1/(4rc)-approximation to the optimal mechanism by using such mechanisms and optimizing over the bias term bg. Theorem 2 (1/(4rc)-approximation). Consider the polynomial time algorithm that returns the best g from G = {sign( P wf P wf x bg)|bg = i/2c, i [ 2rc + 1]}. Then, Gain(g) 1 4rc Gain(gopt). This is a good approximation when the cost unit c is not too large compared to the radius of the domain r. However, in most cases the cost to change ones feature is much larger than a constant fraction of the radius. In this case to find gopt G0-1, we need to simultaneously optimize the total density of instances that fall within the margin of the classifier, their average distance to the decision boundary, and the correlation between wg and wf. One of the challenges involved here is finding a halfspace whose margin captures a large fraction of instances, i.e., has large margin density. Many variants of this problem have been studied in the past and are known to be hard. For example, the densest subspace, the densest halfspace, densest cube, and the densest ball are all known to be hard to approximate [Ben-David et al., 2002; Hardt and Moitra, 2013; Johnson and Preparata, 1978]. Yet, finding dense regions is a routine unsupervised learning task for which existing optimization tools are known to perform well in practice. Therefore, we assume that we have access to such an optimization tool, which we call a density optimization oracle. Definition 1 (Density Optimization Oracle). Oracle O takes any distribution D, margin ℓ, a set K Rn, takes O(1) time and returns O(D, ℓ, K) argmax w K, w =1 b R Denℓ D(w, b). (4) Algorithm 1 (1/4 ϵ) Approximation for G0-1 Input: σ-smoothed distribution D with radius r before perturbation, Vector wf, Projection matrix P, Desired accuracy ϵ. Output: a linear threshold mechanism g Let ϵ = min{ϵ4, ϵ2σ4/r4} and C = . for η = 0, ϵ Pwf , 2ϵ Pwf , . . . , Pwf do Let Kη = Img(P) {w | w wf = η}. Use the density optimization oracle to compute (wη, bη) O(D, 1 Let C C { wη, bη), (wη, bη + 1 2c } end for return g(x) = sign (w x b), where (w, b) argmax (w,b) C (w wf)S-Den1/c D (w, b). Another challenge is that Gain( ) is a non-smooth function. As a result, there are distributions for which small changes to (w, b), e.g., to improve w wf, could result in a completely different gain. However, one of the properties of real-life distributions is that there is some amount of noise in the data, e.g., because measurements are not perfect. This is modeled by smoothed distributions, as described in Section 2. Smooth distributions provide an implicit regularization that smooths the expected loss function over the space of all solutions. Using these two assumptions, i.e., access to a density optimization oracle and smoothness of the distribution, we show that there is a polynomial time algorithm that achieves a 1/4 approximation to the gain of gopt. At a high level, smoothness of D allows us to limit our search to those wg Rn for which wf wg = v for a small set of discrete values v. For each v, we use the density oracle to search over all wg such that wf wg = v and return a candidate with high margin density. We show how a mechanism with high margin density will lead us to (a potentially different) mechanism with high soft-margin density and as a result a near optimal gain. This approach is illustrated in more detail in Algorithm 1. Theorem 3 ((Main) 1/4-approximation). Consider a distribution over X and the corresponding σ-smoothed distribution D for some σ O(r). Then, for any small enough ϵ, Algorithm 1 runs in time poly(d, 1/ϵ), makes O( 1 ϵ4 + r2 ϵ2σ2 ) oracle calls and returns g G0-1, such that 4Gain(gopt) ϵ. Proof Sketch of Theorem 3. Recall from Equation 3, Gain(gopt) is the product of two values (wgopt wf) and S-Den(wgopt, bgopt). The first lemma shows that we can do a grid search over the value of (wgopt wf). In other words, there is a predefined grid on the values of w wf for which there is a w for which w wf wgopt wf. This is demonstrated in Figure 2. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 𝟏( 𝐰𝒈 𝐱 𝑏" ℓ, 𝑏" ) 𝟏( 𝐰 𝐱 𝑏 ℓ, 𝑏) Figure 2: The wg wf can be increased by any small amount ϵ to match a specific value, when wf is shifted by a ϵ angle to vector w. When a distribution is smooth, the margin density of mechanisms (w, b) and (wg, bg) are close when θ(w, wg) and |b bg| are small. The lightly shaded area shows that instances outside of the margin contribute to (soft-)margin density. Lemma 2 (Discretization). For any two unit vectors w1, w2 Img(P), and any ϵ, such that w1 w2 1 2ϵ, there is a unit vector w Img(P), such that w w2 = w w1 + ϵ and w w1 1 ϵ. The second technical lemma shows that approximating wg by a close vector, w, and bg by a close scalar, b, only has a small effect on its soft margin density. That is, the soft margin density is Lipschitz. For this claim to hold, it is essential for the distribution to be smooth. The key property of a σsmoothed distribution D corresponding to the original distribution P is that Denℓ D(wg, bg) includes instances x P that are not in the margin of (wg, bg). As shown in Figure 2, these instances also contribute to the soft margin density of any other w and b, as long as the distance of x to halfplanes w x = b and wg x = bg are comparable. So, it is sufficient to show that the distance of any x to two halfplanes with a small angle is approximately the same. Here, angle between two unit vectors is defined as θ(w, w ) = arccos(w w ). Lemma 2 leverages this fact to prove that soft margin density is Lipschitz smooth. Lemma 3 (Soft Margin Lipschitzness). For any distribution over X and its corresponding σ-smooth distribution D, for ϵ 1 3R2 , any ν < 1, for R = 2r + σ p 2 ln(2/ν), and any ℓ O(R), if θ(w1, w2) ϵσ2 and |b1 b2| ϵσ, we have S-Denℓ D(w1, b1) S-Denℓ D(w2, b2) O(ν + ϵR2). The final step to bring these results together is to show that we can find a mechanism (between those on the grid) that has a large soft margin density. To do this, we show that the mechanism (with potentially small changes to it) that has the highest margin density has at least 1/4 of the soft density of the optimal mechanism. The proof technique of this lemma involves analyzing whether more than half of instances within an ℓ-margin of a mechanism are at distance at most ℓ/2 of the margin, in which case soft margin density of the mechanism is at least 1/4 of its density. Otherwise, shifting the bias of the mechanism by ℓ/2 results in a mechanism whose soft margin is a least 1/4 of the original margin density. Lemma 4 (Margin Density Approximates Soft-Margin Density). For any distribution D over Rn, a class of unit vectors K Rn, margin ℓ> 0, and a set of values N, let (wη, bη) = O(D, ℓ, K {w wf = η}) for η N, and let ( ˆw,ˆb) be the mechanism with maximum ηDenℓ D(wη, bη) among these options. Then, max Gain( ˆw,ˆb),Gain( ˆw,ˆb + ℓ 4 max w K w wf N b R Gain(w, b). Putting these lemmas together, one can show that Algorithm 1, finds a 1 4 approximation. 5 Learning Optimal Mechanisms From Samples Up to this point, we have assumed that distribution D is known to the mechanism designer and all computations, such as measuring margin density and soft margin density, can be directly performed on D. However, in most applications the underlying distribution over feature attributes is unknown or the designer only has access to a small set of historic data. In this section, we show that all computations that are performed in this paper can be done instead on a small set of samples that are drawn i.i.d. from distribution D. Note that (since no mechanism is yet deployed at this point) these samples represent the initial non-strategic feature attributes. We first note that the characterization of the optimal linear mechanism (Theorem 1) is independent of distribution D, that is, even with no samples from D we can still design the optimal linear mechanism. On the other hand, the optimal linear threshold mechanism (Theorem 3) heavily depends on the distribution of instances, e.g., through the margin and soft-margin density of D. To address sample complexity of learning a linear threshold mechanism, we use the concept of pseudo-dimension. This is the analog of VC-dimension for real-valued functions. Definition 2 (Pseudo-dimension). Consider a set of realvalued functions F on the instance space X. A set of instances x(1), . . . , x(n) X is shattered by F, if there are values v(1), . . . , v(n) such for any subset T [n] there is a function f T F for which f(x(i)) v(i) if and only if i T. The size of the largest set that is shattered by F is the pseudo-dimension of F. It is well-known that the pseudo-dimension of a function class closely (via upper and lower bounds) controls the number of samples that are needed for learning a good function. This is characterized by the uniform-convergence property as shown below. Theorem 4 (Uniform Convergence [Pollard, 2011]). Consider the class of functions F such that f : X [0, H] with pseudo-dimension d. For any distribution D and a set S of m = ϵ 2H2(d + ln(1/ϵ)) i.i.d. randomly drawn samples Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Gain% 𝑔= min{ℎ+ ℎ+, 𝐱= 𝟏𝐰 𝐱 𝑏 0 𝑏 𝐰 𝐱(𝐰 𝐰;) ℎ+/ 𝐱= 𝟏𝐰 𝐱 𝑏 ℓ(𝐰 𝐰;) Figure 3: Gainx(g) is a minimum of h1 g(x) and h2 g(x). from D, with probability 1 δ, for all f F 1 m x S f(x) E x D[f(x)] As is commonly used in machine learning, uniform convergence implies that choosing any mechanism based on its performance on the sample set leads to a mechanism whose expected performance is within 2ϵ of the optimal mechanism for the underlying distribution. Lemma 5 (Pseudo-dimension). For each g = sign(wg x bg) and x let Gainx(g) be the gain of g just on instance x and let Denℓ x(g) = 1(wg x bg [ ℓ, 0]). The class of realvalued functions {Gain g}g G0-1 has a pseudo-dimension that is at most O(rank(P)) Moreover, the class of functions {x Denℓ x(g)}g G0-1 has a pseudo-dimension (equivalently VC dimension) O(rank(P)). Proof. For g G0-1, note that Gainx(g)=1 wg x bg 1 c , 0 (bg wg x)(wf wg). Note that we can write Gainx(g) = min{h1(x), h2(x)} for h1 g(x) := 1(w x b 0)(b w x)(w wf) and h2 g(x) := 1(w x b ℓ)(w wf). As show in Figure 3, h1 g and h2 g are both monotone functions of w x b. Pollard [Pollard, 2011] shows that the set of all linear functions in a rank k subspace has pseudo-dimension k. Pollard [Pollard, 2011] also shows that the set of monotone transformations of these functions has pseudo-dimension O(k). Therefore, the set of functions {h1 g(x)}g G0-1 and {h2 g(x)}g G0-1, each have pseudo-dimension Rank(P). It is well-known that the set of all minimums of two functions from two classes each with pseudo-dimension d has a pseudodimension of O(d). Therefore, {x Gainx(g)}g G0-1 has pseudo-dimension of at most O(rank(P)). A similar construction shows that {x Denℓ x(g)}g G0-1 has VC dimension of at most O(rank(P)). We give a bound on the number of samples sufficient for finding a near optimal mechanism in G0-1. Theorem 5 (Sample Complexity). For any small enough ϵ and δ, there is c2ϵ2 (rank(P) + ln(1/δ)) such that for S Dm i.i.d. samples with probability 1 δ, ˆg G0-1 that optimizes the empirical gain on S has Gain(ˆg) Gain(gopt) ϵ. Furthermore, with probability 1 δ when Algorithm 1 is run on S, the outcome ˆg has Gain(ˆg) 1 4Gain(gopt) ϵ. 6 Discussion In this work, we focused on increasing the expected quality of the population, i.e., welfare, through classification. There are, however, other reasonable objectives that one may want to consider. In some cases, our goal may be to design evaluation mechanisms that lead to highest quality accepted individuals, not accounting for those who are rejected by the mechanism. It would be interesting to study these objectives under our model as well. An example of this is to consider, for a set of boolean valued functions G, maxg G Ex D f(δg(x)) | g(x) = 1 . Throughout this work, we focused on the L2 cost function (i.e., cost(x, x ) = c x x 2). This specification models scenarios in which an individual can improve multiple features most efficiently by combining effort rather than working on each feature individually (L1 norm). For example, simultaneously improving writing, vocabulary, and critical analysis of a student more by for example reading novels, is more effective than spending effort to improve vocabulary by, say, memorizing vocab lists, and then improving the other attributes. It would be interesting to analyze the algorithmic aspects of alternative cost functions, such as L1 cost or even non-metric costs (e.g., ones with learning curves whereby the first 10% improvement is cheaper than the next 10%), and different costs for different types of individuals. Finally, we have assumed we know the true mapping of features to qualities (i.e., f). In many settings, one might not know this mapping, or even the full set of features. Instead, the designer only observes the quality of individuals after they respond to their incentives (i.e., (f(δg(x)))), and the projection of their new feature set (i.e., Pδg(x)). Existing works on Stackelberg games and strategic classification have considered the use of learning tools for designing optimal mechanisms without the knowledge of the underlying function [Blum et al., 2014; Haghtalab et al., 2016; Miller et al., 2020]. It would be interesting to characterize how the nature of observations and interventions available in our setting specifically affects this learning process. [Alon et al., 2020] Tal Alon, Magdalen Dobson, Ariel D Procaccia, Inbal Talgam-Cohen, and Jamie Tucker-Foltz. Multiagent evaluation mechanisms. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 2020. [Ben-David et al., 2002] Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. The computational complexity of densest region detection. Journal of Computer and System Sciences, 64(1):22 47, 2002. [Blum et al., 2014] Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Learning optimal commitment to Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) overcome insecurity. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (Neur IPS), pages 1826 1834, 2014. [Cai et al., 2015] Yang Cai, Constantinos Daskalakis, and Christos Papadimitriou. Optimum statistical estimation with strategic data sources. In Proceedings of the 28th Conference on Computational Learning Theory (COLT), pages 280 296, 2015. [Carroll, 2015] Gabriel Carroll. Robustness and linear contracts. American Economic Review, 105(2):536 563, 2015. [Chen et al., 2018] Yiling Chen, Chara Podimata, Ariel D Procaccia, and Nisarg Shah. Strategyproof linear regression in high dimensions. In Proceedings of the 19th ACM Conference on Economics and Computation (EC), pages 9 26. ACM, 2018. [Dekel et al., 2010] Ofer Dekel, Felix Fischer, and Ariel D. Procaccia. Incentive compatible regression learning. Journal of Computer and System Sciences, 76(8):759 777, 2010. [Dong et al., 2018] Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. In Proceedings of the 19th ACM Conference on Economics and Computation (EC), pages 55 70, 2018. [D utting et al., 2019] Paul D utting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In Proceedings of the 20th ACM Conference on Economics and Computation (EC), pages 369 387, 2019. [Grossman and Hart, 1983] Sanford J Grossman and Oliver D Hart. An analysis of the principal-agent problem. Econometrica, 51(1):7 45, 1983. [Haghtalab et al., 2016] Nika Haghtalab, Fei Fang, Thanh Hong Nguyen, Arunesh Sinha, Ariel D. Procaccia, and Milind Tambe. Three strategies to success: Learning adversary models in security games. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), pages 308 314, 2016. [Hardt and Moitra, 2013] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust subspace recovery. In Conference on Computational Learning Theory (COLT), pages 354 375, 2013. [Hardt et al., 2016] Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 7th ACM Conference on Innovations in Theoretical Computer Science Conference (ITCS), pages 111 122, 2016. [Holmstrom and Milgrom, 1987] Bengt Holmstrom and Paul Milgrom. Aggregation and linearity in the provision of intertemporal incentives. Econometrica, 55(2):303 328, 1987. [Holmstrom and Milgrom, 1991] Bengt Holmstrom and Paul Milgrom. Multitask principal-agent analyses: Incentive contracts, asset ownership, and job design. Journal of Law, Economics, and Organization, 7:24 52, 1991. [Hu et al., 2019] Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic manipulation. In Proceedings of the 2nd ACM Conference on Fairness, Accountability, and Transparency (FAT*), pages 259 268, 2019. [Johnson and Preparata, 1978] David S. Johnson and Franco P Preparata. Theoretical Computer Science, 6(1):93 107, 1978. [Kleinberg and Raghavan, 2019] Jon Kleinberg and Manish Raghavan. How do classifiers induce agents to invest effort strategically? In Proceedings of the 20th ACM Conference on Economics and Computation (EC), pages 825 844, 2019. [Mcafee and Mc Millan, 1986] R Preston Mcafee and John Mc Millan. Bidding for contracts: A principal-agent analysis. The RAND Journal of Economics, 17(3):326 338, 1986. [Meir et al., 2012] Reshef Meir, Ariel D Procaccia, and Jeffrey S Rosenschein. Algorithms for strategyproof classification. Artificial Intelligence, 186:123 156, 2012. [Miller et al., 2020] John Miller, Smitha Milli, and Moritz Hardt. Strategic classification is causal modeling in disguise. ar Xiv preprint ar Xiv:1910.10362, 2020. [Milli et al., 2019] Smitha Milli, John Miller, Anca D Dragan, and Moritz Hardt. The social cost of strategic classification. In Proceedings of the 2nd ACM Conference on Fairness, Accountability, and Transparency (FAT*), pages 230 239, 2019. [Pollard, 2011] D. Pollard. Convergence of Stochastic Processes. Springer Series in Statistics. 2011. [Ross, 1973] Stephen A Ross. The economic theory of agency: The principal s problem. American Economic Review, 63(2):134 139, 1973. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)