# incentiveaware_pac_learning__3f5a6cbe.pdf Incentive-Aware PAC Learning Hanrui Zhang, Vincent Conitzer Duke University hrzhang@cs.duke.edu, conitzer@cs.duke.edu We study PAC learning in the presence of strategic manipulation, where data points may modify their features in certain predefined ways in order to receive a better outcome. We show that the vanilla ERM principle fails to achieve any nontrivial guarantee in this context. Instead, we propose an incentive-aware version of the ERM principle which has asymptotically optimal sample complexity. We then focus our attention on incentive-compatible classifiers, which provably prevent any kind of strategic manipulation. We give a sample complexity bound that is, curiously, independent of the hypothesis class, for the ERM principle restricted to incentivecompatible classifiers. This suggests that incentive compatibility alone can act as an effective means of regularization. We further show that it is without loss of generality to consider only incentive-compatible classifiers when opportunities for strategic manipulation satisfy a transitivity condition. As a consequence, in such cases, our hypothesis-classindependent sample complexity bound applies even without incentive compatibility. Our results set the foundations of incentive-aware PAC learning. 1 Introduction Consider the following scenario. A financial institution plans to offer a product to a selected group of customers, and decides that the only thing that should matter in the selection process is the amount of money she currently owns in savings (possibly held at a different financial institution). Each customer may prove her savings balance by submitting a bank statement, and the number shown therein will be used for the selection. However, customers may choose to underreport their balance, by, for example, temporarily transferring their money to another account before the statement date. (We assume for the purpose of this example that overreporting is not possible.) The question is, how does this ability of customers to strategically underreport their balance affect the selection process? The answer, as one would expect, depends on the specific criteria of the selection, which are presumably determined by the nature of the product. For instance, the product could be a secured loan that requires the customer to use the savings for collateral. In that case, customers will benefit from Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. their balance being as high as possible, which means they would submit a statement with the full balance. As a result, the fact that customers can underreport would not affect the selection at all. Alternatively, the financial institution could be a non-profit entity that aims to make a product available only to those with low savings. In that case, every customer who wants access to this product would underreport her balance in order to be approved, and this would make the selection effectively meaningless. More generally, often the right criteria for the selection may not be clear a priori, but have to be learned from observations. In such cases, the decision maker (e.g., the institution) has access to labeled historical data (e.g., the amount of savings owned by some customer, and whether the product was successful for the same customer). A classifier (e.g., selection criteria) is derived from the data and implemented, to which future data points (e.g., new customers) to be classified respond in a strategic way. The goal, naturally, is to classify future data points as accurately as possible, taking into account that their features may be strategically modified. The key challenge is for the classifier derived to generalize from past unmodified observations to future strategic data points. Such problems are partially captured by the PAC learning model (discussed in detail below), but also exhibit additional complexity from strategic manipulation of the features, where the nature of this manipulation is affected by the choice of classifier. In this paper, we investigate novel phenomena in PAC learning introduced by the presence of strategic manipulation. Our results. Two central questions in PAC learning are the following: statistically, how many labeled data points does one need to observe in order to derive a classifier of a desired quality, and computationally, given this many labeled data points, how can one compute such a classifier? We answer these questions for arbitrary feature spaces and structures of strategic manipulation, by presenting an adapted version of the empirical risk minimization (ERM) principle, which roughly says that one should simply pick a classifier that has the best quality on the unmodified labeled data points that are observed. We show that our incentive-aware version of the ERM principle requires asymptotically the minimum possible number of labeled data points for any specific problem setup. This can be viewed as a strategic version of the VC The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) theory, one of the central results in traditional PAC learning. We further consider incentive-compatible classifiers, which provably prevent any kind of strategic manipulation by making revealing one s true features always optimal for any data point. Here, our most remarkable result is a hypothesis-class-independent bound on the number of labeled data points required to compute an incentivecompatible classifier of a desired quality. In traditional PAC learning, it is well known that without any prior knowledge about the ground truth (typically modeled by a hypothesis class consisting of possible classifiers to be considered), it is impossible to learn anything nontrivial, unless the number of labeled data points observed is trivially large or even infinite. By contrast, we show that in the presence of strategic manipulation, it is possible to learn a nontrivial classifier via our strategic version of the ERM principle without any such prior knowledge, except that the classifier learned has to be incentive-compatible. In other words, incentivecompatibility acts as a means of regularization, which provides nontrivial learning guarantees in and of itself. Moreover, when the structure of strategic manipulation is transitive (i.e., if A can pretend to be B and B can pretend to be C, then A can pretend to be C), considering incentivecompatible classifiers is without loss of generality this is commonly known as the revelation principle in economic theory. This, together with our results for incentivecompatible classifiers, implies that ERM-type learning in the presence of transitive strategic manipulation is automatically regularized. That is, the hypothesis-class-independent bound discussed above applies even without any exogenous requirement of incentive compatibility (since requiring this condition is without loss of generality), or any other prior knowledge. Related work. There is an extremely rich body of research on PAC learning (Valiant 1984). Since the enlightening result by Vapnik and Chervonenkis (1971), various measures of complexity have been studied (Alon et al. 1997; Bartlett and Mendelson 2002; Pollard 2012; Daniely et al. 2015), and tighter sample complexity bounds have been developed (Talagrand 1994; Hanneke 2016). See, e.g., (Devroye, Gy orfi, and Lugosi 2013; Shalev-Shwartz and Ben-David 2014) for a comprehensive exposition of PAC learning. Along this line of research, one result that is particularly related to ours is (Cullina, Bhagoji, and Mittal 2018). There, the authors also study PAC learning when data points may modify their features in certain ways. The model they consider is similar to ours in several ways, e.g., historical observations are not modified, and new data points to be classified can modify their features according to a binary relation over the feature space. The main difference between our results and theirs is that they assume data points to be adversarial, in that they always want to receive the incorrect label, whereas in our model, data points are strategic, in that they always want to receive the more desirable label (i.e., label 1), regardless of whether that is the correct label or not. A closely related research topic is strategic machine learning, where it is commonly assumed that strategic agents seek to maximize their utility by modifying their features in certain restricted ways, often at some cost (Hardt et al. 2016; Kleinberg and Raghavan 2019; Haghtalab et al. 2020; Zhang, Cheng, and Conitzer 2021a). Specific topics include strategic aspects of linear regression (Perote and Perote Pena 2004; Dekel, Fischer, and Procaccia 2010; Chen et al. 2018) and online learning (Roughgarden and Schrijvers 2017; Braverman et al. 2019; Feng, Parkes, and Xu 2019; Freeman et al. 2020). Our results differ from the above in that we consider a PAC model, where the key challenge is to generalize from observations to future data points to be classified. Also, we consider arbitrary feature spaces and structures of strategic manipulation, while existing results often focus on relatively restrictive setups, e.g., linear models. Another closely related line of research is mechanism design with partial verification (Green and Laffont 1986; Yu 2011; Kephart and Conitzer 2015, 2016), which is often motivated by similar considerations as strategic machine learning. The general problem considered there is to design an (approximately) optimal mechanism assigning outcomes to data points based on their features, where the authenticity of the features can only be partially verified. Particularly relevant is a series of papers by Zhang, Cheng, and Conitzer (2019b,a, 2021b) with a pronounced learning aspect. The main difference between our results and theirs is that in our model, the underlying distribution of data is only accessible via samples observed, and we do not restrict the structure of possible strategic manipulation. In a companion paper with overlapping authors (Krishnaswamy et al. 2021), we study empirically an incentiveaware classification setting where not every data point has every feature, and a data point may strategically hide some of their features in order to get a better label. Our experimental results in that paper on ERM-style algorithms and realworld credit approval datasets suggest that, with carefully designed ERM-style algorithms that are robust to strategic misreporting, one can do almost as well as with pristine data. We could use results in the current paper to theoretically explain generalization performance in a setting like that of (Krishnaswamy et al. 2021). 2 Preliminaries In this section, we review relevant definitions and results in PAC learning, and formulate the notion of strategic manipulation in classification. 2.1 Background on PAC Learning Our problems of interest fit well with the probably approximately correct (PAC) learning model (Valiant 1984). In PAC learning, there is a feature space X, a label space Y, and a joint population distribution D (X Y) over the feature space and the label space. For a classifier f : X Y, the population loss ℓD(f) is defined as the probability that f assigns a wrong label to a random point with respect to D, i.e., ℓD(f) = Pr (x,y) D[f(x) = y]. The goal of PAC learning, with target relative loss ε > 0 and failure probability δ > 0, is to find a classifier f H from a predetermined hypothesis class H YX , by observing m = m(ε, δ) iid samples from D, such that with probability at least 1 δ, the population loss of f, ℓD(f), is at most ℓD(H) + ε, where ℓD(H) is the population loss of the best classifier in H, i.e., ℓD(H) = min f H ℓD(f ). ℓD(H) is sometimes called the approximation error, which models how well the hypothesis class H approximates the ground truth classifier. Throughout this paper, we focus on binary labels, i.e., Y = {0, 1}. With binary labels, a classifier f : X {0, 1} corresponds bijectively to a subset of X to which f assigns label 1, i.e., {x X | f(x) = 1}. It is sometimes more convenient to deal with the subset of X associated with f. We will treat a classifier f and the subset associated with f interchangeably in the rest of the paper. The two central questions of PAC learning discussed earlier are partially answered by the ERM principle, formally defined below. Given m iid samples S = {(xi, yi)}i Dm,1 the ERM principle finds a classifier f H which minimizes the empirical loss ℓS(f) on S, defined as i [|S|] |f(xi) yi|. That is, the ERM principle finds f argminf H ℓS(f ). This gives a concrete, though not always efficient, way of computing a classifier it is known that with sufficiently many samples, the classifier found by the ERM principle achieves the desired loss and failure probability. Moreover, the number of samples (also known as the sample complexity see below for a formal definition) required by the ERM principle is optimal up to a constant factor.2 This sample complexity is asymptotically determined by the so-called VC dimension of the hypothesis class H, defined below. Definition 1 (VC Dimension). A subset S of the feature space X is shattered by a hypothesis class H, if for any T S, there is a classifier f H such that f S = T. The VC dimension of H, d VC(H), is the cardinality of the largest subset of X that is shattered by H, i.e., d VC(H) = max n |S| |{f S | f H}| = 2|S|o . The VC dimension captures the capacity of the hypothesis class H. The larger the capacity is, the more samples it takes to find an approximately optimal classifier in H. More precisely, the sample complexity of ERM is given by the following theorem (see, e.g., (Shalev-Shwartz and Ben-David 2014)). 1We treat S as an unordered set, since the order does not carry information. 2This is true only in the agnostic setting where ℓD(H) > 0, on which we focus our attention throughout the paper. Similar results can be established for the realizable setting where ℓD(H) = 0. Theorem 1 (Sample Complexity of ERM). Fix a feature space X, a population distribution D, and a hypothesis class H. For any ε > 0, δ > 0, given m = O (d VC(H) + log(1/δ))/ε2 iid samples S Dm, with probability at least 1 δ, any classifier f H minimizing the empirical loss on S, i.e., f argminf H ℓS(f ), has population loss at most ℓD(f) ℓD(H)+ε. Moreover, finding any classifier achieving the same relative loss ε and failure probability δ requires Ω (d VC(H) + log(1/δ))/ε2 iid samples. The above theorem says that the ERM principle finds an approximately optimal (up to an additive loss ε) classifier within H, with an asymptotically optimal number of samples. In other words, the low empirical loss of the classifier found by the ERM principle generalizes with respect to the population distribution D. 2.2 Strategic Manipulation in Classification In the classical PAC learning model, the classifier always observes the real feature of the point being classified. However, as argued above, this is often not the case in reallife scenarios, where the point being classified may strategically modify its feature in order to receive a more desirable outcome. In this paper, we model the data point s ability to modify its feature using a binary relation , which captures the reporting structure over the feature space. For x1, x2 X, we say x1 x2 if a point with feature x1 can pretend to have (i.e., report) feature x2. For any x X, we always have x x, corresponding to the fact that the point may choose not to modify its feature.3 Throughout the paper, we assume that all points being classified prefer label 1 (corresponding to, e.g., acceptance) to 0. As a result, fixing a misreporting structure and a classifier f : X {0, 1}, a point with feature x will pretend to have feature x such that f(x ) = 1 whenever possible, i.e., it will report x argmaxx :x x f(x ). Note that it is possible that x = x. This can happen when f(x) = 1, or f(x ) = 0 for all x such that x x , including x itself.4 In the rest of the paper, we focus on the following variant of the PAC learning model. The learning algorithm has access to m iid unmodified samples {(xi, yi)}i Dm, based on which a classifier f H is computed.5 The classifier f is then deployed, and random points drawn from D modify their features strategically in response to the classifier. 3Note that this is not a technical requirement all results in this paper still hold without this assumption. 4A seemingly more general model is one in which there is a cost c(x, x ) 0 for a point with feature x to report feature x . We remark that with binary labels, it is without loss of generality to ignore this cost, since a way of reporting is feasible iff the gain of receiving label 1 is larger than the cost of reporting. 5We consider scenarios where observed data points have no interest in manipulating the learning process, and thus will not modify their features this is a standard assumption, and appears also in, e.g., (Cullina, Bhagoji, and Mittal 2018). It is definitely true that in many other scenarios, the samples observed may be strategically modified too. However, we do not consider this in the current paper. The strategic population loss of f, taking into consideration strategic manipulation, can be computed as bℓD(f) = Pr (x,y) D h max x :x x f(x ) = y i . Again, the goal is to find a classifier f H with strategic population loss at most bℓD(H) + ε, with probability at least 1 δ, where bℓD(H) = min f H bℓD(f ). 3 Incentive-Aware Empirical Risk Minimization In this section, we investigate the ERM principle in the presence of strategic manipulation. We first give an example, showing that the vanilla ERM principle, which ignores the strategic aspect of the problem, has poor performance in general. We then consider an incentive-aware variant of ERM, and analyze its generalization behavior. 3.1 How Vanilla ERM Fails Consider the following example. The feature space X = {x1, x2, x3}, the hypothesis class H = 2X , and the population distribution D assigns probability 0.5 to feature-label pair (x1, 0), and probability 0.5 to (x2, 1). The reporting structure allows (1) x x for any x X, and (2) x1 x2 and x2 x3. Note that since the feature space is extremely simple, even the power set H = 2X (which has VC dimension d VC(H) = 3) exhibits decent generalization behavior without strategic manipulation. Recall that the vanilla ERM principle finds an arbitrary classifier which minimizes the empirical loss based on the unmodifed sample set. Suppose the number of samples m = ω(1). Then with high probability, the sample set consists of copies of (x1, 0) and (x2, 1) and nothing else. Any classifier f satisfying f(x1) = 0 and f(x2) = 1 achieves 0 empirical loss. However, with strategic manipulation, such an f ends up assigning label 1 to all (new) points whose (true) feature is x1, since those points can report x2 to fool the classifier. As a result, the strategic population loss is bℓD(f) = Pr (x,y) D h max x :x x f(x ) = y i 0.5 max x :x1 x f(x ) 0 = 0.5 |f(x2) 0| = 0.5. In other words, with high probability, any classifier found using the vanilla ERM principle is no better than random guessing. But in fact, the classifier f with f (x1) = f (x2) = 0 and f (x3) = 1 would have 0 loss with strategic behavior, because maxx :x1 x f (x ) = 0 and maxx :x2 x f (x ) = f (x3) = 1. Hence it is possible to do much better than vanilla ERM. The above example shows that with strategic manipulation, the vanilla ERM principle fails spectacularly even in extremely simple cases that would be trivial without strategic manipulation. This observation aligns with the intuition that any learning procedure achieving nontrivial performance must exploit the reporting structure. 3.2 The Incentive-Aware ERM Principle Below we present an incentive-aware version of the ERM principle (henceforth IA ERM), and show that (1) IA ERM finds a classifier with the desired properties, and (2) the sample complexity of IA ERM is asymptotically optimal. These properties of IA ERM closely resemble those of its classical counterpart without strategic manipulation, and suggests that IA ERM is the right version of the ERM principle in the strategic setting that we consider. Incentive-aware ERM. The idea is to minimize the strategic empirical loss bℓS(f) on the sample set S = {(xi, yi)}i, computed by replacing the true feature of every sample point with the most beneficial feature that the point can report, i.e., max x :xi x f(x ) yi The strategic empirical loss is a natural quantity to consider, since the expectation of bℓS(f) over S is precisely the strategic population loss bℓD(f), which we aim to minimize. Given the notion of strategic empirical loss, IA ERM simply finds any classifier f in the hypothesis class H which minimizes bℓS(f), i.e., f argminf H bℓS(f ). We now analyze the generalization behavior of IA ERM. Recall that in the classical PAC setting without strategic manipulation, the sample complexity of ERM depends on the VC dimension of the hypothesis class H. Here, with strategic manipulation, it appears that the VC dimension of H is no longer the right capacity measure to consider. Instead of H, we consider the effective hypothesis class b H, defined below. Definition 2 (Effective Classifier / Hypothesis Class). Fixing a feature space X and a reporting structure , for each classifier f X, the effective classifier bf consists of the set of features to which f effectively assigns label 1, i.e., bf = {x X | x : x x , f(x ) = 1}. Fixing a hypothesis class H, the effective hypothesis class b H consists of the effective classifier of every classifier in H, i.e., b H = { bf | f H}. One may intuitively interpret the effective hypothesis class in the following way. Any classifier f in the presence of strategic manipulation is equivalent to the effective classifier bf without strategic manipulation. Therefore, by considering the effective hypothesis class b H, we can effectively ignore strategic manipulation, and reduce the generalization analysis to that in the classical PAC setting, i.e., Theorem 1. This gives us the following theorem. Theorem 2 (Sample Complexity of IA ERM). Fix a feature space X, a population distribution D, a reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given m = O (d VC( b H) + log(1/δ))/ε2 iid samples S Dm, with probability at least 1 δ, any classifier f H minimizing the strategic empirical loss on S, i.e., f argminf H bℓS(f ), has strategic population loss at most bℓD(f) bℓD(H)+ε. Moreover, finding any classifier achieving the same relative loss ε and failure probability δ requires Ω (d VC( b H) + log(1/δ))/ε2 iid samples. The proof of Theorem 2, as well as all other proofs, are in Appendix A. In words, the above theorem says that IA ERM finds a classifier whose strategic population loss is close to the best classifier in the hypothesis class H, and the sample complexity of finding such a classifier is determined by the VC dimension of the effective hypothesis class b H, rather than that of H itself. We make a few remarks regarding IA ERM. First, when the reporting structure is trivial, i.e., a point with feature x can only report x itself, then the effective class b H = H, and IA ERM reduces to vanilla ERM. In such cases, Theorem 2 reduces precisely to the classical Theorem 1. Second, in general, the VC dimension d VC( b H) of the effective class b H may be smaller or larger than that of H. Below we give two examples, showing that either of the two quantities can be arbitrarily larger than the other. We first give an example where d VC(H) = and d VC( b H) = 1. Consider the example discussed in the introduction. There, the feature space X = R+ = [0, ), and the reporting structure satisfies for any x1, x2 R+, x1 x2 x1 x2. Consider H = 2R+. Clearly d VC(H) = , since any S R+ is shattered by H. On the other hand, for any f H, bf is essentially determined by a threshold θf, where θf = inf{x | f(x) = 1}. bf is then defined by 0, x < θf 1, x > θf f(x), x = θf. In other words, b H is the class of threshold classifiers over R+. It is well-known that d VC( b H) = 1. Now we show that d VC( b H) can be arbitrarily larger than d VC(H). Let the feature space be X = N, and H = {{i} | i N}, i.e., the set of all singletons. It is clear that d VC(H) = 1. Now for any d N, we construct such that d VC( b H) = d.6 Let be such that (1) i i for any i N, and (2) for any i {d, . . . , d + 2d 1} and j {0, . . . , d 1}, j i if the (j +1)-th digit in the binary representation of i d is 1. Now we argue that b H shatters {0, . . . , d 1}. In fact, any subset S {0, . . . , d 1} can be viewed as the binary representation of an integer i S in {0, . . . , 2d 1}, where the (j + 1)-th digit of i S is 1 iff j S. Consider the classifier f S = {i S + d} H. Clearly c f S {0, d 1} = S, since precisely the points in S can report i S +d, which is the only point assigned label 1 by f S. 6Similar constructions could also give d VC( b H) = . 4 Incentive-Compatible Empirical Risk Minimization The IA ERM principle, together with Theorem 2, provides a rather general solution for PAC learning in the presence of strategic manipulation. Still, one may wonder whether it is possible to achieve more desirable properties, e.g., enhanced robustness against strategic manipulation and better generalization guarantees, by further refining / regularizing IA ERM. In this section, we propose incentive compatibility as a means of regularization, which leads to the incentivecompatible ERM principle (henceforce IC ERM). We first present the IC ERM principle and provide a general analysis of its sample complexity in Section 4.1. In particular, we show that IC ERM generalizes no worse than IA ERM when applied to the same hypothesis class. In Section 4.2, we consider a special case of IC ERM, free IC ERM, where the hypothesis class H is implicitly all possible classifiers over the feature space X, i.e., H = 2X . We give an efficient algorithm for free IC ERM, and more importantly, we show that, somewhat surprisingly, it is still possible to derive nontrivial generalization bounds for free IC ERM. Based on the generalization analysis of free IC ERM, we provide a hypothesis-class-independent sample complexity bound for IC ERM, which further illustrates the power of incentive compatibility as a means of regularization. Finally, in Section 4.3, we discuss an important class of reporting structures, i.e., transitive reporting structures, on which IC ERM is equivalent to the more general IA ERM. Based on this equivalence, we give a hypothesisclass-independent sample complexity bound for IA ERM that applies whenever the reporting structure is transitive. 4.1 The Incentive-Compatible ERM Principle First we introduce the notions of incentive-compatible classifiers and hypothesis classes. Incentive compatibility is a standard concept in mechanism design and straightforwardly applying it in our context results in the following definition. Definition 3 (Incentive-Compatible Classifiers / Hypothesis Classes). Fixing a feature space X and a reporting structure , a classifier f : X {0, 1} is incentive compatible if for any x1, x2 X, x1 x2 = f(x1) f(x2). In other words, no point can receive a better outcome by pretending to have a different feature. A hypothesis class H is incentive compatible if it contains only incentive-compatible classifiers. The intuition behind the definitions is simple: when an incentive-compatible classifier is deployed, no point is motivated to strategically modify its feature, since it is impossible to obtain a better outcome by doing so. As a result, the effective classifier induced by any incentive-compatible classifier f is itself, i.e., bf = f, and the strategic (empirical) loss of an incentive-compatible classifier f is the same as the (empirical) loss of the same classifier without strategic manipulation, i.e., bℓD(f) = ℓD(f) and bℓS(f) = ℓS(f). Incentive-compatible classifiers therefore provide arguably the strongest robustness one may hope for against strategic manipulation they eliminate strategic manipulation. The IC ERM principle presented below always finds a classifier that is incentive compatible. Incentive-compatible ERM. For any hypothesis class H, let the incentive-compatible subclass HIC( ) be the largest subset of H that is incentive compatible under , i.e., HIC( ) = {f H | f is incentive compatible under }. We omit the parameter when it is clear from the context. Given a hypothesis class H, the IC ERM principle finds any classifier f in the incentive-compatible subclass HIC minimizing the empirical loss ℓS(f) on the sample set S, i.e., f argminf HIC ℓS(f ) = argminf HIC bℓS(f ). We now analyze the sample complexity of IC ERM. Observe that IC ERM with hypothesis class H is equivalent to vanilla ERM with hypothesis class HIC. Therefore, applying Theorem 1 to HIC, we immediately obtain the following asymptotically optimal sample complexity bound for IC ERM. Theorem 3 (Sample Complexity of IC ERM). Fix a feature space X, a population distribution D, a reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given m = O (d VC(HIC) + log(1/δ))/ε2 iid samples S Dm, with probability at least 1 δ, any classifier f HIC minimizing the empirical loss on S, i.e., f argminf HIC ℓS(f ), has strategic population loss at most bℓD(f) bℓD(HIC) + ε. Moreover, finding any incentive-compatible classifier within H achieving the same relative loss ε and failure probability δ requires Ω (d VC(HIC) + log(1/δ))/ε2 iid samples. Theorem 3 is but a direct application of the classical Theorem 1. To obtain further insights into the sample complexity of IC ERM, we need to take a closer look at the VC dimension of the incentive-compatible subclass HIC, which dictates the sample complexity. As discussed above, IC ERM can be viewed as a regularized version of IA ERM. Below we formalize this intuition, by showing that the sample complexity of IC ERM is in fact no larger than that of IA ERM, or that of vanilla ERM, when applied to the same hypothesis class H. This is done in the following claim via upper bounding the VC dimension of HIC by that of b H, as well as that of H. Proposition 1. Fixing a feature space X and a reporting structure , for any hypothesis class H over X, As a result, d VC(HIC) min{d VC(H), d VC( b H)}. In many natural scenarios, the VC dimension of HIC is significantly smaller than d VC( b H). Below we give an example where d VC(H) = and d VC(HIC) = 1. Consider again the introductory example, where X = R+. As argued above, when H = 2R+, b H is all threshold classifiers, and d VC( b H) = 1. On the other hand, by Proposition 1, HIC b H. In fact, one may show that every classifier in b H is incentive-compatible, and HIC = b H. As a result, d VC(HIC) = d VC( b H) = 1. So, in addition to the highly desirable property of incentive compatibility, IC ERM also has at most the same, and often much better sample complexity than IA ERM, which translates to better generalization fixing the number of samples. We also remark that IC ERM finds an approximately optimal classifier in the incentive-compatible subclass HIC, which may have a worse approximation error than H. In other words, to achieve incentive compatibility, one in general has to sacrifice some accuracy in the form of a larger approximation error. We will see more desirable properties of IC ERM in the rest of this section. 4.2 Free IC ERM and Generalization from Incentive Compatibility We now consider free IC ERM, which is a special case of IC ERM where the hypothesis class consists of all possible classifiers over X, i.e., H = H0 = 2X . At first glance this may appear senseless in the classical PAC setting without strategic manipulation, no (nontrivial) generalization is possible if nothing is known about the ground truth a priori, i.e., when the hypothesis class consists of all possible classifiers. However, as we show below, the reporting structure, together with the requirement of incentive compatibility, in fact induces nontrivial structure over the incentive-compatible subclass HIC 0 , which allows nontrivial sample complexity and generalization bounds. These structures are captured by the following complexity measure, which we term the intrinsic VC dimension. Definition 4 (Intrinsic VC Dimension). Fix a reporting structure over a feature space X. For any x, x X, we say x can reach x (denoted x x ) if there exists a sequence of features x1, . . . , xk for some integer k > 0, such that x1 = x, xk = x , and for any i [k 1], xi xi+1. A set of features S X is independent if for any x, x S where x = x , x cannot reach x . The intrinsic VC dimension of over X, d VC(X, ), is the cardinality of any maximum subset of X that is independent, i.e., d VC(X, ) = max{|S| | S X : x, x S s.t. x = x and x x }. It turns out that the intrinsic VC dimension of the reporting structure is precisely the VC dimension of the incentivecompatible subclass HIC 0 of the null hypothesis class H0, as formalized in the following proposition. Proposition 2 (VC Dimension of Null Hypothesis Class). For any feature space X and reporting structure over X, d VC(X, ) = d VC(HIC( ) 0 ), where H0 = 2X is the null hypothesis class. Note that for any hypothesis class H 2X = H0 over X, the incentive-compatible subclass of H is a subclass of that of the null hypothesis class H0, i.e., HIC HIC 0 . As a result, we always have d VC(HIC) d VC(HIC 0 ) = d VC(X, ). This, together with Theorem 3, immediately implies the following hypothesis-class-independent sample complexity bound for IC ERM, which, in particular, applies to free IC ERM. Theorem 4 (Hypothesis-Class-Independent Sample Complexity Bound for IC ERM). Fix a feature space X, a population distribution D, a reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given m = O (d VC(X, ) + log(1/δ))/ε2 iid samples S Dm, with probability at least 1 δ, any classifier f HIC minimizing the empirical loss on S, i.e., f argminf HIC ℓS(f ), has strategic population loss bℓD(f) bℓD(HIC) + ε. Moreover, when the hypothesis class H is the null hypothesis class H0, finding any incentive-compatible classifier achieving the same relative loss ε and failure probability δ requires Ω (d VC(X, ) + log(1/δ))/ε2 iid samples. We also present an efficient algorithm for free IC ERM in Appendix B. 4.3 Transitive Reporting and the Revelation Principle Finally, we investigate an important family of reporting structures, transitive reporting structures. It turns out that with transitivity, the so-called revelation principle from mechanism design holds: if one accounts for strategic reporting, then without loss of generality one may focus on incentive-compatible classifiers. That is, IC ERM is as general as IA ERM. As a result, with transitivity, the hypothesisclass-independent sample complexity bound extends to IA ERM applied to any hypothesis class. We first give the formal definition of transitive reporting structures, which is essentially the same as that of transitive binary relations. Definition 5 (Transitive Reporting Structures). A reporting structure over X is transitive if for any x1, x2, x3 X, (x1 x2 and x2 x3) = x1 x3. One example of transitive reporting structures is the example from the introduction, where X = R+, and is precisely the same as , which is clearly transitive. It turns out that transitivity of reporting structures is equivalent to the revelation principle holding, which roughly says that any classifier is equivalent to a (possibly different) incentivecompatible classifier. Formally: Proposition 3 (Revelation Principle for PAC Learning). The following is true if and only if the reporting structure is transitive: for any classifier f : X {0, 1}, the effective classifier bf is incentive compatible, and as a result, for any hypothesis class H, the effective class b H is incentivecompatible (i.e., ( b H)IC = b H). In light of Proposition 3, when the reporting structure is transitive, for any H, IA ERM with hypothesis class H is equivalent to IC ERM with hypothesis class b H, in the sense that they yield the same effective classifier given any sample set, with the same sample complexity. This gives us a way to translate the hypothesis-class-independent sample complexity bound for IC ERM to IA ERM, and obtain the following theorem. Theorem 5 (Hypothesis-Class-Independent Sample Complexity Bound for IA ERM). Fix a feature space X, a population distribution D, a transitive reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given m = O (d VC(X, ) + log(1/δ))/ε2 iid samples S Dm, with probability at least 1 δ, any classifier f H minimizing the strategic empirical loss on S, i.e., f argminf H bℓS(f ), has strategic population loss bℓD(f) bℓD(H) + ε. Moreover, when the hypothesis class H is the null hypothesis class H0, finding any classifier achieving the same relative loss ε and failure probability δ requires Ω (d VC(X, ) + log(1/δ))/ε2 iid samples. To develop intuition, consider again the introductory example. As discussed above, there, X = R+ and the reporting structure = is transitive. Moreover, the intrinsic VC dimension of is d VC(X, ) = 1. This is because any singleton set is independent, and for any x1 = x2, either x1 < x2 or x2 < x1, so {x1, x2} cannot be independent. On the other hand, as argued above, any classifier f 2X effectively implements a threshold θf, where f(x) = 1 iff x θf or x > θf. It is well-known (see, e.g., (Shalev-Shwartz and Ben-David 2014)) that Θ(log(1/δ)/ε2) samples suffice to learn such a threshold with relative loss ε and failure probability δ, which coincides with the above sample complexity bound based on the intrinsic VC dimension. Acknowledgements The authors are thankful for support from NSF under award IIS-1814056. Ethics Statement Potential positive impact: Our results can be applied, for example, to prevent fraud. Na ıve learning processes that do not anticipate strategic behavior will not perform at all as expected in practice, creating substantial risk. Our results could make a socially beneficial classification process more accurate. By encouraging truthful reporting, our results could allow for more efficient allocation of resources to those in actual need. Classifiers that are not incentive compatible result in strategic behavior being advantageous to individuals. Vulnerable communities often are not aware of this or do not know how to play the game strategically, exacerbating inequality. Incentive compatible classifiers remove this concern. (Similar arguments have been made for the use of incentive compatible mechanisms in allocating students to public schools, where wealthier parents often know how to play the game better.) Potential negative impact: Our results could make a socially harmful classification process more accurate. Our results encourage entities being classified to report their true features, which, if handled carelessly, could result in privacy issues. Alon, N.; Ben-David, S.; Cesa-Bianchi, N.; and Haussler, D. 1997. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM (JACM) 44(4): 615 631. Bartlett, P. L.; and Mendelson, S. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3(Nov): 463 482. Braverman, M.; Mao, J.; Schneider, J.; and Weinberg, S. M. 2019. Multi-armed Bandit Problems with Strategic Arms. In Conference on Learning Theory, 383 416. Chen, Y.; Podimata, C.; Procaccia, A. D.; and Shah, N. 2018. Strategyproof linear regression in high dimensions. In Proceedings of the 2018 ACM Conference on Economics and Computation, 9 26. Cullina, D.; Bhagoji, A. N.; and Mittal, P. 2018. PAC-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, 230 241. Daniely, A.; Sabato, S.; Ben-David, S.; and Shalev-Shwartz, S. 2015. Multiclass learnability and the erm principle. The Journal of Machine Learning Research 16(1): 2377 2404. Dekel, O.; Fischer, F.; and Procaccia, A. D. 2010. Incentive compatible regression learning. Journal of Computer and System Sciences 76(8): 759 777. Devroye, L.; Gy orfi, L.; and Lugosi, G. 2013. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media. Feng, Z.; Parkes, D. C.; and Xu, H. 2019. The intrinsic robustness of stochastic bandits to strategic manipulation. ar Xiv preprint ar Xiv:1906.01528 . Freeman, R.; Pennock, D. M.; Podimata, C.; and Vaughan, J. W. 2020. No-Regret and Incentive-Compatible Prediction with Expert Advice. ar Xiv preprint ar Xiv:2002.08837 . Green, J. R.; and Laffont, J.-J. 1986. Partially verifiable information and mechanism design. The Review of Economic Studies 53(3): 447 456. Haghtalab, N.; Immorlica, N.; Lucier, B.; and Wang, J. 2020. Maximizing Welfare with Incentive-Aware Evaluation Mechanisms. In 29th International Joint Conference on Artificial Intelligence. Hanneke, S. 2016. The optimal sample complexity OF PAC learning. The Journal of Machine Learning Research 17(1): 1319 1333. Hardt, M.; Megiddo, N.; Papadimitriou, C.; and Wootters, M. 2016. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, 111 122. Kephart, A.; and Conitzer, V. 2015. Complexity of mechanism design with signaling costs. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 357 365. Kephart, A.; and Conitzer, V. 2016. The revelation principle for mechanism design with reporting costs. In Proceedings of the 2016 ACM Conference on Economics and Computation, 85 102. Kleinberg, J.; and Raghavan, M. 2019. How Do Classifiers Induce Agents to Invest Effort Strategically? In Proceedings of the 2019 ACM Conference on Economics and Computation, 825 844. Krishnaswamy, A.; Li, H.; Rein, D.; Zhang, H.; and Conitzer, V. 2021. Classification with Strategically Withheld Data. In Proceedings of the AAAI Conference on Artificial Intelligence. Perote, J.; and Perote-Pena, J. 2004. Strategy-proof estimators for simple regression. Mathematical Social Sciences 47(2): 153 176. Pollard, D. 2012. Convergence of stochastic processes. Springer Science & Business Media. Roughgarden, T.; and Schrijvers, O. 2017. Online prediction with selfish experts. In Advances in Neural Information Processing Systems, 1300 1310. Shalev-Shwartz, S.; and Ben-David, S. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press. Talagrand, M. 1994. Sharper bounds for Gaussian and empirical processes. The Annals of Probability 28 76. Valiant, L. G. 1984. A theory of the learnable. Communications of the ACM 27(11): 1134 1142. Vapnik, V.; and Chervonenkis, A. Y. 1971. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications 16(2): 264 280. Yu, L. 2011. Mechanism design with partial verification and revelation principle. Autonomous Agents and Multi-Agent Systems 22(1): 217 223. Zhang, H.; Cheng, Y.; and Conitzer, V. 2019a. Distinguishing Distributions When Samples Are Strategically Transformed. In Advances in Neural Information Processing Systems, 3187 3195. Zhang, H.; Cheng, Y.; and Conitzer, V. 2019b. When samples are strategically selected. In International Conference on Machine Learning, 7345 7353. Zhang, H.; Cheng, Y.; and Conitzer, V. 2021a. Automated Mechanism Design for Classification with Partial Verification. In Proceedings of the AAAI Conference on Artificial Intelligence. Zhang, H.; Cheng, Y.; and Conitzer, V. 2021b. Classification with Few Tests through Self-Selection. In Proceedings of the AAAI Conference on Artificial Intelligence.