# incentiveaware_pac_learning__3f5a6cbe.pdf

Incentive-Aware PAC Learning

Hanrui Zhang, Vincent Conitzer Duke University hrzhang@cs.duke.edu, conitzer@cs.duke.edu

We study PAC learning in the presence of strategic manipulation, where data points may modify their features in certain predeﬁned ways in order to receive a better outcome. We show that the vanilla ERM principle fails to achieve any nontrivial guarantee in this context. Instead, we propose an incentive-aware version of the ERM principle which has asymptotically optimal sample complexity. We then focus our attention on incentive-compatible classiﬁers, which provably prevent any kind of strategic manipulation. We give a sample complexity bound that is, curiously, independent of the hypothesis class, for the ERM principle restricted to incentivecompatible classiﬁers. This suggests that incentive compatibility alone can act as an effective means of regularization. We further show that it is without loss of generality to consider only incentive-compatible classiﬁers when opportunities for strategic manipulation satisfy a transitivity condition. As a consequence, in such cases, our hypothesis-classindependent sample complexity bound applies even without incentive compatibility. Our results set the foundations of incentive-aware PAC learning.

1 Introduction Consider the following scenario. A ﬁnancial institution plans to offer a product to a selected group of customers, and decides that the only thing that should matter in the selection process is the amount of money she currently owns in savings (possibly held at a different ﬁnancial institution). Each customer may prove her savings balance by submitting a bank statement, and the number shown therein will be used for the selection. However, customers may choose to underreport their balance, by, for example, temporarily transferring their money to another account before the statement date. (We assume for the purpose of this example that overreporting is not possible.) The question is, how does this ability of customers to strategically underreport their balance affect the selection process? The answer, as one would expect, depends on the speciﬁc criteria of the selection, which are presumably determined by the nature of the product. For instance, the product could be a secured loan that requires the customer to use the savings for collateral. In that case, customers will beneﬁt from

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

their balance being as high as possible, which means they would submit a statement with the full balance. As a result, the fact that customers can underreport would not affect the selection at all. Alternatively, the ﬁnancial institution could be a non-proﬁt entity that aims to make a product available only to those with low savings. In that case, every customer who wants access to this product would underreport her balance in order to be approved, and this would make the selection effectively meaningless. More generally, often the right criteria for the selection may not be clear a priori, but have to be learned from observations. In such cases, the decision maker (e.g., the institution) has access to labeled historical data (e.g., the amount of savings owned by some customer, and whether the product was successful for the same customer). A classiﬁer (e.g., selection criteria) is derived from the data and implemented, to which future data points (e.g., new customers) to be classiﬁed respond in a strategic way. The goal, naturally, is to classify future data points as accurately as possible, taking into account that their features may be strategically modiﬁed. The key challenge is for the classiﬁer derived to generalize from past unmodiﬁed observations to future strategic data points. Such problems are partially captured by the PAC learning model (discussed in detail below), but also exhibit additional complexity from strategic manipulation of the features, where the nature of this manipulation is affected by the choice of classiﬁer. In this paper, we investigate novel phenomena in PAC learning introduced by the presence of strategic manipulation.

Our results. Two central questions in PAC learning are the following: statistically, how many labeled data points does one need to observe in order to derive a classiﬁer of a desired quality, and computationally, given this many labeled data points, how can one compute such a classiﬁer? We answer these questions for arbitrary feature spaces and structures of strategic manipulation, by presenting an adapted version of the empirical risk minimization (ERM) principle, which roughly says that one should simply pick a classiﬁer that has the best quality on the unmodiﬁed labeled data points that are observed. We show that our incentive-aware version of the ERM principle requires asymptotically the minimum possible number of labeled data points for any speciﬁc problem setup. This can be viewed as a strategic version of the VC

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

theory, one of the central results in traditional PAC learning. We further consider incentive-compatible classiﬁers, which provably prevent any kind of strategic manipulation by making revealing one s true features always optimal for any data point. Here, our most remarkable result is a hypothesis-class-independent bound on the number of labeled data points required to compute an incentivecompatible classiﬁer of a desired quality. In traditional PAC learning, it is well known that without any prior knowledge about the ground truth (typically modeled by a hypothesis class consisting of possible classiﬁers to be considered), it is impossible to learn anything nontrivial, unless the number of labeled data points observed is trivially large or even inﬁnite. By contrast, we show that in the presence of strategic manipulation, it is possible to learn a nontrivial classiﬁer via our strategic version of the ERM principle without any such prior knowledge, except that the classiﬁer learned has to be incentive-compatible. In other words, incentivecompatibility acts as a means of regularization, which provides nontrivial learning guarantees in and of itself. Moreover, when the structure of strategic manipulation is transitive (i.e., if A can pretend to be B and B can pretend to be C, then A can pretend to be C), considering incentivecompatible classiﬁers is without loss of generality this is commonly known as the revelation principle in economic theory. This, together with our results for incentivecompatible classiﬁers, implies that ERM-type learning in the presence of transitive strategic manipulation is automatically regularized. That is, the hypothesis-class-independent bound discussed above applies even without any exogenous requirement of incentive compatibility (since requiring this condition is without loss of generality), or any other prior knowledge.

Related work. There is an extremely rich body of research on PAC learning (Valiant 1984). Since the enlightening result by Vapnik and Chervonenkis (1971), various measures of complexity have been studied (Alon et al. 1997; Bartlett and Mendelson 2002; Pollard 2012; Daniely et al. 2015), and tighter sample complexity bounds have been developed (Talagrand 1994; Hanneke 2016). See, e.g., (Devroye, Gy orﬁ, and Lugosi 2013; Shalev-Shwartz and Ben-David 2014) for a comprehensive exposition of PAC learning. Along this line of research, one result that is particularly related to ours is (Cullina, Bhagoji, and Mittal 2018). There, the authors also study PAC learning when data points may modify their features in certain ways. The model they consider is similar to ours in several ways, e.g., historical observations are not modiﬁed, and new data points to be classiﬁed can modify their features according to a binary relation over the feature space. The main difference between our results and theirs is that they assume data points to be adversarial, in that they always want to receive the incorrect label, whereas in our model, data points are strategic, in that they always want to receive the more desirable label (i.e., label 1), regardless of whether that is the correct label or not. A closely related research topic is strategic machine learning, where it is commonly assumed that strategic agents

seek to maximize their utility by modifying their features in certain restricted ways, often at some cost (Hardt et al. 2016; Kleinberg and Raghavan 2019; Haghtalab et al. 2020; Zhang, Cheng, and Conitzer 2021a). Speciﬁc topics include strategic aspects of linear regression (Perote and Perote Pena 2004; Dekel, Fischer, and Procaccia 2010; Chen et al. 2018) and online learning (Roughgarden and Schrijvers 2017; Braverman et al. 2019; Feng, Parkes, and Xu 2019; Freeman et al. 2020). Our results differ from the above in that we consider a PAC model, where the key challenge is to generalize from observations to future data points to be classiﬁed. Also, we consider arbitrary feature spaces and structures of strategic manipulation, while existing results often focus on relatively restrictive setups, e.g., linear models. Another closely related line of research is mechanism design with partial veriﬁcation (Green and Laffont 1986; Yu 2011; Kephart and Conitzer 2015, 2016), which is often motivated by similar considerations as strategic machine learning. The general problem considered there is to design an (approximately) optimal mechanism assigning outcomes to data points based on their features, where the authenticity of the features can only be partially veriﬁed. Particularly relevant is a series of papers by Zhang, Cheng, and Conitzer (2019b,a, 2021b) with a pronounced learning aspect. The main difference between our results and theirs is that in our model, the underlying distribution of data is only accessible via samples observed, and we do not restrict the structure of possible strategic manipulation. In a companion paper with overlapping authors (Krishnaswamy et al. 2021), we study empirically an incentiveaware classiﬁcation setting where not every data point has every feature, and a data point may strategically hide some of their features in order to get a better label. Our experimental results in that paper on ERM-style algorithms and realworld credit approval datasets suggest that, with carefully designed ERM-style algorithms that are robust to strategic misreporting, one can do almost as well as with pristine data. We could use results in the current paper to theoretically explain generalization performance in a setting like that of (Krishnaswamy et al. 2021).

2 Preliminaries In this section, we review relevant deﬁnitions and results in PAC learning, and formulate the notion of strategic manipulation in classiﬁcation.

2.1 Background on PAC Learning Our problems of interest ﬁt well with the probably approximately correct (PAC) learning model (Valiant 1984). In PAC learning, there is a feature space X, a label space Y, and a joint population distribution D (X Y) over the feature space and the label space. For a classiﬁer f : X Y, the population loss ℓD(f) is deﬁned as the probability that f assigns a wrong label to a random point with respect to D, i.e., ℓD(f) = Pr (x,y) D[f(x) = y].

The goal of PAC learning, with target relative loss ε > 0 and failure probability δ > 0, is to ﬁnd a classiﬁer f H from

a predetermined hypothesis class H YX , by observing m = m(ε, δ) iid samples from D, such that with probability at least 1 δ, the population loss of f, ℓD(f), is at most ℓD(H) + ε, where ℓD(H) is the population loss of the best classiﬁer in H, i.e.,

ℓD(H) = min f H ℓD(f ).

ℓD(H) is sometimes called the approximation error, which models how well the hypothesis class H approximates the ground truth classiﬁer. Throughout this paper, we focus on binary labels, i.e., Y = {0, 1}. With binary labels, a classiﬁer f : X {0, 1} corresponds bijectively to a subset of X to which f assigns label 1, i.e.,

{x X | f(x) = 1}.

It is sometimes more convenient to deal with the subset of X associated with f. We will treat a classiﬁer f and the subset associated with f interchangeably in the rest of the paper. The two central questions of PAC learning discussed earlier are partially answered by the ERM principle, formally deﬁned below. Given m iid samples S = {(xi, yi)}i Dm,1 the ERM principle ﬁnds a classiﬁer f H which minimizes the empirical loss ℓS(f) on S, deﬁned as

i [|S|] |f(xi) yi|.

That is, the ERM principle ﬁnds

f argminf H ℓS(f ).

This gives a concrete, though not always efﬁcient, way of computing a classiﬁer it is known that with sufﬁciently many samples, the classiﬁer found by the ERM principle achieves the desired loss and failure probability. Moreover, the number of samples (also known as the sample complexity see below for a formal deﬁnition) required by the ERM principle is optimal up to a constant factor.2 This sample complexity is asymptotically determined by the so-called VC dimension of the hypothesis class H, deﬁned below. Deﬁnition 1 (VC Dimension). A subset S of the feature space X is shattered by a hypothesis class H, if for any T S, there is a classiﬁer f H such that f S = T. The VC dimension of H, d VC(H), is the cardinality of the largest subset of X that is shattered by H, i.e.,

d VC(H) = max n |S| |{f S | f H}| = 2|S|o .

The VC dimension captures the capacity of the hypothesis class H. The larger the capacity is, the more samples it takes to ﬁnd an approximately optimal classiﬁer in H. More precisely, the sample complexity of ERM is given by the following theorem (see, e.g., (Shalev-Shwartz and Ben-David 2014)).

1We treat S as an unordered set, since the order does not carry information. 2This is true only in the agnostic setting where ℓD(H) > 0, on which we focus our attention throughout the paper. Similar results can be established for the realizable setting where ℓD(H) = 0.

Theorem 1 (Sample Complexity of ERM). Fix a feature space X, a population distribution D, and a hypothesis class H. For any ε > 0, δ > 0, given

m = O (d VC(H) + log(1/δ))/ε2

iid samples S Dm, with probability at least 1 δ, any classiﬁer f H minimizing the empirical loss on S, i.e., f argminf H ℓS(f ), has population loss at most ℓD(f) ℓD(H)+ε. Moreover, ﬁnding any classiﬁer achieving the same relative loss ε and failure probability δ requires Ω (d VC(H) + log(1/δ))/ε2 iid samples.

The above theorem says that the ERM principle ﬁnds an approximately optimal (up to an additive loss ε) classiﬁer within H, with an asymptotically optimal number of samples. In other words, the low empirical loss of the classiﬁer found by the ERM principle generalizes with respect to the population distribution D.

2.2 Strategic Manipulation in Classiﬁcation In the classical PAC learning model, the classiﬁer always observes the real feature of the point being classiﬁed. However, as argued above, this is often not the case in reallife scenarios, where the point being classiﬁed may strategically modify its feature in order to receive a more desirable outcome. In this paper, we model the data point s ability to modify its feature using a binary relation , which captures the reporting structure over the feature space. For x1, x2 X, we say x1 x2 if a point with feature x1 can pretend to have (i.e., report) feature x2. For any x X, we always have x x, corresponding to the fact that the point may choose not to modify its feature.3 Throughout the paper, we assume that all points being classiﬁed prefer label 1 (corresponding to, e.g., acceptance) to 0. As a result, ﬁxing a misreporting structure and a classiﬁer f : X {0, 1}, a point with feature x will pretend to have feature x such that f(x ) = 1 whenever possible, i.e., it will report x argmaxx :x x f(x ). Note that it is possible that x = x. This can happen when f(x) = 1, or f(x ) = 0 for all x such that x x , including x itself.4 In the rest of the paper, we focus on the following variant of the PAC learning model. The learning algorithm has access to m iid unmodiﬁed samples {(xi, yi)}i Dm, based on which a classiﬁer f H is computed.5 The classiﬁer f is then deployed, and random points drawn from D modify their features strategically in response to the classiﬁer.

3Note that this is not a technical requirement all results in this paper still hold without this assumption. 4A seemingly more general model is one in which there is a cost c(x, x ) 0 for a point with feature x to report feature x . We remark that with binary labels, it is without loss of generality to ignore this cost, since a way of reporting is feasible iff the gain of receiving label 1 is larger than the cost of reporting. 5We consider scenarios where observed data points have no interest in manipulating the learning process, and thus will not modify their features this is a standard assumption, and appears also in, e.g., (Cullina, Bhagoji, and Mittal 2018). It is deﬁnitely true that in many other scenarios, the samples observed may be strategically modiﬁed too. However, we do not consider this in the current paper.

The strategic population loss of f, taking into consideration strategic manipulation, can be computed as

bℓD(f) = Pr (x,y) D

h max x :x x f(x ) = y i .

Again, the goal is to ﬁnd a classiﬁer f H with strategic population loss at most bℓD(H) + ε, with probability at least 1 δ, where bℓD(H) = min f H bℓD(f ).

3 Incentive-Aware Empirical Risk Minimization In this section, we investigate the ERM principle in the presence of strategic manipulation. We ﬁrst give an example, showing that the vanilla ERM principle, which ignores the strategic aspect of the problem, has poor performance in general. We then consider an incentive-aware variant of ERM, and analyze its generalization behavior.

3.1 How Vanilla ERM Fails Consider the following example. The feature space X = {x1, x2, x3}, the hypothesis class H = 2X , and the population distribution D assigns probability 0.5 to feature-label pair (x1, 0), and probability 0.5 to (x2, 1). The reporting structure allows (1) x x for any x X, and (2) x1 x2 and x2 x3. Note that since the feature space is extremely simple, even the power set H = 2X (which has VC dimension d VC(H) = 3) exhibits decent generalization behavior without strategic manipulation. Recall that the vanilla ERM principle ﬁnds an arbitrary classiﬁer which minimizes the empirical loss based on the unmodifed sample set. Suppose the number of samples m = ω(1). Then with high probability, the sample set consists of copies of (x1, 0) and (x2, 1) and nothing else. Any classiﬁer f satisfying f(x1) = 0 and f(x2) = 1 achieves 0 empirical loss. However, with strategic manipulation, such an f ends up assigning label 1 to all (new) points whose (true) feature is x1, since those points can report x2 to fool the classiﬁer. As a result, the strategic population loss is

bℓD(f) = Pr (x,y) D

h max x :x x f(x ) = y i

0.5 max x :x1 x f(x ) 0 = 0.5 |f(x2) 0|

= 0.5. In other words, with high probability, any classiﬁer found using the vanilla ERM principle is no better than random guessing. But in fact, the classiﬁer f with f (x1) = f (x2) = 0 and f (x3) = 1 would have 0 loss with strategic behavior, because maxx :x1 x f (x ) = 0 and maxx :x2 x f (x ) = f (x3) = 1. Hence it is possible to do much better than vanilla ERM. The above example shows that with strategic manipulation, the vanilla ERM principle fails spectacularly even in extremely simple cases that would be trivial without strategic manipulation. This observation aligns with the intuition that any learning procedure achieving nontrivial performance must exploit the reporting structure.

3.2 The Incentive-Aware ERM Principle Below we present an incentive-aware version of the ERM principle (henceforth IA ERM), and show that (1) IA ERM ﬁnds a classiﬁer with the desired properties, and (2) the sample complexity of IA ERM is asymptotically optimal. These properties of IA ERM closely resemble those of its classical counterpart without strategic manipulation, and suggests that IA ERM is the right version of the ERM principle in the strategic setting that we consider.

Incentive-aware ERM. The idea is to minimize the strategic empirical loss bℓS(f) on the sample set S = {(xi, yi)}i, computed by replacing the true feature of every sample point with the most beneﬁcial feature that the point can report, i.e.,

max x :xi x f(x ) yi

The strategic empirical loss is a natural quantity to consider, since the expectation of bℓS(f) over S is precisely the strategic population loss bℓD(f), which we aim to minimize. Given the notion of strategic empirical loss, IA ERM simply ﬁnds any classiﬁer f in the hypothesis class H which minimizes bℓS(f), i.e., f argminf H bℓS(f ). We now analyze the generalization behavior of IA ERM. Recall that in the classical PAC setting without strategic manipulation, the sample complexity of ERM depends on the VC dimension of the hypothesis class H. Here, with strategic manipulation, it appears that the VC dimension of H is no longer the right capacity measure to consider. Instead of H, we consider the effective hypothesis class b H, deﬁned below. Deﬁnition 2 (Effective Classiﬁer / Hypothesis Class). Fixing a feature space X and a reporting structure , for each classiﬁer f X, the effective classiﬁer bf consists of the set of features to which f effectively assigns label 1, i.e.,

bf = {x X | x : x x , f(x ) = 1}.

Fixing a hypothesis class H, the effective hypothesis class b H consists of the effective classiﬁer of every classiﬁer in H, i.e., b H = { bf | f H}. One may intuitively interpret the effective hypothesis class in the following way. Any classiﬁer f in the presence of strategic manipulation is equivalent to the effective classiﬁer bf without strategic manipulation. Therefore, by considering the effective hypothesis class b H, we can effectively ignore strategic manipulation, and reduce the generalization analysis to that in the classical PAC setting, i.e., Theorem 1. This gives us the following theorem. Theorem 2 (Sample Complexity of IA ERM). Fix a feature space X, a population distribution D, a reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given

m = O (d VC( b H) + log(1/δ))/ε2

iid samples S Dm, with probability at least 1 δ, any classiﬁer f H minimizing the strategic empirical loss on S, i.e., f argminf H bℓS(f ), has strategic population loss at most bℓD(f) bℓD(H)+ε. Moreover, ﬁnding any classiﬁer achieving the same relative loss ε and failure probability δ requires Ω (d VC( b H) + log(1/δ))/ε2 iid samples.

The proof of Theorem 2, as well as all other proofs, are in Appendix A. In words, the above theorem says that IA ERM ﬁnds a classiﬁer whose strategic population loss is close to the best classiﬁer in the hypothesis class H, and the sample complexity of ﬁnding such a classiﬁer is determined by the VC dimension of the effective hypothesis class b H, rather than that of H itself. We make a few remarks regarding IA ERM. First, when the reporting structure is trivial, i.e., a point with feature x can only report x itself, then the effective class b H = H, and IA ERM reduces to vanilla ERM. In such cases, Theorem 2 reduces precisely to the classical Theorem 1. Second, in general, the VC dimension d VC( b H) of the effective class b H may be smaller or larger than that of H. Below we give two examples, showing that either of the two quantities can be arbitrarily larger than the other. We ﬁrst give an example where d VC(H) = and d VC( b H) = 1. Consider the example discussed in the introduction. There, the feature space X = R+ = [0, ), and the reporting structure satisﬁes for any x1, x2 R+, x1 x2 x1 x2. Consider H = 2R+. Clearly d VC(H) = , since any S R+ is shattered by H. On the other hand, for any f H, bf is essentially determined by a threshold θf, where θf = inf{x | f(x) = 1}. bf is then deﬁned by

0, x < θf 1, x > θf f(x), x = θf.

In other words, b H is the class of threshold classiﬁers over R+. It is well-known that d VC( b H) = 1. Now we show that d VC( b H) can be arbitrarily larger than d VC(H). Let the feature space be X = N, and H = {{i} | i N}, i.e., the set of all singletons. It is clear that d VC(H) = 1. Now for any d N, we construct such that d VC( b H) = d.6 Let be such that (1) i i for any i N, and (2) for any i {d, . . . , d + 2d 1} and j {0, . . . , d 1}, j i if the (j +1)-th digit in the binary representation of i d is 1. Now we argue that b H shatters {0, . . . , d 1}. In fact, any subset S {0, . . . , d 1} can be viewed as the binary representation of an integer i S in {0, . . . , 2d 1}, where the (j + 1)-th digit of i S is 1 iff j S. Consider the classiﬁer f S = {i S + d} H. Clearly c f S {0, d 1} = S, since precisely the points in S can report i S +d, which is the only point assigned label 1 by f S.

6Similar constructions could also give d VC( b H) = .

4 Incentive-Compatible Empirical Risk Minimization The IA ERM principle, together with Theorem 2, provides a rather general solution for PAC learning in the presence of strategic manipulation. Still, one may wonder whether it is possible to achieve more desirable properties, e.g., enhanced robustness against strategic manipulation and better generalization guarantees, by further reﬁning / regularizing IA ERM. In this section, we propose incentive compatibility as a means of regularization, which leads to the incentivecompatible ERM principle (henceforce IC ERM). We ﬁrst present the IC ERM principle and provide a general analysis of its sample complexity in Section 4.1. In particular, we show that IC ERM generalizes no worse than IA ERM when applied to the same hypothesis class. In Section 4.2, we consider a special case of IC ERM, free IC ERM, where the hypothesis class H is implicitly all possible classiﬁers over the feature space X, i.e., H = 2X . We give an efﬁcient algorithm for free IC ERM, and more importantly, we show that, somewhat surprisingly, it is still possible to derive nontrivial generalization bounds for free IC ERM. Based on the generalization analysis of free IC ERM, we provide a hypothesis-class-independent sample complexity bound for IC ERM, which further illustrates the power of incentive compatibility as a means of regularization. Finally, in Section 4.3, we discuss an important class of reporting structures, i.e., transitive reporting structures, on which IC ERM is equivalent to the more general IA ERM. Based on this equivalence, we give a hypothesisclass-independent sample complexity bound for IA ERM that applies whenever the reporting structure is transitive.

4.1 The Incentive-Compatible ERM Principle First we introduce the notions of incentive-compatible classiﬁers and hypothesis classes. Incentive compatibility is a standard concept in mechanism design and straightforwardly applying it in our context results in the following deﬁnition. Deﬁnition 3 (Incentive-Compatible Classiﬁers / Hypothesis Classes). Fixing a feature space X and a reporting structure , a classiﬁer f : X {0, 1} is incentive compatible if for any x1, x2 X, x1 x2 = f(x1) f(x2). In other words, no point can receive a better outcome by pretending to have a different feature. A hypothesis class H is incentive compatible if it contains only incentive-compatible classiﬁers. The intuition behind the deﬁnitions is simple: when an incentive-compatible classiﬁer is deployed, no point is motivated to strategically modify its feature, since it is impossible to obtain a better outcome by doing so. As a result, the effective classiﬁer induced by any incentive-compatible classiﬁer f is itself, i.e., bf = f, and the strategic (empirical) loss of an incentive-compatible classiﬁer f is the same as the (empirical) loss of the same classiﬁer without strategic manipulation, i.e., bℓD(f) = ℓD(f) and bℓS(f) = ℓS(f).

Incentive-compatible classiﬁers therefore provide arguably the strongest robustness one may hope for against strategic manipulation they eliminate strategic manipulation. The IC ERM principle presented below always ﬁnds a classiﬁer that is incentive compatible.

Incentive-compatible ERM. For any hypothesis class H, let the incentive-compatible subclass HIC( ) be the largest subset of H that is incentive compatible under , i.e.,

HIC( ) = {f H | f is incentive compatible under }.

We omit the parameter when it is clear from the context. Given a hypothesis class H, the IC ERM principle ﬁnds any classiﬁer f in the incentive-compatible subclass HIC minimizing the empirical loss ℓS(f) on the sample set S, i.e.,

f argminf HIC ℓS(f ) = argminf HIC bℓS(f ).

We now analyze the sample complexity of IC ERM. Observe that IC ERM with hypothesis class H is equivalent to vanilla ERM with hypothesis class HIC. Therefore, applying Theorem 1 to HIC, we immediately obtain the following asymptotically optimal sample complexity bound for IC ERM.

Theorem 3 (Sample Complexity of IC ERM). Fix a feature space X, a population distribution D, a reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given

m = O (d VC(HIC) + log(1/δ))/ε2

iid samples S Dm, with probability at least 1 δ, any classiﬁer f HIC minimizing the empirical loss on S, i.e., f argminf HIC ℓS(f ), has strategic population loss at most bℓD(f) bℓD(HIC) + ε. Moreover, ﬁnding any incentive-compatible classiﬁer within H achieving the same relative loss ε and failure probability δ requires Ω (d VC(HIC) + log(1/δ))/ε2 iid samples.

Theorem 3 is but a direct application of the classical Theorem 1. To obtain further insights into the sample complexity of IC ERM, we need to take a closer look at the VC dimension of the incentive-compatible subclass HIC, which dictates the sample complexity. As discussed above, IC ERM can be viewed as a regularized version of IA ERM. Below we formalize this intuition, by showing that the sample complexity of IC ERM is in fact no larger than that of IA ERM, or that of vanilla ERM, when applied to the same hypothesis class H. This is done in the following claim via upper bounding the VC dimension of HIC by that of b H, as well as that of H.

Proposition 1. Fixing a feature space X and a reporting structure , for any hypothesis class H over X,

As a result,

d VC(HIC) min{d VC(H), d VC( b H)}.

In many natural scenarios, the VC dimension of HIC is signiﬁcantly smaller than d VC( b H). Below we give an example where d VC(H) = and d VC(HIC) = 1. Consider again the introductory example, where X = R+. As argued above, when H = 2R+, b H is all threshold classiﬁers, and d VC( b H) = 1. On the other hand, by Proposition 1, HIC b H. In fact, one may show that every classiﬁer in b H is incentive-compatible, and HIC = b H. As a result, d VC(HIC) = d VC( b H) = 1. So, in addition to the highly desirable property of incentive compatibility, IC ERM also has at most the same, and often much better sample complexity than IA ERM, which translates to better generalization ﬁxing the number of samples. We also remark that IC ERM ﬁnds an approximately optimal classiﬁer in the incentive-compatible subclass HIC, which may have a worse approximation error than H. In other words, to achieve incentive compatibility, one in general has to sacriﬁce some accuracy in the form of a larger approximation error. We will see more desirable properties of IC ERM in the rest of this section.

4.2 Free IC ERM and Generalization from Incentive Compatibility We now consider free IC ERM, which is a special case of IC ERM where the hypothesis class consists of all possible classiﬁers over X, i.e., H = H0 = 2X . At ﬁrst glance this may appear senseless in the classical PAC setting without strategic manipulation, no (nontrivial) generalization is possible if nothing is known about the ground truth a priori, i.e., when the hypothesis class consists of all possible classiﬁers. However, as we show below, the reporting structure, together with the requirement of incentive compatibility, in fact induces nontrivial structure over the incentive-compatible subclass HIC 0 , which allows nontrivial sample complexity and generalization bounds. These structures are captured by the following complexity measure, which we term the intrinsic VC dimension. Deﬁnition 4 (Intrinsic VC Dimension). Fix a reporting structure over a feature space X. For any x, x X, we say x can reach x (denoted x x ) if there exists a sequence of features x1, . . . , xk for some integer k > 0, such that x1 = x, xk = x , and for any i [k 1], xi xi+1. A set of features S X is independent if for any x, x S where x = x , x cannot reach x . The intrinsic VC dimension of over X, d VC(X, ), is the cardinality of any maximum subset of X that is independent, i.e.,

d VC(X, ) =

max{|S| | S X : x, x S s.t. x = x and x x }.

It turns out that the intrinsic VC dimension of the reporting structure is precisely the VC dimension of the incentivecompatible subclass HIC 0 of the null hypothesis class H0, as formalized in the following proposition. Proposition 2 (VC Dimension of Null Hypothesis Class). For any feature space X and reporting structure over X,

d VC(X, ) = d VC(HIC( ) 0 ),

where H0 = 2X is the null hypothesis class. Note that for any hypothesis class H 2X = H0 over X, the incentive-compatible subclass of H is a subclass of that of the null hypothesis class H0, i.e., HIC HIC 0 . As a result, we always have

d VC(HIC) d VC(HIC 0 ) = d VC(X, ). This, together with Theorem 3, immediately implies the following hypothesis-class-independent sample complexity bound for IC ERM, which, in particular, applies to free IC ERM. Theorem 4 (Hypothesis-Class-Independent Sample Complexity Bound for IC ERM). Fix a feature space X, a population distribution D, a reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given m = O (d VC(X, ) + log(1/δ))/ε2

iid samples S Dm, with probability at least 1 δ, any classiﬁer f HIC minimizing the empirical loss on S, i.e., f argminf HIC ℓS(f ), has strategic population loss bℓD(f) bℓD(HIC) + ε. Moreover, when the hypothesis class H is the null hypothesis class H0, ﬁnding any incentive-compatible classiﬁer achieving the same relative loss ε and failure probability δ requires Ω (d VC(X, ) + log(1/δ))/ε2 iid samples. We also present an efﬁcient algorithm for free IC ERM in Appendix B.

4.3 Transitive Reporting and the Revelation Principle Finally, we investigate an important family of reporting structures, transitive reporting structures. It turns out that with transitivity, the so-called revelation principle from mechanism design holds: if one accounts for strategic reporting, then without loss of generality one may focus on incentive-compatible classiﬁers. That is, IC ERM is as general as IA ERM. As a result, with transitivity, the hypothesisclass-independent sample complexity bound extends to IA ERM applied to any hypothesis class. We ﬁrst give the formal deﬁnition of transitive reporting structures, which is essentially the same as that of transitive binary relations. Deﬁnition 5 (Transitive Reporting Structures). A reporting structure over X is transitive if for any x1, x2, x3 X, (x1 x2 and x2 x3) = x1 x3. One example of transitive reporting structures is the example from the introduction, where X = R+, and is precisely the same as , which is clearly transitive. It turns out that transitivity of reporting structures is equivalent to the revelation principle holding, which roughly says that any classiﬁer is equivalent to a (possibly different) incentivecompatible classiﬁer. Formally: Proposition 3 (Revelation Principle for PAC Learning). The following is true if and only if the reporting structure is transitive: for any classiﬁer f : X {0, 1}, the effective classiﬁer bf is incentive compatible, and as a result, for any hypothesis class H, the effective class b H is incentivecompatible (i.e., ( b H)IC = b H).

In light of Proposition 3, when the reporting structure is transitive, for any H, IA ERM with hypothesis class H is equivalent to IC ERM with hypothesis class b H, in the sense that they yield the same effective classiﬁer given any sample set, with the same sample complexity. This gives us a way to translate the hypothesis-class-independent sample complexity bound for IC ERM to IA ERM, and obtain the following theorem.

Theorem 5 (Hypothesis-Class-Independent Sample Complexity Bound for IA ERM). Fix a feature space X, a population distribution D, a transitive reporting structure , and a hypothesis class H. For any ε > 0, δ > 0, given

m = O (d VC(X, ) + log(1/δ))/ε2

iid samples S Dm, with probability at least 1 δ, any classiﬁer f H minimizing the strategic empirical loss on S, i.e., f argminf H bℓS(f ), has strategic population loss bℓD(f) bℓD(H) + ε. Moreover, when the hypothesis class H is the null hypothesis class H0, ﬁnding any classiﬁer achieving the same relative loss ε and failure probability δ requires Ω (d VC(X, ) + log(1/δ))/ε2 iid samples.

To develop intuition, consider again the introductory example. As discussed above, there, X = R+ and the reporting structure = is transitive. Moreover, the intrinsic VC dimension of is d VC(X, ) = 1. This is because any singleton set is independent, and for any x1 = x2, either x1 < x2 or x2 < x1, so {x1, x2} cannot be independent. On the other hand, as argued above, any classiﬁer f 2X effectively implements a threshold θf, where f(x) = 1 iff x θf or x > θf. It is well-known (see, e.g., (Shalev-Shwartz and Ben-David 2014)) that Θ(log(1/δ)/ε2) samples sufﬁce to learn such a threshold with relative loss ε and failure probability δ, which coincides with the above sample complexity bound based on the intrinsic VC dimension.

Acknowledgements

The authors are thankful for support from NSF under award IIS-1814056.

Ethics Statement

Potential positive impact:

Our results can be applied, for example, to prevent fraud.

Na ıve learning processes that do not anticipate strategic behavior will not perform at all as expected in practice, creating substantial risk.

Our results could make a socially beneﬁcial classiﬁcation process more accurate.

By encouraging truthful reporting, our results could allow for more efﬁcient allocation of resources to those in actual need.

Classiﬁers that are not incentive compatible result in strategic behavior being advantageous to individuals. Vulnerable communities often are not aware of this or do not

know how to play the game strategically, exacerbating inequality. Incentive compatible classiﬁers remove this concern. (Similar arguments have been made for the use of incentive compatible mechanisms in allocating students to public schools, where wealthier parents often know how to play the game better.)

Potential negative impact:

Our results could make a socially harmful classiﬁcation process more accurate.

Our results encourage entities being classiﬁed to report their true features, which, if handled carelessly, could result in privacy issues.

Alon, N.; Ben-David, S.; Cesa-Bianchi, N.; and Haussler, D. 1997. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM (JACM) 44(4): 615 631.

Bartlett, P. L.; and Mendelson, S. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3(Nov): 463 482.

Braverman, M.; Mao, J.; Schneider, J.; and Weinberg, S. M. 2019. Multi-armed Bandit Problems with Strategic Arms. In Conference on Learning Theory, 383 416.

Chen, Y.; Podimata, C.; Procaccia, A. D.; and Shah, N. 2018. Strategyproof linear regression in high dimensions. In Proceedings of the 2018 ACM Conference on Economics and Computation, 9 26.

Cullina, D.; Bhagoji, A. N.; and Mittal, P. 2018. PAC-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, 230 241.

Daniely, A.; Sabato, S.; Ben-David, S.; and Shalev-Shwartz, S. 2015. Multiclass learnability and the erm principle. The Journal of Machine Learning Research 16(1): 2377 2404.

Dekel, O.; Fischer, F.; and Procaccia, A. D. 2010. Incentive compatible regression learning. Journal of Computer and System Sciences 76(8): 759 777.

Devroye, L.; Gy orﬁ, L.; and Lugosi, G. 2013. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media.

Feng, Z.; Parkes, D. C.; and Xu, H. 2019. The intrinsic robustness of stochastic bandits to strategic manipulation. ar Xiv preprint ar Xiv:1906.01528 .

Freeman, R.; Pennock, D. M.; Podimata, C.; and Vaughan, J. W. 2020. No-Regret and Incentive-Compatible Prediction with Expert Advice. ar Xiv preprint ar Xiv:2002.08837 .

Green, J. R.; and Laffont, J.-J. 1986. Partially veriﬁable information and mechanism design. The Review of Economic Studies 53(3): 447 456.

Haghtalab, N.; Immorlica, N.; Lucier, B.; and Wang, J. 2020. Maximizing Welfare with Incentive-Aware Evaluation Mechanisms. In 29th International Joint Conference on Artiﬁcial Intelligence.

Hanneke, S. 2016. The optimal sample complexity OF PAC learning. The Journal of Machine Learning Research 17(1): 1319 1333.

Hardt, M.; Megiddo, N.; Papadimitriou, C.; and Wootters, M. 2016. Strategic classiﬁcation. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, 111 122.

Kephart, A.; and Conitzer, V. 2015. Complexity of mechanism design with signaling costs. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 357 365.

Kephart, A.; and Conitzer, V. 2016. The revelation principle for mechanism design with reporting costs. In Proceedings of the 2016 ACM Conference on Economics and Computation, 85 102.

Kleinberg, J.; and Raghavan, M. 2019. How Do Classiﬁers Induce Agents to Invest Effort Strategically? In Proceedings of the 2019 ACM Conference on Economics and Computation, 825 844.

Krishnaswamy, A.; Li, H.; Rein, D.; Zhang, H.; and Conitzer, V. 2021. Classiﬁcation with Strategically Withheld Data. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Perote, J.; and Perote-Pena, J. 2004. Strategy-proof estimators for simple regression. Mathematical Social Sciences 47(2): 153 176.

Pollard, D. 2012. Convergence of stochastic processes. Springer Science & Business Media.

Roughgarden, T.; and Schrijvers, O. 2017. Online prediction with selﬁsh experts. In Advances in Neural Information Processing Systems, 1300 1310.

Shalev-Shwartz, S.; and Ben-David, S. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.

Talagrand, M. 1994. Sharper bounds for Gaussian and empirical processes. The Annals of Probability 28 76.

Valiant, L. G. 1984. A theory of the learnable. Communications of the ACM 27(11): 1134 1142.

Vapnik, V.; and Chervonenkis, A. Y. 1971. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications 16(2): 264 280.

Yu, L. 2011. Mechanism design with partial veriﬁcation and revelation principle. Autonomous Agents and Multi-Agent Systems 22(1): 217 223.

Zhang, H.; Cheng, Y.; and Conitzer, V. 2019a. Distinguishing Distributions When Samples Are Strategically Transformed. In Advances in Neural Information Processing Systems, 3187 3195.

Zhang, H.; Cheng, Y.; and Conitzer, V. 2019b. When samples are strategically selected. In International Conference on Machine Learning, 7345 7353.

Zhang, H.; Cheng, Y.; and Conitzer, V. 2021a. Automated Mechanism Design for Classiﬁcation with Partial Veriﬁcation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Zhang, H.; Cheng, Y.; and Conitzer, V. 2021b. Classiﬁcation with Few Tests through Self-Selection. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.