# information_discrepancy_in_strategic_learning__1061569c.pdf Information Discrepancy in Strategic Learning Yahav Bechavod 1 Chara Podimata 2 Zhiwei Steven Wu 3 Juba Ziani 4 We initiate the study of the effects of nontransparency in decision rules on individuals ability to improve in strategic learning settings. Inspired by real-life settings, such as loan approvals and college admissions, we remove the assumption typically made in the strategic learning literature, that the decision rule is fully known to individuals, and focus instead on settings where it is inaccessible. In their lack of knowledge, individuals try to infer this rule by learning from their peers (e.g., friends and acquaintances who previously applied for a loan), naturally forming groups in the population, each with possibly different type and level of information regarding the decision rule. We show that, in equilibrium, the principal s decision rule optimizing welfare across sub-populations may cause a strong negative externality: the true quality of some of the groups can actually deteriorate. On the positive side, we show that, in many natural cases, optimal improvement can be guaranteed simultaneously for all sub-populations. We further introduce a measure we term information overlap proxy, and demonstrate its usefulness in characterizing the disparity in improvements across sub-populations. Finally, we identify a natural condition under which improvement can be guaranteed for all subpopulations while maintaining high predictive accuracy. We complement our theoretical analysis with experiments on real-world datasets. 1School of Computer Science and Engineering, The Hebrew University. 2School of Engineering and Applied Sciences, Harvard University. 3School of Computer Science, Carnegie Mellon University. 4School of Industrial and Systems Engineering, Georgia Institute of Technology.. Correspondence to: Yahav Bechavod , Chara Podimata , Zhiwei Steven Wu , Juba Ziani . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1. Introduction Machine learning algorithms are increasingly used to make consequential decisions across a variety of domains, including loan approvals, college admissions, probation qualifications, and hiring. Given the high stakes of these decisions, individuals are incentivized to invest effort in changing their attributes, to obtain more favorable decisions. The evidence for such strategic adaptation from multiple domains (e.g., Bj orkegren et al. (2020); Dee et al. (2019); Dranove et al. (2003); Greenstone et al. (2020); Gonzalez-Lira & Mobarak (2019)) has inspired a growing literature on strategic learning that studies the interaction between learning algorithms and strategic individuals ( agents ). Models in the strategic learning literature, however, typically make a full transparency assumption that is, the agents fully observe the deployed scoring or decision rule (Hardt et al., 2016; Dong et al., 2018; Chen et al., 2020a; Bechavod et al., 2021; Shavit et al., 2020; Hu et al., 2019; Kleinberg & Raghavan, 2020). In the context of credit scores, for example, this translates to the assumption that individuals know the deployed credit scoring rule in full detail. However, in reality, such a full-transparency assumption is often far-fetched, and as many credit scoring rules are proprietary, banks or financial agencies rarely make their machine learning model fully transparent to outsiders. Instead, they may only provide some labeled examples (e.g., past applicants who were granted loans) or explanations (e.g., ways to improve one s credit score). As the actual scoring rule in effect is not directly observable, agents naturally attempt to infer it using other sources of information, which may differ greatly across different individuals. This is the case when the population is naturally clustered (due to e.g., their demographic, geographic, and cultural differences) and people have the tendency of observational learning (Bandura, 2008; Apesteguia et al., 2007; Smith & Sørensen, 2000; Bikhchandani et al., 1998) that is, agents learn by observing others within their communities. For example, when applying for a loan at a specific bank, individuals may learn from the past experiences of their peers/friends (i.e., their applications and loan decisions) to gauge the decision rule. Hence, individuals from different peer-networks may form different ideas about the decision rule, which in turn can lead to disparities in strate- Information Discrepancy in Strategic Learning gic investments and outcomes. To make things worse, there is often a regulatory requirement that the same decision rule be used on all sub-populations (due to e.g., the risk of redlining (Hunt, 2005; Rothstein, 2017)), prohibiting the use of group-specific decision rules by the decision-maker for to mitigate the adverse effects of information discrepancy. 1.1. Our Work Our work introduces the first framework to study the disparate effects of non-transparency in strategic learning on individuals ability to improve. Below, we provide an overview of our contributions, and the roadmap of the paper. Equilibrium Model. We propose a model for the principalagent interaction, when individuals from different subpopulations learn from their peers (Sec. 2). We then show how individuals from different sub-populations use the information available among their peers to form estimates of the decision rule, and compute the closed-form solutions for their and the principal s responses in equilibrium (Sec. 3). Improvement Across Sub-Populations. Using our proposed model, we first prove a strong negative externality result: even if the principal deploys a decision rule that is optimized for maximizing the improvement across subpopulations, and individuals best-respond to the principal s rule, some sub-populations may still suffer deterioration in their quality (true label). On the positive side, we show that improvement can be guaranteed simultaneously for all sub-populations under moderate conditions, e.g., when they have similar costs for effort exertion or when the extent to which they share the same information is minimal (Sec. 4.1). We then examine the extent to which information discrepancy regarding the deployed decision rule may result in disparity of improvement between sub-populations. We introduce the information overlap proxy measure and prove it upper bounds this disparity. We conclude by characterizing the exact conditions for the disparity to vanish (Sec. 4.2). Subsequently, we study how efficiently each of the subpopulations exerts their efforts in improving their quality. For that, we introduce the per-unit improvement (which measures the efficiency of the sub-populations effort exertions), and we identify moderate conditions so that individuals from all sub-populations exert their effort optimally (Sec. 4.3). Finally, we consider a case where the principal interpolates between the objectives of outcome improvement and predictive accuracy. We identify a natural condition under which improvement in all sub-populations is guaranteed while maintaining high predictive accuracy (Sec. 4.4 and App. E). We further show that similar conclusions can be drawn in the general settings where the principal is a learner who does not originally know the properties of the individuals sub-populations, but has to learn them instead (App. A). Empirical Evaluation. Our experiments on real-world datasets (TAIWAN-CREDIT and ADULT) illustrate our theoretical results, further highlighting the pivotal role access to information plays in strategic settings (Sec. 5). 1.2. Related Work Our work is primarily related to three strands of literature on strategic behavior in learning (of independent interest is Jagadeesan et al. (2021), who re-examine the standard assumptions around strategic learning). The first one advocates that changes in the agents original features are considered gaming , hence the learner wishes to construct algorithms that are robust to such behavior (Meir et al., 2010; 2011; 2012; Perote & Perote-Pena, 2004; Dekel et al., 2010; Chen et al., 2018; Roughgarden & Schrijvers, 2017; Freeman et al., 2020; Dalvi et al., 2004; Br uckner et al., 2012; Hardt et al., 2016; Dong et al., 2018; Chen et al., 2020a; Ahmadi et al., 2021; Sundaram et al., 2020; Ioannidis & Loiseau, 2013; Horel et al., 2014; Cai et al., 2015). When the learner optimizes for robustness to strategic behavior, the deployed algorithm has disparate impact on different sub-populations (Hu et al., 2019; Milli et al., 2019). Braverman & Garg (2020) study the impact of randomized or noisy classifiers on mitigating inequalities, but focus on a singledimensional case. In our work, we do not consider the disparate impact of robustness , but rather of information disparities across groups. In a concurrent and independent work, Ghalme et al. (2021) study non-transparency in strategic classification, characterize the price of opacity , and show conditions under which fully transparent classifiers are the recommended policy. While sharing some of the assumptions, the goals of the two works are orthogonal; The goal of our paper is not to develop new learning algorithms or to decide what decision rule should or should not be used by the principal. Rather, our aim is to provide understanding of the effects of non-transparency of the decision rule on the ability of individuals from different sub-populations to improve. The second strand of literature advocates that machine learning algorithms should incentivize good strategic behavior (aka improvements) (Kleinberg & Raghavan, 2020; Ustun et al., 2019; Khajehnejad et al., 2019; Tsirtsis & Rodriguez, 2020; Liu et al., 2020; Haghtalab et al., 2020; Alon et al., 2020; Chen et al., 2020b; Gupta et al., 2019). Our work is most closely related to (Liu et al., 2020; Haghtalab et al., 2020; Gupta et al., 2019). Liu et al. (2020) study the long-term impact of strategic learning on different sub-populations, but focus on decision rules that are fully known to the agents. Haghtalab et al. (2020) study social welfare maximization when the learner does not have full knowledge of the feature space of the agents, contrary to Information Discrepancy in Strategic Learning our model where the information discrepancies appear on the agents side. Gupta et al. (2019) minimize the difference in recourse across sub-populations, whereas we focus on a principal optimizing the social welfare. The third strand concerns causality in strategic learning (Miller et al. (2020); Bechavod et al. (2021); Shavit et al. (2020) and more broadly Perdomo et al. (2020)), where the learner tries to learn the causal relationship between agents features and their labels/scores by leveraging the agents strategic behavior. Importantly, in our setting, even if the principal knows the causal relationship perfectly, the disparate impact from the algorithm may still be unavoidable. Our work is connected to a literature on social welfare and fairness (Heidari et al., 2018; 2019; Hu & Chen, 2020). Heidari et al. (2018) propose incorporating social welfare considerations to the standard loss minimization goal of machine learning; our focus differs due to the presence of information discrepancies. Hu & Chen (2020) study the social welfare implications that result from a fair classification algorithm and show that applying more strict fairness criteria that are codified as parity constraints, can worsen welfare outcomes across sub-populations; our point of view is reversed and looks at the effect of welfare maximization on fairness. Heidari et al. (2019) also study how agents in different sub-populations invest their efforts through observational learning, by imitating a model behavior within their group. We consider a different type of observational learning where agents try to infer the deployed rule instead. Further, while they focus on disparities in level of effort across groups, we focus on disparities in improvement. More broadly, the fact that peer-influenced behavior might induce disparities in the absence of perfect information has been studied in economics and sociology (e.g., Coate & Loury (1993); Okafor (2020); Mc Pherson et al. (2001); G undo gdu et al. (2019); Di Maggio & Garip (2011); Calv o Armengol & Jackson (2004)). In this paper, however, we aim to formally understand this phenomenon in the context of strategic learning. We also go beyond characterizing such disparities, to consider objectives such as efficient effort exertion and improvement while maintaining high accuracy. 2. Model and Preliminaries We study a Stackelberg game between a principal and a population of agents comprised of m sub-populations ( groups ) with different distributions over the feature space X Rd. We focus on the case m = 2 for clarity, but our results extend to arbitrary m, as outlined in Appendix C. Let the groups be G1 and G2, with associated distributions of feature vectors D1 and D2 respectively over X. Let S1, S2 be the subspaces defined by the supports of D1, D2 respectively. Let Π1, Π2 Rd d be the orthogonal projection matrices onto subspaces S1, S2 respectively. Let w Rd denote the ground truth linear assessment rule (which is known1 to the principal through past observations): i.e., for a feature vector x, the corresponding agent s expected true quality is given by E[y | x] = w , x . Note that, while w is optimal for prediction accuracy, it may not be the one maximizing the welfare across groups. This is because it is often worth incentivizing modifications of features that are easy for agents to improve, and features who can be modified by and benefit several groups. The principal deploys a linear scoring rule w Rd. Agent i from group g draws private feature vector xg,i Dg. Initially, agents from both groups have no information regarding w, so they simply report xg,i to the principal and receive scores ˆyg,i = w, xg,i . After enough agents from both groups have received scores for their reported features, the remaining agents use this past information (i.e., featurepredicted score tuples) to appropriately alter their feature vectors from x Dg to ˆx(x; g). Knowing that the ground truth assessment rule together with the scoring rule that the principal deploys are linear, and given the fact that they are risk-averse, agents perform empirical risk minimization (ERM) on the peer-dataset comprised of the first unmodified Ng R+ samples Sg = {(xg,i, ˆyg,i)}i [Ng] to compute an estimate west(g) of the deployed scoring rule w. Running ERM is a natural choice given that the agents are risk-averse, fully rational, and have no other information. Remark 2.1. Note that in practice, information may also be shared between groups. For example, agents in group G1 may all see a few additional samples from group G2. In this case, note we that can extend the definition of subspace S1 to include the support of the observed samples in group G2, and extend the definition of group G1 to being supported on this new subspace. This reduces to situations in the paper where the sub-spaces (and hence the information) seen by groups G1 and G2 overlap, which constitute a major part of the study of this paper. Given original features x and estimation rule west(g), each (myopically rational) agent chooses ˆx(x; g) as the x that optimizes their underlying utility function (which a generalization of the standard utility function used in the literature on strategic classification) defined as u (x, x ; g) = Score (x ; g) Cost (x, x ; g) (1) where Score (x ; g) = west(g), x is the estimate2 value the agent derives for reporting feature vector x and Cost(x, x ; g) = 1 2(x x) Ag(x x) is the agent s cost for modifying vector x into x . We call Ag Rd d the cost 1We relax this assumption in App. A. 2The actual value that the agent derives by reporting x is the outcome w , x . But w is never revealed to the agent; the only information that she has is the estimate for the principal s w. Information Discrepancy in Strategic Learning matrix for group g, and assume it is positive definite (PD).3 Due to not restricting Ag further, this cost function family is rather large and encapsulates some cost functions used in the literature on strategic classification (e.g., (Dong et al., 2018; Ahmadi et al., 2021). This functional form is a simple way to model important practical situations in which features can be modified in a correlated manner, and investing in one feature may lead to changes in other features. At a high level, the utility in Eq. (1) captures the net gains that an agent obtains from spending effort to report x , rather than x. Since ˆx(x; g) is the best response coming from Eq. (1), then ˆx(x; g) = arg maxx X u (x, x ; g). As we show in Section 3 the best response ˆx(x; g) takes the form x + g(w) for a movement function4 g(w) to be specified shortly. Putting everything together, the protocol in Algorithm 1 summarizes the principal-agent interaction. Algorithm 1 Principal-Agent Interaction Protocol 1: Nature selects ground truth scoring function w . 2: Learner deploys scoring rule w Rd (solution to Eq. (2)), but does not directly reveal it to the agents. 3: Agents from groups g {1, 2} draw their (private) feature vectors x Dg. 4: Given peer-dataset Sg, (private) feature vector x, utility function u(x, x ; g), agents best-respond with feature vector ˆx(x; g) = arg maxx X u(x, x ; g). When it comes to the principal s behavior, we posit that the principal s objective is to maximize the agents average social welfare across groups ( social welfare for short), defined as the sum over groups of the average (over agents) and expected (over the randomness of the labels) improvement of their true (as measured by w ) labels, after bestresponding. In other words, the principal deploys the equilibrium scoring rule w SW: w SW = arg max w : w 2 1 SW (w ) = arg max w : w 2 1 g {1,2} Ex Dg [ ˆx(x; g), w ] (2) In Sec. 4, we additionally consider a principal who wishes to trade-off predictive accuracy and social welfare. We aim to study the improvement among groups, in the presence of information discrepancy, at a Stackelberg equilibrium of our game. In other words, the principal and agents best respond to each other, with the principal acting first and committing to a rule in anticipation of the strategic 3In turn, one can write Cost(x, x ; g) = 1 Ag(x x) 2 2 noting that p Ag is well-defined. 4Slightly abusing notation w is an argument of g( ), but this is only used in the analysis. The agents do not directly see w. best responses of agents. We quantify improvement using two notions: total improvement and per-unit improvement. Definition 2.2. For rule ew Rd, we define the total improvement ( improvement ) for group g as: Ig (ew) = ˆx(x; g), w x, w = x + g (ew) , w x, w = g (ew) , w . For the same rule, the per-unit improvement for group g is: u Ig (ew) = Ig Πg ew Πg ew 2 Πg ew Πg ew 2 The usefulness of defining the total improvement as one of our measures is clear. The per-unit improvement only considers the part of the deployed scoring rule that belongs in the relevant subspace of each group, and measures how efficient the direction of this rule projected onto the relevant subspace is at inducing improvement for the group. We focus on three objectives for the two groups: do-noharm, equality, and optimality. Definition 2.3 (Do-No-Harm). A rule ew causes no harm for group g if Ig (ew) 0. Definition 2.4 (Equality). A rule ew enforces group-equality if: I1(ew) = I2(ew). Definition 2.5 (Optimality). A rule w enforces g s groupoptimality if: w = arg max ew u Ig(ew). Remark 2.6. We note that achieving optimality in per-unit improvement (Def. 2.5) is equivalent to guaranteeing, for a rule w, that no other w for which Πgw 2 Πgw 2, can induce greater improvement than w does in group g. Based on these objectives, we quantify how much the equilibrium play in this strategic interaction exacerbates inequalities between the groups due to their information discrepancies, even in the best-case scenario, where the principal is optimizing the population s average welfare across groups. 3. Equilibrium Computation In this section, we compute the equilibrium plays. We first compute the agents estimate rules (west(g)), given the information from their own group. We then derive the closed form of their best-response ˆx(x; g). Using these, we solve the principal s optimization problem of Eq. (2). App. B contains the proofs of the section. 3.1. Computing an Estimate of Principal s Scoring Rule Recall that agents from each group g run ERM on their peer dataset Sg to derive their estimated decision rule west. We posit that the agents are risk averse that is, they prefer Information Discrepancy in Strategic Learning certain outcomes, rather than betting on uncertain ones. In our setting, this corresponds to agents taking the minimum norm ERM to break ties, since they only wish to move in directions that can surely improve their outcome. Note that if agents invest efforts outside of the informational subspace, this could result in them not improving their outcome further (or even worse - deteriorating their outcome), while still incurring a cost. Formally, the agents compute west(g) as: west(g) = arg min ew W ew 2 2, (3) w : w = arg min w X x g,iw ˆyg,i 2 When agents use ERM, we can state their estimate rule in closed form. Lemma 3.1. Agents from group g using ERM compute the estimate rule west(g) = Πgw. 3.2. Closed Form of Agents Best-Response Slightly abusing notation, the agents value becomes: Score(x, x ; g) = west(g), x , which equals Πgw, x from Lemma 3.1. So, the agents utility (Eq. (1)) becomes: u (x, x ; g) = Πgw, x 1 Ag (x x) 2 (4) Lemma 3.2. The best-response of an agent from group g with feature vector x is: ˆx(x; g) = x+A 1 g Πgw. We write g(w) x + g(w). In equilibrium, the principal knows that the agent s bestresponse as a function of their private feature vector x is given by Lemma 3.2. We use this next when solving the principal s optimization problem. 3.3. Principal s Chosen Scoring Rule in Equilibrium Using the fact that the principal can compute g(w) for any group g, we can obtain a closed form solution for the principal s chosen rule w (i.e., the solution to Eq. (2)). Lemma 3.3. The principal s scoring rule that maximizes the social welfare in equilibrium is: Π1A 1 1 + Π2A 1 2 w Π1A 1 1 + Π2A 1 2 w (5) We note that w SW does not, in general, equal w . One reason for that is disparities in feature modification costs: even if a unit modification of feature i leads to a high level of improvement w (i), this feature may be too costly to improve. Second, even when the costs are identical for all features (e.g., Ag = Id d), it is still the case that w SW = w (unless Π1 + Π2 = Id d), since it is often worth incentivizing feature changes in directions that overlap across and benefit both groups, in order to maximize their joint social welfare (see example below). Example 3.4. Let d = 3, and the optimal feature vector be w = (2/3, 2/3, 1/3) (note that w = 1.). Π1 projects to features 1 and 3, while Π2 projects to features 2 and 3. Agents costs I3 3 in both groups, to isolate the effect of the projections on the social welfare maximizing rule. When posting w = (2/3, 2/3, 1/3), we have 1(w ) = Π1w = (2/3, 0, 1/3) and 2(w ) = Π2w = (0, 2/3, 1/3); this leads to an increase in the social welfare across groups of (w ) ( 1(w) + 2(w)) = (4/9 + 0 + 1/9) + (0 + 4/9 + 1/9) = 10/9. An alternative is to put more weight on shared feature 3, even though it yields the lowest level of improvement in each group. For example, let us pick w = 1 3 (1, 1, 1). We now get a better expected improvement across groups of (w ) ( 1(w) + 2(w)) = 1 3 ((2/3 + 1/3) + (2/3 + 1/3)) = 2/ Using the same techniques as for Lemma 3.3, we also characterize the scoring rule that maximizes the social welfare of a single group g. We use this as a benchmark to understand how far from optimal w SW can be within each group. Lemma 3.5. The scoring rule maximizing the social welfare of group g is: wg = (A 1 g Πg) w (A 1 g Πg) w . 4. Equilibrium Analysis In this section, we study the societal impact of the equilibrium strategies of the principal and the agents computed in Section 3. We do so by examining feasibility of the objectives of cross-group improvement introduced in Section 2. We then study the ability to achieve improvement across groups while maintaining high predictive accuracy. We assume (A 1 1 Π1 + A 1 2 Π2) w = 0, as otherwise the objective of Eq. (17) is always 0. The proofs for this section can be found in Appendix D. 4.1. Do-No-Harm When a benevolent5 principal deploys an equilibrium rule maximizing the social welfare of the population, one could expect that this rule does not cause any negative externality (i.e., outcome deterioration). However, this is not the case in general, as we observe in the following example. Example 4.1. Assume that the cost and the projection matrices for the two groups are: 5In Sec. 4.4, we instead consider a principal who wishes to trade off social welfare and predictive accuracy. Information Discrepancy in Strategic Learning A1 = 1 2 2 5 , A2 = 4 4 4 8 Π1 = 1/2 1/2 1/2 1/2 , Π2 = 1 0 0 0 Note that A1, A2 are symmetric and PD, as their eigenvalues are λ1 = 3 2 2 and λ2 = 2(3 5) respectively. Further, Π1, Π2 are orthogonal projections, as Π2 g = Πg = Π g . Finally, assume that w = [0 a], for scalar a > 0. Then, for the numerator of I2(w ) we have that: w A 1 1 Π1Π2A 1 2 + A 1 2 Π2A 1 2 w = w 2 1 5/8 5/16 16a < 0 (a > 0). The construction described in example 4.1 hinges on the observation that the product of two positive semi-definite matrices is generally not positive semi-definite. The specific values in the example were hence chosen to illustrate a situation were the product of A1 and A2 is not positive semi-definite, and w is then simply selected in a manner which exploits this to cause outcome deterioration. Remark 4.2. Example 4.1 highlights the fact that, perhaps surprisingly, even assuming the best-case , where the principal optimizes social welfare, is not sufficient to overcome the tension that stems from cost disparities across groups. Our experiments (App. F) illustrate this counter-intuitive insight. We hence next abstract away from cost disparities (and consider cost functions that differ among groups only by a multiplicative factor6), as we wish to examine cases when discrepancy between the two groups is only due to disparities in information regarding the principal s assessment rule. We first, however, state the more general necessary and sufficient conditions for guaranteeing no negative externality. Theorem 4.3. In equilibrium, there is no negative externality for group g and any w if and only the matrix A 1 1 Π1 + A 1 2 Π2 Πg A 1 g + A 1 g Πg Π1A 1 1 + Π2A 1 2 is PSD. As we show next (Corollary 4.4), assuming proportional costs between groups in fact suffices to guarantee no negative externality in any of the groups in equilibrium, regardless of information discrepancy between them. Corollary 4.4. There is no negative externality for either group in equilibrium if the cost matrices are proportional to each other; i.e., A1 = c A2 for a scalar c > 0. 6This covers most of the cost functions considered in prior work (Hardt et al., 2016; Dong et al., 2018; Chen et al., 2020a; Ahmadi et al., 2021), where the cost matrices are diagonal with identical coefficients for all agents. Another interesting implication of Theorem 4.3 is that no negative externality is experienced in equilibrium when subspaces S1, S2 are orthogonal. Intuitively, this happens because the two groups have no informational overlap, and hence optimal social welfare by the principal is achieved by a rule which only has to take into account a single group in each informational subspace. Corollary 4.5. There is no negative externality in equilibrium, if subspaces S1, S2 are orthogonal. Theorem 4.3 offers an important takeaway. Namely, that information discrepancy by itself is not sufficient to cause outcome deterioration. It may, however, still result in disparities in improvement, which we discuss next. 4.2. Equal Improvement Across groups While highly desirable in itself, the ability to induce improvement simultaneously in all groups does not prevent differences in the extent of such improvements across groups. In this subsection, we hence study a stronger objective: equal improvement across groups.7 To isolate the effects of information discrepancy, we assume throughout it that A1 = A2 = Id d. We first introduce a measure that will be helpful in quantifying the difference in improvement: Definition 4.6. Given a scoring rule w Rd and projections Π1, Π2 Rd d, we define the information overlap proxy between groups G1, G2 with respect to w to be r1,2(w) := Π1w Π2w 2. The following lemma shows that the information overlap proxy with respect to the underlying scoring rule w upper bounds the difference in improvement between groups. Lemma 4.7. Let diff1,2(w) |I1(w) I2(w)| be the disparity in improvement across groups when the principal s rule is w. In equilibrium, if A1 = A2 = Id d, then: diff1,2(w SW) r1,2(w ). Further, equality holds if and only if Π1w and Π2w are co-linear. Remark 4.8. In particular, note that the bound is tight when Π1 = Π2 (perfect overlap) and Π1 = Id d, Π2 = 0 (maximum informational disparities across groups). We next derive necessary and sufficient conditions for equality of improvement in the general case. Theorem 4.9. In equilibrium, groups have equal improvement for all w if and only if A 1 1 Π1A 1 1 = A 1 2 Π2A 1 2 . Note that Theorem 4.9 holds globally (regardless of w ), identifying the condition for improvement disparity to vanish. Lemma 4.7, however, provides an instance-specific 7Equality of total improvement is a strictly stronger objective than do-no-harm. Indeed, achieving equal total improvement guarantees that there exists no negative externality, since the optimal social welfare is always non-negative. Information Discrepancy in Strategic Learning (as a function of w ) upper bound. As we allow for such instance-specific analysis, equal improvement may arise under weaker conditions. 4.3. Efficient Effort Exertion Across Groups Achieving equality of improvement across groups does not, however, guarantee that effort to improve is exerted at the same level of efficiency across groups. Our notion of perunit improvement (Definition 2.2) aims to capture the concept of optimal effort exertion formally (Remark 2.6). This section studies the ability to ensure efficient effort exertion across groups. We begin by exhibiting the difference between improvement and efficient effort exertion. Proposition 4.10. Let α > 0 be arbitrarily small. In equilibrium we may see simultaneously: arbitrarily different improvement across groups: I1(w SW) < α I2(w SW). optimal per-unit improvement in all groups, i.e., u Ig(w SW) = u Ig(wg), g. Next, we derive the necessary and sufficient conditions for optimal per-unit improvement in the general case. Theorem 4.11. In equilibrium, group g gets optimal perunit improvement if and only if: A 1 g Πg A 1 g w Πg A 1 g w 2 A 1 g Πg Π1A 1 1 + Π2A 1 2 w Πg Π1A 1 1 + Π2A 1 2 w 2 , w + Since the condition in Theorem 4.11 can be difficult to interpret, we identify two natural cases where optimal per-unit improvement is guaranteed. The first case occurs when their groups information on the decision rule does not overlap. Corollary 4.12. In equilibrium, optimal per-unit improvement across groups is guaranteed if S1, S2 are orthogonal. The second case is when groups have the same information regarding the decision rule, and their feature modification costs are proportional to one another. Corollary 4.13. In equilibrium, optimal per-unit improvement across groups is guaranteed when the cost matrices are proportional to each other and Π1 = Π2. Intuitively, both Corollary 4.12 and Corollary 4.13 reflect situations where the direction of the optimized solution in each of the groups informational subspace is not affected by the other groups. 4.4. Improvement With High Predictive Accuracy As we noted before, improving social welfare may lead to inaccuracies in the learner s predictions, as in general, w = w SW, wg. However, in practice, a learner may both want to i) improve the underlying quality of the population so that they are more likely for example to repay their loans, while ii) correctly predicting each agent s credit-worthiness. In this section, we hence replace the assumption of a benevolent principal with one that wishes to trade-off predictive accuracy and social welfare. In other words, we study the ability to induce improvement in all groups while deploying highly-accurate decision rules. The proofs of this subsection and additional intuition can be found in Appendices D.4 and E. A simple way to take into account both accuracy and social welfare objectives is to consider decision rules of the form λw +(1 λ)w SW for λ [0, 1]. Such rules exhibit a tradeoff between picking the accuracy-optimizing (as λ 1) and the welfare-optimizing (as λ 0) rules. We investigate conditions under which a decision rule of this form can induce improvement in all groups. To do so, we begin by introducing a simple condition regarding modification costs. Definition 4.14. We say that group g has decomposable modification cost, if, for all x Sg, Ag x Sg, and for all x S g , Ag x S g . At a high level, the condition in Definition 4.14 requires the cost of any modification of features to be decomposable into two independent components: the cost of modifications within the group s subspace of information, and the cost of modifications outside of it.8 We refer the reader to Appendix E for more intuition regarding why such condition may arise naturally, but note that this condition encodes that agents in group g never perform manipulations outside of their informational subspace Sg, i.e. modifications whose effect they have no understanding of. As we prove next, the condition in Definition 4.14 ensures improvement in all groups while maintaining high predictive accuracy. Theorem 4.15. Assume group g has decomposable modification cost. Then, Do-No-Harm for group g is guaranteed if the principal deploys w . Further, if Do-No-Harm is guaranteed for group g under w SW, it is also guaranteed for any convex combination of w SW and w . In particular, Theorem 4.15 shows that even when the principal optimizes for accuracy alone, no negative externality is experienced in any of the groups under the condition of Definition 4.14. We next show a surprising implication of Theorem 4.15. Namely, that unlike the case for the social 8We further note that a similar condition arises in the context of the principal s learning problem (appendix A). Information Discrepancy in Strategic Learning welfare maximizing solution (as shown in Example 4.1), improvement in all groups may in fact be naturally guaranteed for the accuracy-maximizing solution, Corollary 4.16. Under full information (Π1 = Π2 = Id d), Do-No-Harm for all groups is guaranteed for w . Corollary 4.16 highlights an interesting perspective for the benefits of transparency in prediction in strategic contexts; even a mostly self-interested principal could in fact benefit from making its rule more transparent. For example, one could consider this in the context of loan approvals, where a bank deploys a proprietary decision rule, aiming primarily for high predictive accuracy, and secondarily for increasing the quality of loan candidates across all groups. Corollary 4.16 can then be viewed as an incentive for the bank to increase the transparency of such rule. 5. Experiments Here, we empirically evaluate the impact of information disparities at equilibrium on two real-world datasets that pertain to our setting: the TAIWAN-CREDIT and ADULT datasets.9 Our code is available in the supplementary. Experimental Setup. For TAIWAN-CREDIT d = 24 and ADULT d = 14. In order to guarantee numerical (rather than categorical) feature values, we pre-processed the ADULT dataset to transform the categorical ones to integers. Specifically, for the features for which there was a clear hierarchical ordering (e.g., the Education feature, where we could order agents in terms of their highest education level reached), we reflected this ordering in the assignment of numerical values to these categories. For the TAIWAN-CREDIT dataset, no pre-processing was needed. Table 1: Groups for the TAIWAN-CREDIT dataset. Age Education Marriage G1 25 yrs old gradschool & college married G2 > 25 yrs old high school not-married diff(w SW) 0.34 0.05 0.23 r1,2(w ) 0.5 0.52 0.48 Table 2: Groups for the ADULT dataset. Age Country Education G1 35 yrs old western world degrees high school G2 > 35 yrs old everyone else degrees < high-school diff(w SW) 0.15 0.66 0.20 r1,2(w ) 0.29 0.89 0.77 In both cases, we ran ERM in order to identify w and we assumed that costs are A1 = A2 = Id d. In App. F we present additional experimental results for cost matrices A1, A2 that differ from one another. After the pre- 9Available at https://archive.ics.uci.edu/ ml/datasets/default+of+credit+card+clients & https://archive.ics.uci.edu/ml/datasets/ adult. processing step, we created the groups of the population based on categories that intuitively define peer-networks. Our judgment for picking these categories is based on folklore ideas about how people choose their network and social circles. For the TAIWAN-CREDIT dataset, we use the following categories: Age, Education, and Marriage, while for the ADULT dataset, we use: Age, Country, Education, and the final groups are in Tables 1 and 2. In order to obtain the projection matrices Π1, Π2, we ran SVD on the points inside of G1, G2. To be more precise, let Xi R|Gi| d be the matrix having as rows the vectors x , x Gi. Then, running SVD on X produces three matrices: Xg = U D V g , where V Rd r and r = rank(Xg). Let Vg,5 correspond to the matrix having as columns the eigenvectors corresponding to the 5 top eigenvalues and zeroed out all other d 5 columns. Then, the projection matrix Πg is defined as Πg = Vg,5V g,5.10 Results. In summary, our experimental results illustrate our theoretical analysis, and extend our insights to when the projection matrices do not satisfy the exact conditions required by the formal statements of Sec. 4. First, we see that in both datasets, the principal s rule that optimizes the social welfare does not cause any negative externality when A1 = A2 = Id d (that said, we do observe outcome deteriorations when the cost matrices differ from one another see Appendix F). In fact, we observe strict improvement, i.e., Ig(w SW) > 0 for all groups g. Second, neither the total nor the per-unit improvements are equal. In terms of total improvements, we in fact see significant disparities when the groups are defined based on their Age or their Marital Status in the TAIWAN-CREDIT dataset and based on every categorization in the ADULT dataset. These significant disparities for the particular groups we created match our intuition: we expect that people from significantly different age groups or countries to have very different understandings of the scoring rule, in turn leading to possibly very disparate total improvements. We note also that in both datasets the difference in the total improvements of the groups is upper bounded by the overlap proxy (i.e., diff(w SW) r1,2(w )), as expected from Lemma 4.7. That said, the gap between the two quantities can be rather large. This is because the magnitude of the overlap is not the only factor controlling diff(w SW). Rather, other factors (e.g., the direction of the overlap or how it compares to w ) also matter significantly. Note that optimal per-unit improvement can be very different across groups; an extreme example is the Country categorization in ADULT. It is surprising, however, that 10Effectively, we focus the feature space on the directions corresponding the top 5 eigenvalues found in each group s data, as per traditional principal component analysis. Information Discrepancy in Strategic Learning Age Education Marriage 0 Improvement Age Country Education 0 Improvement I1(w SW) I2(w SW) u I1(w SW) u I2(w SW) u I1(w1) u I2(w2) Figure 1: Left, Right: evaluation on the TAIWAN-CREDIT and ADULT dataset respectively. Tables 1 and 2 contain the characteristics of groups G1, G2. Recall that Ig(w SW), u Ig(w SW), denote the total and per-unit improvement for group g in equilibrium respectively, while u Ig(wg) denotes the optimal per-unit improvement for group g. for the Education categorization in TAIWAN-CREDIT the optimal per-unit improvement is almost identical. Another interesting observation is that the per-unit improvement is close to (or almost the same as!) the optimal per-unit improvement for all groups in ADULT. We suspect that this is due to having very different projection matrices Π1, Π2. 6. Conclusion In this work, we have taken a first step towards understanding the implications of inaccessible decision rules for different sub-populations in strategic learning. Our results establish a close connection between the informational overlap across sub-populations, the extent to which it is possible to ensure improvement in each, and whether such improvement can be induced while maintaining high predictive accuracy. We discuss next limitations of our work and avenues for future research. First, our model uses a linearity assumption regarding the form of the decision rule; we believe this linear assumption is a natural and simple choice for a first model that studies the phenomenon of information discrepancy, but an interesting future direction would be to understand what happens beyond linearity. Second, we assume agents best-respond perfectly to the principal s choices. While one can weaken the best-response assumption, this may affect the sharpness of our results. We note that some form of assumption regarding how agents respond to the model is natural and reflects many real-life situations. Finally, while we provide guarantees for improvement with high predictive accuracy in the form of safeguarding against outcome deterioration, it would be interesting to further study the trade-offs between accuracy and improvement. 7. Acknowledgments Yahav Bechavod was supported in part by Israel Science Foundation (ISF) grants 1044/16 and 2861/20, the United States Air Force and DARPA under contract FA8750-19-20222, and the Apple Scholars in AI/ML Ph D Fellowship. Chara Podimata was supported in part by an MSR Dissertation Grant, a Siebel Scholarship, the National Science Foundation under grant CCF-1718549 and the Harvard Data Science Initiative. Juba Ziani was funded in part by the Warren Center for Network and Data Sciences at the University of Pennsylvania, and NSF grant AF-1763307. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force and DARPA. Information Discrepancy in Strategic Learning Ahmadi, S., Beyhaghi, H., Blum, A., and Naggita, K. The strategic perceptron. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 6 25, 2021. Alon, T., Dobson, M., Procaccia, A., Talgam-Cohen, I., and Tucker-Foltz, J. Multiagent evaluation mechanisms. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1774 1781, 2020. Anderson Jr, W. N., James Harner, E., and Trapp, G. E. Eigenvalues of the difference and product of projections. Linear and Multilinear Algebra, 17(3-4):295 299, 1985. Apesteguia, J., Huck, S., and Oechssler, J. Imitation theory and experimental evidence. Journal of Economic Theory, 136(1):217 235, 2007. Bandura, A. Observational learning. The international encyclopedia of communication, 2008. Bechavod, Y., Ligett, K., Wu, S., and Ziani, J. Gaming helps! learning from strategic interactions in natural dynamics. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 1234 1242. PMLR, 13 15 Apr 2021. Bikhchandani, S., Hirshleifer, D., and Welch, I. Learning from the behavior of others: Conformity, fads, and informational cascades. Journal of Economic Perspectives, 12 (3):151 170, September 1998. doi: 10.1257/jep.12.3.151. Bj orkegren, D., Blumenstock, J. E., and Knight, S. Manipulation-proof machine learning. Co RR, abs/2004.03865, 2020. Braverman, M. and Garg, S. The role of randomness and noise in strategic classification. In 1st Symposium on Foundations of Responsible Computing, FORC 2020, June 1-3, 2020, Harvard University, Cambridge, MA, USA (virtual conference), volume 156 of LIPIcs, pp. 9:1 9:20, 2020. Br uckner, M., Kanzow, C., and Scheffer, T. Static prediction games for adversarial learning problems. The Journal of Machine Learning Research, 13(1):2617 2654, 2012. Cai, Y., Daskalakis, C., and Papadimitriou, C. Optimum statistical estimation with strategic data sources. In Conference on Learning Theory, pp. 280 296. PMLR, 2015. Calv o-Armengol, A. and Jackson, M. O. The effects of social networks on employment and inequality. American Economic Review, 94(3):426 454, June 2004. doi: 10. 1257/0002828041464542. Chen, Y., Podimata, C., Procaccia, A. D., and Shah, N. Strategyproof linear regression in high dimensions. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 9 26, 2018. Chen, Y., Liu, Y., and Podimata, C. Learning strategy-aware linear classifiers. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020a. Chen, Y., Wang, J., and Liu, Y. Strategic recourse in linear classification. ar Xiv preprint ar Xiv:2011.00355, 2020b. Coate, S. and Loury, G. C. Will affirmative-action policies eliminate negative stereotypes? The American Economic Review, 83(5):1220 1240, 1993. ISSN 00028282. Dalvi, N., Domingos, P., Sanghai, S., and Verma, D. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 99 108, 2004. Dee, T. S., Dobbie, W., Jacob, B. A., and Rockoff, J. The causes and consequences of test score manipulation: Evidence from the new york regents examinations. American Economic Journal: Applied Economics, 11(3):382 423, July 2019. doi: 10.1257/app.20170520. Dekel, O., Fischer, F., and Procaccia, A. D. Incentive compatible regression learning. Journal of Computer and System Sciences, 76(8):759 777, 2010. Di Maggio, P. and Garip, F. How network externalities can exacerbate intergroup inequality. American Journal of Sociology, 116(6):1887 1933, 2011. ISSN 00029602, 15375390. Dong, J., Roth, A., Schutzman, Z., Waggoner, B., and Wu, Z. S. Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 55 70, 2018. Dranove, D., Kessler, D., Mc Clellan, M., and Satterthwaite, M. Is more information better? the effects of report cards on health care providers. Journal of Political Economy, 111(3):555 588, 2003. ISSN 00223808, 1537534X. Freeman, R., Pennock, D., Podimata, C., and Vaughan, J. W. No-regret and incentive-compatible online learning. In International Conference on Machine Learning, pp. 3270 3279. PMLR, 2020. Ghalme, G., Nair, V., Eilat, I., Talgam-Cohen, I., and Rosenfeld, N. Strategic classification in the dark. Co RR, abs/2102.11592, 2021. Information Discrepancy in Strategic Learning Gonzalez-Lira, A. and Mobarak, A. M. Slippery Fish: Enforcing Regulation under Subversive Adaptation. IZA Discussion Papers 12179, Institute of Labor Economics (IZA), February 2019. Greenstone, M., He, G., Jia, R., and Liu, T. Can technology solve the principal-agent problem? evidence from china s war on air pollution. SSRN Electronic Journal, 01 2020. doi: 10.2139/ssrn.3638591. G undo gdu, D., Panzarasa, P., Oliver, N., and Lepri, B. The bridging and bonding structures of place-centric networks: Evidence from a developing country. PLo S ONE, 14(9):e0221148, September 2019. doi: 10.1371/journal. pone.0221148. Gupta, V., Nokhiz, P., Roy, C. D., and Venkatasubramanian, S. Equalizing recourse across groups. Co RR, abs/1909.03166, 2019. Haghtalab, N., Immorlica, N., Lucier, B., and Wang, J. Z. Maximizing welfare with incentive-aware evaluation mechanisms. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pp. 160 166, 2020. Hardt, M., Megiddo, N., Papadimitriou, C., and Wootters, M. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pp. 111 122, 2016. Heidari, H., Ferrari, C., Gummadi, K. P., and Krause, A. Fairness behind a veil of ignorance: A welfare analysis for automated decision making. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, 2018. Heidari, H., Nanda, V., and Gummadi, K. P. On the longterm impact of algorithmic decision policies: Effort unfairness and feature segregation through social learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2692 2701. PMLR, 2019. Horel, T., Ioannidis, S., and Muthukrishnan, S. Budget feasible mechanisms for experimental design. In Latin American Symposium on Theoretical Informatics, pp. 719 730. Springer, 2014. Hu, L. and Chen, Y. Fair classification and social welfare. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 535 545, 2020. Hu, L., Immorlica, N., and Vaughan, J. W. The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 259 268, 2019. Hunt, D. B. Redlining. Encyclopedia of Chicago, 2005. Ioannidis, S. and Loiseau, P. Linear regression as a noncooperative game. In International Conference on Web and Internet Economics, pp. 277 290. Springer, 2013. Jagadeesan, M., Mendler-D unner, C., and Hardt, M. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, pp. 4687 4697. PMLR, 2021. Khajehnejad, M., Tabibian, B., Sch olkopf, B., Singla, A., and Gomez-Rodriguez, M. Optimal decision making under strategic behavior. ar Xiv preprint ar Xiv:1905.09239, 2019. Kleinberg, J. and Raghavan, M. How do classifiers induce agents to invest effort strategically? ACM Transactions on Economics and Computation (TEAC), 8(4):1 23, 2020. Liu, L. T., Wilson, A., Haghtalab, N., Kalai, A. T., Borgs, C., and Chayes, J. The disparate equilibria of algorithmic decision making when individuals invest rationally. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 381 391, 2020. Mc Pherson, M., Smith-Lovin, L., and Cook, J. M. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415 444, 2001. doi: 10.1146/annurev. soc.27.1.415. Meir, R., Procaccia, A. D., and Rosenschein, J. S. On the limits of dictatorial classification. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pp. 609 616, 2010. Meir, R., Almagor, S., Michaely, A., and Rosenschein, J. S. Tight bounds for strategyproof classification. In 10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011), Taipei, Taiwan, May 2-6, 2011, Volume 1-3, pp. 319 326, 2011. Meir, R., Procaccia, A. D., and Rosenschein, J. S. Algorithms for strategyproof classification. Artificial Intelligence, 186:123 156, 2012. Miller, J., Milli, S., and Hardt, M. Strategic classification is causal modeling in disguise. In International Conference on Machine Learning, pp. 6917 6926. PMLR, 2020. Milli, S., Miller, J., Dragan, A. D., and Hardt, M. The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 230 239, 2019. Okafor, C. O. All things equal? social networks as a mechanism for discrimination. ar Xiv: General Economics, 2020. Information Discrepancy in Strategic Learning Perdomo, J., Zrnic, T., Mendler-D unner, C., and Hardt, M. Performative prediction. In International Conference on Machine Learning, pp. 7599 7609. PMLR, 2020. Perote, J. and Perote-Pena, J. Strategy-proof estimators for simple regression. Mathematical Social Sciences, 47(2): 153 176, 2004. Rothstein, R. The Color of Law: A Forgotten History of how Our Government Segregated America. Liveright Publishing Corporation, 2017. ISBN 9781631492853. Roughgarden, T. and Schrijvers, O. Online prediction with selfish experts. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 1300 1310, 2017. Shavit, Y., Edelman, B., and Axelrod, B. Causal strategic linear regression. In International Conference on Machine Learning, pp. 8676 8686. PMLR, 2020. Smith, L. and Sørensen, P. Pathological outcomes of observational learning. Econometrica, 68(2):371 398, 2000. doi: https://doi.org/10.1111/1468-0262.00113. Sundaram, R., Vullikanti, A., Xu, H., and Yao, F. Paclearning for strategic classification. ar Xiv preprint ar Xiv:2012.03310, 2020. Tsirtsis, S. and Rodriguez, M. G. Decisions, counterfactual explanations and strategic behavior. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. Ustun, B., Spangher, A., and Liu, Y. Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 10 19, 2019. Information Discrepancy in Strategic Learning A. The Principal s Learning Problem Up until know, we have assumed that the principal has full information on the parameters of the problem. In particular, the principal perfectly knows the underlying linear model w , the cost matrices A1 and A2, and the projection matrices Π1 and Π2. In this section, we study how our principal can learn w SW from samples of agents modified features. To do so, we present two simple building blocks: one that uses a batch of observations to help us estimate w , and one that, aims to estimate g(w) = A 1 g Πgw for a given w. We make the following commutativity assumption (please see Appendix E for more regarding this assumption): Assumption A.1. For all g {1, 2}, Πg A 1 g = A 1 g Πg. We remark that this assumption holds in several cases of interest. For example, this holds when Ag = σg Id d for some σg 0, i.e. when the cost of an agent for modifying features is the same across all features and independent across features. This also happens when Πg and A 1 g are both diagonal, in which case they are simultaneously diagonalizable hence commute (for example, when Πg is the projection to a subset of the features, and when manipulating one feature does not affect another feature for free). Under Assumption A.1, Equation (5) can be rewritten as w SW = A 1 1 Π1w + A 1 2 Π2w A 1 1 Π1w + A 1 2 Π2w = 1(w ) + 2(w ) 1(w ) + 2(w ) , Accurate estimation of both Πgw and g(w) for any given w is sufficient for accurate estimation of w SW. The principal can then take a classical explore-then-exploit approach, in which she first sets aside a batch of agents in group g to estimate the parameters of the problem to her desired accuracy, then use the parameters she learned to incentivize optimal outcome improvement during the rest of the time horizon. Estimating Πgw To estimate Πgw , we use Algorithm 2. The algorithm has access to n agents from group g. It consists of first posting an initial model of w = 0 w.l.o.g.11 , observing the agents true, unmodified features and true labels (according to w ), and using these observations to compute and output an estimate w of Πgw : Algorithm 2 Estimating Πgw Post w = 0; for i = 1 to n do Principal observes agent i s true feature vector xi, and his true label yi; end for Output w arg minw Pn i=1 x i Πgw yi 2; For simplicity of exposition, we consider the case in which the noise in the label follows a Gaussian distribution, as per the below assumptions. We note however that our results classically extend to the sub-Gaussian case by classical recovery guarantees of linear least-square regression . Assumption A.2. For every agent i, yi x i w N(0, σ2) where 0 σ2 < . Claim A.3. Under Assumption A.2, with probability at least 1 δ, the output w of Algorithm 2 satisfies w Πgw 2 = O σ2d log(1/δ) where λg denotes the smallest non-zero eigenvalue of Σg, the covariance matrix of distribution Dg. Proof. Without loss of generality, we restrict attention to the subspace Sg induced by the support of the distribution of features Dg. Let Σg be the covariance matrix of distribution Dg; by definition of Dg, Σg is full-rank with smallest eigenvalue 11The choice of w = 0 in Algorithm 2 is not crucial. In fact, picking any given w induces the same distribution of feature vectors as Sg, with its expectation shifted by a constant amount of g(w), after the first Ng unmodified observations. In turn, given Ng + n samples, the distribution of the last n feature vectors used for estimation remains full-rank in subspace Sg and still has covariance matrix Σg. Therefore, the high-probability bound of Claim A.3 remains the same. In most cases, Ng is of the order of the dimension d of the problem, hence we have that n >> Ng, and the cost of waiting for agent to learn w is minimal. Information Discrepancy in Strategic Learning λg in Sg. By the classical recovery results on least-square regression, since E [yi|xi] = x i (Πgw ) by assumption, we obtain that w Πgw 2 = O σ2d log(1/δ) This concludes the proof. Estimating g(w). Algorithm 3 has access to 2n + Ng agents from group g, takes as an input a vector w, and outputs an estimate of g(w). Algorithm 3 Estimating g(w) Post w1 = 0 for i = 1 to n do Principal observes agent i s true feature vector xi, and his true label yi end for Post w2 = w for i = n + 1 to n + Ng do agent i plays true feature vector xi end for for i = n + Ng + 1 to 2n + Ng do principal observes agent i s modified feature vector ˆxi = xi + g(w) end for Output g 1 n P2n+Ng i=n+Ng+1 ˆxi Pn i=1 ˆxi Claim A.4. Let us assume that for all i, xi 1. Then, with probability at least 1 δ, the output g of Algorithm 3 satisfies g g(w) 2 d log(d/2δ) Proof. First, we note that i=n+Ng+1 ˆxi i=n+Ng+1 xi + n g(w) i=n+Ng+1 xi In turn, we have that g g(w) 2 = 1 i=n+Ng+1 xi We have that 2n+Ng X i=n+Ng+1 xi where Zi = xi+n+Ng xi. In turn, Zi is a random vector with mean E[Zi] = 0 and covariance matrix 2Σg, noting that xi and xi+n+Ng are drawn independently. Further, |Zi(k)| 2. By Hoeffding s inequality, we have that with probability at least 1 δ d, for a given k [d], 2n log(d/2δ). Information Discrepancy in Strategic Learning By union bound, we have that with probability at least 1 δ, this holds simultaneously for all k [d], with directly yields 2dn log(d/2δ). This immediately leads to the result. B. Supplementary Material for Section 3 B.1. Omitted Proofs from Subsection 3.1 Proof of Lemma 3.1. We first identify the rules ew that are the solutions of the error minimization part of Eq. (3): ew = arg min w X x g,iw ˆyg,i 2 = arg min w X (Πgxg,i) w ˆyg,i 2 | {z } f(w ) where the second equation is due to the fact that since xg,i Dg, it holds that Πgxg,i = xg,i. To solve the minimization problem of Eq. (6), we take the first order conditions, so at the optimal ew: f (ew) = 0 2 X i [Ng] (Πgxg,i) (Πgxg,i) ew ˆyg,i = 0 (7) Now, ew = Πgw is one of the solutions of Eq. (7), since ˆyg,i = (Πgxg,i) w = (Πgxg,i) Πgw. Next, we argue that due to the norm-minimization rule we use for tie-breaking, it is also the unique solution. To do so, let ew be a norm-minimizing solution of Eq. (7), and write ew = Πgw + x , where x is an arbitrary vector; note that this is without loss of generality. We can write ew = Πgw + Πgx + Π g x (where Π g x is the projection of x in the orthogonal subspace of Sg). Now, note that Πg ew = Πgw + Πgx + ΠgΠ g x = Πgw + Πgx is also a solution to Eq. (7), as X i [Ng] (Πgxg,i) (Πgxg,i) Πg ew ˆyg,i = i [Ng] (Πgxg,i) x g,iΠ g Πg ew ˆyg,i i [Ng] (Πgxg,i) x g,iΠg ew ˆyg,i , where the last step is due to the fact that Πg represents an orthogonal projection, hence Π g = Πg and ΠgΠg = Πg. Further, if Π g x = 0, we have that ew 2 = Πgw + Πgx 2 + Π g x 2 > Πgw + Πgx 2 = Πg ew 2 by orthogonality of Πg (w + x ) and Π g x . This contradicts the fact that ew is a norm-minimizing solution of Eq. (7). Therefore, we have ew = Πgw + Πgx . Using this together with ˆyg,i = (Πgxg,i) Πgw, the left-hand side of Eq. (7) becomes: X i [Ng] (Πgxg,i) (Πgxg,i) Πgw + (Πgxg,i) Πgx (Πgxg,i) Πgw = X i [Ng] (Πgxg,i) (Πgxg,i) x (8) where the last equality comes from the fact that ΠgΠ g = 0d d. We next prove a technical lemma. Information Discrepancy in Strategic Learning Lemma B.1. Let Z P i [Ng] (Πgxg,i) (Πgxg,i) be full-rank in subspace Sg. Then, for any vector x Rd it holds that Z (Πgx ) = 0 if and only if Πgx = 0. Proof of Lemma B.1. If Πgx = 0 then, it holds that Z(Πgx ) = 0. So, assume that Z (Πgx ) = 0. Let r = rank(Sg) (hence, r = rank(Z)). Let us denote by v1, . . . , vr the eigenvectors of Z corresponding to eigenvalues λ1, . . . , λr for which λi > 0, i [r]; note that (v1, . . . , vr) span Sg. For the rest of the eigenvalues (i.e., i [r+1, d], λi = 0) the remaining eigenvectors are denoted as vr+1, . . . , vd. Without loss of generality, we take vi to have norm 1 for all i; since Z is a symmetric matrix, (v1, . . . , vr) is an orthonormal basis for SG and (v1, . . . , vd) is an orthonormal basis for Rd. Let V denote the d d matrix [v 1 v 2 v d ] that is the change of basis that transforms the standard basis into (v1, . . . , vd). By orthonormality of (v1, . . . , vd), V is unitary (i.e., V V = I). In turn, Z = (Πgx ) = V V Z V V Πgx (9) Let us define matrices P1 = V Z V and P2 = V Πgx . P1 is a diagonal matrix having λ1, . . . λd on the diagonal (and hence, it only has positive values until row r and 0 s for rows in {r + 1, d}). Also, P2 = V Πg x = [a1 ar 0 0] , where ai = v i (Πgx ) . Substituting the values of P1, P2 in Eq. (9) we have that: Z (Πgx ) = V [λ1a1 λrar 0 0] But Z (Πgx ) = 0 if and only if λiai = 0, i [r], because V is invertible. Since λi > 0 for i [r], it must be that ai = 0. Since then, V Πgx = 0 and V is invertible, this implies that Πgx = 0. Defining Z as Z = P i [Ng] (Πgxg,i) (Πgxg,i) , 12 then from Lemma B.1, Eq. (8) is equal to 0 if and only if Πgx = 0. This directly yields ew = Πgw + Πgx = Πgw. B.2. Omitted Proofs from Subsection 3.2 Proof of Lemma 3.2. The function in Eq. (4) is concave. At the optimum x from the first order conditions we have that u (x, x ; g) = Πgw Ag(x x) = 0. Solving the latter in terms of x and using the fact that matrix Ag is positive definite (hence also invertible) gives us the result. B.3. Omitted Proofs from Subsection 3.3 Lemma B.2. Let Q Rd d a symmetric PD matrix and c a vector in Rd. Then, the following optimization problem: max x Rd c x s.t., x Qx b has unique solution: x = b Q 1c p Proof. We first compute the Lagrangian: L(x, λ) = c x + λ 2 x Qx b (10) We can then find the KKT conditions: c + λQx = 0 (11) λ x Qx b = 0 (13) x Qx b (14) 12Given enough samples from the peer dataset (i.e., a large enough Ng), one can guarantee that Z is full rank. Information Discrepancy in Strategic Learning At maximum it must be the case that λ > 0 (from Eq. (12)) and hence, combining Eqs. (14) and (13) we get x Qx = b. Due to the fact that λ > 0, then from Eq. (11), solving in terms of x and using the fact that Q is symmetric positive definite we get: Substituting the above in equation x Qx = b we obtain: λ2 c Q 1c = b (16) Solving this in terms of λ gives λ = 1 c Q 1c. Substituting λ in Eq. (15) we get the result. The proof is completed by the fact that the objective function is convex and the feasible set is concave; hence the global optimum is found at a KKT point. Proof of Lemma 3.3. We first note a useful lemma (which we formally state and prove in Lemma B.2), namely that if Q Rd d is a symmetric PD matrix and c a vector in Rd then the solution of the optimization problem maxx c x such that x Qx b has unique solution x = b Q 1c Using the closed-form of the agents best-response from Lemma 3.2 in Eq. (2) we get that: w SW = arg max w : w 2 1 E x D1 [ w , ˆx(x; 1) ] + E x D2 [ w , ˆx(w; 2) ] = arg max w : w 2 1 E x D1 [ w , x + 1(w) ] + E x D2 [ w , x + 2(w) ] = arg max w : w 2 1 A 1 1 Π1 + A 1 2 Π2 w , w (17) We rewrite the objective function to be optimized above as: [w (A 1 1 Π1 + A 1 2 Π2)]w = [(A 1 1 Π1 + A 1 2 Π2) w ] w and the constraint for w remains: w w 1. This problem is an instance of the problem solved in Lemma B.2 for c = (A 1 1 Π1 + A 1 2 Π2) w , b = 1 and Q the identity matrix. Substituting c, Q, b in the solution of Lemma B.2 gives the result. C. Generalizing to Multiple Groups Let G denote the set of all groups, i.e., G = {1, 2, . . . , m}. As is customary in the literature, we use G j to denote all the groups apart from group j, i.e., G j = {1, 2, . . . , j 1, j + 1, . . . , m}. In order to explain how the theorem and proposition statements change when m > 2, we first outline how the principal s equilibrium rule changes as a result of the presence of m > 2 groups. Due to the fact that the estimated rule for each group g G is: west(g) = Πgw, then from extending Lemma 3.3 we have that the principal s equilibrium rule becomes: Π1A 1 1 + + Πm A 1 m w Π1A 1 1 + + Πm A 1 m w (18) We first analyze the do-no-harm objective for the case that m > 2 groups are present in the population. The analogue of Theorem 4.3 for m > 2 groups follows. Theorem C.1. In equilibrium, there is no negative externality for group g and any w if and only if for all g G, the matrix P i G A 1 i Πi Πg A 1 g + A 1 g Πg P i G A 1 i Πi is PSD. This means that we can still guarantee that there is no negative externality for any of the groups in equilibrium in the two cases of interest, namely: 1. when the cost matrices are proportional to each other, i.e., Ai = cij Aj for all (i, j) G2 and some scalars cij > 0 (analogue of Proposition 4.4 for m > 2 groups). 2. when the subspaces S1, . . . , Sm are orthogonal (analogue of Proposition 4.5 for m > 2 groups). Information Discrepancy in Strategic Learning To derive the aforementioned results, the only change in the proofs of Theorem 4.3, and Propositions 4.4 and 4.5 is that w SW should be substituted with the expression in Equation (18). We proceed to discussing total improvement for m > 2 groups. We find it useful to present a slightly generalized version of the overlap proxy. Definition C.2. Given a scoring rule w Rd and projections Π1, . . . , Πm Rd, we define the overlap proxy between any two groups Gi, Gk with respect to w to be: ri,k(w) Πiw Πkw . Using this definition, we can state the direct generalization of Lemma 4.7. Lemma C.3. Let diffj,k |Ij(w) Ik(w)| be the disparity in total improvement across groups when the principal s rule is w. In equilibrium, if Aj = Ak = Id d, then: diffj,k(w SW) rj,k(w ). Further, the equality holds if and only if Πjw and Πkw are co-linear. The analogue of Theorem 4.9 for m > 2 becomes: Theorem C.4. In equilibrium, the groups obtain equal total improvement for all w if and only if A 1 1 Π1A 1 1 = A 1 2 Π2A 1 2 = = A 1 m Πm A 1 m . Finally, we turn our attention to the per-unit improvement, and we state the analogue of Theorem 4.11 for m > 2 groups. This analogue is again derived using Equation (18) for w SW. Theorem C.5. In equilibrium, group g gets optimal per-unit improvement if and only if: * A 1 g Πg A 1 g w Πg A 1 g w 2 A 1 g Πg Π1A 1 1 + Π2A 1 2 + + Πm A 1 m w Πg Π1A 1 1 + Π2A 1 2 + + Πm A 1 m w 2 , w + Note that this means that, in equilibrium, optimal per-unit outcome improvement is guaranteed if there exists cg > 0, such that: Πg(A 1 g Πg) w = cgΠg(A 1 1 Π1 + + A 1 m Πm) w Two notable examples for which this condition holds are: 1. when all of S1, . . . , Sm are orthogonal to each other 2. when Ai = cij Aj and Πi = Πj. D. Supplementary Material for Section 4 D.1. Omitted Proofs from Subsection 4.1 Proof of Theorem 4.3. By Definition 2.2, having no negative externality in equilibrium translates to: g : Ig (w) 0 A 1 1 Π1 + A 1 2 Π2 w A 1 1 Π1 + A 1 2 Π2 w 2 0 (Lemma 3.3) D A 1 g Πg A 1 1 Π1 + A 1 2 Π2 w , w E 0 Π g A 1 g A 1 1 Π1 + A 1 2 Π2 w , w 0 ((AB) = B A and A = A) A 1 1 Π1 + A 1 2 Π2 Π g A 1 g w , w 0 ((AB) = B A ) A 1 1 Π1Π g A 1 g + A 1 2 Π2Π g A 1 g w , w 0 ((A + B)C = AC + BC) Using the fact that Π g = Πg (as orthogonal projection matrices) and that A 1 g = A 1 g (as Ag is a symmetric matrix), we obtain the condition q(w ) 0 where q(w ) = (w ) Mw is a quadratic form with M = A 1 1 Π1Πg A 1 g + Information Discrepancy in Strategic Learning A 1 2 Π2Πg A 1 g . By standard linear algebra arguments, noting that (w ) Mw = ((w ) Mw ) = (w ) M w , we can rewrite q(w ) = 1 2(w ) M + M w . The condition then holds for all w if and only if M + M = A 1 1 Π1Πg A 1 g + A 1 2 Π2Πg A 1 g + A 1 g ΠgΠ1A 1 1 + A 1 g ΠgΠ2A 1 2 is PSD. This concludes the proof. Proof of Corollary 4.4. Fix a group g {1, 2} (wlog, let g = 1), and let A = A 1 1 . Then, from Theorem 4.3 no negative externality for group g is guaranteed if and only if: c A Π2Π1 A , w 0 c A Π2Π1 A w ! w A Π1 A + 1 c2 A Π2Π1 A w 0 (19) Eq. (19) is true if and only if matrix AΠ1 A + AΠ2Π1 A is PSD. Matrix Π1 is by definition PSD. Matrix A is PD, hence its inverse, A, is also PD. As a result, matrix AΠ1 A is PSD. We shift our attention to matrix AΠ2Π1 A now. Since Π1, Π2 are projection matrices, then the eigenvalues of their product Π2Π1 are non-negative (Anderson Jr et al., 1985). Recall that a matrix is PSD if and only if its eigenvalues are non-negative. As a result, matrix Π2Π1 is PSD. Using the same property as above (i.e., that if matrices A, B are PSD, then so is matrix ABA) we can conclude that AΠ2Π1 A is PSD. If matrices A, B are PSD, then so is matrix A + B. Hence, matrix AΠ2 A + AΠ2Π2 A is PSD, i.e., by definition that for any vector z we have that: z ( AΠ1 A + AΠ2Π1 A/c)z 0. This concludes our proof. Proof of Corollary 4.5. Fix a group g {1, 2} (wlog let g = 1). From Theorem 4.3 we need: D A 1 1 Π2 1A 1 1 + A 1 2 Π2Π1A 1 1 w , w E 0 D A 1 1 Π1A 1 1 w , w E 0 (20) where for the last inequality we used Π2Π1 = 0 (as subspaces S1, S2 are orthogonal) and Π2 g = Πg (as orthogonal projection matrices). Eq. (20) holds since matrices Π1 and A1 are PSD. D.2. Omitted Proofs from Subsection 4.2 Proof of Lemma 4.7. Note that (Π1 + Π2) w (Π1 + Π2) w (Π1 + Π2) w (Π1 + Π2) w = (Π1 Π2) (Π1 + Π2) w (Π1 + Π2) w , w = 1 (Π1 + Π2) w (Π1 + Π2) w , (Π1 Π2) w . By Cauchy-Schwarz, we have that 1 (Π1 + Π2) w (Π1 + Π2) w , (Π1 Π2) w 1 (Π1 + Π2) w (Π1 + Π2) w (Π1 Π2) w = (Π1 Π2) w = r1,2(w ), with equality if and only if (Π1 + Π2) w and (Π1 Π2) w are colinear, i.e., there exists α R such that α (Π1 + Π2) w = (Π1 Π2) w , which can be equivalently written as (1 α)Π1w = (1 + α)Π2w , i.e., Π1w and Π2w are colinear. Information Discrepancy in Strategic Learning Proof of Theorem 4.9. Equal total outcome improvement across groups is guaranteed in equilibrium if and only if the following holds: I1 (w SW) I2 (w SW) = 0 (Definition 2.2) A 1 1 Π1 + A 1 2 Π2 w A 1 1 Π1 + A 1 2 Π2 w 2 A 1 1 Π1 + A 1 2 Π2 w A 1 1 Π1 + A 1 2 Π2 w 2 D A 1 1 Π1 A 1 1 Π1 + A 1 2 Π2 w A 1 2 Π2 A 1 1 Π1 + A 1 2 Π2 w , w E = 0 Dh A 1 1 Π1 A 1 1 Π1 + A 1 2 Π2 A 1 2 Π2 A 1 1 Π1 + A 1 2 Π2 i w , w E = 0 A 1 1 Π1Π 1 A 1 1 + A 1 2 Π2Π 1 A 1 1 A 1 1 Π1Π 2 A 1 2 A 1 2 Π2Π 2 A 1 2 w , w = 0 A 1 1 Π1A 1 1 + A 1 2 Π2Π1A 1 1 A 1 1 Π1Π2A 1 2 A 1 2 Π2A 1 2 w , w = 0 where the second transition is due to Lemma 3.3, the fourth is due to Av Bv = (A B)v, the second-to-last one is due to (AB) = B A and A = A, and the last one is due to the fact that Πg = Π g , Π2 g = Πg and A 1 g = A 1 g . Let M A 1 1 Π1A 1 1 + A 1 2 Π2Π1A 1 1 A 1 1 Π1Π2A 1 2 A 1 2 Π2A 1 2 , the above can be written as the quadratic form q(w ) = (w ) Mw . In turn, q(w ) = 0 simultaneously for all w , i.e q = 0, if and only M is skew-symmetric, which means M + M = 0. This can be rewritten as 0 = A 1 1 Π1A 1 1 + A 1 2 Π2Π1A 1 1 A 1 1 Π1Π2A 1 2 A 1 2 Π2A 1 2 + A 1 1 Π1A 1 1 + A 1 1 Π1Π2A 1 1 A 1 2 Π2Π1A 1 1 A 1 2 Π2A 1 2 , or equivalently: 2A 1 1 Π1A 1 1 2A 1 2 Π2A 1 2 = 0. This concludes the proof. D.3. Omitted Proofs from Section 4.3 Proof of Proposition 4.10. We focus on a two-dimensional example for clarity of exposition. To abstract away from discrepancies in the cost matrices, we assume that A1, A2 = I2 2, and that the projection matrices of the two groups are Π1 = 1 0 0 0 and Π2 = 0 0 0 1 Next, we select ε > 0 such that ε2 1 ε2 < α, and assume that the distribution we face is such that the optimal outcome-decision rule is: w = ε, 1 ε2 . From Lemma 3.3, substituting the values of A1, A2, Π1, Π2 as defined above, we have that the w maximizing the social welfare satisfies: w 2 = h ε, p Substituting A1, A2, Π1, Π2 in g(w) = A 1 g Πgw we have: I1(w SW) = ε2 and I2(w SW) = 1 ε2. Next, we compute u I1(w1) and u I2(w2). Substituting A1, A2, Π1, Π2 in the definition of g(w) and by Lemma 3.5, we have: w1 = [1, 0] , w2 = [0, 1] . Finally: u I1(w) = u I1 [ε, 0] [ε, 0] 2 = u I1 [1, 0] = ε = max w u I1(w ) u I2(w) = u I1 = u I2 [0, 1] = p = u I2(w2). Information Discrepancy in Strategic Learning However, for the total outcome improvement: I1(w) I2(w) = ε2 1 ε2 < α, which concludes the proof. Proof of Theorem 4.11. Using Lemmas 3.3, 3.5 and Definition 2.2 we get that w SW induces optimal per-unit outcome improvement if and only if: w SW = wg = arg max w u Ig (w ) u Ig (wg) u Ig (w) = 0 Πgwg Πgwg 2 = 0 (Definition of u Ig( )) Πg(A 1 g Πg) w (A 1 g Πg) w 2 Πg(A 1 g Πg) w (A 1 g Πg) w 2 Πg(A 1 1 Π1+A 1 2 Π2) w (A 1 1 Π1+A 1 2 Π2) w 2 Πg(A 1 1 Π1+A 1 2 Π2) w (A 1 1 Π1+A 1 2 Π2) w 2 Πg(A 1 g Πg) w (A 1 g Πg) w 2 Πg(A 1 g Πg) w 2 (A 1 g Πg) w 2 Πg(A 1 1 Π1+A 1 2 Π2) w (A 1 1 Π1+A 1 2 Π2) w 2 Πg(A 1 1 Π1+A 1 2 Π2) w 2 (A 1 1 Π1+A 1 2 Π2) w 2 A 1 g Πg Πg A 1 g Πg w Πg A 1 g Πg w 2 A 1 g Πg Πg A 1 1 Π1 + A 1 2 Π2 w Πg A 1 1 Π1 + A 1 2 Π2 w 2 A 1 g Πg A 1 g Πg w Πg A 1 g Πg w 2 A 1 g Πg A 1 1 Π1 + A 1 2 Π2 w Πg A 1 1 Π1 + A 1 2 Π2 w 2 where the transitions are given by: (4) Lemmas 3.3 and 3.5, (5) v c for any scalar c, and (7) ΠgΠg = Πg as they are orthogonal projections. Lemma D.1. In equilibrium, optimal per-unit outcome improvement is guaranteed if there exists cg > 0, such that: Πg(A 1 g Πg) w = cgΠg(A 1 1 Π1 + A 1 2 Π2) w . Proof of Lemma D.1. Assume the condition in the statement holds and denote v = Πg(A 1 g Πg) w = cgΠg(A 1 1 Π1 + A 1 2 Π2) w . By Theorem 4.11, we know that for any group g {1, 2}, we are guaranteed Optimal per-unit outcome Improvement if and only if the following holds: * A 1 g Πg A 1 g Πg w Πg A 1 g Πg w 2 A 1 g Πg A 1 1 Π1 + A 1 2 Π2 w Πg A 1 1 Π1 + A 1 2 Π2 w 2 Which is in our case equivalent to requiring * A 1 g v v 2 A 1 g Equivalently, this can be written as A 1 g v v 2 A 1 g v v 2 , w = 0. This concludes the proof. Information Discrepancy in Strategic Learning Proof of Corollary 4.12. The condition in Lemma D.1 can be written as Πg A 1 g w = cg ΠgΠ1A 1 1 + ΠgΠ1A 1 2 w Since Π1Π2 = Π2Π1 = 0, the condition holds for c1 = c2 = 1 Proof of Corollary 4.13. Assume Π1 = Π2 and A1 = c A2. It is immediate that the condition of Lemma D.1 holds for c1 = 1 1 + c, c2 = c 1 + c D.4. Omitted Proofs from Section 4.4 As in many practical settings, the principal is expected to take into account a joint objective of predictive accuracy and social welfare, We show next that under mild conditions, deploying any combination of the social-welfare maximizing solution and the true underlying predictor results in inheriting the Do-No-Harm guarantee of the social welfare maximizer. We begin by proving three useful lemmas: Lemma D.2. Assume Do-No-Harm is guaranteed for group g G for each of w1,w2. Then, Do-No-Harm is guaranteed for αw1 + βw2, for any α, β 0. Proof of Lemma D.2. Ig αw1 + βw2 = g αw1 + βw2 , w (Definition 2.2) = A 1 g Πg αw1 + βw2 , w (Lemma 3.2) = α A 1 g Πgw1, w + β A 1 g Πgw2, w (Linearity) = αIg w1 + βIg w2 (Definition 2.2) 0 (Ig w1 , Ig w2 0) Lemma D.3. Assume for some g G, Πg A 1 g is positive semi-definite. Then Do-No-Harm is guaranteed for group g when the principal deploys w . Proof of Lemma D.3. Ig (w ) = g (w ) , w (Definition 2.2) = A 1 g Πgw , w (Lemma 3.2) = w Πg A 1 g w (By definition) 0 (Πg A 1 g 0) Lemma D.4. Assume for some g G, Πg,A 1 g commute. Then Πg A 1 g is positive semi-definite. Information Discrepancy in Strategic Learning Proof of Lemma D.4. For any w, w Πg A 1 g w = w ΠgΠg A 1 g w (Πg = ΠgΠg) = w Π g A 1 g Πgw (Πg = Π g , Πg A 1 g = A 1 g Πg) = (Πgw) A 1 g Πgw (w Π g = (Πgw) ) 0 (A 1 g 0) Observation D.5. Note that the assumption that Πg A 1 g is positive semi-definite is very intuitive. The reason is that the information recovered by group g regarding the principal s decision rule w resides in the subspace defined by Πg. Proof of Theorem 4.15. The theorem follows directly by combining Lemmas E.1, D.2, D.3, and D.4. Proof of Corollary 4.16. Given the fact that for all g, Πg and A 1 g commute since Πg = I, the corollary follows directly from Lemmas D.2, D.3, and D.4. E. On Πg And A 1 g ommuting Lemma E.1. Suppose that for all x Sg, Agx Sg, and for all x S g , Agx S g . Then, A 1 g and Πg commute. Further, Cost (x, x ; g) = (Πg(x x)) Ag(Πg(x x)) + (Π g (x x)) Ag(Π g (x x)). Proof. Remember that because S g is the orthogonal subspace to Sg, we have that for all x, x = Πgx + Π g x. Now, note that Πg Agx = Πg Ag(Πgx) + Πg Ag(Π g x). Since Πgx Sg, we have that Ag(Πgx) Sg, hence Πg Ag(Πgx) = Ag(Πgx) (since Πg is the orthogonal projection operator onto Sg). Further, since Π g x S g , we have that Ag(Π g x) S g , leading to Πg Ag(Π g x) = 0. This leads to Πg Agx = AgΠgx for all x, directly implying that Πg Ag = AgΠg. This can be further rewritten as Πg = AgΠg A 1 g , or equivalently A 1 g Πg = Πg A 1 g , showing the first part of the result. The second part of the result follows immediately from Cost (x, x ; g) = (x x) Ag(x x) = Πg(x x) + Π g (x x) Ag Πg(x x) + Π g (x x) = (Πg(x x)) Ag(Πg(x x)) + (Π g (x x)) Ag(Π g (x x)) + (Πg(x x)) Ag(Π g (x x)) + (Π g (x x)) Ag(Πg(x x)). Since Ag(Π g (x x)) S g , we have that Πg(x x)) Ag(Π g (x x)) = (x x) Πg Ag(Π g (x x)) = 0. A similar argument holds to show (Π g (x x)) Ag(Πg(x x)) = 0, concluding the proof. Note that this condition is a fairly natural one: indeed, it shows that the cost of modifying a feature vector x can be decomposed in two independent components: modifying Πgx, the part of x that is in Sg, and modifying Π g x, the part of x that is in S g . In turn, it indicates that feature modifications within Sg do not affect feature modifications that within S g , and vice-versa. One reason for this is that agents in group g are only aware of Πgx in the first place, and only consider feature modifications that are entirely contained within Sg; in this case, feature modifications within S g do not matter, as they are never considered by the agents in the first place who are only interested in improving (Πgw) x = w (Πgx) (where w is the deployed rule). Information Discrepancy in Strategic Learning F. Supplementary Material for Section 5 In this section, we study the impact of disparities in access to information about the model not just on their own, but in conjunction with cost disparities and asymmetries of the scoring rule. To do so, we provide additional experimental results on both the ADULT and the TAIWAN-CREDIT dataset. In Figure 2, we study disparities in improvement on the ADULT and the TAIWAN-CREDIT datasets. While we keep using the same data Xg, rule w , and projection matrices Πg as in Section 5, we now consider non-identity cost matrices. To do so, we draw Ag uniformly at random; for Ag s the uniform distribution is taken over [ 1, 1], coefficient by coefficient. Age Education Marriage 0.5 Improvement Age Country Education 0 Improvement I1(w SW) I2(w SW) u I1(w SW) u I2(w SW) u I1(w1) u I2(w2) Figure 2: Left, Right: evaluation on the TAIWAN-CREDIT and ADULT dataset respectively. Ag s are drawn at random. We first note that the scale of the improvements may differ from those of Figure 1; for example, the difference is striking when looking for example at the education feature of the left plot, for the TAIWAN-CREDIT dataset. Compared to Figure 1 where both total and per-unit outcome improvements are a bit less than 0.5 for all groups, we see that they are now above 1.5 for group 2. Significant changes in outcome improvements can also be seen for group 1 on the age and marriage feature of the left plot (TAIWAN-CREDIT), and for group 2 on the country and education features of the right plot (ADULT). This comes from the fact that information disparities are not the only parameter that have an effect on disparities of improvements across groups: changing the value and magnitude of the Ag s changes which features can be easily modified by agents, and what features give them the best improvement per level of cost exerted. This changes which features are desirable to invest in for the agents, and hence for a welfare maximizing principal. In the case of the Marriage feature, because both the total and per-unit improvements are significantly reduced compared to Figure 1, it seems that the disparities we observe are not only due to the fact that the learner may be putting less weight on group 1 (defined as the part of the norm of w SW that belongs to group 1 s information space S1) and more on group 2; rather, what seems to happen is that the learner focuses on directions in which both groups have information, but that are only good and easy to modify for group 2. We further observe that the addition of non-identity cost matrices can lead to a degradation of outcomes in one of the groups, when the principal optimizes over the joint social welfare. This is visible on the left plot in Figure 2, where the total and per-unit improvements for group 1 are negative for the Age feature. This matches the relatively counter-intuitive observation of Section 4.1 that optimizing for the social welfare of both groups may hurt the welfare of one of them. Finally, when comparing the results across age groups in the left plots for Figure 1 and Figure 2 for the Age feature, we observe a significant reversal of the disparities of improvements across groups: group 1 was obtaining slightly better outcomes than group 2 in Figure 1, but group 2 has significantly worse (in fact, negative) improvements while group 1 improves slightly more than before in Figure 2. This paints a nuanced picture that shows that the amount of information that a group has about the scoring rule used by the principal is not the only factor of importance. While having more information Information Discrepancy in Strategic Learning is important, how this information interacts with the true model w and the strategic behavior of the agents matters; having a lot of information in directions that have little effect on an agents true label, or in directions that are very costly for some agents to modify, does not help them when it comes to improving their true labels.