# strategic_representation__0972dfc1.pdf Strategic Representation Vineet Nair 1 Ganesh Ghalme 2 Inbal Talgam-Cohen 1 Nir Rosenfeld 1 Humans have come to rely on machines for reducing excessive information to manageable representations. But this reliance can be abused strategic machines might craft representations that manipulate their users. How can a user make good choices based on strategic representations? We formalize this as a learning problem, and pursue algorithms for decision-making that are robust to manipulation. In our main setting of interest, the system represents attributes of an item to the user, who then decides whether or not to consume. We model this interaction through the lens of strategic classification (Hardt et al. 2016), reversed: the user, who learns, plays first; and the system, which responds, plays second. The system must respond with representations that reveal nothing but the truth but need not reveal the entire truth. Thus, the user faces the problem of learning set functions under strategic subset selection, which presents distinct algorithmic and statistical challenges. Our main result is a learning algorithm that minimizes error despite strategic representations, and our theoretical analysis sheds light on the trade-off between learning effort and susceptibility to manipulation. 1. Introduction There is a growing recognition that learning systems deployed in social settings are likely to induce strategic interactions between the system and its users. One promising line of research in this domain is strategic classification (Hardt et al., 2016), which studies learning in settings where users can strategically modify their features in response to a learned classification rule. The primary narrative in strategic classification is one of self-interested users that act to game the system, and of systems that should defend against this 1Technion Israel Institute of Technology. 2Indian Institute of Technology, Hyderabad. Correspondence to: Vineet Nair . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). form of gaming behavior. In this paper, we initiate the study of reversed scenarios, in which it is the system that strategically games its users. We argue that this is quite routine: In settings where users make choices about items (e.g., click, buy, watch), decisions are often made on the basis of only partial information. But the choice of what information to present to users often lies in the hands of the system, which can utilize its representation power to promote its own goals. Here we formalize the problem of strategic representation, and study how learning can now aid users in choosing well despite strategic system behavior. As a concrete example, consider a user browsing for a hotel in an online booking platform. Hotels on the platform are represented by a small set of images; as there are many images available for each hotel, the system must choose which subset of these to display. Clearly, the choice of representation can have a significant effect on users decision-making (whether to click the hotel, and subsequently whether to book it), and in turn, on the system s profit. The system may well attempt to capitalize on the control it has over what information is presented to the user at the expense of the user, who may be swayed to make sub-optimal decisions. Given that users rely on system-crafted representations for making decisions, choosing well requires users to account for the strategic nature of the representations they face. Our goal in this paper is to study when and how learning can aid users in making decisions that are strategically robust. We model users as deciding based on a choice strategy h mapping represented items to binary choices, which can be learned from the user s previous experiences (e.g., hotels stayed in, and whether they were worthwhile). Since in reality full descriptions of items are often too large for humans to process effectively (e.g., hundreds or thousands of attributes), the role of the system is to provide users with compact representations (e.g., a small subset of attributes). We therefore model the system as responding to h with a representation mapping ϕ, which determines for each item x its representation z = ϕ(x), on which users rely for making choices (i.e., h is a function of z). We focus on items x that are discrete and composed of a subset of ground elements; accordingly, we consider representation mappings ϕ that are lossy but truthful, meaning they reveal a cardinalityconstrained subset of the item s full set of attributes: z x and k1 |z| k2 for some exogenously-set k1, k2. Strategic Representation Given that the system has representation power how should users choose h? The key idea underlying our formulation is that the system and its users have misaligned goals. A user wishes to choose items that are worthwhile to her, as determined by her valuation function v(x) in our case, a set function, which can reflect complex tastes (e.g., account for complementarity among the attributes, such as balcony and ocean view in the hotel example). Meanwhile, the system aims to maximize user engagement by choosing a feasible representation z = ϕ(x) that will incite the user to choose the item. Importantly, while values are based on the true x, choices rely on representations z which the system controls. This causes friction: a representation may be optimal for the system, but may not align with user interests. The challenge for users (and our goal in learning) is therefore to make good choices on the basis of strategic representations. Note that this is not a simple missing value problem, since which attributes will be missing depends on how users choose, i.e., on their learned choice function h.1 Nor is this a problem that can be addressed by mapping representations to a categorical representation (with classes 0 , 1 , and unknown ); to see this, note that given a subset representation z, it is impossible to know which of the attributes that do not appear in z are withheld and which are truly missing. Subset representations. We focus on subset representations since they provide a natural means to ensure that representations reveal nothing but the truth (but not necessarily the whole truth ). This is our primary modeling concern, which stems from realistic restrictions on the system (like consumer expectations, legal obligations, or business practices). Subset representations cleanly capture what we view as a fundamental tension between users and an informationally-advantageous system the ability to withhold information; examples include retail items described by a handful of attributes; videos presented by several key frames; and movie recommendations described by a select set of reviews, to name a few. Overview of our results. We begin by showing that users who choose na ıvely can easily be manipulated by a strategic system (Sec. 3). We then proceed to study users who learn (Sec. 4). Our main result is an efficient algorithm for learning h (Thm. 4.7), which holds under a certain realizability assumption on v. The algorithm minimizes the empirical error over a hypothesis class of set-function classifiers, whose complexity is controlled by parameter k, thus allowing users to trade off expressivity and statsitical efficiency. The algorithm builds upon several structural properties which we establish for set-function classifiers. Our key result here is 1Consider the following subtlety: since users must commit to a choice strategy h, the choice of how to handle missing features (a part of the strategy) determines which features will be missing. that folding the system s strategic response into the hypothesis class results in an induced class having a simple form that makes it amenable to efficient optimization (Thm. 4.6). Building on this, we continue to study the balance of power (Sec. 5), as manifested in the interplay between k (hypothesis complexity, which reflects the power of the user) and the range [k1, k2] (which determines the power of the system). For fixed k1, k2, we analyze how the choice of k affects the user s classification error, through its decomposition into estimation and approximation errors. For estimation, we give a generalization bound (Thm. 5.9), obtained by analyzing the VC dimension of the induced function class (as it relies on k). For approximation, we give several results (e.g., Thm. 5.4) that link the expressivity of the user s value function v to the complexity of the learned h (again, as it relies on k). Together, our results characterize how much is gained versus how much effort is invested in learning as k is varied. One conclusion is that even minimal learning can help significantly (whereas no learning can be catastrophic). From the system s perspective, and for fixed k, we study how the range [k1, k2] affects the system. Intuitively, we would expect that increasing the range should be beneficial to the system, as it provides more alternatives to choose z from. However, perhaps surprisingly, we find that the system can increase its payoff by tying its hands to a lower k2. This is because k2 upper-bounds not only the system s range but also the effective k of the user (who gets nothing from choosing k > k2), and the lower the k, the better it is for the system (Lemma 5.10). The choice of k1 turns out to be immaterial against fully strategic users, but highly consequential against users that are not. 1.1. Relation to Strategic Classification Our formalization of strategic representation shares a deep connection to strategic classification (Hardt et al., 2016). Strategic representation and strategic classification share an underlying structure (a leading learning player who must take into account a strategic responder), but there are important differences. The first is conceptual: in our setting, roles are reversed it is users who learn (and not the system), and the system strategically responds (and not users). This allows us to pursue questions regarding the susceptibility of users to manipulation, with different emphases and goals, while maintaining the language of strategic classification. The second difference is substantive: we consider inputs that are sets (rather than continuous vectors), and manipulations that hide information (rather than alter it). Technically, switching to discrete inputs can be done by utilizing the cost function to enforce truthfulness constraints as a hard penalty on modifications. But this change is not cosmetic: since the system behaves strategically by optimizing over subsets, learning must account for set-relations between Strategic Representation different objects in input space; this transforms the problem to one of learning set functions.2 From a modeling perspective, subsets make our work compatible with classic attribute-based consumer decision theory (Lancaster, 1966). Overall, we view the close connection to strategic classification as a strength of our formalization, showing that the framework of strategic classification is useful far beyond what was previously considered; and also that a fairly mild variation can lead to a completely different learning problem, with distinct algorithmic and statistical challenges. 1.2. Related Work (see also Appx. B) Strategic classification. Strategic classification is a highly active area of research. Recent works in this field include statistical learning characterizations (Zhang & Conitzer, 2021; Sundaram et al., 2021; Ghalme et al., 2021), practical optimization methods (Levanon & Rosenfeld, 2021; 2022), relaxation of key assumptions (Ghalme et al., 2021; Bechavod et al., 2022; Jagadeesan et al., 2021; Levanon & Rosenfeld, 2022; Eilat et al., 2022), relations to causal aspects (Miller et al., 2020; Chen et al., 2020a), and consideration of societal implications (Milli et al., 2019; Hu et al., 2019; Chen et al., 2020b; Levanon & Rosenfeld, 2021). Two recent works are closely relate to ours: Zrnic et al. (2021) consider a dynamic setting of repeated learning that can change the order of play; in contrast, we switch the roles. Krishnaswamy et al. (2021) consider information withholding by users (rather than the system). They aim to learn a truth-eliciting mechanism, which incentivizes the second player to reveal all information (i.e., the whole truth ). Their mechanism ensures that withholding never occurs; in contrast, our goal is to predict despite strategic withholding (i.e., nothing but the truth ). Bayesian persuasion. In Bayesian persuasion (Kamenica & Gentzkow, 2011), a more-informed player (i.e., the system) uses its information advantage coupled with commitment power to influence the choices of a decision maker (i.e., the user). Works closest to ours are by Dughmi et al. (2015), who upper bound the number of signaled attributes, and by Haghtalab et al. (2021), who study strategic selection of anecdotes. Both works consider the human player as a learner (as do we). However, in our work, the order of play is reversed the decision-maker (user) moves first and the more-informed player (system) follows. Bayesian persuasion also assumes that the system knows the user s valuation, and crucially relies on both parties knowing the distribution D of items. In contrast, we model the user as having only 2This is a subtle point: Since sets can be expressed as binary membership vectors, it is technically possible to use conventional vector-based approaches to learn h. Nonetheless, these approaches cannot account for strategic behavior; this is since ϕ implements a set operation, which is ill-defined for continuous vector inputs. sample access to D, and the system as agnostic to it. 2. A Formalization of Strategic Representation We begin by describing the setting from the perspective of the user, which is the learning entity in our setup. We then present the types of users we study, and draw connections between our setting and others found in the literature. 2.1. Learning Setting In our setup, a user is faced with a stream of items, and must choose which of these to consume. Items are discrete, with each item x X 2E described by a subset of ground attributes, E, where |E| = q. We assume all feasible items have at most n attributes, |x| n. The value of items for the user are encoded by a value function, v : X R. We say an item x is worthwhile to the user if it has positive value, v(x) > 0, and use y = Y (x) = sign(v(x)) to denote worthwhileness, i.e., y = 1 if x is worthwhile, and y = 1 if it is not. Items are presented to the user as samples drawn i.i.d. from some unknown distribution D over X, and for each item, the user must choose whether to consume it (e.g., click, buy, watch) or not. We assume that the user makes choices regarding items by committing at the onset to a choice function h that governs her choice behavior. In principle, the user is free to chose h from some predefined function class H; learning will consider finding a good h H, but the implications of the choice of H itself will play a central role in our analysis. Ideally, the user would like to choose items if and only if they are worthwhile to her; practically, her goal is to find an h for which this holds with large probability over D. For this, the user can make use of her knowledge regarding items she has already consumed, and therefore also knows their value; we model this as providing the user access to a labeled sample set S = {(xi, yi)}m i=1 where xi D and yi = sign(v(xi)), which she can use for learning h. Strategic representations. The difficulty in learning h is that user choices at test time must rely only on item representations, denoted z Z, rather than on full item descriptions. Thus, learning considers choice functions that operate on representations, h : Z { 1}; the challenge lies in that while choices must be made on the basis of representations z, item values are derived from their full descriptions x which representations describe only partially. The crucial aspect of our setup is that representations are not arbitrary; rather, representations are controlled by the system, which can choose them strategically to promote its own goals. We model the system as acting through a representation mapping, ϕ : X Z, which operates independently on any x, and can be determined in response to the user s choice of h. This mimics a setting in which a Strategic Representation fast-acting system can infer and quickly respond to a user s (relatively fixed) choice patterns. We assume the system s goal is to choose a ϕh that maximizes expected user engagement: Ex D[1{h(ϕh(x)) = 1}]. (1) Nonetheless, representations cannot be arbitrary, and we require ϕh to satisfy two properties. First, chosen representations must be truthful, meaning that z x for all x. Second, representations are subject to cardinality constraints, k1 |z| k2 for some predetermined k1, k2 N. We will henceforth use Z to mean representations of feasible cardinality. Both requirements stem from realistic considerations: A nontruthful system which intentionally distorts item information is unlikely to be commercially successful in the long run; intuitively, truthfulness gives users some hope of resilience to manipulation. For k1, k2, we think of these as exogenous parameters of the environment, arising naturally due to physical restrictions (e.g., screen size) or cognitive considerations (e.g., information processing capacity); if k2 < n, we say representations are lossy. Under these constraints, the system can optimize Eq. (1) by choosing representations via the best-response mapping: ϕh(x) = argmax z Z h(z) s.t. z x, |z| [k1, k2] (2) Eq. (2) is a best-response since it maximizes Eq. (1) for any given h: for every x, ϕh chooses a feasible z x that triggers a positive choice event, h(z) = 1 if such a z exists. In this way, k1, k2 control how much leeway the system has in revealing only partial truths; as we will show, both parameters play a key role in determining outcomes for both system and user. From now on we overload notation and by ϕh(x) refer to this best-response mapping. Learning objective. Given function class H and a labeled sample set S, the user aims to find a choice function h H that correctly identifies worthwhile items given their representation, and in a way that is robust to strategic system manipulation. The user s objective is therefore to maximize: Ex D[1{h(ϕh(x)) = y}] (3) where ϕh is the best-response mapping in Eq. (2). Note that since h is binary, the argmax of ϕh may not be unique; e.g., if some z1 x, z2 x both have h(z1) = h(z2) = 1. Nonetheless, the particular choice of z does not matter from the user s perspective, her choice of h is invariant to the system s choice of best-response z (proof in Appx. C): Observation 2.1. Every best-response z ϕh(x) induces the same value in the user s objective function (Eq. (3)). 2.2. User Types Our main focus throughout the paper will be on users that learn h by optimizing Eq. (3). But to understand the potential benefit of learning, we also analyze simpler types of user behavior. Overall, we study three user types, varying in their sophistication and the amount of effort they invest in choosing h. These include: The na ıve user: Acts under the (false) belief that representations are chosen in her own best interest. This user type truthfully reports her preferences to the system by setting h = v as her choice function.3 The agnostic user: Makes no assumptions about the system. This user employs a simple strategy that relies on basic data statistics which provides minimal but robust guarantees regarding her payoff. The strategic user: Knows that the system is strategic, and anticipates it to best-respond. This user is willing to invest effort (in terms of data and compute) in learning a choice function h that maximizes her payoff by accounting for the system s strategic behavior. Our primary goal is to study the balance of power between users (that choose) and the system (which represents). In particular, we will be interested in exploring the tradeoff between a user s effort and her susceptibility to manipulation. 2.3. Strategic Representation as a Game Before proceeding, we give an equivalent characterization of strategic representation as a game. Our setting can be compactly described as a single-step Stackelberg game: the first player is User, which observes samples S = {(xi, yi)}m i=1, and commits to a choice function h : Z { 1}; the second player is System, which given h, chooses a truthful ϕh : X Z (note how ϕh depends on h). The payoffs are: User: Ex D[1{h(ϕh(x)) = y}] (4) System: Ex D[1{h(ϕh(x)) = 1}] (5) Note that payoffs differ only in that User seeks correct choices, whereas System benefits from positive choices. This reveals a clear connection to strategic classification, in which System, who plays first, is interested in accurate predictions, and for this it can learn a classifier; and User, who plays second, can manipulate individual inputs (at some cost) to obtain positive predictions. Thus, strategic representation can be viewed as a variation on strategic classification, but with roles reversed . Nonetheless, and despite 3Note that while h takes values in X, v takes values in X. Nonetheless, truthfulness implies that Z X, and so v is welldefined as a choice function over Z. Strategic Representation these structural similarities, strategic representation bears key differences: items are discrete (rather than continuous), manipulations are subject to hard set constraints (rather than soft , continuous costs), and learning regards set functions (rather than vector functions). These differences lead to distinct questions and unique challenges in learning. 3. Warm-up: Na ıve and Agnostic Users The na ıve user. The na ıve user employs a what you see is what you get policy: given a representation of an item, z, this user estimates the item s value based on z alone, acting as if z were the item itself. Consequently, the na ıve user sets h(z) = sign(v(z)), even though v is truly a function of x. The na ıve user fails to account for the system s strategic behavior (let alone the fact that z x of some actual x). Despite its naivety, there are conditions under which this user s approach makes sense. Our first result shows that the na ıve policy is sensible in settings where the system is benevolent, and promotes user interests instead of its own. Lemma 3.1. If system plays the benevolent strategy: ϕbenev h (x) = argmax z x,|z| [k1,k2] {1{h(z) = sign(v(x))}, then the na ıve approach maximizes user payoff. Proof in Appx. D. The above lemma is not meant to imply that na ıve users assume the system is benevolent; rather, it justifies why users having this belief might act in this way. Real systems, however, are unlikely to be benevolent; our next example shows a strategic system can easily manipulate na ıve users to receive arbitrarily low payoff. Example 1. Let x1 = {a1}, x2 = {a1, a2}, x3 = {a2} with v(x1) = +1 and v(x2) = v(x3) = 1. Fix k1 = k2 = 1, and let D = (ε/2, 1 ε, ε/2). Note Z = {a1, a2} are the feasible representations. The na ıve user assigns h = (a1) = +1, h(a2) = 1 according to v. For x2, a strategic system plays ϕ(x2) = a1. The expected payoff to the user is ε. One reason a na ıve user is susceptible to manipulation is because she does not make any use of the data she may have. We next describe a slightly sophisticated user that uses a simple strategy to ensure a better payoff. The agnostic user. The agnostic user puts all faith in data; this user does not make assumptions on, nor is she susceptible to, the type of system she plays against. Her strategy is simple: collect data, compute summary statistics, and choose to either always accept or always reject (or flip a coin). In particular, given a sample set S = {(xi, yi)}m i=1, the agnostic user first computes the fraction of positive examples, ˆµ := 1 m Pm i=1 yi. Then, for some tolerance τ, sets for all z, h(z) = 1 if ˆµ 1/2 + τ, h(z) = 1 if ˆµ 1/2 τ, and flips a coin otherwise. In Example 1, an agnostic user would choose h = ( 1, 1) when m is large, guaranteeing a payoff of at least m(1 ε/2) 2+ m (1 ε/2) as m . Investing minimal effort, for an appropriate choice of τ, this user s strategy turns out to be quite robust. Theorem 3.2. (Informal) Let µ be the true rate of positive examples, µ = ED[Y ]. Then as m increases, the agnostic user s payoff approaches max{µ, 1 µ} at rate 1/ m. Formal statement and proof in Appx. A.1. In essence, the agnostic user guarantee herself the majority rate with rudimentary usage of her data, and in a way that does not depend on how system responds. But this can be far from optimal; we now turn to the more elaborate strategic user who makes more clever use of the data at her disposal. 4. Strategic Users Who Learn A strategic agent acknowledges that the system is strategic, and anticipates that representations are chosen to maximize her own engagement. Knowing this, the strategic user makes use of her previous experiences, in the form of a labeled data set S = {(xi, yi)}m i=1, to learn a choice function ˆh from some function class H that optimizes her payoff (given that the system is strategic). Cast as a learning problem, this is equivalent to minimizing the expected classification error on strategically-chosen representations: h = argmin h H ED[1{h(ϕh(x)) = y}]. (6) Since the distribution D is unknown, we follow the conventional approach of empirical risk minimization (ERM) and optimize the empirical analog of Eq. (6): ˆh = argmin h H i=1 h(ϕh(xi)) = yi). (7) Importantly since every zi = ϕh(xi) is a set, H must include set functions h : Z { 1}, and any algorithm for optimizing Eq. (7) must take this into account. In Sections 4.1 and 4.2, we characterize the complexity of a user s choice function and relate its complexity to that of v, and in Section 4.3 give an algorithm that computes ˆh, the empirical minimizer, for a hypothesis class of a given complexity. 4.1. Complexity Classes of Set Functions Ideally, a learning algorithm should permit flexibility in choosing the complexity of the class of functions it learns (e.g., the degree of a polynomial kernel, the number of layers in a neural network), as this provides means to trade-off running time with performance and to reduce overfitting. In this section we propose a hierarchy of set-function complexity classes that is appropriate for our problem. Strategic Representation Denote by Γk(z) all subsets of z having size at most k: Γk(z) = {z 2E : z z, |z | k}. We start by defining k-order functions over the representation space. These functions are completely determined by weights placed on subsets of size at most k. Definition 4.1. We say the function h : Z { 1} is of order k if there exists real weights on sets of cardinality at most k, {w(z ) : z Γk(z)}, such that h(z) = sign X z Γk(z) w(z ) . Not all functions h(z) can necessarily be expressed as a k-order function (for some k); nonetheless, in the context of optimizing Eq. (7), we show that working with k-order functions is sufficiently general, since any set function h can be linked to a matching k-order function h (for some k k2) through how it operates on strategic inputs. Lemma 4.2. For any h : Z { 1}, there exists k k2 and a corresponding k-order function h such that: h(ϕh(x)) = h (ϕh (x)) . Lem. 4.2 permits us to focus on k-order functions. The proof is constructive (see Appx. E), and the construction itself turns out to be highly useful. In particular, the proof constructs h having a particular form of binary basis weights, w(z) {a , a+}, which we assume from now on are fixed ( k). Hence, every function h has a corresponding binaryweighted k-order function h , which motivates the following definition of functions and function classes. Definition 4.3. We say a k-order function h with basis weights w is binary-weighted if: ( {a , a+} z such that |z| = k = a z such that |z| < k for some fixed a ( 1, 0) and a+ > P i [k] n i . A binary weighted k-order h determines a family of k-size subsets, described by having weights as a+, such that for any z Z with |z| k, h(z) = 1 if and only if z contains a subset from the family (for z with |z| < k, h(z) = 1 always). This is made precise using the notion of lifted functions in Lem. A.4 in Appx. A.2. Next, denote: Hk = {h : h is a binary-weighted k-order function}. The classes {Hk}k will serve as complexity classes for our learning algorithm; the user provides k as input, and ALG outputs an ˆh Hk that minimizes the empirical loss4. As 4Assuming the empirical error is zero. we will show, using k as a complexity measure provides the user direct control over the tradeoff between estimation and approximation error, as well as over the running time. Next, we show that the {Hk}k classes are strictly nested. This will be important for our analysis of approximation error, as it will let us reason about the connection between the learned ˆh and the target function v (proof in Appx. E). Lemma 4.4. For all k, Hk 1 Hk and Hk \ Hk 1 = . Note that Hn includes all binary-weighted set functions, but since representations are of size at most k2, it suffices to consider only k k2. Importantly, k can be set lower than k1; for example, H1 is the class of threshold modular functions, and H2 is the class of threshold pairwise functions. The functions we consider are parameterized by their weights, w, and so any k-order function has at most |w| = Pk i=0 q i weights. In this sense, the choice of k is highly meaningful. Now that we have defined our complexity classes, we turn to discussing how they can be optimized over. 4.2. Learning via Reduction to Induced Functions The simple structure of functions in Hk makes them good candidates for optimization. But the main difficulty in optimizing the empirical error in Eq. (7) is that the choice of h does not only determine the error, but also determines the inputs on which errors are measured (indirectly through the dependence of ϕh on h). To cope with this challenge, our approach is to work with induced functions that already have the system s strategic response encoded within, which will prove useful for learning. Additionally, as they operate directly on x (and not z), they can easily be compared with v, which will become important in Sec. 5. Definition 4.5. For a class H, its induced class is: FH {f : X { 1} : h H s.t. f(x) = h(ϕh(x))} The induced class FH includes for every h H a corresponding function that already has ϕh integrated in it. We use Fk = FHk to denote the induced class of Hk. For every h, we denote its induced function by fh. Whereas h functions operate on z, induced functions operate directly on x, with each fh accounting internally for how the system strategic responds to h on each x. Our next theorem provides a key structural result: induced functions inherit the weights of their k-order counterparts. Theorem 4.6. For any h Hk with weights w: h(z) = sign X z Γk(z) w(z ) , its induced fh Fk can be expressed using the same weights, w, but with summation over subsets of x, i.e.: fh(x) = sign X z Γk(x) w(z) . Strategic Representation Thm. 4.6 is the main pillar on which our algorithm stands: it allows us to construct h by querying the loss directly i.e., without explicitly computing ϕh by working with the induced fh; this is since: 1{h(ϕh(xi)) = yi} = 1{fh(xi) = yi} Thus, through their shared weights, induced functions serve as a bridge between what we optimize, and how. 4.3. Learning Algorithm We now present our learning algorithm, ALG. The algorithm is exact: it takes as input a training set S and a parameter k, and returns an h Hk that minimizes the empirical loss (Eq. (7)). Correctness holds under the realizabilility condition Y Fk, i.e., Y is the induced function of some h Hk.5 The algorithm constructs h by sequentially computing its weights, w = {w(z)}|z| k. As per Def. 4.3, only w(z) for z with |z| = k must be learned; hence, weights are sparse, in the sense that only a small subset of them are assigned a+, while the rest are a . Weights can be implemented as a hash table, where w(z) = a+ if z is in the table, and w(z) = a if it is not. Our next result establishes the correctness of ALG. The proof leverages a property that characterizes the existence of an h Hk having zero empirical error (see Lem. E.1). The proof of Lem. E.1 uses Thm. 4.6, which enables the loss to be directly computed for the induced functions using the shared weight structure. Theorem 4.7. For any k [k2], if Y is realizable then ALG returns an ˆh that minimizes the empirical error. Proof in Appx. E. Note that our algorithm is exact: it returns a true minimizer of the empirical 0/1 loss, assuming Y is realizable. Additionally, ALG can be used to identify if there exists an h Hk with zero empirical error; at Step 15, for each x S+ if there does not exist a z Zk,S or z Z+ such that z x then from Lem. E.1 in Appx. D there does not exist an h with zero empirical error. Lemma 4.8. Let n be the size of elements in X, m be the number of samples, and k k2 be the user s choice of complexity. Then ALG runs in O(m n k ) time. This runtime is made possible due to several key factors: (i) only k-sized weights need to be learned, (ii) all weights are binary-valued, and (iii) loss queries are efficient in induced space. Nonetheless, when n and k are large, runtime may be significant, and so k must be chosen with care. Fortunately, our results in Sec. 5.1.1 give encouraging evidence that learning with small k even k = 1, for which runtime is O((mn)2) is quite powerful (assuming Y is realizable). 5Note that even for standard linear binary classification, finding an empirical minimizer of the 0/1 in the agnostic (i.e., nonrealizable) case is NP-hard (Shalev-Shwartz & Ben-David, 2014). Algorithm 1 ALG 1: Input: S = {(xi, yi)}i [m], k [k2] 2: Pre-compute: 3: S+ = {x S : y = +1}, 4: S = {x S : y = 1} 5: Zk,S = {z : |z| = k, x S z x} 6: ˆp(xi) = 1 j [m] 1{xi = xj} i [m] 7: Fix a ( 1, 0) and a+ > P i [1,k] n i 8: Initialize: 9: Z+ = , Z = , V = , Sz = z Zk,S 10: Run: 11: for x S do 12: for z s.t. z x and z Zk,S do 13: Z = Z {z}, Zk,S = Zk,S \ {z} 14: Sz = Sz {x} 15: end for 16: end for 17: for x S+ do 18: for z x such that z Zk,S do 19: Z+ = Z+ {z} 20: end for 21: end for 22: Set w(z) = ( a+ if z Z+ implemented a o.w. (implicitly) as hash table 23: Return ˆh(z) = sign(P z Γk(z) w(z )). In the analysis, the m n k is made possible only since weights are sparse, and since ALG operates on a finite sample set of size m. Alternatively, if m is large, then this can be replaced with q k . This turns out to be necessary; in Appx. A.3 we show that, in the limit, q k is a lower bound. 5. Balance of Power Our final section explores the question: what determines the balance of power between system and users? We begin with the perspective of the user, who has commitment power, but can only minimize the empirical error. For her, the choice of complexity class k is key in balancing approximation error how well (in principle) can functions h Hk approximate v; and estimation error how close the empirical payoff of the learned ˆh is to its expected value. Our results give insight into how these types of error trade off as k is varied (here we do not assume realizability). For the system, the important factors are k1 and k2, since these determine its flexibility in choosing representations. Since more feasible representation mean more flexibility, it would seem plausible that smaller k1 and larger k2 should help the system more. However, our results indicate differently: for system, smaller k2 is better, and the choice of k1 has limited effect on strategic users. The result for k2 goes Strategic Representation through a connection to the user s choice of k; surprisingly, smaller k turns out to be, in some sense, better for all. 5.1. User s Perspective We begin by studying the effects of k on user payoff. Recall that users aim to minimize the expected error (Eq. (6)): ε(h) = ED[1{h(ϕh(x)) = sign(v(x))}], but instead minimize the empirical error (Eq. (7)). For reasoning about the expected error of the learned choice function ˆh Hk, a common approach is to decompose it into two error types approximation and estimation: ε(ˆh) = ε(h ) | {z } approx. + ε(ˆh) ε(h ) | {z } estimation , h = argmin h Hk ε(h ) Approximation error describes the lowest error obtainable by functions in Hk; this measures the expressivity of Hk, and is independent of ˆh. For approximation error, we define a matching complexity structure for value functions v, and give several results relating the choice of k and the complexity of v. Estimation error describes how far the learned ˆh is from the optimal h Hk, and depends on the data size, m. Here we give a generalization bound based on VC analysis. 5.1.1. USER APPROXIMATION ERROR To analyze the approximation error, we must be able to relate choice functions h (that operate on representations z) to the target value function v (which operates on items x). To connect the two, we will again use induced functions, for which we now define a matching complexity structure. Definition 5.1. A function f : X { 1} has an induced complexity of ℓif exists a function g : Zℓ { 1} s.t.: ( 1 if z x, |z| = ℓand g(z) = 1 1 o.w. and ℓis minimal (i.e., there is no such g : Zℓ 1 { 1}). We show in Lem. 5.2 and Cor. 5.3 that the induced complexity of a function f captures the minimum k [1, n] such that f is an induced function of an h Hk. Lemma 5.2. Let k k2. Then for every h Hk, the induced complexity of the corresponding fh is ℓ k. Corollary 5.3. Let Fk = FHk be the induced function class of Hk, as defined in Def. 4.5. Then: Fk = {f : X { 1} : f has induced complexity k}. Proof of Cor. 5.3 is in Appx. F. We now turn to considering the effect of k on approximation error. Since the worthwhileness function Y (x) = sign(v(x)) operates on x, we can consider its induced complexity, which we denote by ℓ (i.e., Y Fℓ ). The following result shows that if ℓ k, then Hk is expressive enough to perfectly recover Y . Theorem 5.4. If ℓ k then the approximation error is 0. One conclusion from Thm. 5.4 is that if the user knows ℓ , then zero error is, in principle, obtainable; another is that there is no reason to choose k > ℓ . In practice, knowing ℓ can aid the user in tuning k according to computational (Sec. 4.3) and statistical considerations (Sec. 5.1.2). Further conclusions relate ℓ and k2: Corollary 5.5. If ℓ k2 and the distribution D has full support on X, then k = ℓ is the smallest k that gives zero approximation error. Corollary 5.6. If ℓ > k2, then the approximation error weakly increases with k, i.e., ε(h k) ε(h k 1) for all k k2. Furthermore, if the distribution D has full support on X then no k can achieve zero approximation error. Proofs in Appx. F. In general, Cor. 5.6 guarantees only weak improvement with k. Next, we show that increasing k can exhibit clear diminishing-returns behavior, with most of the gain obtained at very low k. Lemma 5.7. Let D be the uniform distribution over X. Then there is a value function v for which ε(h k) diminishes convexly with k. The proof is constructive (see Appx. F). We construct a v whose approximation error h k Hk is upper bounded by ε(h k) 1 4 q n The diminishing returns property of upper bound is illustrated in Fig. 1. Although Lem. 5.7 describes a special case, we conjecture that this phenomena applies more broadly. Our next result shows that learning k1-order functions can be as powerful as learning subadditive functions; hence, learning with k = k1 is highly expressive. Interestingly, the Figure 1. Upper bound on approximation error showing diminishing returns. Parameters: q = 400, n = 30 and k2 = 10. Strategic Representation connection between general (k-order) functions and subadditive functions is due to the strategic response mapping, ϕ. Lemma 5.8. Consider threshold-subadditive functions: HSA = {sign(g(z)) : g is subadditive on subsets in Z} Then for every threshold-subadditive hg HSA, there is an h Hk1 for which h(ϕh(x)) = hg(ϕhg(x)) x X. 5.1.2. USER ESTIMATION ERROR For estimation error, we give generalization bounds based on VC analysis. The challenge in analyzing functions in Hk is that generalization applies to the strategic 0/1 loss, i.e., 1{h(ϕh(x)) = y}, and so standard bounds (which apply to the standard 0/1 loss) do not hold. To get around this, our approach relies on directly analyzing the VC dimension of the induced class, Fk (a similar approach was taken in Sundaram et al. (2021) for SC). This allows us to employ tools from VC theory, which give the following bound. Theorem 5.9. For any k and m, given a sample set S of size m sampled from D and labeled by some v, we have ε(ˆh) ε(h ) C( q k log( q k /ϵ) + log(1/δ) m w.p. at least 1 δ over S, and for a fixed constant C. In particular, ALG in Sec. 4.3, assuming Y is realizable, returns an ˆh Hk for which: C( q k log( q k /ϵ) + log(1/δ) m w.p. at least 1 δ over S, and for a fixed constant C. The proof relies on Thm. 4.6; since h and fh share weights, the induced Fk can be analyzed as a class of q-variate degreek multilinear polynomials. Since induced functions already incorporate ϕ, VC analysis for the 0/1 loss can be applied. Note that such polynomials have exactly q k degrees of freedom; hence the term in the bound. 5.2. System s Perspective The system s expressive power derives from its flexibility in choosing representations z for items x. Since k1, k2 determine which representations are feasible, they directly control the system s power to manipulate; and while the system itself may not have direct control over k1, k2 (i.e., if they are set by exogenous factors like screen size), their values certainly affect the system s ability to optimize engagement. Our next result is therefore unintuitive: for system, a smaller k2 is better (in the worst case), even though it reduces the set of feasible representations. This result is obtained indirectly, by considering the effect of k2 on the user s choice of k. Lemma 5.10. There exists a distribution D and a value function v such that for all k < k k2, system has higher payoff against the optimal h k Hk than against h k Hk . The proof is in Appx. F; it uses the uniform distribution, and the value function from Thm. 5.7. Recalling that the choice of k controls the induced complexity ℓ(Cor. 5.3), and that users should choose k to be no greater than k2, we can conclude the following (in a worst-case sense): Corollary 5.11. For the system, lower k2 is better. Proof in Appx. F. For k1, it turns out that against strategic users it is inconsequential. This is since payoff to the strategic user is derived entirely from k, which is upperbounded by k2, but can be set lower than k1. This invariance is derived immediately from how functions in Hk are defined, namely that w(z) = a for all z with |z| < k (Def. 4.3). This, however, holds when the strategic user chooses to learn over Hk for some k. Consider, alternatively, a strategic user that decides to learn subadditive functions instead. In this case, Thm. 5.8 shows that k1 determines the users effective k; the smaller k1, the smaller the subset of subadditive functions that can be learned. Hence, for user, smaller k1 means worse approximation error. This becomes even more pronounced when facing a na ıve user; for her, a lower k1 means that system now has a large set of representations to choose from; if even one of them has v(z) = 1, the system can exploit this to increase its gains. In this sense, as k1 decreases, payoff to the system (weakly) improves. 6. Discussion Our analysis of the balance of power reveals a surprising conclusion: for both parties, in some sense, simple choice functions are better. For system, lower k improves its payoff through how it relates to k2 (Corr. 5.11). For users, lower k is clearly better in terms of runtime (Lem. 4.8) and estimation error (Thm. 5.9), and for approximation error, lower k has certain benefits as it relates to ℓ (Corr. 5.5), and via diminishing returns (Thm. 5.7). Thus, and despite having conflicting interests to some degree, incentives align. But the story is more complex. For users, there is no definitive notion of better ; strategic users always face a trade-off, and must choose k to balance approximation, estimation, and runtime. In principle, users are free to choose k at will; but since there is no use for k > k2, a system controlling k2 de facto controls k as well. This places a concrete restriction on the freedom of users to choose, and inequitably: for small k2, users whose v has complexity k2 (i.e., having simple tastes ) are less susceptible to manipulation than users with v of complexity > k2 (e.g., fringe users with eclectic tastes) (Thm. 5.4, Corrs. 5.5, 5.6). In this sense, the choice of k2 also has implications on fairness. We leave the further study of these aspects of strategic representation for future work. Strategic Representation Abboud, E., Agha, N., Bshouty, N. H., Radwan, N., and Saleh, F. Learning threshold functions with small weights using membership queries. In Proceedings of the twelfth annual conference on Computational learning theory, pp. 318 322, 1999. Abraham, I., Babaioff, M., Dughmi, S., and Roughgarden, T. Combinatorial auctions with restricted complements. In Proceedings of the 13th ACM Conference on Electronic Commerce, pp. 3 16, 2012. Angluin, D. Queries and concept learning. Machine learning, 1988. Balcan, M.-F. and Harvey, N. J. Learning submodular functions. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 793 802, 2011. Bechavod, Y., Podimata, C., Wu, Z. S., and Ziani, J. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022. Chen, Y., Wang, J., and Liu, Y. Linear classifiers that encourage constructive adaptation. ar Xiv preprint ar Xiv:2011.00355, 2020a. Chen, Y., Wang, J., and Liu, Y. Strategic recourse in linear classification. ar Xiv preprint ar Xiv:2011.00355, 2020b. Chevaleyre, Y., Endriss, U., Estivie, S., and Maudet, N. Multiagent resource allocation in k -additive domains: preference representation and complexity. Ann. Oper. Res., 163(1):49 62, 2008. Conitzer, V., Sandholm, T., and Santi, P. Combinatorial auctions with k-wise dependent valuations. In AAAI, 2005. Dughmi, S., Immorlica, N., O Donnell, R., and Tan, L. Algorithmic signaling of features in auction design. In Algorithmic Game Theory - 8th International Symposium, SAGT, pp. 150 162. Springer, 2015. Eilat, I., Finkelshtein, B., Baskin, C., and Rosenfeld, N. Strategic classification with graph neural networks. ar Xiv preprint ar Xiv:2205.15765, 2022. Feige, U., Feldman, M., Immorlica, N., Izsak, R., Lucier, B., and Syrgkanis, V. A unifying hierarchy of valuations with complements and substitutes. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015. Feldman, V. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163 182, 2009. Ghalme, G., Nair, V., Eilat, I., Talgam-Cohen, I., and Rosenfeld, N. Strategic classification in the dark. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021. Haghtalab, N., Immorlica, N., Lucier, B., Mobius, M., and Mohan, D. Persuading with anecdotes. Technical report, National Bureau of Economic Research, 2021. Hardt, M., Megiddo, N., Papadimitriou, C. H., and Wootters, M. Strategic classification. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016. Hu, L., Immorlica, N., and Vaughan, J. W. The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pp. 259 268, 2019. Jagadeesan, M., Mendler-D unner, C., and Hardt, M. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, 2021. Kamenica, E. and Gentzkow, M. Bayesian persuasion. American Economic Review, 101(6), 2011. Krishnaswamy, A. K., Li, H., Rein, D., Zhang, H., and Conitzer, V. Classification with strategically withheld data. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021. Lancaster, K. J. A new approach to consumer theory. Journal of political economy, 74(2):132 157, 1966. Levanon, S. and Rosenfeld, N. Strategic classification made practical. In Proceedings of the 38th International Conference on Machine Learning, ICML, 2021. Levanon, S. and Rosenfeld, N. Generalized strategic classification and the case of aligned incentives. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022. Miller, J., Milli, S., and Hardt, M. Strategic classification is causal modeling in disguise. In International Conference on Machine Learning, pp. 6917 6926. PMLR, 2020. Milli, S., Miller, J., Dragan, A. D., and Hardt, M. The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pp. 230 239, 2019. Rosenfeld, N., Oshiba, K., and Singer, Y. Predicting choice with set-dependent aggregation. In International Conference on Machine Learning, 2020. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Strategic Representation Sundaram, R., Vullikanti, A., Xu, H., and Yao, F. Paclearning for strategic classification. In International Conference on Machine Learning, pp. 9978 9988, 2021. Zhang, H. and Conitzer, V. Incentive-aware pac learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 5797 5804, 2021. Zrnic, T., Mazumdar, E., Sastry, S., and Jordan, M. Who leads and who follows in strategic classification? Advances in Neural Information Processing Systems, 34, 2021. Strategic Representation A. Additional Results A.1. Agnostic User Theorem A.1 stated below is the formal version of Theorem 3.2 in Section 3. Theorem A.1 shows that given a a large enough sample size m, an agnostic user s payoff would approach max{µ, 1 µ}, where µ = ED[y]. Theorem A.1. Let 2 2+ m δ < 1/8 and τ = δ 2(1 δ) + q m , then agnostic user s expected payoff guarantee is given by (1 δ)(1 µ) if bµ 1/2 τ (1 δ)µ if bµ 1/2 + τ = 1/2 Otherwise Before we prove the theorem, we state Hoeffding s inequality, which is a well known result from probability theory. Lemma A.2 (Hoeffding). Let Sm = Pm i=1 Xi be the sum of m i.i.d. random variables with Xi [0, 1] and µ = E[Xi] for all i [m], then m µ ε) e 2mε2 and P(Sm m µ ε) e 2mε2. We will use the following equivalent form of the above inequality. Let δ := e 2mε2 i.e. ε = q m and bµ = Sm m . Then we have with probability at-least (1 δ) we have Now we are ready to give the proof of Theorem A.1 Proof of Theorem A.1. We begin with the following supporting lemma. Lemma A.3. Let 2 2+ m δ < 1/8, then τ < 1/2. Proof. The proof follows from following sequence of inequalities, 2 2 + m < δ m > 4(1/δ 1)2 = m > 4(1/δ 1) log(1/δ) δ 2(1 δ) > 2 log(1/δ) Let γ = δ 2(1 δ). We have τ = γ + γ which is an increasing function of δ, so we the maximum is achieved at δ = 1/8 and is given by 1/ 14 + 1/14 < 1/2. This completes the proof of the lemma. From Lemma A.3 we have that 1/2 + τ < 1, hence there is a non-trivial range i.e. bµ [1/2 + τ, 1] where user assigns h(z) = +1 for all z with probability 1. Similarly, when bµ [0, 1/2 τ] user assigns h(z) = 1 for all z with probability 1. We will consider three cases separately. Case 1 (bµ [1/2 + τ, 1]): From Hoeffding s inequality (Eq. 9) we have that with probability at-least (1 δ), = µ 1/2 + δ 2(1 δ) = 1 2(1 δ) Hence, with probability at-least (1 δ) an agnostic user will get a payoff of µ. Hence, the expected payoff in this case is at-least (1 δ)µ 1/2 (1 δ)(1 µ). Strategic Representation Case 2 (bµ [0, 1/2 τ]): Similar to Case 1 here we use the tail bound given by Hoeffding s inequality (Eq. 8) to get with probability at least (1 δ), = µ 1/2 δ 2(1 δ) = 1 2δ Hence, (1 µ) 1 2(1 δ). The agnostic user guarantees a payoff of (1 µ) with probability at least (1 δ) in this case. Hence we have the payoff of (1 δ)(1 µ) 1/2 (1 δ)µ in this case. Case 3 (bµ (1/2 τ, 1/2 + τ)): Finally, in this case, the agnostic user chooses h(z) = 1 for all z Z with probability 1/2 and h(z) = 1 for all z Z with probability 1/2. Hence, the expected payoff is given by 1 2(1 µ) = 1/2 irrespective of the true mean µ of positive samples. A.2. Lifted Functions The relation between choice functions and their induced counterparts passes through an additional type of functions that operate on sets of size exactly ℓ. Denote Zℓ= {z 2E : |z| = ℓ}, and note that all feasible representations can be partitioned as Z = Zk1 . . . Zk2. We refer to functions that operate on single-sized sets as restricted functions. Our next result shows that choice functions in Hk can be represented by restricted functions over Zk that are lifted to operate on the entire Z space. This will allow us to work only with sets of size exactly k. Lemma A.4. For each h Hk there exists a corresponding g : Zk { 1} such that h = lift(g), where: lift(g)(z) = 1 if k |z| and z z, |z | = k s.t. g(z ) = 1 1 o.w. Proof. Let h Hk. Then there is a weight function w on sets of size at most k such that either w(z) ( 1, 0) or w(z) > P i [k] n i , and h(z) = sign X z :z z,|z | k w(z ) Define g : Zk { 1, 1} such that for a z Zk, g(z) = 1 if w(z) > 0 and g(z) = 1 otherwise. It is easy to see from the choice of w(z) that h = lift(g). A.3. A Lower Bound on the Running Time As stated in Lemma 4.8, the running time of our algorithm is m( n k ). We argued in Section 4 that is made possible only since weights are sparse, and since the algorithm operates on a finite sample set of size m. If m is large, then this expression can be replaced with q k . We now show that, in the limit (or under full information), the dependence on q k is necessary. The conclusion from Lemma A.5 is that to find the loss minimizer, any algorithm must traverse at least all such h; since there exist q k such functions, this is a lower bound. This is unsurprising; Hk is tightly related to the class of multilinear polynomials, whose degrees of freedom are exactly q k . Lemma A.5. Consider a subclass of Hk composed of choice functions h which have w(z) = a+ for exactly one z with |z| = k, and w(z) = a otherwise. Then, for every such h, there exists a corresponding v, such that h is a unique minimizer (within this subclass) of the error w.r.t. v. Proof. Let z1 and z2 be distinct k size subsets, and let a (0, 1) and a+ > P i [1,k] n i . Further, let wi, i [1, 2] be a weight function that assigns a+ to zi and a to all other subsets of size at most k. Let h1 and h2 be two function in Hk defined by the binary weighted functions w1 and w2 respectively. Observe that for vi = fhi the approximation error (see (Eq. (6))) of hi is zero. Hence, to prove the lemma it is sufficient to show that fh1 = fh2. Suppose fh1 = fh2. Since z1 = z2, there exists an x X such that z1 x but z2 x. From Theorem 4.6 and the choice of a+ and a , this implies fh1(x) = 1 but fh2(x) = 1, and hence, a contradiction. Strategic Representation B. Additional Related Work Learning set functions. Concept learning refers to learning a binary function over hypercubes (Angluin, 1988) through a query access model. Abboud et al. (1999) provide a lower bound on membership queries to exactly learn a threshold function over sets where each element has small integer valued weights. Our learning framework admits large weights and has only a sample access in contrast with the query access studied in this literature. Feldman (2009) show that the problem of learning set functions with sample access is computationally hard. However, we show (see Section 5) that the strategic setting is more nuanced; a more complex representations are disadvantageous for both user and the system. In other words, it is in the best interest of system to choose smaller (and much simpler) representations. A by-now classic work in learning theory studies the learnability from data of submodular (non-threshold) set functions (Balcan & Harvey, 2011). Though we consider learning subadditive functions in this work, an extension to submodular valuations is a natural extension. Learning set functions is in general hard, even for certain subcalsses such as submodular functions. Rosenfeld et al. (2020) show that it s possible to learn certain parameterized subclasses of submodular functions, when the goal is to use them for optimization. But this refers to learning over approximate proxy losses; whereas in our work, we show that learning is possible directly over the 0/1 loss. Hierarchies of set functions. Conitzer et al. (2005) (and independently, Chevaleyre et al. (2008)) suggest a notion of k-wise dependent valuations, to which our Definition 4.1 is related. We also allow up to k-wise dependencies, but our valuations need not be positive and we focus on their sign (an indication whether an item is acceptable or not). Our set function valuations are also over item attributes rather than multiple items. Despite the differences, the definitions have a shared motivation: Conitzer et al. (2005) believe that this type of valuation is likely to arise in many economic scenarios, especially since due to cognitive limitations, it might be difficult for a player to understand the inter-relationships between a large group of items. Hierarchies of valuations with limited dependencies/synergies have been further studied by Abraham et al. (2012); Feige et al. (2015) under the title hypergraph valuations . These works focus on monotone valuations that have only positive weights for every subset, and are thus mathematically different than ours. C. A Missing Proof from Section 2 Observation 2.1. Every best-response z ϕh(x) induces the same value in the user s objective function (Eq. (3)). Proof. The proof follows from the definition of best response (Eq. 2). Let z1, z2 ϕh(x). Then since ϕh consists of only best response, we have either h(z1) = h(z2) = 1, or h(z1) = h(z2) = 1. Hence, h(z1) = v(x) if and only if h(z2) = v(x) for any z1, z2 ϕh(x). D. A Missing Proof from Section 3 and an Additional Example Lemma 3.1. If system plays the benevolent strategy: ϕbenev h (x) = argmax z x,|z| [k1,k2] {1{h(z) = sign(v(x))}, then the na ıve approach maximizes user payoff. Proof. Since a na ıve user plays h(z) = sign(v(z)), for each x X the payoff of the user is maximized if in response the system plays a z x such that sign(v(z)) = sign(v(x)). Observe that, if there exists a z Z and z x, such that sign(v(z)) = sign(v(x)) then z ϕbenev h (x) and consequently the user s payoff is maximized for such an x. Conversely, if there exists no z Z and z x such that sign(v(z)) = sign(v(x)), then no truthful system can ensure more than zero utility for such an x. Hence, a benevolent system maximizes the utility of a na ıve user. We now present an additional example to show how a na ıve user s choice function can be manipulated by the strategic system and, as a consequence, the user may obtain arbitrarily small payoff against a strategic system. Example 2. Let x1 = {a1, a2}, x2 = {a1, a3}, x3 = {a1, a4}, x4 = {a2, a3}, x5 = {a3, a4}} with sign(v(x1)) = sign(v(x5)) = sign(v(a2)) = sign(v(a4)) = +1 and sign(v(x2)) = sign(v(x3)) = sign(v(x4)) = sign(v(a1)) = sign(v(a3)) = 1. Further, let k1 = k2 = 1 with zi = ai as representations and a distribution D = ( ε 4) supported over (x1, x2, x3, x4, x5). Strategic Representation A unique truthful representation for this instance is h = ( 1, +1, 1, +1). A strategic agent can manipulate a na ıve agent into non-preferred choices by using a representation (a2, a1, a4, a2, a4) for (x1, x2, x3, x4, x5). Note here that a na ıve agent expected z1 as a representation for x3 since h(z1) = sign(v(x3)) = 1 and h(z4) = +1 = sign(v(x3)). However, a strategic agent chose a4 as under given h we have h(a4) = 1. A na ıve users payoff in this case is reduced to ε which can be arbitrarily small. E. Missing Proofs from Section 4 Lemma 4.2. For any h : Z { 1}, there exists k k2 and a corresponding k-order function h such that: h(ϕh(x)) = h (ϕh (x)) . Proof. Define k as follows: if h(z) = 1 for all z Z then k = k1, and otherwise k = max k [k1,k2]{ z such that |z| = k and h(z) = 1, but for all z z and z Z, h(z ) = 1}. Define h as follows: For |z| < k, h (z) = 1; for |z| k h (z) = 1 if z : |z | = k, z z and h(z ) = 1; h (z) = 1 otherwise. First, we argue that h defined as above satisfies h (ϕh (x)) = h(ϕh(x)) for all x X. Suppose h(ϕh(x)) = 1. Then there exists z Z such that z x and h(z) = 1. From the choice of k, we may assume without loss of generality that |z| = k. Further, from the construction of h , we have h (z) = 1, and hence h (ϕh (x)) = h(ϕh(x)) = 1. Now suppose h(ϕh(x)) = 1. Then for all z x we have h(z) = 1. In particular, for all z x such that |z| = k we have h(z) = 1. This implies for all z x such that |z| k we have h (z) = 1. This is because if there exists z x such that |z| k and h (z) = 1 then from the definition of h there exists a z z x such that |z | = k, and h(z ) = 1 (a contradiction). Additionally, from definition, for all z x such that |z| < k we have h (z) = 1. Hence, if h(ϕh(x)) = 1 then h (ϕh (x)) = 1. Now, we show that h is a k-order function. Let a ( 1, 0) and w(z) = a for all z such that |z| k and h (z) = 1. Further, for all z such that |z| = k, if h (z) = 1 then let w(z) = a+ > P i [k] n i . For all z Z, if |z| < k then by construction of h , we have h (z) = 1, and since for all z Γk(z), w(z ) = a < 0 we have P z Γk(z) w(z ) < 0. Hence, for all z Z, if |z| < k h (z) = sign X z Γk(z) w(z ) = 1. Similarly, for all z Z, if |z| k then by construction of h , we have h (z) = 1 if and only if there exists a z z, and z = |k| such that h (z) = h(z) = 1. In particular, if |z| k and h (z) = 1 then there exists a z z, and |z | = k such that w(z ) = a+. Since a+ > P i [k] n i , a ( 1, 0), and k2 n, we have if |z| k and h (z) = 1 then h (z) = sign X z Γk(z) w(z ) = 1. Finally, if |z| k and h (z) = 1 then from the definition of h there does not exists a z z, and |z | = k such that w(z ) = a+. Since a ( 1, 0), we have if |z| k and h (z) = 1 then h (z) = sign X z Γk(z) w(z ) = 1. Lemma 4.4. For all k, Hk 1 Hk and Hk \ Hk 1 = . Strategic Representation Proof. Arbitrarily choose u E (recall E is the ground set) such that |u| = k, and let w(u) = ak,+ > P i [k] n i .6 Also for all z = u and |z| k, let w(z) = a ( 1, 0). Let h : Z { 1} be defined as follows h(z) = sign X z Γk(z) w(z ) From the definition of Hk, we have h Hk. We show that h Hk 1. First, observe that for all z Z h(z) = 1 if and only if u z. Suppose h Hk 1. Then there is a weight function w on sets of size at most k 1 such that either w (z) = a ( 1, 0) or wz = ak 1,+ > P i [k 1] n i , and h(z) = sign X z Γk 1(z) w (z ) Let z Z be such that u z. This implies h(z) = 1. Hence there exist a u z such that |u | = k 1 and w (u ) = ak 1,+. Let z Z be such that u z but u z. Such a z exists because u u = u. Further, as u z, we have h( z) = 1. But since u z, we have from the choice of ak 1,+ and a X z Γk 1( z) w (z ) > 0 z Γk 1( z) w (z ) = h( z) = 1. This gives a contradiction. Hence, h Hk 1. Theorem 4.6. For any h Hk with weights w: h(z) = sign X z Γk(z) w(z ) , its induced fh Fk can be expressed using the same weights, w, but with summation over subsets of x, i.e.: fh(x) = sign X z Γk(x) w(z) . Proof. Since h Hk, w satisfies the following two properties (see Definition 4.3): 1. either w(z) = a ( 1, 0) or w(z) = a+ > P i [k] n i , 2. w(z) = a for all z having |z| < k. Further, from the definition of fh, we have fh(x) = h(ϕh(x)). This implies fh(x) = 1 z Z, z x such that h(z) = 1 . From the the above two properties of the weights function, we have h(z) = 1 z z, |z | = k such that w(z ) = a+ > 0 . From the above two equations we conclude that fh(x) = 1 z x, |z| = k such that w(z) = a+ > 0 . Finally, the two properties of w ensure that fh(x) = sign X z Γk(x) w(z) . 6Here we wish to distinguish between a+ for k and k 1 and hence we use ak,+ instead of a+. Strategic Representation Theorem 4.7. For any k [k2], if Y is realizable then ALG returns an ˆh that minimizes the empirical error. Proof. Throughout, for ease of notation, we use x S to denote x {x1, . . . , xm}. Let Zk = {z : |z| = k, x S z x}. Recall Zk,S is equal to Zk at the beginning of the algorithm. Also, for each z Zk, let Xz = {x S | z x}. The following lemma characterizes the training set for which there exists an h Hk with zero empirical error. Lemma E.1. There exists an h Hk with zero empirical error if and only if for all x S+ there exists a z Zk and z x such that z x for all x S . Proof. Suppose there exists an h Hk with zero empirical error. Since h Hk, from Lemma A.4 we have there exists a g : Zk { 1} such that h = lift(g). We state the following observation, and its proof follows from the definition of lift(g). Observation E.2. 1. For every z Z such that h(z) = 1 there is a z Zk such that z z and g(z ) = 1, 2. For every z Z such that h(z) = 1, it must be that for every z Zk, z z we have g(z ) = 1. Further, as the empirical error for h is zero, we have the following observation. Observation E.3. 1. For every x S+ there is a z Z and z x such that h(z) = 1, 2. For every x S , it must be that for every z Z, z x we have h(z) = 1. Proof. For every x S+, since the empirical error is zero, we have h(ϕh(x)) = 1. From the definition of ϕh, this implies there is a z Z and z x such that h(z) = 1. Similarly, for every x S , since the empirical error is zero, we have h(ϕh(x)) = 1. Again from the definition of ϕh, it must be that for every z Z, z x we have h(z) = 1. Hence, from Observations E.2 and E.3, we have for every x S+ there is a z Zk, z x such that g(z) = 1. Similarly, for every x S it must be that for every z Zk, z x we have g(z) = 1. Hence, for every x S+ there exists a z Zk and z x such that z x for all x S . Conversely, suppose for every x S+ there exists a z Zk and z x such that z x for all x S . Then define g : Zk { 1} as follows: a) for all z Zk such that z x for an x S , let g(z) = 1, b) for all z Zk, such that z x for an x S+ and z x for an x S , let g(z) = 1, c) for all z Zk, such that z x for any x S, let g(z) = 1. From the supposition, we have that for every x S+, there is a z Zk and z x such that g(z) = 1. Now define h = lift(g). To show that the empirical error of h is zero, it is sufficient to show that for every x S h(ϕh(x)) = 1, and for every x S+ h(ϕh(x)) = 1. Let x S . From the definition of g, for every z Zk such that z x, g(z) = 1. Hence, from the definition of lift, we have for every z Z such that z x, h(z) = 1. Now from the definition of best response, we have h(ϕh(x)) = 1. Similarly, if x S+ from our supposition and the definition of g, we have there exists a z Zk such that z x and g(z) = 1. Hence, from the definition of lift, there exists a z Z such that z x and h(z) = 1. Finally, from the definition of best response , we have h(ϕh(x)) = 1. Now, if Y Fk then there exists an h Hk such that the induced function of h is equal to Y , that is, fh = Y . This implies there exists an h Hk which attains zero empirical error on the training set. Since empirical error is always non-negative, such an h minimizes the empirical error in this case. Hence, from Lemma E.1, it follows that if Y is realizable then for all x S+ at Step 17 of ALG there is either a z Z+ and z x, or z Zk,S and z x. Now, observe that at the beginning of Step 22, set Z+ satisfies the following: z Z+ x S+ such that z x and x S such that z x . (10) Further, at Step 22, for a z Zk, w(z) = a+ if z Z+. This implies w(z) = a+ x S+ such that z x and x S such that z x . (11) Also, from Theorem 4.6, the induced function fˆh corresponding to the returned ˆh is given as fˆh(x) = sign X z Γk(x) w(z) . (12) Strategic Representation To complete the proof of theorem, we show that fˆh(xi) = yi for every xi S. Suppose x S . Then from Equations and 10 and 11, for every z x and |z| k we have w(z) = a < 0, and hence from Equation 12 for fˆh we have fˆh(x) = y = 1. Similarly, suppose x S+. Then from Equation 11, there exists z x, |z| = k such that w(z) = a+. Hence, from Equation 12, and noting that a+ > P i [k] n i and a ( 1, 0) we have fˆh(x) = y = 1. Lemma 4.8. Let n be the size of elements in X, m be the number of samples, and k k2 be the user s choice of complexity. Then ALG runs in O(m n k ) time. Proof. In the first two for loops, for each x S+ (or in S ) the internal for loop runs for O( n k ) time. Since |S| m, this is a total of at most O(m n k ) operations. Similarly, Step 22 places weights on at most m n k subsets, and hence runs in O(m n k ) time. Hence, ALG runs in O(m n k ) time. F. Missing Proofs from Section 5 Lemma 5.2. Let k k2. Then for every h Hk, the induced complexity of the corresponding fh is ℓ k. Proof. Let ℓ= mink [1,k]{there exists a g : Zℓ { 1} such that h = lift(g)}. From Lemma A.4, we know ℓ k. Further, assume g : Zℓ { 1} is such that h = lift(g). Now, from the definition of fh, we have for all x X, fh(x) = 1 if and only if there exists a z Z such that z x and h(z) = 1. Since h = lift(g), fh(x) = 1 if and only if there exists a z Zℓsuch that z x and g(z) = 1. This implies the induced complexity of fh is ℓ k. Corollary 5.3. Let Fk = FHk be the induced function class of Hk, as defined in Def. 4.5. Then: Fk = {f : X { 1} : f has induced complexity k}. Proof. From Lemma 5.2, we know that functions in Fk have induced complexity at most k. We show that if f has induced complexity at most k then there is an h Hk such that f = fh. Let the induced complexity of f be equal to ℓ k. Then there exists a g : Zℓ { 1} such that f(x) = 1 z Zℓsuch that z x and g(z) = 1. (13) Let h = lift(g). First we show that f(x) = fh(x) for all x X. Since h is a lift of g, if g(z ) = 1 for a z Zℓthen for all z Z such that z z, we have h(z) = 1. h(z) = 1 z Zℓsuch that z z and g(z ) = 1. (14) Hence, from Equations 13 and 14 for all x X, f(x) = 1 if and only if there exists z Z such that z x and h(z) = 1. From the definition of induced function, this implies f(x) = fh(x) for all x X. To show h Hk, we construct a weight function w on sets of size at most k. For z 2E and |z| < k, let w(z) = a ( 1, 0). For z Zk, let i [1,k] n i if z z, |z | = ℓand g(z) = 1 = a o.w. Now from Equation 14, h(z) = 1 if and only if there exists z Zℓsuch that z z and g(z ) = 1. Hence, from the definition of w, h(z) = 1 if and only if there exists z Zℓsuch that z z and w(z ) = a+. In particular, since a+ > Pk i=1 n i and a (0, 1), we have h(z) = sign X z :z z,|z | k w(z ) . Theorem 5.4. If ℓ k then the approximation error is 0. Strategic Representation Proof. Since the induced complexity of v is ℓ , there is a function g : Zℓ { 1} s.t.: ( 1 if z x, |z| = ℓ and g(z) = 1 1 o.w. i [1,k] n i and a (0, 1), and define the weight function w on sets of size at most k as follows: a) if |z| < k then let w(z) = a , b) if |z| = k and there exists a z z such that g(z ) = 1 then w(z) = a+, and c) if |z| = k and there does not exist a z z such that g(z ) = 1 then w(z) = a . Now define h using w as follows: h(z) = sign X z Γk(z) w(z ) . We now show that for each x X, h(ϕh(x)) = fh(x) = v(x) implying h k = h. Suppose fh(x) = 1 for an x X. Then there exists a z Z such that z x and h(z) = 1. From Theorem 4.6, and the choice of a+ and a we have that there exists a z x, |z| = k such that w(z) = a+. From the construction of w this implies there exists a z x, |z| = ℓ such that g(z) = a+. But from the above definition of v this implies v(x) = 1. Similarly, we can argue, if fh(x) = 1 then v(x) = 1 for any x X. Hence, h(ϕh(x)) = v(x) for each x X implying h k = h and 0 approximation error for h. Corollary 5.5. If ℓ k2 and the distribution D has full support on X, then k = ℓ is the smallest k that gives zero approximation error. Proof. In the proof of Theorem 5.4, we show that for k = ℓ , we have zero approximation error. Hence, to prove the corollary it is sufficient to show that for a k < ℓ the approximation error is not zero. Suppose there is an h Hk such that ε(h) = 0 and k < ℓ . Since the distribution D has full support, this implies fh(x) = v(x) for all x X. Hence, the induced complexity of v is at most k < ℓ giving a contradiction. Corollary 5.6. If ℓ > k2, then the approximation error weakly increases with k, i.e., ε(h k) ε(h k 1) for all k k2. Furthermore, if the distribution D has full support on X then no k can achieve zero approximation error. Proof. The approximation error weakly decreases because Hk 1 Hk for all k k2. Also, from the proof of Corollary 5.5, it is clear that no k can achieve zero approximation error. Lemma 5.7. Let D be the uniform distribution over X. Then there is a value function v for which ε(h k) diminishes convexly with k. Proof. We construct a v such that the approximation error for h k Hk is as given below ε(h k) = 1 4 q n It is easy to see that ε(h k) diminishes convexly with k (see Fig. 1). We choose k2 elements e1, e2, . . . , ek2 E (the ground set), and let ze be the k2 size subset consisting of these k2 elements. For a v : X R, let X + v = {x X| sign(v(x)) = 1} and X v = {x X| sign(v(x)) = 1}. We first show that there exists a v with the following two properties: 1. if x X + v then there exists a z ze such that z x. 2. For k [1, k2], let Xk = {x X| z ze, |z| = k, and z x}. Then X + v Xk = 3 4(Pk2 ℓ=k k2 ℓ q k2 n ℓ ), for every k [1, k2]. 3. For every z ze, let Xz = {x X | z x}. Then |X + v Xz| = 3 4 q k n k , where |z| = k. We construct such a v iteratively. We begin by making the following observation. Observation F.1. For each k [1, k2], |Xk| = Pk2 ℓ=k k2 ℓ q k2 n ℓ . Strategic Representation Proof. Recall X consists of size n subsets of E. For a k [1, k2] we wish to choose n size subsets of E that contain a z ze, |z| = k. This equivalent to choosing a fixed ℓ k size subset of ze and then choosing the remaining n ℓelements from the q k2 elements (not part of ze) in E. For every ℓ k we can choose ℓsize subset of ze in k2 ℓ ways, and for each such choice we can choose the remaining n ℓelements in q k2 n ℓ ways. Since, this holds for any ℓ [k, k2], we have |Xk| = Pk2 ℓ=k k2 ℓ q k2 n ℓ . Constructing v: The idea is to iteratively add elements in X to X + v , that is, iteratively determine the x X such that sign(v(x)) = 1. In the first round, we arbitrarily choose 3 4 q k2 n k2 from Xk2 and add it to X + v , and the remaining 1 4 q k2 n k2 are added to X v . At round k, assume we have constructed a v satisfying the above three properties for k > k, that is, 1. if x X + v then there exists a z ze such that z x. 2. X + v Xk = 3 4(Pk2 ℓ=k k2 ℓ q k2 n ℓ ), for every k [k + 1, k2]. 3. For every z ze, let Xz = {x X | z x}. Then |X + v Xz| = 3 n k , where |z| = k > k. Hence, at round k, we have k2 k q k2 n k elements in Xk which are not yet in X + v or X v . From these elements in Xk, for every k size subset z ze we arbitrarily choose 3 4 q k2 n k elements containing z and add the remaining 1 4 q k2 n k elements to X v . Now, observe that v satisfies the first two properties for every k [k, k2] after this procedure. We argue v satisfies the third property for any z ze, such that |z| = k. The n size sets in X containing a z ze, such that |z| = k, can be partitioned into sets containing different ℓ>= k size subsets of ze. In particular, we have the following combinatorial equality q k In the above expression, q k2 n ℓ corresponds to the number of n size sets that contain only a a specific ℓ k size subset of ze. Since our iterative procedure ensures from each such partition at least 3 4 fraction of x is added to X + v , we have that v satisfies the third property. Optimal h Hk: From the construction of v, it is clear that the optimal h Hk for the above constructed v, for any k [1, k2] satisfies the following: for every z Z, h (z) = 1 if and only if there exists a z ze, |z | = k, and z z. Further as D is the uniform distribution, for such an h : ε(h k) = 1 4 q n Lemma 5.8. Consider threshold-subadditive functions: HSA = {sign(g(z)) : g is subadditive on subsets in Z} Then for every threshold-subadditive hg HSA, there is an h Hk1 for which h(ϕh(x)) = hg(ϕhg(x)) x X. Proof. Let h HSA with a corresponding g : Z R such that h(z) = sign(g(z)) for all z Z. Choose an a+ > Pk1 i=1 n i , and a (0, 1). Define a weight function w on sets of size at most k1 as follows: ( a+ if |z| = k1, h(z) = 1 a o.w. Let h Hk1 be the function defined by the binary weight w as defined above. We argue that for every x X, if h(ϕh(x)) = h (ϕh (x)). For every x X, h(ϕh(x)) = 1 if and only if there is a z Z and z x such that h(z) = 1. Since g is sub-additive, we have 0 g(z) X z z,z =z,z Z g(z ) . (15) Strategic Representation A simple recursive argument implies h(ϕh(x)) = 1 if and only if there is a z x such that |z| = k1 and h(z) = 1, and hence w(z) = a+. Hence, from Theorem 4.6 this implies, h(ϕh(x)) = 1 if and only if h (ϕh (x)) = 1. Theorem 5.9. For any k and m, given a sample set S of size m sampled from D and labeled by some v, we have ε(ˆh) ε(h ) C( q k log( q k /ϵ) + log(1/δ) m w.p. at least 1 δ over S, and for a fixed constant C. In particular, ALG in Sec. 4.3, assuming Y is realizable, returns an ˆh Hk for which: C( q k log( q k /ϵ) + log(1/δ) m w.p. at least 1 δ over S, and for a fixed constant C. Proof. We first argue that the VC dimension of Hk is at most q k . Let d = P i [1,k] n i , index the vectors in {0, 1}d by z E (the ground set), such that |z| k. Then each z Z can be represented by a binary vector ez {0, 1}d, with the entry indexed by a z being 1 if and only if z z. Further, let w {a , a+}d be a binary weighted vector with a and a+ as in Def. 4.3. Then from the definition of Hk, for each h Hk, there is a wh {a , a+}d such that a) h(z) = sign( w, ez ) for all z Z, and b) the entry of w indexed by a z with |z | < k is a . From this we observe that the VC dimension of Hk is at most q k , since each h Hk is decided by the realization of binary weights on entries indexed by the q k sets. Now the first part of the theorem follows by noting that the first bound is the agnostic PAC generalization guarantee for an algorithm minimizing the empirical error in the standard classification setting with VC dimension at most q k . To prove the second part, we have Y Fk, and hence the approximation error is zero, that is, ε(h ) = 0 (from Lemma E.1). Further, ALG minimizes the empirical error (Theorem 4.7), and returns an ˆh with zero empirical error. Lemma 5.10. There exists a distribution D and a value function v such that for all k < k k2, system has higher payoff against the optimal h k Hk than against h k Hk . Proof. The v is constructed as in the proof Lemma 5.7. We recall notations from the proof of Lemma 5.7: ze is a k2 size subset. Further, in the proof of Lemma 5.7 we argued that for k [1, k2], h k is such that for all z Z, h (z) = 1 if and only if there exists a k size z z which is also a subset of ze. Now let k, k [1, k2] such that k < k . Since D is the uniform distribution, to show system s utility is more for k compared to k it is sufficient to show that X x X 1{h k(ϕh k(x)) = 1} > X x X 1{h k (ϕh k (x)) = 1} From the proof of Lemma 5.7 and Theorem 4.6, it follows that x X 1{h k(ϕh k(x)) = 1} = X x X 1{fh k(x) = 1} = Similarly, X x X 1{h k (ϕh k (x)) = 1} = Since k < k , we have x X 1{fh k(x) = 1} = implying system s utility is more for k compared to k . Corollary 5.11. For the system, lower k2 is better. Strategic Representation Proof. In Lemma 5.10, we showed there exists a user with v such that for all k, k [1, k2] and k < k , the system has better utility against the optimal choice function in Hk than in Hk . Since the choice of k the user can make is bounded by k2, a lower k2 maximizes the worst-case payoff to the system.