# strategic_apple_tasting__6e6dbbd3.pdf

Strategic Apple Tasting

Keegan Harris Carnegie Mellon University Pittsburgh, PA 15213 keeganh@cmu.edu

Chara Podimata MIT & Archimedes/Athena RC Cambridge, MA 02142 podimata@mit.edu

Zhiwei Steven Wu Carnegie Mellon University Pittsburgh, PA 15213 zstevenwu@cmu.edu

Algorithmic decision-making in high-stakes domains often involves assigning decisions to agents with incentives to strategically modify their input to the algorithm. In addition to dealing with incentives, in many domains of interest (e.g. lending and hiring) the decision-maker only observes feedback regarding their policy for rounds in which they assign a positive decision to the agent; this type of feedback is often referred to as apple tasting (or one-sided) feedback. We formalize this setting as an online learning problem with apple-tasting feedback where a principal makes decisions about a sequence of T agents, each of which is represented by a context that may be strategically modified. Our goal is to achieve sublinear strategic regret, which compares the performance of the principal to that of the best fixed policy in hindsight, if the agents were truthful when revealing their contexts. Our main result is a learning algorithm which incurs O(

T) strategic regret when the sequence of agents is chosen stochastically. We also give an algorithm capable of handling adversarially-chosen agents, albeit at the cost of O(T (d+1)/(d+2)) strategic regret (where d is the dimension of the context). Our algorithms can be easily adapted to the setting where the principal receives bandit feedback this setting generalizes both the linear contextual bandit problem (by considering agents with incentives) and the strategic classification problem (by allowing for partial feedback).

1 Introduction

Algorithmic systems have recently been used to aid in or automate decision-making in high-stakes domains (including lending and hiring) in order to, e.g., improve efficiency or reduce human bias [12, 1]. When subjugated to algorithmic decision-making in high-stakes settings, individuals have an incentive to strategically modify their observable attributes to appear more qualified. Such behavior is often observed in practice. For example, credit scores are often used to predict the likelihood an individual will pay back a loan on time if given one. Online articles with titles like 9 Ways to Build and Improve Your Credit Fast are ubiquitous and offer advice such as pay credit card balances strategically in order to improve one s credit score with minimal effort [46]. In hiring, common advice ranges from curating a list of keywords to add to one s resume, to using white font in order to trick automated resume scanning software [23, 2]. If left unaccounted for, such strategic manipulations could result in individuals being awarded opportunities for which they are not qualified for, possibly at the expense of more deserving candidates. As a result, it is critical to keep individuals incentives in mind when designing algorithms for learning and decision-making in high-stakes settings.

In addition to dealing with incentives, another challenge of designing learning algorithms for highstakes settings is the possible selection bias introduced by the way decisions are made. In particular, decision-makers often only have access to feedback about the deployed policy from individuals that have received positive decisions (e.g., the applicant is given the loan, the candidate is hired to the job and then we can evaluate how good our decision was). In the language of online learning, this type of feedback is known as apple tasting (or one-sided) feedback. When combined, these two complications (incentives & one-sided feedback) have the potential to amplify one other, as algorithms can learn

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

only when a positive decision is made, but individuals have an incentive to strategically modify their attributes in order to receive such positive decisions, which may interfere with the learning process.

1.1 Contributions We formalize our setting as a game between a principal and a sequence of T strategic agents, each with an associated context xt which describes the agent. At every time t {1, . . . , T}, the principal deploys a policy πt, a mapping from contexts to binary decisions (e.g., whether to accept/reject a loan applicant). Given policy πt, agent t then presents a (possibly modified) context x t to the algorithm, and receives a decision at = πt(x t). If at = 1, the principal observes reward rt(at) = rt(1); if at = 0 they receive no feedback. rt(0) is assumed to be known and constant across rounds.1 Our metric of interest is strategic regret, i.e., regret with respect to the best fixed policy in hindsight, if agents were truthful when reporting their contexts.

Our main result is an algorithm which achieves O(

T) strategic regret (with polynomial per-round runtime) when there is sufficient randomness in the distribution over agents (Algorithm 1). At a high level, our algorithm deploys a linear policy at every round which is appropriately shifted to account for the agents strategic behavior. We identify a sufficient condition under which the data received by the algorithm at a given round is clean , i.e. has not been strategically modified. Algorithm 1 then onlinelearns the relationship between contexts and rewards by only using data for which it is sure is clean.

In contrast to performance of algorithms which operate in the non-strategic setting, the regret of Algorithm 1 depends on an exponentially-large constant c(d, δ) (1 δ) d due to the one-sided feedback available for learning, where d is the context dimension and δ (0, 1) is a parameter which represents the agents ability to manipulate. While this dependence on c(d, δ) is insignificant when the number of agents T (i.e. is very large), it may be problematic for the principal whenever T is either small or unknown. To mitigate this issue, we show how to obtain O(d T 2/3) strategic regret by playing a modified version of the well-known explore-then-commit algorithm (Algorithm 2). At a high level, Algorithm 2 explores by always assigning action 1 for a fixed number of rounds (during which agents do not have an incentive to strategize) in order to collect sufficient information about the data-generating process. It then exploits by using this data learn a strategy-aware linear policy. Finally, we show how to combine Algorithm 1 and Algorithm 2 to achieve O(min{c(d, δ)

T, d T 2/3}) strategic regret whenever T is unknown.

While the assumption of stochastically-chosen agents is well-motivated in general, it may be overly restrictive in some specific settings. Our next result is an algorithm which obtains O(T (d+1)/(d+2)) strategic regret when agents are chosen adversarially (Algorithm 4). Algorithm 4 uses a variant of the popular Exp3 algorithm to trade off between a carefully constructed set of (exponentially-many) policies [6]. As a result, it achieves sublinear strategic regret when agents are chosen adversarially, but requires an exponentially-large amount of computation at every round.

Finally, we note that while our primary setting of interest is that of one-sided feedback, all of our algorithms can be easily extended to the more general setting in which the principal receives bandit feedback at each round, i.e. rt(0) is not constant and must be learned from data. To the best of our knowledge, we are the first to consider strategic learning in the contextual bandit setting.

1.2 Related work Strategic responses to algorithmic decision-making There is a growing line of work at the intersection of economics and computation on algorithmic decision-making with incentives, under the umbrella of strategic classification or strategic learning [27, 3, 41, 22] focusing on online learning settings [21, 19], causal learning [52, 31, 29, 34], incentivizing desirable behavior [39, 28, 11, 42], incomplete information [30, 25, 37]. In its most basic form, a principal makes either a binary or real-valued prediction about a strategic agent, and receives full feedback (e.g., the agent s label) after the decision is made. While this setting is similar to ours, it crucially ignores the one-sided feedback structure present in many strategic settings of interest. In our running example of hiring, full feedback would correspond to a company not offering an applicant a job, and yet still getting to observe whether they would have been a good employee! As a result, such methods are not applicable in our setting. Concurrent work [18] studies the effects of bandit feedback in the related problem of performative prediction [47], which considers data distribution shifts at the population level in response to the deployment of a machine learning model. In contrast, our focus is on strategic responses to machine

1We relax this assumption at later parts of the paper with virtually no impact on our results.

learning models at the individual level under apple tasting and bandit feedback. Ahmadi et al. [4] study an online strategic learning problem in which they consider bandit feedback with respect to the deployed classifier. In contrast, we use the term bandit feedback to refer to the fact that we only see the outcome when for the action/decision taken.

Apple tasting and online learning Helmbold et al. [32] introduce the notion of apple-tasting feedback for online learning. In particular, they study a binary prediction task over instances (e.g., fresh/rotten apples), in which a positive prediction is interpreted as accepting the instance (i.e. tasting the apple ) and a negative prediction is interpreted as rejecting the instance (i.e., not tasting the apple). The learner only gets feedback when the instance is accepted (i.e., the apple is tasted). While we are the first to consider classification under incentives with apple tasting feedback, similar feedback models have been studied in the context of algorithmic fairness [9], partial-monitoring games [5], and recidivism prediction [24]. A related model of feedback is that of contaminated controls [40], which considers learning from (1) a treated group which contains only treated members of the agent population and (2) a contaminated control group with samples from the entire agent population (not just those under control). Technically, our results are also related to a line of work in contextual bandits which shows that greedy algorithms without explicit exploration can achieve sublinear regret as long as the underlying context distribution is sufficiently diverse [49, 8, 38, 53, 48].

Bandits and agents A complementary line of work to ours is that of Bayesian incentive-compatible (BIC) exploration in multi-armed bandit problems [43, 35, 50, 36, 44, 45]. Under such settings, the goal of the principal is to persuade a sequence of T agents with incentives to explore across several different actions with bandit feedback. In contrast, in our setting it is the principal, not the agents, who is the one taking actions with partial feedback. As a result there is no need for persuasion, but the agents now have an incentive to strategically modify their behavior in order to receive a more desirable decision/action.

Other related work Finally, our work is broadly related to the literature on learning in repeated Stackelberg games [7, 55], online Bayesian persuasion [16, 17, 13], and online learning in principalagent problems [20, 33, 56]. In the repeated Stackelberg game setting, the principal (leader) commits to a mixed strategy over a finite set of actions, and the agent (follower) best-responds by playing an action from a finite set of best-responses. Unlike in our setting, both the principal s and agent s payoffs can be represented by matrices. In contrast, in our setting the principal commits to a pure strategy from a continuous set of actions, and the agent best-responds by playing an action from a continuous set. In online Bayesian persuasion, the principal (sender) commits to a signaling policy (a random mapping from states of the world to receiver actions) and the agent (receiver) performs a posterior update on the state based on the principal s signal, then takes an action from a (usually finite) set. In both this setting and ours, the principal s action is a policy. However in our setting the policy is a linear decision rule, whereas in the Bayesian persuasion setting, the policy is a set of conditional probabilities which form an incentive compatible signaling policy. This difference in the policy space for the principal typically leads to different algorithmic ideas being used in the two settings. Strategic learning problems are, broadly speaking, instances of principal-agent problems. In contract design, the principal commits to a contract (a mapping from outcomes to agent payoffs). The agent then takes an action, which affects the outcome. In particular, they take the action which maximizes their expected payoff, subject to some cost of taking the action. The goal of the principal is to design a contract such that their own expected payoff is maximized. While the settings are indeed similar, there are several key differences. First, in online contract design the principal always observes the outcome, whereas in our setting the principal only observes the reward if a positive decision is made. Second, the form of the agent s best response is different, which leads to different agent behavior and, as a result, different online algorithms for the principal.

2 Setting and background

We consider a game between a principal and a sequence of T agents. Each agent is associated with a context xt X Rd, which characterizes their attributes (e.g., a loan applicant s credit history/report). At time t, the principal commits to a policy πt : X {1, 0}, which maps from contexts to binary decisions (e.g., whether to accept/reject the loan application). We use at = 1 to denote the the principal s positive decision at round t (e.g., agent t s loan application is approved), and at = 0 to denote a negative decision (e.g., the loan application is rejected). Given πt, agent t best-responds by strategically modifying their context within their effort budget as follows:

Definition 2.1 (Agent best response; lazy tiebreaking). Agent t best-responds to policy πt by modifying their context according to the following optimization program.

x t arg max x X 1{πt(x ) = 1}

s.t. x xt 2 δ

Furthermore, we assume that if an agent is indifferent between two (modified) contexts, they choose the one which requires the least amount of effort to obtain (i.e., agents are lazy when tiebreaking).

In other words, every agent wants to receive a positive decision, but has only a limited ability to modify their (initial) context (represented by ℓ2 budget δ).2 Such an effort budget may be induced by time or monetary constraints and is a ubiquitous model of agent behavior in the strategic learning literature (e.g., [39, 28, 19, 10]). We focus on linear thresholding policies where the principal assigns action π(x ) = 1, if and only if β, x γ for some β Rd, γ R. We refer to β, x t = γ as the decision boundary. For linear thresholding policies, the agent s best-response according to Definition 2.1 is to modify their context in the direction of β/ β 2 until the decision-boundary is reached (if it can indeed be reached). While we present our results for lazy tiebreaking for ease of exposition, all of our results can be readily extended to the setting in which agents best-respond with a trembling hand , i.e. trembling hand tiebreaking. Under this setting, we allow agents who strategically modify their contexts to overshoot the decision boundary by some bounded amount, which can be either stochastic or adversarially-chosen. See Appendix D for more details.

The principal observes x t and plays action at = πt(x t) according to policy πt. If at = 0, the principal receives some known, constant reward rt(0) := r0 R. On the other hand, if the principal assigns action at = 1, we assume that the reward the principal receives is linear in the agent s unmodified context, i.e., rt(1) := θ(1), xt + ϵt (1)

for some unknown θ(1) Rd, where ϵt is i.i.d. zero-mean sub-Gaussian random noise with (known) variance σ2. Note that rt(1) is observed only when the principal assigns action at = 1, and not when at = 0. Following Helmbold et al. [32], we refer to such feedback as apple tasting (or one-sided) feedback. Mapping to our lending example, the reward a bank receives for rejecting a particular loan applicant is the same across all applicants, whereas their reward for a positive decision could be anywhere between a large, negative reward (e.g., if a loan is never repaid) to a large, positive reward (e.g., if the loan is repaid on time, with interest).

The most natural measure of performance in our setting is that of Stackelberg regret, which compares the principal s reward over T rounds with that of the optimal policy given that agents strategize.

Definition 2.2 (Stackelberg regret). The Stackelberg regret of a sequence of policies {πt}t [T ] on agents {xt}t [T ] is

Reg Stackel(T) := X

t [T ] rt( π ( xt)) X

t [T ] rt(πt(x t))

where xt is the best-response from agent t to policy π and π is the optimal-in-hindsight policy, given that agents best-respond according to Definition 2.1.

A stronger measure of performance is that of strategic regret, which compares the principal s reward over T rounds with that of the optimal policy had agents reported their contexts truthfully.

Definition 2.3 (Strategic regret). The strategic regret of a sequence of policies {πt}t [T ] on agents {xt}t [T ] is

Regstrat(T) := X

t [T ] rt(π (xt)) X

t [T ] rt(πt(x t))

where π (xt) = 1 if θ(1), xt r0 and π (xt) = 0 otherwise.

2Our results readily extend to the setting in which the agent s effort constraint takes the form of an ellipse rather than a sphere. Under this setting, the agent effort budget constraint in Definition 2.1 would be A1/2(x xt) 2 δ, where A Rd d is some positive definite matrix. If A is known to the principal, this may just be viewed as a linear change in the feature representation.

Classification under agent incentives with apple tasting feedback

For t = 1, . . . , T:

1. Principal publicly commits to a mapping πt : X {1, 0}. 2. Agent t arrives with context xt X (hidden from the principal). 3. Agent t strategically modifies context from xt to x t according to Definition 2.1. 4. Principal observes (modified) context x t and plays action at = πt(x t). 5. Principal observes rt(1) := θ(1), xt + ϵt, if and only if at = 1.

Figure 1: Summary of our model.

Proposition 2.4. Strategic regret is a stronger performance notion compared to Stackelberg regret, i.e., Reg Stackel(T) Regstrat(T).

Proof. The proof follows from the corresponding regret definitions and the fact that the principal s reward is determined by the original (unmodified) agent contexts.

RStackel(T) := X

t [T ] rt( π ( xt)) X

t [T ] rt(πt(x t))

t [T ] rt( π ( xt)) X

t [T ] rt(π (xt)) + X

t [T ] rt(π (xt)) X

t [T ] rt(πt(x t))

0 + Rstrat(T)

where the last line follows from the fact that the principal s reward from the optimal policy when the agent strategizes is at most their optimal reward when agents do not strategize.

Because of Proposition 2.4, we focus on strategic regret, and use the shorthand Regstrat(T) = Reg(T) for the remainder of the paper. Strategic regret is a strong notion of optimality, as we are comparing the principal s performance with that of the optimal policy for an easier setting, in which agents do not strategize. Moreover, the apple tasting feedback introduces additional challenges which require new algorithmic ideas to solve, since the principal needs to assign actions to both (1) learn about θ(1) (which can only be done when action 1 is assigned) and (2) maximize rewards in order to achieve sublinear strategic regret. See Figure 1 for a summary of the setting we consider.

We conclude this section by pointing out that our results also apply to the more challenging setting of bandit feedback, in which rt(1) is defined as in Equation (1), rt(0) := θ(0), xt + ϵt and only rt(at) is observed at each time-step. We choose to highlight our results for apple tasting feedback since this is the type of feedback received by the principal in our motivating examples. Finally, we note that e O( ) hides polylogarithmic factors, and that all proofs can be found in the Appendix.

3 Strategic classification with apple tasting feedback

In this section, we present our main results: provable guarantees for online classification of strategic agents under apple tasting feedback. Our results rely on the following assumption.

Assumption 3.1 (Bounded density ratio). Let f U d : X R 0 denote the density function of the uniform distribution over the d-dimensional unit sphere. We assume that agent contexts {xt}t [T ] are drawn i.i.d. from a distribution over the d-dimensional unit sphere with density function f : X R 0 such that f(x) f Ud(x) c0 > 0, x X.3

Assumption 3.1 is a condition on the initial agent contexts {xt}t [T ], before they are strategically modified. Indeed, one would expect the distribution over modified agent contexts to be highly discontinuous in a way that depends on the sequence of policies deployed by the principal. Furthermore, none of our algorithms need to know the value of c0. As we will see in the sequel, this assumption allows us to handle apple tasting feedback by relying on the inherent diversity in the agent population

3Our restriction to the unit sphere is without loss of generality. All of our results and analysis extend readily to the setting where contexts are drawn from a distribution over the d-dimensional sphere with radius R > 0.

ALGORITHM 1: Strategy-Aware OLS with Apple Tasting Feedback (SA-OLS) Assign action 1 for the first d rounds. Set Dd+1 = {(xs, r(1) s )}d s=1. for t = d + 1, . . . , T do

Estimate θ(1) as bθ (1) t using OLS and data Dt.

Assign action at = 1 if bθ (1) t , x t δ bθ (1) t 2 + r0.

if bθ (1) t , x t > δ bθ (1) t 2 + r0 then Conclude that x t = xt. Dt+1 = Dt {(xt, r(1) t )} else Dt+1 = Dt

for exploration; a growing area of interest in the online learning literature (see references in Section 1.2). Moreover, such assumptions often hold in practice. For example, in the related problem of (non-strategic) contextual bandits (we will later show how our results extend to the strategic version of this problem), Bietti et al. [14] find that a greedy algorithm with no explicit exploration achieved the second-best empirical performance across a large number of datasets when compared to many popular contextual bandit algorithms. In our settings of interest (e.g. lending, hiring), such an assumption is reasonable if there is sufficient diversity in the applicant pool. In Section 4 we show how to remove this assumption, albeit at the cost of worse regret rates and exponential computational complexity.

At a high level, our algorithm (formally stated in Algorithm 1) relies on three key ingredients to achieve sublinear strategic regret:

1. A running estimate of θ(1) is used to compute a linear policy, which separates agents who receive action 1 from those who receive action 0. Before deploying, we shift the decision boundary by the effort budget δ to account for the agents strategizing.

2. We maintain an estimate of θ(1) (denoted by bθ (1)) and only updating it when at = 1 and we can ensure that x t = xt.

3. We assign actions greedily (i.e. using no explicit exploration) w.r.t. the shifted linear policy.

Shifted linear policy If agents were not strategic, assigning action 1 if bθ (1) t , xt r0 and action

0 otherwise would be a reasonable strategy to deploy, given that bθ (1) t is our best estimate of θ(1)

so far. Recall that the strategically modified context x t is s.t., x t xt δ. Hence, in Algorithm 1,

we shift the linear policy by δ bθ (1) 2 to account for strategically modified contexts. Now, action

1 is only assigned if bθ (1) t , xt δ bθ (1) 2 + r0. This serves two purposes: (1) It makes it so that

any agent with unmodified context x such that bθ (1) t , x < r0 cannot receive action 1, no matter how

they strategize. (2) It forces some agents with contexts in the band r0 bθ (1) t , x < δ bθ (1) 2 + r0 to strategize in order to receive action 1. Estimating θ(1) After playing action 1 for the first d rounds, Algorithm 1 forms an initial estimate of θ(1) via ordinary least squares (OLS). Note that since the first d agents will receive action 1 regardless of their context, they have no incentive to modify and thus x t = xt for t d. In future rounds, the algorithm s estimate of θ(1) is only updated whenever x t lies strictly on the positive side of the linear decision boundary. We call these contexts clean, and can infer that x t = xt due to the lazy tiebreaking assumption in Definition 2.1 (i.e. agents will not strategize more than is necessary to receive the positive classification).

Condition 3.2 (Sufficient condition for x = x). Given a shifted linear policy parameterized by β(1) Rd, we say that a context x is clean if β(1), x > δ β(1) 2 + r0.

Greedy action assignment By assigning actions greedily according to the current (shifted) linear policy, we are relying on the diversity in the agent population for implicit exploration (i.e., to collect more datapoints to update our estimate of θ(1)). As we will show, this implicit exploration

is sufficient to achieve e O(

T) strategic regret under Assumption 3.1, albeit at the cost of an exponentially-large (in d) constant which depends on the agents ability to manipulate (δ).

We are now ready to present our main result: strategic regret guarantees for Algorithm 1 under apple tasting feedback.

Theorem 3.3 (Informal; detailed version in Theorem B.1). With probability 1 γ, Algorithm 1 achieves the following performance guarantee:

Reg(T) e O 1 c0 c1(d, δ) c2(d, δ)

dσ2T log(4d T/γ)

where c0 is a lower bound on the density ratio as defined in Assumption 3.1, c1(d, δ) :=

Px U d(x[1] δ) Θ (1 δ)d/2

d2 for sufficiently large d and c2(d, δ) := Ex U d[x[2]2|x[1]

4δ2 3, where x[i] denotes the i-th coordinate of a vector x.4

Proof sketch. Our analysis begins by using properties of the strategic agents and shifted linear decision boundary to upper-bound the per-round strategic regret for rounds t > d by a term proportional

to bθ (1) t θ(1) 2, i.e., our instantaneous estimation error for θ(1). Next we show that

bθ (1) t θ(1) 2

Pt s=1 xsϵs1{I(1) s } 2 λmin(Pt s=1 xsx s 1{I(1) s })

where λmin(M) is the minimum eigenvalue of (symmetric) matrix M, and I(1) s = { bθ (1) s , xs

δ bθ (1) s 2 + r0} is the event that Algorithm 1 assigns action as = 1 and can verify that x s = xs. We upper-bound the numerator using a variant of Azuma s inequality for martingales with subgaussian tails. Next, we use properties of Hermitian matrices to show that λmin(Pt s=1 xsx s 1{I(1) s }) is lower-bounded by two terms: one which may be bounded w.h.p. by using the extension of Azuma s inequality for matrices, and one of the form Pt s=1 λmin(Es 1[xsx s 1{I(1) s }]), where Es 1 denotes the expected value conditioned on the filtration up to time s. Note that up until this point, we have only used the fact that contexts are drawn i.i.d. from a bounded distribution.

Using Assumption 3.1 on the bounded density ratio, we can lower bound λmin(Es 1[xsx s 1{I(1) s }]) by λmin(EU d,s 1[xsx s 1{I(1) s }]), where the expectation is taken with respect to the uniform distribution over the d-dimensional ball. We then use properties of the uniform distribution to show that λmin(EU d,s 1[xsx s 1{I(1) s }]) O(c0 c(d, δ)). Putting everything together, we get that

bθ (1) t θ(1) 2 (c0 c(d, δ)

t) 1 with high probability. Via a union bound and the fact that P

t 2T, we get that Reg(T) e O( 1 c0 c(d,δ)

T). Finally, we use tools from highdimensional geometry to lower bound the volume of a spherical cap and we show that for sufficiently large d, c1(d, δ) Θ (1 δ)d/2

3.1 High-dimensional contexts While we typically think of the number of agents T as growing and the context dimension d as constant in our applications of interest, there may be situations in which T is either unknown or small. Under such settings, the 1/c(d,δ) dependence in the regret bound (where c(d, δ) = c1(d, δ) c2(d, δ)) may become problematic if δ is close to 1. This begs the question: Why restrict the OLS estimator in Algorithm 1 to use only clean contexts (as defined in Condition 3.2)? Perhaps unsurprisingly,

we show in Appendix B that the estimate bθ (1) given by OLS will be inconsistent if even a constant fraction of agents strategically modify their contexts.

4While we assume that δ is known to the principal, Algorithm 1 is fairly robust to overestimates of δ, in the sense that (1) it will still produce a consistent estimate for θ(1) (albeit at a rate which depends on the overestimate instead of the actual value of δ) and (2) it will incur a constant penalty in regret which is proportional to the amount of over-estimation.

ALGORITHM 2: Explore-Then-Commit Input :Time horizon T, failure probability γ Set T0 according to Theorem B.9 Assign action 1 for the first T0 rounds

Estimate θ(1) as ˆθ (1) T0 via OLS for t = T0 + 1, . . . , T do

Assign action at = 1 if ˆθ (1) T0 , xt δ ˆθ (1) T0 2 and action at = 0 otherwise

Given the above, it seems reasonable to restrict ourselves to learning procedures which only use data from agents for which the principal can be sure that x = x. Under such a restriction, it is natural to ask whether there exists some sequence of linear polices which maximizes the number of points of the form (x t, rt(1)) for which the principal can be sure that x t = xt. Again, the answer is no:

Proposition 3.4. For any sequence of linear policies {βt}t, the expected number of clean points is:

Ex1,...,x T U d

t [T ] 1{ xt, βt > δ βt 2}

= c1(d, δ) T

when (initial) contexts are drawn uniformly from the d-dimensional unit sphere.

The proof follows from the rotational invariance of the uniform distribution over the unit sphere. Intuitively, Proposition 3.4 implies that any algorithm which wishes to learn θ(1) using clean samples will only have c1(d, δ) T datapoints in expectation. Observe that this dependence on c1(d, δ) arises as a direct result of the agents ability to strategize. We remark that a similar constant often appears in the regret analysis of BIC bandit algorithms (see Section 1.2). Much like our work, [43] find that their regret rates depend on a constant which may be arbitrarily large, depending on how hard it is to persuade agents to take the principal s desired action in their setting. The authors conjecture that this dependence is an inevitable price of incentive-compatibility . While our results do not rule out better strategic regret rates in d for more complicated algorithms (e.g., those which deploy non-linear policies), it is often unclear how strategic agents would behave in such settings, both in theory (Definition 2.1 would require agents to solve a non-convex optimization with potentially no closed-form solution) and in practice, making the analysis of such nonlinear policies difficult in strategic settings.

We conclude this section by showing that polynomial dependence on d is possible, at the cost of e O(T 2/3) strategic regret. Specifically, we provide an algorithm (Algorithm 3) which obtains the following regret guarantee whenever T is small or unknown, which uses Algorithm 1 and a variant of the explore-then-commit algorithm (Algorithm 2) as subroutines:

Theorem 3.5 (Informal; details in Theorem B.13). Algorithm 3 incurs expected strategic regret

E[Reg(T)] = e O min d5/2

T, d T 2/3 ,

where the expectation is taken with respect to the sequence of contexts {xt}t [T ] and random noise {ϵt}t [T ].

The algorithm proceeds by playing a strategy-aware variant of explore-then-commit (Algorithm 2) with a doubling trick until the switching time τ = g(d, δ) is reached. Note that g(d, δ) is a function of both d and δ, not c0. If round τ is indeed reached, the algorithm switches over to Algorithm 1 for the remaining rounds.

Extension to bandit feedback Algorithm 1 can be extended to handle bandit feedback by explicitly

keeping track of an estimate bθ (0) of θ(0) via OLS, assigning action at = 1 if and only if bθ (1) t bθ (0) t , x t δ bθ (1) t bθ (0) t 2, and updating the OLS estimate of bθ (0) whenever at = 0 (since agents will not strategize to receive action 0). Algorithm 3 may be extended to bandit feedback by exploring for twice as long in Algorithm 2, in addition to using the above modifications. In both cases, the strategic regret rates are within a constant factor of the rates obtained in Theorem 3.3 and Theorem 3.5.

ALGORITHM 3: Strategy-aware online classification with unknown time horizon Compute switching time τ = g(d, δ) Let τ0 = 1 for i = 1, 2, 3, . . . do Let τi = 2 τi 1 if Pi j=1 τj < τ then Run Algorithm 2 with time horizon τi and failure probability 1/τ 2 i else Break and run Algorithm 1 for the remainder of the rounds

4 Beyond stochastic contexts

In this section, we allow the sequence of initial agent contexts to be chosen by an (oblivious) adversary. This requires new algorithmic ideas, as the regression-based algorithms of Section 3 suffer linear strategic regret under this adversarial setting. Our algorithm (Algorithm 4) is based on the popular EXP3 algorithm [6]. At a high level, Algorithm 4 maintains a probability distribution over experts , i.e., a discretized grid E over carefully-selected policies. In particular, each grid point e E Rd

represents an estimate of θ(1), and corresponds to a slope vector which parameterizes a (shifted) linear threshold policy, like the ones considered in Section 3. We use at,e to refer to the action played by the principal at time t, had they used the linear threshold policy parameterized by expert e. At every time-step, (1) the adversary chooses an agent xt, (2) a slope vector et E is selected according to the current distribution, (3) the principal commits to assigning action 1 if and only if et, x t δ et 2, (4) the agent strategically modifies their context xt x t, and (5) the principal assigns an action at according to the policy and receives the associated reward rt(at) (under apple tasting feedback).

Algorithm EXP4, which maintains a distribution over experts and updates the loss of all experts based on the current action taken, is not directly applicable in our setting as the strategic behavior of the agents prevents us from inferring the loss of each expert at every time-step [? ]. This is because if x t = xt under the thresholding policy associated with expert e), it is generally not possible to back out xt given x t, which prevents us from predicting the counterfactual context the agent would have modified to had the principal been using expert e instead. As a result, we use a modification of the standard importance-weighted loss estimator to update the loss of only the policy played by the algorithm (and therefore the distribution over policies). Our regret guarantees for Algorithm 4 are as follows:

Theorem 4.1 (Informal; detailed version in Theorem C.1). Algorithm 4 incurs expected strategic regret E[Reg(T)] = e O(T (d+1)/(d+2)).

We remark that Algorithm 4 may be extended to handle settings in which agents are selected by an adaptive adversary by using EXP3.P [6] in place of EXP3.

Proof sketch. The analysis is broken down into two parts. In the first part, we bound the regret w.r.t. the best policy on the grid. In the second, we bound the error incurred for playing policies on the grid, rather than the continuous space of policies. We refer to this error as the Strategic Discretization Error (SDE(T)). The analysis of the regret on the grid mostly follows similar steps to the analysis of EXP3 / EXP4. The important difference is that we shift the reward obtained by at, by a factor of 1 + λ, where λ is a (tunable) parameter of the algorithm. This shifting (which does not affect the regret, since all the losses are shifted by the same fixed amount) guarantees that the losses at each round are non-negative and bounded with high probability. Technically, this requires bounding the tails of the subgaussian of the noise parameters ϵt.

We now shift our attention to bounding SDE(T). The standard analysis of the discretization error in the non-strategic setting does not go through for our setting, since an agent may strategize very differently with respect to two policies which are close together in ℓ2 distance, depending on the agent s initial context. Our analysis proceeds with a case-by-case basis. Consider the best expert e in the grid. If at,e = π (xt) (i.e., the action of the best expert matches that of the optimal policy), there is no discretization error in round t. Otherwise, if at,e = π (xt), we show that the per-round SDE is upper-bounded by a term which looks like twice the discretization upper-bound for the non-strategic setting, plus an additional term. We show that this additional term must always be non-positive by considering two subcases (at,e = 1, π (xt) = 0 and at,e = 0, π (xt) = 1) and using properties about how agents strategize against the deployed algorithmic policies.

ALGORITHM 4: EXP3 with strategy-aware experts (EXP3-SAE)

Create set of discretized policies e E = [(1/ε)d], where ε = (dσ log(T)/T)1/(d+2).

Set parameters η = q

T λ2|E| , γ = 2ηλ|E|, and λ = σ 2 log T. Initialize probability distribution pt(e) = 1/|E|, e E. for t [T] do Choose policy et from probability distribution qt(e) = (1 γ) pt(e) + γ |E|. Observe x t. Play action at,et = 1 if et, x t δ et 2. Otherwise play action at,et = 0. Observe reward rt(at,et). Update loss estimator for each policy e E: bℓt(e) = (1 + λ rt(at,et)) 1{e = et}/qt(e).

Update probability distribution e E: pt+1(e) pt(e) exp ηbℓt(e) .

Computational complexity While both Algorithm 1 and Algorithm 3 have O(d3) per-iteration computational complexity, Algorithm 4 must maintain and update a probability distribution over a grid of size exponential in d at every time-step, making it hard to use in practice if d is large. We view the design of computationally efficient algorithms for adversarially-chosen contexts as an important direction for future research.

Extension to bandit feedback Algorithm 4 may be extended to the bandit feedback setting by maintaining a grid over estimates of θ(1) θ(0) (instead of over θ(1)). No further changes are required.

5 Conclusion

We study the problem of classification under incentives with apple tasting feedback. Such one-sided feedback is often what is observed in real-world strategic settings including lending and hiring. Our main result is a greedy algorithm (Algorithm 1) which achieves e O(

T) strategic regret when the initial agent contexts are generated stochastically. The regret of Algorithm 1 depends on a constant c1(d, δ) which scales exponentially in the context dimension, which may be problematic in settings for which the number of agents is small or unknown. To address this, we provide an algorithm (Algorithm 3) which combines Algorithm 1 with a strategy-aware version of the explore-then-commit algorithm using a doubling trick to achieve e O(min{

d T c1(d,δ), d T 2/3}) expected strategic regret whenever T is unknown. Finally, we relax the assumption of stochastic contexts and allow for contexts to be generated adversarially. Algorithm 4 achieves e O(T d+1 d+2 ) expected strategic regret whenever agent contexts are generated adversarially by running EXP3 over a discretized grid of strategy-aware policies, but has exponential-in-d per-round computational complexity. All of our results also apply to the more general setting of bandit feedback, under slight modifications to the algorithms. There are several directions for future work:

Unclean data The regret of Algorithm 1 depends on a constant which is exponentially large in d, due to the fact that it only learns using clean data (Condition 3.2). While learning using unclean data will generally produce an inconsistent estimator, it would be interesting to see if the principal could leverage this data to remove the dependence on this constant. Alternatively, lower bounds which show that using unclean data will not improve regret would also be interesting.

Efficient algorithms for adversarial contexts Our algorithm for adversarially-chosen agent contexts suffers exponential-in-d per-round computational complexity, which makes it unsuitable for use in settings with high-dimensional contexts. Deriving polynomial-time algorithms with sublinear strategic regret for this setting is an exciting (but challenging) direction for future research.

More than two actions Finally, it would be interesting to extend our algorithms for strategic learning under bandit feedback to the setting in which the principal has three or more actions at their disposal. While prior work [29] implies an impossibility result for strategic regret minimization with three or more actions, other (relaxed) notions of optimality (e.g., sublinear Stackelberg regret; recall Definition 2.2) may still be possible.

Acknowledgements

KH is supported in part by an NDSEG Fellowship. KH and ZSW are supported in part by the NSF FAI Award #1939606. For part of this work, CP was supported by a FODSI postdoctoral fellowship from UC Berkeley. The authors would like to thank the anonymous Neur IPS reviewers for valuable feedback.

[1] Algorithmic hiring: Complex hiring by numbers? URL https://hiring.monster.com/ resources/recruiting-strategies/workforce-planning/hiring-algorithms/.

[2] Using white font on a cv to trick ats. URL https://www.thecvstore.net/blog/ cv-ats-white-font/#:~:text=Even%20if%20you%20get%20past,harm%20your% 20future%20employment%20prospects.

[3] Saba Ahmadi, Hedyeh Beyhaghi, Avrim Blum, and Keziah Naggita. The strategic perceptron. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 6 25, 2021.

[4] Saba Ahmadi, Avrim Blum, and Kunhe Yang. Fundamental bounds on online strategic classification. In Proceedings of the 24th ACM Conference on Economics and Computation, EC 2023, London, United Kingdom, July 9-12, 2023, pages 22 58. ACM, 2023.

[5] András Antos, Gábor Bartók, Dávid Pál, and Csaba Szepesvári. Toward a classification of finite partial-monitoring games. Theoretical Computer Science, 473:77 99, 2013.

[6] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

[7] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the sixteenth ACM conference on economics and computation, pages 61 78, 2015.

[8] Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329 1349, 2021. doi: 10.1287/mnsc.2020. 3605. URL https://doi.org/10.1287/mnsc.2020.3605.

[9] Yahav Bechavod, Katrina Ligett, Aaron Roth, Bo Waggoner, and Steven Z Wu. Equal opportunity in online classification with partial feedback. Advances in Neural Information Processing Systems, 32, 2019.

[10] Yahav Bechavod, Katrina Ligett, Steven Wu, and Juba Ziani. Gaming helps! learning from strategic interactions in natural dynamics. In International Conference on Artificial Intelligence and Statistics, pages 1234 1242. PMLR, 2021.

[11] Yahav Bechavod, Chara Podimata, Steven Wu, and Juba Ziani. Information discrepancy in strategic learning. In International Conference on Machine Learning, pages 1691 1715. PMLR, 2022.

[12] Jillian Berman. Do ai-powered lending algorithms silently discriminate? this initiative aims to find out, Nov 2021. URL https://www.marketwatch.com/story/ do-ai-powered-lending-algorithms-silently-discriminate-this-initiative\ -aims-to-find-out-11637246524.

[13] Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning, pages 2164 2183. PMLR, 2023.

[14] Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake-off. The Journal of Machine Learning Research, 22(1):5928 5976, 2021.

[15] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Cambridge University Press, 2020.

[16] Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persuasion. Advances in Neural Information Processing Systems, 33:16188 16198, 2020.

[17] Matteo Castiglioni, Alberto Marchesi, Andrea Celli, and Nicola Gatti. Multi-receiver online bayesian persuasion. In International Conference on Machine Learning, pages 1314 1323. PMLR, 2021.

[18] Yatong Chen, Wei Tang, Chien-Ju Ho, and Yang Liu. Performative prediction with bandit feedback: Learning through reparameterization. ar Xiv preprint ar Xiv:2305.01094, 2023.

[19] Yiling Chen, Yang Liu, and Chara Podimata. Learning strategy-aware linear classifiers. Advances in Neural Information Processing Systems, 33:15265 15276, 2020.

[20] Vincent Conitzer and Nikesh Garera. Learning algorithms for online principal-agent problems (and selling goods online). In Proceedings of the 23rd international conference on Machine learning, pages 209 216, 2006.

[21] Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 55 70, 2018.

[22] Itay Eilat, Ben Finkelshtein, Chaim Baskin, and Nir Rosenfeld. Strategic classification with graph neural networks. ar Xiv preprint ar Xiv:2205.15765, 2022.

[23] Christian Eilers. Resume keywords: List by industry [for use to pass the ats], Jan 2023. URL

https://zety.com/blog/resume-keywords.

[24] Danielle Ensign, Sorelle A Friedler, Scott Nevlle, Carlos Scheidegger, and Suresh Venkatasubramanian. Decision making with limited feedback: Error bounds for predictive policing and recidivism prediction. In Proceedings of Algorithmic Learning Theory,, volume 83, 2018.

[25] Ganesh Ghalme, Vineet Nair, Itay Eilat, Inbal Talgam-Cohen, and Nir Rosenfeld. Strategic classification in the dark. In International Conference on Machine Learning, pages 3672 3681. PMLR, 2021.

[26] Nika Haghtalab, Thodoris Lykouris, Sloan Nietert, and Alexander Wei. Learning in stackelberg games with non-myopic agents. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 917 918, 2022.

[27] Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM conference on innovations in theoretical computer science, pages 111 122, 2016.

[28] Keegan Harris, Hoda Heidari, and Steven Z Wu. Stateful strategic regression. Advances in Neural Information Processing Systems, 34:28728 28741, 2021.

[29] Keegan Harris, Anish Agarwal, Chara Podimata, and Zhiwei Steven Wu. Strategyproof decisionmaking in panel data settings and beyond. ar Xiv preprint ar Xiv:2211.14236, 2022.

[30] Keegan Harris, Valerie Chen, Joon Kim, Ameet Talwalkar, Hoda Heidari, and Steven Z Wu. Bayesian persuasion for algorithmic recourse. Advances in Neural Information Processing Systems, 35:11131 11144, 2022.

[31] Keegan Harris, Dung Daniel T Ngo, Logan Stapleton, Hoda Heidari, and Steven Wu. Strategic instrumental variable regression: Recovering causal relationships from strategic responses. In International Conference on Machine Learning, pages 8502 8522. PMLR, 2022.

[32] David P Helmbold, Nicholas Littlestone, and Philip M Long. Apple tasting. Information and Computation, 161(2):85 139, 2000.

[33] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 359 376, 2014.

[34] Guy Horowitz and Nir Rosenfeld. Causal strategic classification: A tale of two shifts. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 13233 13253. PMLR, 2023.

[35] Xinyan Hu, Dung Ngo, Aleksandrs Slivkins, and Steven Z Wu. Incentivizing combinatorial bandit exploration. Advances in Neural Information Processing Systems, 35:37173 37183, 2022.

[36] Nicole Immorlica, Jieming Mao, Aleksandrs Slivkins, and Zhiwei Steven Wu. Bayesian exploration with heterogeneous agents. In The world wide web conference, pages 751 761, 2019.

[37] Meena Jagadeesan, Celestine Mendler-Dünner, and Moritz Hardt. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, pages 4687 4697. PMLR, 2021.

[38] Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 2231 2241, Red Hook, NY, USA, 2018. Curran Associates Inc.

[39] Jon Kleinberg and Manish Raghavan. How do classifiers induce agents to invest effort strategically? ACM Transactions on Economics and Computation (TEAC), 8(4):1 23, 2020.

[40] Tony Lancaster and Guido Imbens. Case-control studies with contaminated controls. Journal of Econometrics, 71(1-2):145 160, 1996.

[41] Sagi Levanon and Nir Rosenfeld. Strategic classification made practical. In International Conference on Machine Learning, pages 6243 6253. PMLR, 2021.

[42] Sagi Levanon and Nir Rosenfeld. Generalized strategic classification and the case of aligned incentives. In International Conference on Machine Learning, pages 12593 12618. PMLR, 2022.

[43] Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 565 582, 2015.

[44] Daniel Ngo, Logan Stapleton, Vasilis Syrgkanis, and Zhiwei Steven Wu. Incentivizing bandit exploration: Recommendations as instruments. In Proceedings of the 2021 International Conference on Machine Learning (ICML 21), 2021.

[45] Daniel Ngo, Keegan Harris, Anish Agarwal, Vasilis Syrgkanis, and Zhiwei Steven Wu. Incentiveaware synthetic control: Accurate counterfactual estimation via incentivized exploration. 2023. URL https://keeganharris.github.io/IE_SC.pdf.

[46] Bev O Shea. 9 ways to build and improve your credit fast, Nov 2022. URL https://www. nerdwallet.com/article/finance/raise-credit-score-fast.

[47] Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599 7609. PMLR, 2020.

[48] Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The externalities of exploration and how data diversity helps exploitation. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pages 1724 1738. PMLR, 2018. URL http://proceedings.mlr.press/v75/ raghavan18a.html.

[49] Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. Greedy algorithm almost dominates in smoothed contextual bandits. SIAM Journal on Computing, 52(2):487 524, 2023. doi: 10.1137/19M1247115. URL https://doi.org/10.1137/ 19M1247115.

[50] Mark Sellke and Aleksandrs Slivkins. The price of incentivizing exploration: A characterization via thompson sampling and sample complexity. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 795 796, 2021.

[51] Ohad Shamir. A variant of azuma s inequality for martingales with subgaussian tails. ar Xiv preprint ar Xiv:1110.2392, 2011.

[52] Yonadav Shavit, Benjamin Edelman, and Brian Axelrod. Causal strategic linear regression. In International Conference on Machine Learning, pages 8676 8686. PMLR, 2020.

[53] Vidyashankar Sivakumar, Zhiwei Steven Wu, and Arindam Banerjee. Structured linear contextual bandits: A sharp and geometric smoothed analysis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 9026 9035. PMLR, 2020. URL

http://proceedings.mlr.press/v119/sivakumar20a.html.

[54] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389 434, 2012.

[55] Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in stackelberg games with an omniscient follower. In International Conference on Machine Learning, pages 42304 42316. PMLR, 2023.

[56] Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I. Jordan. The sample complexity of online contract design. In Proceedings of the 24th ACM Conference on Economics and Computation, EC 2023, London, United Kingdom, July 9-12, 2023, page 1188. ACM, 2023.

A Useful concentration inequalities

Theorem A.1 (Matrix Azuma, Tropp [54]). Consider a self-adjoint matrix martingale {Ys : s = 1, . . . , t} in dimension d, and let {Xs}s [t] be the associated difference sequence satisfying Es 1Xs = 0d d and X2 s A2 s for some fixed sequence {As}s [t] of self-adjoint matrices. Then for all α > 0, P (λmax(Yt EYt) α) d exp( α2/8σ2),

where σ2 := Pt s=1 A2 s 2.

Theorem A.2 (A variant of Azuma s inequality for martingales with subgaussian tails, Shamir [51]). Let Z1, Z2, . . . Zt be a martingale difference sequence with respect to a sequence W1, W2, . . . , Wt, and suppose there are constants b > 1, c > 0 such that for any s and any α > 0, it holds that

max{P(Zs > α|X1, . . . , Xs 1), P(Zs < α|X1, . . . , Xs 1)} b exp( cα2).

Then for any γ > 0, it holds with probability 1 γ that

28b log(1/γ)

B Proofs for Section 3

B.1 Proof of Theorem 3.3

Theorem B.1. Let f U d : X R 0 denote the density function of the uniform distribution over the d-dimensional unit sphere. If agent contexts are drawn from a distribution over the d-dimensional unit sphere with density function f : X R 0 such that f(x) f Ud(x) c0 > 0, x X, then Algorithm 1 achieves the following performance guarantee:

Reg(T) 4d + 8 c0 c1(δ, d) c2(δ, d)

14dσ2T log(4d T/γ)

with probability 1 γ, where 0 < c1(δ, d) := Px U d(x[1] δ) and 0 < c2(δ, d) := Ex U d[x[2]2|x[1] δ].

Proof. We start from the definition of strategic regret. Note that under apple tasting feedback, θ(0) = 0.

D θ(a t ) θ(at), xt E

D ˆθ (a t ) t ˆθ (at) t , xt E + D θ(a t ) ˆθ (a t ) t , xt E + D ˆθ (at) t θ(at), xt E

D θ(1) ˆθ (1) t , xt E + D θ(0) ˆθ (0) t , xt E

θ(1) ˆθ (1) t 2 xt 2 + θ(0) ˆθ (0) t 2 xt 2

θ(1) ˆθ (1) t 2 + θ(0) ˆθ (0) t 2

where the first inequality follows from Lemma B.2, the second inequality follows from the Cauchy Schwarz, and the third inequality follows from the fact that the instantaneous regret at each time-step is at most 2 and we use the first d rounds to bootstrap our OLS. The result follows by Lemma B.3, a

union bound, and the fact that PT d+1 q

bθ (a t ) t bθ (at) t , xt 0

Proof. If at = a t , the condition is satisfied trivially. If at = a t , then either (1) bθ (1) t bθ (0) t , x t

δ bθ (1) t bθ (0) t 2 0 and θ(1) θ(0) < 0 or (2) bθ (1) t bθ (0) t , x t δ bθ (1) t bθ (0) t 2 < 0 and θ(1) θ(0) 0.

Case 1: bθ (1) t bθ (0) t , x t δ bθ (1) t bθ (0) t 2 0 and θ(1) θ(0) < 0. (a t = 0, at = 1) By Definition 2.1, we can rewrite

bθ (1) t bθ (0) t , x t δ bθ (1) t bθ (0) t 2 0

as bθ (1) t bθ (0) t , xt + (δ δ) bθ (1) t bθ (0) t 2 0

for some δ δ. Since (δ δ) bθ (1) t bθ (0) t 2 0, bθ (1) t bθ (0) t , xt 0 must hold.

Case 2: bθ (1) t bθ (0) t , x t δ bθ (1) t bθ (0) t 2 < 0 and θ(1) θ(0) 0. (a t = 1, at = 0)

Since modification did not help agent t receive action at = 1, we can conclude that bθ (1) t bθ (0) t , xt < 0.

Lemma B.3. Let f U d : X R 0 denote the density function of the uniform distribution over the d-dimensional unit sphere. If T d and agent contexts are drawn from a distribution over the d-dimensional unit sphere with density function f : X R 0 such that f(x) f Ud(x) c0 R>0, x X, then the following guarantee holds under apple tasting feedback.

θ(1) ˆθ (1) t+1 2 2 c0 c1(δ, d) c2(δ, d)

14dσ2 log(2d/γt)

with probability 1 γt.

Proof. Let I(1) s = { ˆθ (1) s , xs δ ˆθ (1) s 2 + r0}. Then, from the definition of θ(1) t+1 we have:

ˆθ (1) t+1 :=

s=1 xsx s 1{I(1) s }

s=1 xsrs(1)1{I(1) s } (closed form solution of OLS)

s=1 xsx s 1{I(1) s }

s=1 xs(x s θ(1) + ϵs)1{I(1) s } (plug in rs(1))

s=1 xsx s 1{I(1) s }

s=1 xsϵs1{I(1) s }

Re-arranging the above and taking the ℓ2 norm on both sides we get:

θ(1) ˆθ (1) t+1 2 =

s=1 xsx s 1{I(1) s }

s=1 xsϵs1{I(1) s }

s=1 xsx s 1{I(1) s }

s=1 xsϵs1{I(1) s }

2 (Cauchy-Schwarz)

Pt s=1 xsϵs1{I(1) s } 2 σmin Pt s=1 xsx s 1{I(1) s }

Pt s=1 xsϵs1{I(1) s } 2 λmin Pt s=1 xsx s 1{I(1) s }

where for a matrix M, σmin is the smallest singular value σmin(M) := min x =1 Mx and λmin is the smallest eigenvalue. Note that the two are equal since the matrix Pt s=1 xsx s 1{I(1) s } is PSD as the sum of PSD matrices (outer products induce PSD matrices). The final bound is obtained by applying Lemma B.4, Lemma B.5, and a union bound.

Lemma B.4. The following bound holds on the ℓ2-norm of Pt s=1 xsϵs1{I(1) s } with probability 1 γt:

s=1 xsϵs1{I(1) s }

14dσ2t log(d/γt)

Proof. Let x[i] denote the i-th coordinate of a vector x. Observe that Pt s=1 ϵsxs[i]1{I(1) s } is a sum of martingale differences with Zs := ϵsxs[i]1{I(1) s }, Xs := Ps s =1 ϵs xs [i]1{I(1) s }, and

max{P(Zs > α|X1, . . . , Xs 1), P(Zs < α|X1, . . . , Xs 1)} exp( α2/2σ2).

By Theorem A.2,

s=1 ϵsxs[i]1{I(1) s } 2 p

14σ2t log(1/γi)

with probability 1 γi. The desired result follows via a union bound and algebraic manipulation.

Lemma B.5. The following bound holds on the minimum eigenvalue of Pt s=1 xsx s 1{I(1) s } with probability 1 γt:

s=1 xsx s 1{I(1) s }

t c0 c1(δ, d) c2(δ, d) + 4 p

2t log(d/γt)

s=1 xsx s 1{I(1) s }

s=1 xsx s 1{I(1) s } Es 1[xsx s 1{I(1) s }]

s=1 Es 1[xsx s 1{I(1) s }]

s=1 xsx s 1{I(1) s } Es 1[xsx s 1{I(1) s }]

s=1 λmin Es 1[xsx s 1{I(1) s }] (2)

where the inequalities follow from the fact that λmin(A + B) λmin(A) + λmin(B) for two Hermitian matrices A, B. Note that the outer products form Hermitian matrices. Let Yt := Pt s=1 xsx s 1{I(1) s } Es 1[xsx s 1{I(1) s }]. Note that by the tower rule, EYt = EY0 = 0. Let Xs := Es 1[xsx s 1{I(1) s }], then Es 1[ Xs] = 0, and ( Xs)2 4Id. By Theorem A.1,

P(λmax( Yt) α) d exp( α2/32t).

Since λmax( Yt) = λmin(Yt),

P(λmax(Yt) α) d exp( α2/32t).

Therefore, λmin(Yt) 4 p

2t log(d/γt) with probability 1 γt. We now turn our attention to lower bounding λmin(Es 1[xsx s 1{I(1) s }]).

λmin(Es 1[xsx s 1{I(1) s }]) := min ω Sd 1 ω Es 1[xsx s 1{I(1) s }]ω

= min ω Sd 1 ω Z xsx s 1{I(1) s }f(xs)dxs

= min ω Sd 1 ω Z xsx s 1{I(1) s }f(xs) f U d(xs)

c0 min ω Sd 1 ω Es 1,U d[xsx s 1{I(1) s }]ω

= c0 min ω Sd 1 Es 1,U d[ ω, xs 2| bβs, xs δ bβs 2] Ps 1,U d( bβs, xs δ bβs 2) | {z } c1(δ,d)

= c0 c1(δ, d) min ω Sd 1 Es 1,U d[ ω, xs 2| bβs, xs δ bβs 2] (3)

Throughout the remainder of the proof, we surpress the dependence on U d and note that unless stated otherwise, all expectations are taken with respect to U d. Let Bs Rd d be the orthonormal matrix such that the first column is bβs/ bβs 2. Note that Bse1 = bβs/ bβs 2 and Bsx U d.

Es 1[xsx s | bβs, xs δ bβs ] = Es 1[(Bsxs)(Bsxs) | bβs, Bsxs δ]

= Bs Es 1[xsx s |x s B s bβs 2Bse1 δ bβs 2]B s = Bs Es 1[xsx s |xs[1] δ]B s

Observe that for j = 1, i = j, E[xs[j]xs[i]|xs[1] δ] = 0. Therefore,

Es 1[xsx s | bβs, xs δ bβs] = Bs(E[xs[2]2|xs[1] δ]Id + (E[xs[1]2|xs[1] δ] E[xs[2]2|xs[1] δ])e1e 1 )B s = E[xs[2]2|xs[1] δ]Id

+ (E[xs[1]2|xs[1] δ] E[xs[2]2|xs[1] δ]) bβs bβs 2

λmin(Es 1[xsx s 1{I(1) s }]) c0 c1(δ, d) min ω Sd 1 E[xs[2]2|xs[1] δ] ω 2

+ (E[xs[1]2|xs[1] δ] E[xs[2]2|xs[1] δ])

ω, bβs bβs 2

c0 c1(δ, d) c2(δ, d)

Lemma B.6. For sufficiently large values of d,

c1(δ, d) Θ (1 δ)d/2

Proof. Lemma B.6 is obtained via a similar argument to Theorem 2.7 in Blum et al. [15]. As in Blum et al. [15], we are interested in the volume of a hyperspherical cap. However, we are interested in a lower-bound, not an upper-bound (as is the case in [15]). Let A denote the portion of the d-dimensional hypersphere with x[1] c d 1 and let H denote the upper hemisphere.

c1(δ, d) := Px U d(x[1] δ) = vol(A)

In order to lower-bound c1(δ, d), it suffices to lower bound vol(A) and upper-bound vol(H). In what follows, let V (d) denote the volume of the d-dimensional hypersphere with radius 1.

Lower-bounding vol(A): As in [15], to calculate the volume of A, we integrate an incremental volume that is a disk of width dx[1] and whose face is a ball of dimension d 1 and radius p

1 x[1]2. The surface area of the disk is (1 x[1]2) d 1

2 V (d 1) and the volume above the slice x[1] δ is

vol(A) = Z 1

δ (1 x[1]2) d 1

2 V (d 1)dx[1]

To get a lower bound on the integral, we use the fact that 1 x2 1 x for x [0, 1]. The integral now takes the form

δ (1 x[1]) d 1

2 V (d 1)dx[1] = V (d 1)

d + 1 2(1 δ) d+1

Upper-bounding vol(H): We can obtain an exact expression for vol(H) in terms of V (d 1) using the recursive relationship between V (d) and V (d 1):

2 + 1) V (d 1)

Plugging in our bounds for vol(A), vol(H) and simplifying, we see that

c1(δ, d) (1 δ) d+1

2 π(d + 1) Γ( d

where the equality follows from Stirling s approximation.

Lemma B.7. The following bound holds on c2(δ, d):

Proof. We begin by computing E[x[2]2|x[1] = δ ], for δ (0, 1). If x[1] = δ , then x[2]2 + . . . + x[d]2 1 (δ )2. Using this fact, we see that

E[x[2]2|x[1] = δ ] = 1

d Er Unif[0,1 (δ )2][r2] = 1

3d(1 (δ )2)3.

Since E[x[2]2|x[1] δ] E[x[2]2|x[1] = δ + 1 δ

E[x[2]2|x[1] δ] E x[2]2|x[1] = δ + 1

B.2 Proof of Proposition 3.4

Proposition B.8. For any sequence of linear threshold policies β1, . . . , βT ,

Ex1,...,x T U d

t=1 1{ xt, βt δ βt 2}

= T Px U d(x[1] δ)

Proof. Let Bt Rd d be the orthonormal matrix such that the first column is βt/ βt 2. Note that Btx U d if x U d and Bte1 = βt/ βt 2.

Ex1,...,x T U d[

t=1 1{ xt, βt δ βt 2}] =

t=1 Pxt U d( xt, βt δ βt 2)

t=1 Pxt U d( Btxt, βt δ βt 2)

t=1 Pxt U d(x t B t βt 2Bte1 δ βt 2)

t=1 Pxt U d(x t Ide1 δ 2)

= T Px U d(x[1] δ 2)

B.3 Explore-Then-Commit Analysis

Theorem B.9. Let f U d : X R 0 denote the density function of the uniform distribution over the d-dimensional unit sphere. If agent contexts are drawn from a distribution over the d-dimensional unit sphere with density function f : X R 0 such that f(x) f U(x) c0 > 0, x X, then Algorithm 2 achieves the following performance guarantee

Reg ETC(T) 8 631/3

c0 dσ2/3T 2/3 log1/3(4d/γ)

with probability 1 γ if T0 := 4 631/3σ2/3d T 2/3 log1/3(4d/γ).

Reg ETC(T) :=

t=1 θ(a t ) θ(at), xt

t=1 θ(a t ) θ(at), xt

t=T0+1 ˆθ (a t ) T0/2 ˆθ (at) T0/2, xt + θ(a t ) ˆθ (a t ) T0/2, xt + ˆθ (at) T0/2 θ(at), xt

t=T0+1 | θ(1) ˆθ (1) T0/2, xt | + | θ(0) ˆθ (0) T0/2, xt |

t=T0+1 θ(1) ˆθ (1) T0/2 2 xt 2 + θ(0) ˆθ (0) T0/2 2 xt 2

T0 + T θ(1) ˆθ (1) T0/2 2 + T θ(0) ˆθ (0) T0/2 2

7dσ2 log(4d/γ)

with probability 1 γ, where the last inequality follows from Lemma B.10 and a union bound. The result follows from picking T0 = 4 631/3dσ2/3T 2/3 log1/3(4d/γ).

Lemma B.10. Let f U d : X R 0 denote the density function of the uniform distribution over the d-dimensional unit sphere. If T 2d and agent contexts are drawn from a distribution over the

d-dimensional unit sphere with density function f : X R 0 such that f(x) f U(x) c0 > 0, x X, then the following guarantee holds for a {0, 1}.

θ(a) ˆθ (a) t+1 2 12d

7dσ2 log(2d/γt)

with probability 1 γt.

Proof. Observe that

ˆθ (a) T0/2 :=

s=1 xs+kx s+k

s=1 xs+kr(a) s+k

s=1 xs+kx s+k

s=1 xs+k(x s+kθ(a) + ϵs+k)

s=1 xs+kx s+k

s=1 xs+kϵs+k

where k = 0 if a = 0 and k = T0 if a = 1. Therefore,

θ(a) ˆθ (a) T0/2 2 =

s=1 xs+kx s+k

s=1 xs+kϵs+k

s=1 xs+kx s+k

s=1 xs+kϵs+k

PT0/2 s=1 xs+kϵs+k 2 σmin(PT0/2 s=1 xs+kx s+k)

PT0/2 s=1 xs+kϵs+k 2 λmin(PT0/2 s=1 xs+kx s+k)

The desired result is obtained by applying Lemma B.11, Lemma B.12, and a union bound.

Lemma B.11. The following bound holds on the ℓ2-norm of PT0/2 s=1 xs+kϵs+k with probability 1 γ:

s=1 xs+kϵs+k

7dσ2T0 log(d/γ)

Proof. Observe that PT0/2 s=1 ϵk+sxk+s[i] is a sum of martingale differences with Zk+s := ϵk+sxk+s[i], Xk+s := Ps s =1 ϵk+s xk+s [i], and

max{P(Zk+s > α|Xk+1, . . . , Xk+s 1), P(Zk+s < α|Xk+1, . . . , Xk+s 1)} exp( α2/2σ2).

By Theorem A.2, T0/2 X

s=1 ϵk+sxk+s[i] 2 p

7σ2T0 log(1/γi)

with probability 1 γi. The desired result follows via a union bound and algebraic manipulation.

Lemma B.12. The following bound holds on the minimum eigenvalue of PT0/2 s=1 xs+kx s+k with probability 1 γ:

s=1 xs+kx s+k

T0 log(d/γ)

s=1 xs+kx s+k) λmin(

s=1 xs+kx s+k E[xs+kx s+k]) + λmin(

s=1 E[xs+kx s+k])

s=1 xs+kx s+k E[xs+kx s+k]) +

s=1 λmin(E[xs+kx s+k])

where the inequalities follow from the fact that λmin(A + B) λmin(A) + λmin(B) for two Hermitian matrices A, B. Let YT0/2 := PT0/2 s=1 xs+kx s+k E[xs+kx s+k]. Note that EYT0/2 = EY0 = 0, Xs+k := E[xs+kx s+k], E[ Xs+k] = 0, and ( Xs+k)2 4Id. By Theorem A.1,

P(λmax( YT0/2) α) d exp( α2/16T0).

Since λmax( YT0/2) = λmin(YT0/2),

P(λmax(YT0/2) α) d exp( α2/16T0).

Therefore, λmin(YT0/2) 4 p

T0 log(d/γ) with probability 1 γ. We now turn our attention to lower bounding λmin(E[xs+kx s+k]).

λmin(E[xs+kx s+k]) := min ω Sd 1 ω E[xs+kx s+k]ω

= min ω Sd 1 ω 1

B.4 Proof of Theorem 3.5

Theorem B.13. Let Reg OLS(T) be the strategic regret of Algorithm 1 and Reg ETC(T) be the strategic regret of Algorithm 2. The expected strategic regret of Algorithm 3 is

E[Reg(T)] 4 min{E[Reg OLS(T)], E[Reg ETC(T)]}

Proof. Case 1: T < τ From Theorem B.9, we know that

Reg ETC(τi) 8 631/3

c0 dσ2/3τ 2/3 i log1/3(4dτ 2 i )

with probability 1 1/τ 2 i . Therefore,

E[Reg ETC(τi)] 8 631/3

c0 dσ2/3τ 2/3 i log1/3(4dτ 2 i ) + 2

Observe that Pi 1 j=1 E[Reg ETC(τj)] E[Reg ETC(τi)]. Suppose τi 1 T τi for some i. Under such a scenario, E[Reg(T)] 2E[Reg ETC(τi)] 2E[Reg ETC(2T)] 4E[Reg ETC(T)]

Case 2: T τ Let t denote the actual switching time of Algorithm 3.

t=1 θ(a t ) θ(at), xt +

t=t +1 θ(a t ) θ(at), xt

E[Reg(T)] 2 E[Reg ETC(t )] + E[Reg OLS(T t )] 2 E[Reg OLS(t )] + E[Reg OLS(T)] 2 E[Reg OLS(τ )] + E[Reg OLS(T)] 3 E[Reg OLS(T)]

where the first line follows from case 1, the second line follows from the fact that t τ (and so E[Reg ETC(t )] E[Reg OLS(t )]), the third line follows from the fact that t τ , and the fourth line follows from the fact that T τ .

B.5 Inconsistency of OLS when using all data

Theorem B.14. limt ˆθ (1) t+1 = θ(1) if and only if limt Pt s=1 x sx s 1{as = 1} = limt Pt s=1 x sx s 1{as = 1}.

lim t ˆθ (1) t+1 := lim t

s=1 x sx s 1{as = 1}

s=1 x sr(1) s 1{as = 1}

s=1 x sx s 1{as = 1}

s=1 x s(x s θ(1) + ϵs)1{as = 1}

s=1 x sx s 1{as = 1}

s=1 x sx s θ(1) 1{as = 1}

s=1 x sx s 1{as = 1}

s=1 x sϵs1{as = 1}

s=1 x sx s 1{as = 1}

s=1 x sx s 1{as = 1}

C Proofs for Section 4

C.1 Proof of Theorem 4.1

Theorem C.1. Algorithm 4 with η = q

T λ2|E| , γ = 2ηλ|E|, and ε = dσ log T

T 1/(d+2) incurs expected strategic regret:

E[Reg(T)] 6T (d+1)/(d+2)(dσ log T)1/(d+2) = e O T (d+1)/(d+2) .

Proof. Let at,e correspond to the action chosen by a grid point e E. We simplify notation to at = at,et to be the action chosen by the sampled grid point et at round t. For the purposes of the analysis, we also define ℓt(e) = 1 + λ rt(at,et).

We first analyze the difference between the loss of the algorithm and the best-fixed point on the grid e , i.e.,

E[Regε(T)] = max e E E

t [T ] rt(at,e )

t [T ] rt(at)

t [T ] ℓt(et)

t [T ] ℓt(e )

where the equivalence between working with ℓt( ) as opposed to rt( ) holds because ℓt( ) are just a shift from rt( ) that is the same across all rounds and experts. For the regret of the algorithm, we show that:

E[Regε(T)] = O

We define the good event as the event that the reward is in [0, 1] for every round t: C = {rt [0, 1], t [T]}. Note that this depends on the noise of the round εt. We will call the complement of the good event, the bad event C. The regret of the algorithm depends on both C and C as follows:

E[Regε(T)] = E[Regε(T)|C] Pr[C]+E[Regε(T)| C] Pr[ C] E[Regε(T)|C]+T Pr[ C] (5)

where the inequality is due to the fact that Pr[C] 1 and that in the worst case, the algorithm must pick up a loss of 1 at each round.

We now upper bound the probability with which the bad event happens.

Pr[ C] = Pr[ t : rt / [0, 1]] X

t [T ] Pr[rt / [0, 1]] (union bound)

t [T ] Pr[|εt| λ] 2 exp( λ2/σ2) T 2

T (substituting λ)

Plugging Pr[ C] to Equation (5) we get:

E[Regε(T)] E[Regε(T)|C] + 2 (6)

So for the remainder of the proof we will condition on the clean event C and compute E[Regε(T)|C]. Conditioning on C means that 1 + λ rt(a) [0, λ], where λ = σ log T.

We first compute the first and the second moments of estimator bℓt( ). For the first moment:

E h bℓt(e) i = X

e E qt(e ) ℓt(e) 1{e = e }

qt(e) = ℓt(e) (7)

For the second moment:

E h bℓ2 t(e) i = X

e E qt(e )ℓ2 t(e) 1{e = e }

q2 t (e) = ℓ2 t(e) qt(e) λ2

where for the first inequality, we have used the fact that ℓt(e) λ (since we conditioned on C) and the last one is due to the fact that qt(e) γ/|E|.

We define the weight assigned to grid point e E at round t as: wt(e) = wt 1(e) exp( ηbℓt(e)) and w0(e) = 1, e E. Let Wt = P e E wt(e) be the potential function. Then,

e E w0(e) = |E| (9)

Using e to denote the best-fixed policy in hindsight, we have:

e E w T (e) w T (e ) = exp

We next analyze how much the potential changes per-round:

e E wt(e) exp ηbℓt(e)

e E pt(e) exp ηbℓt(e) !

e E pt(e) 1 ηbℓt(e) + η2bℓ2 t(e) !

(e x 1 x + x2, x > 0)

e E pt(e)bℓt(e) + η2 X

e E pt(e)bℓ2 t(e)

e E pt(e) = 1)

e E pt(e)bℓt(e) + η2 X

e E pt(e)bℓ2 t(e)

qt(e) γ/|E|

(1 γ) bℓt(e) + η2 X

qt(e) γ/|E|

(1 γ) bℓ2 t(e)

qt(e) γ/|E|

(1 γ) bℓt(e) + η2 X

qt(e) (1 γ) bℓ2 t(e) (11)

where the second inequality is due to the fact that log x x 1 for x 0. In order for this inequality to hold we need to verify that:

e E pt(e)bℓt(e) + η2 X

e E pt(e)bℓ2 t(e) 0,

or equivalently, that: 1 η X

e E pt(e)bℓt(e) 0 (12)

We do so after we explain how to tune η and γ.

We return to Equation (11); summing up for all rounds t [T] in Equation (11) we get:

qt(e) γ/|E|

(1 γ) bℓt(e) + η2 X

qt(e) (1 γ) bℓ2 t(e) (13)

Using Equation (9) and Equation (10) we have that: log(WT /W0) η P

t [T ] bℓt(e ) log |E|. Combining this with the upper bound on log(WT /W0) from Equation (13) and multiplying both sides by (1 γ)/η we get:

bℓt(e) (1 γ) X

bℓt(e ) η X

e E qt(e)bℓ2 t(e) + (1 γ)log(|E|)

We can slightly relax the right hand side using the fact that γ < 1 and get:

bℓt(e) (1 γ) X

bℓt(e ) η X

e E qt(e)bℓ2 t(e) + log(|E|)

Taking expectations (wrt the draw of the algorithm) on both sides of the above expression and using our derivations for the first and second moment (Equation (7) and Equation (8) respectively) we get:

ℓt(e) (1 γ) X

t [T ] ℓt(e ) η X

e E qt(e) λ2

qt(e) + log(|E|)

Using the fact that ℓt( ) [0, λ] the above becomes:

E [Regε(T)|C] = X

e E qt(e)ℓt(e) X

t [T ] ℓt(e ) ηTλ2|E| + log(|E|)

Tuning η = q

T λ2|E| and γ = 2ηλ|E|, we get that:

E [Regε(T)|C] 3 p

T|E|λ2 log(|E|) = 3 p

T|E|σ log(T) log(|E|) (14)

Before we proceed to bounding the discretization error that we incur by playing policies only on the grid, we verify that Equation (12) holds for the chosen η and γ parameters. Note that when bℓt(e) = 0, then Equation (12) holds. So we focus on the case where bℓt(e) = ℓt(e)/qt(e).

e E pt(e)ℓt(e)

e E pt(e)ℓt(e) |E|

e E pt(e)λ |E|

where the first inequality is due to the fact that qt(e) γ/|E|, e E, the second is because ℓt(e) λ, the first equality is because P e E pt(e) = 1, and the last equality is because of the values that we chose for parameters η and γ.

The final step in proving the theorem is to bound the strategic discretization error that we incur because our algorithm only chooses policies on the grid, while θ(1), θ(0) (and hence, the actual optimal policy) may not correspond to any grid point. Let a t correspond to the action chosen by the optimal policy.

t [T ] E [rt(a t )] X

t [T ] E [rt(at,e )] = X

D θ(a t ) θ(at,e ), xt E

We separate the T rounds into 3 groups: in group G1, we have rounds t [T] such that a t = at,e . In group G2, we have rounds t [T] such that a t = 0 but at,e = 1. In group G3, we have rounds t [T], such that a t = 1 but at,e = 0. With these groups in mind, one can rewrite the above equation as:

D θ(a t ) θ(at,e ), xt E + X

D θ(a t ) θ(at,e ), xt E + X

D θ(a t ) θ(at,e ), xt E

For all the rounds in G1, the strategic discretization error is equal to 0. Hence the strategic discretization error becomes:

D θ(a t ) θ(at,e ), xt E

| {z } SDE(G2)

D θ(a t ) θ(at,e ), xt E

| {z } SDE(G3)

We first analyze SDE(G2):

SDE(G2) = X

D θ(0) θ(1), xt E

Let us denote by bθ (1) and bθ (0) the points such that e = bθ (1) bθ (0). Adding and subtracting e , xt in the above, we get:

SDE(G2) = X

θ(0) bθ (0), xt

+ bθ (1) θ(1), xt

+ bθ (0) bθ (1), xt

θ(0) bθ (0), xt

bθ (1) θ(1), xt

+ bθ (0) bθ (1), xt

bθ (0) bθ (1), xt

(Cauchy-Schwarz)

Finally, we show that Qt 0. For the rounds where at,e = 1 but a t = 0, it can be the case that xt = x t (as the agents only strategize in order to get assigned action 1. But since at,e = 1, then from the algorithm: bθ (1) bθ (0), x t

δ e bθ (0) bθ (1), x t

Adding and subtracting x t from quantity Qt, we have:

Qt = bθ (0) bθ (1), xt x t

+ bθ (0) bθ (1), x t

bθ (0) bθ (1), xt x t

δ e (Equation (16))

bθ (0) bθ (1) xt x t δ e (Cauchy-Schwarz)

As a result: SDE(G2) 2εT (17)

Moving on to the analysis of SDE(G3):

SDE(G3) = X

D θ(0) θ(1), xt E

Again, we use bθ (1) and bθ (0) the points that e = bθ (1) bθ (0). Adding and subtracting e , xt and following the same derivations as in SDE(G3), we have that:

SDE(G3) 2εT + X

bθ (1) bθ (0), xt

Since at,e = 0, then it must have been the case that x t = xt; this is because the agent would not spend effort to strategize if they would still be assigned action 0. For this reason, it must be that Qt 0.

Combining the upper bounds for SDE(G2) and SDE(G3) in Equation (15), we have that SDE(T) 4εT.

Putting everything together, we have that the regret is comprised by the regret incurred on the discretized grid and the strategic discretization error, i.e.,

E[Reg(T)] 3 p

T|E|σ log(T) log(|E|) + 4εT = 3

d σ log(T) log(1/ε) + 4εT

Tuning ε = dσ log T

T 1/(d+2) we get that the regret is:

E[Reg(T)] 6T (d+1)/(d+2)(dσ log T)1/(d+2) = e O T (d+1)/(d+2) .

D Extension to trembling hand best-response

Observe that when lazy tiebreaking (Definition 2.1), if agent t modifies their context they modify it by an amount δL such that δL,t := min 0 η δ η

s.t. πt(x t) = 1

x t xt 2 = η.

We define γ-trembling hand tiebreaking as δT H,t = δL,t + αt, where αt [0, min{δ δL,t, γ}] may be chosen arbitrarily. Our results in Section 3 may be extended to trembling hand tiebreaking by considering the following redefinition of a clean point:

Condition D.1 (Sufficient condition for x = x). Given a shifted linear policy parameterized by β(1) Rd, we say that a context x is clean if β(1), x > (δ + γ) β(1) 2 + r0.

No further changes are required. This will result in a slightly worse constant in Theorem 3.3 (i.e. all instances of δ will be replaced by δ + γ). Our algorithms and results in Section 4 do not change. Our definition of trembling hand best-response is similar in spirit to the ϵ-best-response in Haghtalab et al. [26]. Specifically, Haghtalab et al. [26] study a Stackelberg game setting in which the follower best-responds ϵ-optimally. In our trembling hand setting, the strategic agent can also be thought of as ϵ-best responding (using the language of [26]), although it is important to note that an ϵ-best response for the agent in our setting will cause the agent to only strategize more than necessary.