# incentivized_learning_in_principalagent_bandit_games__4de8c80a.pdf

Incentivized Learning in Principal-Agent Bandit Games

Antoine Scheid 1 Daniil Tiapkin 1 2 Etienne Boursier 3 Aymeric Capitaine 1 Eric Moulines 1

Michael I. Jordan 4 5 El-Mahdi El-Mhamdi 1 Alain Durmus 1

This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal and the agent have misaligned objectives and the choice of action is only left to the agent. However, the principal can influence the agent s decisions by offering incentives which add up to his rewards. The principal aims to iteratively learn an incentive policy to maximize her own total utility. This framework extends usual bandit problems and is motivated by several practical applications, such as healthcare or ecological taxation, where traditionally used mechanism design theories often overlook the learning aspect of the problem. We present nearly optimal (with respect to a horizon T) learning algorithms for the principal s regret in both multi-armed and linear contextual settings. Finally, we support our theoretical guarantees through numerical experiments.

1 Introduction

Decision-making under uncertainty is a ubiquitous feature of real-world applications of machine learning, arising in domains as diverse as recommendation systems (Li et al., 2010), healthcare (Yu et al., 2021), and agriculture (Evans et al., 2017). Multi-armed bandits provide a classical point of departure for decision-making under uncertainty in these settings (Thompson, 1933; Woodroofe, 1979; Lattimore & Szepesv ari, 2020; Slivkins et al., 2019). The

1Centre de Math ematiques Appliqu ees CNRS Ecole polytechnique Institut Polytechnique de Paris route de Saclay 91128 Palaiseau cedex 2Universit e Paris-Saclay, CNRS, Laboratoire de math ematiques d Orsay, 91405, Orsay, France 3INRIA, Universite Paris Saclay, LMO, Orsay, France 4University of California, Berkeley 5Inria, Ecole Normale Sup erieure, PSL Research University. Correspondence to: Antoine Scheid <antoine.scheid@polytechnique.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

basic bandit solution involves an agent who learns which decisions yield high rewards via repeated experimentation. Real-world decision-making problems, however, often present challenges that are not addressed in this simple optimization framework. These include the challenge of scarcity when there are multiple decision-makers, issues of misaligned objectives, and problems arising from information asymmetries and signaling. The economics literature addresses these issues through the design of game-theoretic mechanisms, including auctions and contracts (see, e.g., Myerson, 1989; Laffont & Martimort, 2009), aiming to achieve favorable outcomes despite agents self-interest and limited information set. Unfortunately, the economics literature tends to neglect the learning aspect of the problem, often assuming that preferences, or distributions on preferences, are known a priori. Our work focuses on the blend of mechanism design and learning. We study a principal-agent model with information asymmetry and we develop a learning framework in which the principal aims to uncover the true preferences of the agent while optimizing her own gains.

Building on the work of Dogan et al. (2023a;b), we consider a repeated game between a principal and an agent, where, at each round, the principal proposes an incentive transfer associated with any action. The agent greedily chooses the action that maximizes the sum of his expected reward and the incentive. The goal of the principal is to learn an incentive policy which maximizes her own utility over time, taking into account both the rewards that she reaps and the costly incentives that she offers.

Our contributions are as follows:

We present the Incentivized Principal-Agent Algorithm (IPA) framework, which comprises two steps. First, IPA estimates the minimal level of incentive needed to make the agent select any desired action. Subsequently, forming an upper estimate of these incentives, IPA uses a regret-minimization algorithm in a black-box fashion. The overall algorithm achieves both nearly optimal distribution-free and instance-dependent regret bounds. We extend IPA to the linear contextual bandit setting (see, e.g., Abe & Long, 1999; Auer, 2002; Dani et al., 2008), significantly broadening its applicability in various applications. Here Contextual IPA achieves

Incentivized Learning in Bandit Games

T log(T)) regret bound. We emphasize that Contextual IPA is the first known algorithm for incentivized learning in a contextual setting. Moreover it matches, up to logarithmic factors, the minimax lower bound Ω(d

T) for the easier problem of stochastic linear bandits (Rusmevichientong & Tsitsiklis, 2010).

2 Related Work

While classical work on bandit problems and reinforcement learning has predominantly focused on single-agent scenarios, many emerging applications require considering multiple agents. Recent literature has accordingly begun to study frameworks for learning in multi-agent multi-armed bandit settings (see, e.g., Boursier & Perchet, 2022).

Mansour et al. (2020) discuss how a social planner can simultaneously learn and influence self-interested agents decisions through Bayesian-Incentive Compatible (BIC) recommendations. The rationale behind this notion is that a BIC recommendation guarantees to each agent a maximal reward given the past, at any step. The social planner objective is to design BIC recommendations that maximize the global welfare. Mansour et al. (2020) propose an algorithm for solving this problem in both multi-armed and contextual bandit settings. Notably, their work turns any black-box bandit algorithm into a BIC algorithm. For this problem, Sellke & Slivkins (2021) show that Thompson sampling can be made BIC, with a sufficient number of initial observations. Hu et al. (2022) extend this work to the combinatorial bandits problem.

Another line of work due to Banihashem et al. (2023) and Simchowitz & Slivkins (2023) studies how a principal can provide recommendations to agents so that they explore all reachable states in a Markov Decision Process (MDP). To this end, the principal supplies the agents with a modified history, with the modifications carefully chosen to retain the agents trust. This line of work is closely related to the online Bayesian persuasion literature (see, e.g., Castiglioni et al., 2020; 2021), which dates to the seminal work of Kamenica & Gentzkow (2011). Online Bayesian persuasion consists of the principal sequentially influencing agents decision with signals in her own interest and has been extended to the online Information Design setting (Doval & Ely, 2020; Bernasconi et al., 2022).

In these works, the information asymmetry favors the principal, so that the principal can influence the agent s decision at little or no cost. In our problem, the agent instead has perfect knowledge of the problem parameters and his action can only be influenced through utility transfers.

Ben-Porat et al. (2024) study a principal and an agent sharing a common Markov Decision Process (MDP) with different reward functions. Similarly to our setting, at each

step the action is chosen by an agent with full knowledge of the game. The objective of the principal is to minimize her cumulative regret under a constraint on the incentive budget. Despite extending our setup to MDPs, Ben-Porat et al. (2024) do not consider uncertainty in the principal s side, turning the game into an optimization problem. Other leader/follower dynamics with Reinforcement Learning have for instance been studied in the works of Chen et al. (2023); Zhong et al. (2021) to find a Stackelberg equilibrium or an optimal policy for the leader.

Related issues arise in the study of dynamic pricing (Den Boer, 2015; Javanmard & Nazerzadeh, 2019; Mao et al., 2018; Golrezaei et al., 2023). Our work diverges from dynamic pricing in that in our case the principal not only faces uncertainty with respect to the agent s utility, but also with respect to her own utility.

Some works study similar principal-agent games but with a specific focus on the achievable optimality of the contract Cohen et al. (2022) or a specific stochastic model for the agent s behavior Conitzer & Garera (2006). The work of Zhu et al. (2022) interestingly studies online contract design but with a specific focus on the Stackelberg regret of the task, which provides results concerning the sample complexity.

Finally, our study is inspired by the work of Dogan et al. (2023b) to explore the principal s learning mechanism within a principal-agent setting. They propose an ε-Greedy algorithm with suboptimal regret guarantees. In particular, it suffers an exponential dependence in the number of actions. Dogan et al. (2023a) extend the work of (Dogan et al., 2023b), taking into account the presence of uncertainty on the agent s side but with the same limitation. We consider the same setup and problem as introduced in (Dogan et al., 2023b;a). However, it corresponds to the only common ground between our work and theirs. First, their algorithm as well as their regret bound depends on a hyperparameter m that controls the level of exploration, as mentioned in Appendix B. If m is too small, there is not enough exploration and the regret scales poorly with respect to T whereas if m is too large, the convergence takes longer to occur. m needs to be carefully chosen and depends on the specific bandit instance considered, which makes things more complicated. Furthermore, their algorithm is harder to analyze and to implement in practical settings: a lot of difficulties come from a simultaneous exploration/exploitation of both the principal s rewards and the agent s ones. In contrast, the algorithmic solution that we provide for the multi-armed case is completely orthogonal, relying on a two phases approach and avoids fine-tuning of any kind. We provide both distribution free and instance-dependent regret bounds that nearly match the known lower bounds. Also note that we extend our approach to the non-trivial contextual case.

Incentivized Learning in Bandit Games

3 Multi-Armed Principal-Agent Learning

Setup. We consider a repeated principal-agent game. A contextual version of the game is introduced in Section 4. The action set for the agent (or set of arms) is fixed to be A := [K] = {1, . . . , K}, K N . We assume that the agent s rewards, s = (s1, . . . , s K) RK + , are deterministic and that they are known to the agent and unknown to the principal.

For each action a [K], the rewards of the principal are given by a random, i.i.d. sequence (Xa(t))t [T ], where Xa(t) νa and νa is the arm distribution. The distributions {νa}K a=1 are unknown to the principal and are learned as a consequence of the following principal-agent interaction.

At each round t [T], where T is the game horizon, the principal proposes an incentive π(t) R+ associated with an action at [K]. The agent then greedily chooses action At maximizing his utility:

At argmaxa [K]{sa + 1at(a)π(t)} , (1)

breaking ties arbitrarily.1 The principal then observes the arm At selected by the agent, as well as her reward given by XAt(t). The utility of the principal on the round is given by XAt(t) 1at(At)π(t). For any a [K], the principal s mean reward is θa := E[Xa(t)]. See Table 1 in Appendix A for a summary of the main definitions used in this section.

The sequence of incentives (at, π(t))t [T ] defines a sequence of actions (At)t [T ] chosen by the agent. The goal of the principal is to maximize her total utility. On a single round, she thus aims at proposing an optimal incentive πopt

on an arm aopt [K], which solves

maximize R xνa(dx) π over π R+, a [K]

such that a argmaxa [K]{sa + 1a(a )π} . (2)

This is consistent with the conventional framework for utility in bandit problems, where we subtract the cost of incentives to the principal. Here, the principal s influence is exerted solely through the strategic use of incentives, carefully designed to guide the agent s behavior. We define µ := R xνaopt(dx) πopt. Maximizing the total utility of the principal over T rounds is equivalent to minimizing the expected regret, defined as

R(T) := T µ

t=1 E[XAt(t) 1at(At)π(t)] . (3)

Remark. In the prior work of Dogan et al. (2023b), incentives were defined as a vector of size K, where the

1Note that the related works (Ben-Porat et al., 2024; Simchowitz & Slivkins, 2023) assume a tie-breaking in favor of the principal, an assumption that we do not need here.

incentivize associated with an action a [K] was denoted πa. In our setting, since the goal of the principal is to make sure that the agent picks one prescribed action, it is enough to consider a restricted family of the form πa = 1at(a)π(t), where (at, π(t)) are incentives in the sense defined above.

We make the following assumption.

H1. For any a [K], sa [0, 1].

Neither the distributions νa nor the preferences of the agent are known to the principal. Another difficulty arises from designing the magnitude of the incentive π(t): if it is too small, the agent might not choose the arm at proposed by the principal whereas using an overly large amount leads the principal to overpay, decreasing her utility.

This trade-off also arises in dynamic pricing, where sellers must strike a balance between attractive pricing and profitability. For discussion of the results in that literature, see the comprehensive overview by Den Boer (2015). In addition, there are links between dynamic pricing and bandit problems (see, e.g., Javanmard & Nazerzadeh, 2019; Cai et al., 2023).

Optimal incentives. Before introducing IPA, we highlight a pivotal observation. For any given round t 1, action a [K] and ε > 0, the principal can entice the agent to choose a by offering an incentive, π ,ε a R+, defined as:

π ,ε a = max a [K] sa sa + ε . (4)

With this incentive, it holds that for any a [K], a = a:

sa < sa + π ,ε a ,

which ensures that the agent chooses At = a, given that action a yields a superior reward. Consequently, π a := limε 0 π ,ε a = maxa [K] sa sa represents the infimal incentive necessary to make arm a the agent s selection. Assuming s is known to the principal, then using π ,ε a for any ε > 0 across all arms a [K], will provide an expected reward of θa π a per arm, which can be found using a standard bandit algorithm. Lemma 1 allows us to define the regret in a more convenient way.

Lemma 1. For any T N, the regret of any algorithm on our problem instance can be written as

R(T) = T max a [K]{θa + sa max a [K] sa }

t=1 {θAt 1at(At)π(t)}

Warm up: fixed horizon solution and regret analysis. IPA separates the problem of learning optimal incentives π a for each action a a problem that can be solved efficiently

Incentivized Learning in Bandit Games

via binary search (see Algorithm 3) from estimation of the principal s expected reward (θa)a [K], which is achieved using a standard multi-armed bandit algorithm. With a known horizon T, the algorithm unfolds in two stages. First, for each action a [K], the principal devotes NT := log2 T rounds of binary search per arm to estimate π a, maintaining lower πa(NT ) and upper πa(NT ) bounds with πa(NT ) π a πa(NT ). Denoting πa := πa( log2 T ) for simplicity, we compute the estimate

ˆπa := πa( log2 T ) + 1/T , (5)

where 1/T is added to avoid any tie-breaking situation. We show formally in Lemma 8 that

ˆπa 2/T π a < ˆπa . (6)

In the second phase, IPA then employs an arbitrary multiarmed bandit subroutine Alg in a black-box manner to learn θ.

Bandit instance. For any distributions ( νa)a [K] and sequence of i.i.d. random variables (Ya(t))t N , a [K], with Ya(t) νa, we define the history Gt := (As, Us, YAs(s))s t where for any s N , (Us)s N is a family of independent uniform random variables on [0, 1] allowing for randomization in the subroutine. Let Alg be a bandit algorithm, i.e., Alg: (Ut, Gt 1) 7 a Rec t . We define the expected regret of Alg as

RAlg(T, ν) := T max a [K] EY ν[Ya(1)]

t=1 YAlg(Ut,Gt 1)(t)

After the binary search phase, for t > K log2 T , the principal plays Alg on her bandit instance driven by her own mean rewards (θa)a [K] and the approximated incentives (ˆπa)a [K]. Alg will be fed with a shifted history, defined for any t > K log2 T as

Ht := (a Rec s , Us, XAs(s) ˆπa Rec s )s [K log2 T +1,t] , (7)

with a Rec t the action recommended by Alg at time t and At the action pulled by the agent. At time t, IPA offers the incentive ˆπa Rec t to the agent if he chooses action a Rec t . Equation (6) ensures that this incentive makes a Rec t strictly preferable to any other action for the agent and so a Rec t is eventually played. As can be seen in (7), the shift of each arm s mean by ˆπa is taken into account while Alg is learning. We also define the shifted distribution ρT a for any a [K] as the distribution of Xa(1) ˆπa.

Theorem 1. IPA run with any multi-armed bandit subroutine Alg has an overall regret R(T) such that

R(T) 2 + (1 + max a [K]{θa} min a [K]{θa})(1 + K log2 T)

Algorithm 1 IPA

1: Input: Set of actions A := [K], time horizon T, subroutine Alg 2: Compute Hs := for any s K log2 T 3: for a [K] do 4: # See Algorithm 3 5: πa, πa = Binary Search(a, log2 T , 0, 1) 6: end for 7: For any action a [K], ˆπa = πa + 1/T 8: for t = K log2 T + 1, . . . , T do 9: Sample Ut U(0, 1) and get a recommended action by Alg, a Rec t = Alg(Ut, Ht 1) 10: Offer an incentive ˆπa Rec t on action a Rec t and nothing for any other action a [K], a = a 11: Observe At, XAt(t) and compute history Ht = (a Rec s , Us, XAs(s) ˆπa Rec s )s [K log2 T +1,t] 12: end for

+ RAlg(T K log2 T , ρT ) ,

where RAlg stands for the regret induced by Alg on the shifted vanilla multi-armed bandit problem ρT .

The proof is postponed to Appendix C.

Corollary 1. Assume the principal s reward distribution νa for any action a [K] is 1-subgaussian. Then, IPA run with the bandit subroutine Alg = UCB has a regret bounded for any T N as follows

R(T) 3 + 3 X

a [K], a>0 a

+ (1 + max a [K]{θa} min a [K]{θa})(1 + 9K log2 T)

TK log T ; X

where a := maxa [K]{θa + sa } (θa + sa) are the reward gaps.

Note that any black-box algorithm, not necessarily UCB, can be employed, yielding other concrete bounds in the corollary. We recover the usual multi-armed UCB bounds (both distribution-free and instance-dependent): this is why IPA achieves the bound provided in Corollary 1. For completeness, the UCB subroutine is given in Appendix E.

4 Contextual Principal-Agent Learning

In this section, we study the same interaction between a principal and an agent, but in a contextual setting (see, e.g., Abe & Long, 1999; Auer, 2002; Dani et al., 2008). We use the simplified model of stochastic linear bandits for both the

Incentivized Learning in Bandit Games

agent and the principal. Consider a set of possible actions in B(0, 1), where B(0, 1) stands for the unit closed ball in Rd, and a family of zero-mean distributions indexed by B(0, 1), ( νa)a B(0,1) such that for any a B(0, 1), t [T], ηa(t) νa. The principal s reward is given by the sequence {(Xa(t))a B(0,1) : t [T]} of independent random variables such that for any t [T], a B(0, 1),

Xa(t) := θ , a + ηa(t) ,

where θ is unknown to the principal. The agent s reward function is defined as a 7 s , a , where s Rd is known to the agent and unknown to the principal. With this notation, the agent and the principal observe an action set At B(0, 1), at each round t 1. Note that this set is no longer stationary. The precise timeline is as follows. At each round, the principal proposes an incentive function, κ(t, ): At R+, associating any action a At B(0, 1) with a transfer of incentives κ(t, a) from the principal to the agent. The principal chooses κ(t, ) as a function with a finite support, which makes it upper semi-continuous. The agent then greedily chooses the action At as follows

At argmaxa At{ s , a + κ(t, a)} , (8)

which is well-defined since κ(t, ) is upper semi-continuous and At satisfies the following assumption.

H2. For any t 1, At is closed, therefore compact. Moreover, s B(0, 1) and θ B(0, 1).

The principal then observes the arm At selected by the agent, as well as her incurred reward given by XAt(t). The utility of the principal on the round is given by XAt(t) κ(t, At). This defines, for a sequence of principal s incentive functions {κ(t, ), t [T]}, the sequence of actions {At : t [T]} chosen by the agent. The goal of the principal is to maximize her total utility. On a single round t, she thus aims at proposing an optimal incentive function κ(t, ) which solves

maximize θ , a κ(t, a) over κ(t, ): At R+,

such that a argmaxa At{ s , a + κ(t, a )} . (9)

In addition, we define the optimal average reward at t as

µ t := sup κ(t, ): At R+ { θ , a κ(t, a)}

such that a argmaxa At{ s , a + κ(t, a )} . (10)

Maximizing the total utility of the principal over T rounds is equivalent to minimizing the expected regret, defined as

t=1 (XAt(t) κ(t, At))

Similarly to Lemma 1, the following result provides an alternative definition for the regret.

Lemma 2. For any T N, the regret of any algorithm on our contextual problem instance can be written as

t=1 max a At{ θ + s , a max a At s , a }

t=1 ( θ , At κ(t, At))

The proof is deferred to Appendix D.

Design of the optimal incentives. At any round t 1, for the agent to necessarily choose action a At, the principal can provide the agent with the incentive function κ ,ε a (t, a ) := 1a(a )π ,ε(t, a), where for any ε > 0,

π ,ε(t, a) := max a t At{ s , a t a + ε} .

Lemma 10 in Appendix D guarantees that this choice of κ ,ε a gives At = a, where At is defined in (8). Define

aag t := argmaxa At s , a ,

π (t, a) := s , aag t s , a (12)

and κ a(t, a ) := 1a(a )π (t, a). As in the non-contextual case, taking ε 0 makes the incentive function κ a the infimal function that makes the choice of a strictly preferable to any other arm a At at time t.

Similarly to the multi-armed setting, we decompose the problem into two distinct components. First, we aim to estimate the agent s reward a 7 s , a based on the observation of agent s selected actions given an appropriate choice of incentives. As discussed below, this can be achieved with a binary-search-like procedure. Second, once this function is accurately estimated, the principal can use a contextual bandit algorithm Ctx Alg in a black-box manner to minimize her own regret with the estimated incentive function to determine the agent s behavior.

Estimation of the agent s reward. The approach that we propose is based on a sequence of confidence sets {St}t [T ] that satisfy s St for any t [T]. We construct the sequence (St)t [T ] recursively such that their diameters decrease along the iterations. This is motivated by Lemma 3 which allows us to control the estimation error of π and relates it to the diameter of these sets. The proof is postponed to Appendix D.

Lemma 3. For any t [T] and closed subset S B(0, 1) with s S, it holds, for any a At, | maxs S,a At s, a a π (t, a)| 2 diam(S, At) where diam(S, At) := maxa At maxs1,s2 S | s1 s2, a |.

In the light of Lemma 3, we thus aim to build confidence sets St with decreasing diameters such that s St for any t. To

Incentivized Learning in Bandit Games

this end, the principal can offer an incentive function κ(t, ) concentrated on a single point a At as in the multi-armed case: κ(t, a ) = π(t) 1a(a ) for π(t) R+. In this case, the principal receives the agent s choice as a feedback; by (8), either At = at or At = argmaxa At s , a = aag t . In addition, At = at is equivalent to the fact that

s , aag t at π(t) .

The information At = at or At = aag t can be used as binary search feedback in the direction aag t at, as follows. Given a current confidence set St at time t it can be updated either as St+1 = St {s: s, aag t at π(t)} if At = at or St+1 = St {s: s, aag t at π(t)} otherwise.

However, since the action set At is non-stationary, we cannot determine the action aag t by just observing the first round. Consequently, the new set St+1 cannot be computed as previously in the non-contextual setting. This makes the single-point incentive functions not suited for an efficient learning of s over the iterations. Instead, at any time t, we seek for a form of binary-search feedback in the direction a1 t at 2 for any two arms a1 t = a2 t At. As we will see, this can be achieved by considering an incentive function κ with support {a1 t, a2 t} At.

Indeed, an important remark is that the amount of incentive needed to make the agent play any particular action is bounded under H2 since

max a At π (t, a) = max a At s , aag t a 2 , (13)

this bound being known by the principal. For any a B(0, 1), this makes the incentive function a 7 3 1a(a ) sufficient to ensure At = a from (8). The value 3 in the definition of the incentive function is chosen instead of 2 to avoid an arbitrary tie-breaking.

Consequently, under the choice κ(t, a1 t) = 3, κ(t, a2 t) = 3 + π(t) for π(t) 0 and κ(t, a ) = 0 for any other arm a , (13) guarantees that only a1 t and a2 t may be chosen by the agent, helping the principal to update her confidence set St in a known direction. Specifically for such an incentive κ, the choice At = a1 t reveals the following information on s :

s , a1 t a2 t π(t),

that permits the definition of a binary search-like feedback in the direction a1 t a2 t and thus allows us to update the confidence set St following St+1 = St {s: s, a1 t a2 t π(t)} if At = a1 t or St+1 = St {s: s, a1 t a2 t π(t)} otherwise.

Binary search. This update turns our estimation of the optimal incentives π into a multidimensional binary search where the unknown quantity is the vector s . At each iteration t, a vector wt from the unit sphere is given. Then,

the algorithm has to guess the value of s , wt using its previous observations. Finally, an oracle reveals as feedback whether the guess is above or below the true value s , wt , and the algorithm updates its observation history. In our case, wt := (a1 t a2 t)/ a1 t a2 t for a1 t, a2 t At and the resulting feedback is given through the agent picking either a1 t or a2 t. However, extending the binary search to the multidimensional case is non-trivial for two reasons.

Direction of the multidimensional binary search. In the contextual bandit setting, we cannot divide the horizon into two successive phases. Indeed, the principal cannot choose any binary search direction in Rd, since wt depends on the action set At available at each iteration. For instance, action sets At could be restricted to a small dimensional subspace of Rd during the whole binary search procedure, so that the principal can only get a good estimate of s in this subspace. After this phase, received action sets could be totally different (e.g., in the orthogonal subspace or the whole of Rd) during the remainder of the game.

We solve the issue of constraint directions for the binary search by running it in an adaptive way, depending on the available action set at each time step and on the current level of estimation on this set. More precisely, at iteration t, the principal s estimate of the true value s , wt is ˆst, wt , where ˆst is defined as the centroid of St:

ˆst := 1 Vol(St)

St xdx with Vol(St) = Z

Whenever | ˆst, wt s , wt | < 1/T, the principal incurs a negligible cost to incentivize the agent to choose her desired action. Then, in this context, for any action a At that the principal wants to play, she designs the incentive

ˆπ(t, a) := max a At ˆst, a ˆst, a + 2/T ,

ˆκa(t, a ) = 1a(a )ˆπ(t, a) for any a At .

To control the precision of the estimation ˆπ(t, a) of π (t, a) for any a At, Lemma 4 shows that it is sufficient to consider the event Et, defined as

Et := max a1 t =a2 t At diam St, a1 t a2 t a1 t a2 t

where we recall the definition of the projected diameter: diam(St, x) := maxs1,s2 St | s1 s2, x | for any x Rd. When Et is false, the principal does not have a good characterization of the incentive function that she needs to provide and thus Contextual IPA runs a multidimensional binary search step, which is explained in the paragraph below. Otherwise, Contextual IPA runs a contextual bandit subroutine Ctx Alg in a black-box manner on her bandit instance driven by the principal s own mean rewards θ , a and the approximated incentives ˆπ(t, a) for any

Incentivized Learning in Bandit Games

a At. Lemma 4 guarantees that these approximations are upper estimates of π (t, a). The principal proposes an incentive function ˆκa Rec t depending on the estimate to make the agent select the action a Rec t recommended by the bandit subroutine. Again, we do not impose any assumption on the tie-breaking, which can be arbitrary. Lemma 4. Consider t [T], At B(0, 1), St B(0, 1) such that Et defined in (14) is true. Then for any action a At, we have: π (t, a) < ˆκa(t, a) π (t, a) + 4/T.

A corollary of Lemma 4 is that running Contextual IPA, under Et, At = a Rec t .

Issue of the diameter reduction. We illustrate the challenge of the multidimensional constrained binary search on a very simple problem. At time t, we can only run a binary search step in one of the directions wt = a a for a, a At. Suppose that we have two directions of interest, v1, v2 in Rd, such that we aim to decrease the diameter of St in the direction of v1 or v2. Even if we divide the diameter of St in two in a direction wt, which is always possible, this does not necessarily imply that the diameter of St would reduce along any direction vi, as illustrated on Figure 1.

diam(S0, v1)

diam(S0, v2)

diam(S1, v1)

diam(S1, v2)

Figure 1: Illustration of a case where the volume S0 is cut along a direction w1 to give a new confidence set S1; while the diameter is not reduced along the directions v1 nor v2.

An early attempt to tackle this multidimensional binary search problem with adversarial directions was presented by Cohen et al. (2020), who used ellipsoid methods. Here, we use the recent strategy proposed by Lobel et al. (2018) and their Projected Volume subroutine, which is described further in Appendix E.

Non-stationarity of the reward shift. For any rewards {(Ya(t))a At : t [T]}, we define the history as Gt := (As, As, Us, YAs(s))s t where (Us)s N is a family of independent uniform random variables on [0, 1] to allow randomization in the decision making. Let Ctx Alg be a linear contextual bandit algorithm, i.e., Ctx Alg: (Ut, Gt 1, At) 7 a Rec t At. When the principal is not running a binary search step, i.e., when Et is true, she plays the Ctx Alg subroutine on her bandit instance. We define a subset It of all the iterations during which Ctx Alg is run and a shifted history Ht available at time t as

It := {s [t] such that Es is true} , (15)

and Ht := (As, As, Us, Xa Rec s (s) ˆκa Rec t (s, As))s It.

In our setup, Ctx Alg will be fed in a black-box manner with this shifted history to issue recommendations a Rec t , Ctx Alg : (Ut, Ht 1, At) 7 a Rec t . At time t, for an action a Rec t recommended by Ctx Alg, our meta-algorithm Contextual IPA proposes an incentive designed so that the agent eventually picks a Rec t (Lemma 4): At = a Rec t .

However, the last difference between the non-contextual and contextual cases is that the shift between ˆπ(t, at) and π (t, at) is not constant anymore on bandit steps t IT . This shift of rewards is interpreted as adversarial corruption (Bubeck & Slivkins, 2012; Lykouris et al., 2018).

At each round, taking into account this shift, the optimal average utility associated with action a At for the principal is r (t, a) := θ , a π (t, a), while the principal can only estimate a non-stationary expected reward2 r (t, a) + εcorrupt t with the corruption level εcorrupt t defined as

εcorrupt t := π (t, At) ˆπ(t, At) (16)

and εcorrupt It := (εcorrupt s )s It. In this setup, we can define a corrupted regret as follows

Rcorrupt Ctx Alg(IT , εcorrupt IT )

t IT max a At r (t, a) r (t, a Rec t )

where a Rec t = Ctx Alg(Ut, Ht 1(εcorrupt It 1 ), At) and Ht(εcorrupt It ) = (As, As, Us, r (s, a Rec s ) + ηa Rec s (s) + εcorrupt s )s It. Then, we aim to minimize the corrupted regret with Ctx Alg, which is not possible using a naive linear contextual bandit algorithm.

Regret analysis. We split the regret into three components, each of them being bounded separately. One of these components comes from the bias in the estimation of the optimal incentives. Secondly, the principal incurs a cost due to the iterations of Ctx Alg on the corrupted bandit instance. We use the results from He et al. (2022) with a known corruption level to bound this term. Finally, the last term follows from the multidimensional binary search steps used to estimate s . Lemma 5 allows us to bound the number of such steps; see Appendix D for a proof which builds on the work of Lobel et al. (2018).

Lemma 5. Consider Et defined by (14) with (St)t [T ] defined by Contextual IPA. Then it holds almost surely

2Even if we were to feed the stochastic observations (XAs(s) ˆπ(t, As))s t at time t, past algorithmic decisions would depend on different observation distributions, making the direct use of classical regret bounds of the bandit subroutine impossible.

Incentivized Learning in Bandit Games

Algorithm 2 Contextual IPA

1: Input: horizon T, subroutine Ctx Alg, δ = 1/16T 2d(d + 1)2

2: Initialize: H0 = V0 = , S0 = {s Rd : s 1} 3: for t = 1, . . . , T do 4: Observe available action set At 5: if Et is FALSE then 6: # Where Et is defined in (14) 7: St+1, Vt+1 8: = Projected Volume(T, δ, St, Vt, a1 t, a2 t) 9: else 10: Compute ˆst as the centroid of St 11: Sample Ut U(0, 1) 12: Get a Rec t = Ctx Alg(Ut, Ht, At) 13: Let ˆπ(t, a Rec t ) = max a At ˆst, a ˆst, a Rec t + 2

14: Propose incentive function ˆκa Rec t (t, a) = 1a Rec t (a) ˆπ(t, a Rec t ) 15: Observe At as defined in (8), XAt 16: Update Ht with (At, At, Ut, XAt κ(t, At)) 17: end if 18: end for

t 1 1Ec t 192 d log(d T) ,

where Et is defined by (14).

Theorem 2. If Contextual IPA is run with any linear contextual subroutine Ctx Alg, then with the same constant 192 as in Lemma 5, the regret of Contextual IPA is bounded as

R(T) 2 + Rcorrupt Ctx Alg(IT , εcorrupt IT ) + 1344 d log(d T) .

We emphasize that our results still hold with any contextual linear bandit algorithm and that the overall regret is mostly driven by the term Rcorrupt Ctx Alg: although the principal has to solve simultaneously a pricing-like problem and a stochastic bandit problem, she almost achieves linear bandit state-ofthe art regret.

The main difference between traditional and corrupted rewards in the bandit setting lies in the fact that, in the former case, rewards are typically assumed to be i.i.d., whereas in the latter case, they may be chosen adversarially. An algorithm is robust to corruptions if it yields regret guarantees for any possible reward corruption within a specific budget. This kind of problem was first considered by Bubeck & Slivkins (2012) and has been extensively studied since (see, e.g., Kapoor et al., 2019). In our setting, a bound for the corruption budget is available, thanks to Lemma 6 below.

Lemma 6. Consider (Et)t [T ] defined by (14) with (St)t [T ] defined by Contextual IPA. Let IT and (εcorrupt t )t [T ] as defined in (15) and (16) and t [T] such that Et is true. Then |εcorrupt t | 4/T and P t IT |εcorrupt t | := Ccorrupt 4.

With standard bandit assumptions, we can then consider a corruption robust algorithms, such as CW OFUL from He et al. (2022).

H 3. At each round t 1, for any action a At, the principal s reward Xa(t) is Ht-conditionally 1-subgaussian, i.e., for any λ R, we have E eλ(Xa(t) E[Xa(t)])|Ht 1 eλ2/2.

Corollary 2. Suppose that H3 is true. If Algorithm 2 is run with the subroutine Ctx Alg := CW OFUL proposed by He et al. (2022), the regularization parameter λ = 1 and a confidence level δ = 1/T, the following bound holds

R(T) 11 + 1344 d log(d T) + CCtx Algd

with CCtx Alg being an universal constant.

As in the multi-armed setting, the obtained regret bounds are comparable to the achievable best performance in the standard bandit settings, where the principal does not need to estimate the agent s parameters s .

5 Experiments

We illustrate our theoretical findings with experiments on a toy example and compare IPA with the Principal s ε-Greedy algorithm of Dogan et al. (2023b). We also compare with a UCB Oracle baseline that runs UCB on the shifted bandit instance with arm means µa := θa π a and no principal-agent consideration. This baseline corresponds to the case where the principal knows the agent s reward vector and therefore only has to consider a bandit algorithm. Experimental details can be found in Appendix B. We observe in Figure 2 that the Principal ε-Greedy Algorithm from Dogan et al. (2023b) exhibits suboptimal performance. Additionally, another issue arises from its computational complexity, requiring an optimization step at every round. In comparison, IPA yields a regret nearly equal to the one of Oracle UCB, illustrating that the cost of estimating the agent s preferences, obtained from binary search, is negligible for IPA.

6 Lower Bounds

For the sake of clarity, we stick to the multi-armed case of Section 3 in this section. A simple observation yields that

Incentivized Learning in Bandit Games

0 2000 4000 6000 8000 10000 Time step t

Cumulative regret

IPA Principal's -Greedy

Figure 2: Cumulative regret for different algorithms on a 5 arms instance.

where µa = θa π a and µ = maxa [K] µa. Even if the principal was to know the optimal incentives (π a)a [K], she would still face a bandit instance with arm means µa. From there, we can directly extend standard lower bounds from the bandit literature to our setting (Lai & Robbins, 1985; Burnetas & Katehakis, 1996).

Proposition 1. Let D be a class of distributions. Consider the multi-armed case of Section 3 and a policy satisfying for any instance ν DK and α > 0, R(T) = o(T α). Then, for any ν DK,

lim inf T R(T)

µ µa KLinf(νa π a, µ , D) ,

where denoting by KL the Kullback-Leibler divergence,

KLinf(ρ, µ , D)

:=inf{KL(ρ, ρ ): ρ D , Z xρ (dx) > µ } ,

The complete proof is postponed to Appendix F. Proposition 1 states that IPA yields a nearly optimal regret. Similar arguments can be made in the contextual setting.

7 Conclusion and Possible Extensions

This paper presents two novel algorithms called IPA and Contextual IPA, tackling generalizations of both multiarmed and contextual bandits that account for principalagent interactions. By decoupling the learning of the agent and the estimation of the principal s parameters, we are able to obtain a nearly optimal algorithm, improving over the previous work of Dogan et al. (2023b). Overall, we obtain an efficient principal-agent bandit framework that allows us to take into account an interaction between a principal and an agent with misaligned interests in a bandit environment. There are various possible extensions of our work, among which considering strategic behaviors for

repeated interactions with a single agent or uncertainty on the agent s side.

Information rent. Again, we consider the multi-armed case for the sake of clarity. We assumed that the agent is always greedy and therefore chooses at time t an action following

At = argmaxa At{sa + 1at(a)π(t)} . (18)

However, nothing prevents the agent from lying and choosing another action instead of (18). The maximal total welfare that can be extracted at each round is maxa A{sa + θa}. In our setting, with a trustful agent, this reward was shared between the two actors with an average reward maxa A sa for the agent and maxa A{sa + θa} maxa A sa for the principal. However, as it is exposed by Dogan et al. (2023b, Section 4), the agent could play with a malicious policy and choose At as if he had different (sa)a [K]. In that case, he can extract an individual reward maxa A{sa + θa} mina A θa , while letting a reward mina [K] θa to the principal. In that case, the agent exploits his information rent to increase his profits. Against such adversarial and powerful agents, the principal cannot do more than play and learn with the (sa)a [K] announced by the agent. However, this situation is not an issue with myopic agents who act greedily since each of them tries to maximize his own instantaneous reward and eventually select At from (18). This situation is encountered in many applications, where each agent has a single round interaction with the principal for the whole game.

Learning agents. A possible extension would be to incorporate uncertainty on the agent s side and consider learning agents (see, e.g., Dogan et al., 2023a). However again, when considering single round interactions with the agents, each agent myopically maximizes his reward a priori. Consequently, the agent policy is stationary and would be driven by sa, the expected beliefs on the action rewards. The single interaction model is already well suited for numerous real world applications. In the case of repeated interactions between the principal and a single learning agent, it becomes much more complex, as this agent can both learn his true rewards on the run while trying to influence future actions of the principal with his own choices. Restricting the agent s policy to a specific set might then be necessary, as done by Dogan et al. (2023a) with Agent s ε-Greedy strategy. The major learning difficulty from the principal side would then come from the nonstationarity of the agents decisions and could be handled using non-stationary bandits algorithms (see, e.g., Gittins, 1979; Lattimore & Szepesv ari, 2020).

Incentivized Learning in Bandit Games

Acknowledgements

Funded by the European Union (ERC, Ocean, 101071601). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. The work of DT has been supported by the Paris ˆIle-de-France R egion in the framework of DIM AI4IDF.

Impact Statement

This paper aims to advance the field of principal-agent problems, particularly in the online setting. We believe that this study could lead to several positive real-world outcomes, though we do not feel it is necessary to highlight any specific consequences here. However, we did mention a few relevant domains of interest in the introduction.

Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In ICML, pp. 3 11, 1999.

Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Banihashem, K., Hajiaghayi, M., Shin, S., and Slivkins, A. Bandit social learning: Exploration under myopic behavior. ar Xiv preprint ar Xiv:2302.07425, 2023.

Ben-Porat, O., Mansour, Y., Moshkovitz, M., and Taitler, B. Principal-agent reward shaping in mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 9502 9510, 2024.

Bernasconi, M., Castiglioni, M., Marchesi, A., Gatti, N., and Trov o, F. Sequential information design: Learning to persuade in the dark. Advances in Neural Information Processing Systems, 35:15917 15928, 2022.

Bertsimas, D. and Vempala, S. Solving convex programs by random walks. Journal of the ACM (JACM), 51(4): 540 556, 2004.

Boursier, E. and Perchet, V. A survey on multi-player bandits. ar Xiv preprint ar Xiv:2211.16275, 2022.

Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42 1. JMLR Workshop and Conference Proceedings, 2012.

Burnetas, A. N. and Katehakis, M. N. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122 142, 1996.

Cai, J., Chen, R., Wainwright, M. J., and Zhao, L. Doubly high-dimensional contextual bandits: An interpretable model for joint assortment-pricing. ar Xiv preprint ar Xiv:2309.08634, 2023.

Castiglioni, M., Celli, A., Marchesi, A., and Gatti, N. Online bayesian persuasion. Advances in Neural Information Processing Systems, 33:16188 16198, 2020.

Castiglioni, M., Marchesi, A., Celli, A., and Gatti, N. Multi-receiver online bayesian persuasion. In International Conference on Machine Learning, pp. 1314 1323. PMLR, 2021.

Chen, S., Wang, M., and Yang, Z. Actions speak what you want: Provably sample-efficient reinforcement learning of the quantal stackelberg equilibrium from strategic feedbacks. ar Xiv preprint ar Xiv:2307.14085, 2023.

Cohen, A., Deligkas, A., and Koren, M. Learning approximately optimal contracts. In International Symposium on Algorithmic Game Theory, pp. 331 346. Springer, 2022.

Cohen, M. C., Lobel, I., and Paes Leme, R. Feature-based dynamic pricing. Management Science, 66(11):4921 4943, 2020.

Conitzer, V. and Garera, N. Learning algorithms for online principal-agent problems (and selling goods online). In Proceedings of the 23rd international conference on Machine learning, pp. 209 216, 2006.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008.

Den Boer, A. V. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in operations research and management science, 20(1): 1 18, 2015.

Dogan, I., Shen, Z.-J. M., and Aswani, A. Estimating and incentivizing imperfect-knowledge agents with hidden rewards. ar Xiv preprint ar Xiv:2308.06717, 2023a.

Dogan, I., Shen, Z.-J. M., and Aswani, A. Repeated principal-agent games with unobserved agent rewards and perfect-knowledge agents. ar Xiv preprint ar Xiv:2304.07407, 2023b.

Doval, L. and Ely, J. C. Sequential information design. Econometrica, 88(6):2575 2608, 2020.

Incentivized Learning in Bandit Games

Evans, K. J., Terhorst, A., and Kang, B. H. From data to decisions: helping crop producers build their actionable knowledge. Critical reviews in plant sciences, 36(2): 71 88, 2017.

Gittins, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148 164, 1979.

Golrezaei, N., Jaillet, P., and Liang, J. C. N. Incentive-aware contextual pricing with non-parametric market noise. In International Conference on Artificial Intelligence and Statistics, pp. 9331 9361. PMLR, 2023.

Gr otschel, M., Lov asz, L., and Schrijver, A. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.

He, J., Zhou, D., Zhang, T., and Gu, Q. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in Neural Information Processing Systems, 35:34614 34625, 2022.

Hu, X., Ngo, D., Slivkins, A., and Wu, S. Z. Incentivizing combinatorial bandit exploration. Advances in Neural Information Processing Systems, 35:37173 37183, 2022.

Javanmard, A. and Nazerzadeh, H. Dynamic pricing in highdimensions. The Journal of Machine Learning Research, 20(1):315 363, 2019.

Kamenica, E. and Gentzkow, M. Bayesian persuasion. American Economic Review, 101(6):2590 2615, 2011.

Kapoor, S., Patel, K. K., and Kar, P. Corruption-tolerant bandit learning. Machine Learning, 108(4):687 715, 2019.

Laffont, J.-J. and Martimort, D. The theory of incentives: the principal-agent model. In The Theory of Incentives. Princeton University Press, 2009.

Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22, 1985.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on the World Wide Web, pp. 661 670, 2010.

Lobel, I., Leme, R. P., and Vladu, A. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346 1361, 2018.

Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114 122, 2018.

Mansour, Y., Slivkins, A., and Syrgkanis, V. Bayesian incentive-compatible bandit exploration. Operations Research, 68(4):1132 1161, 2020.

Mao, J., Leme, R., and Schneider, J. Contextual pricing for lipschitz buyers. Advances in Neural Information Processing Systems, 31, 2018.

Myerson, R. B. Mechanism design. Springer, 1989.

Rademacher, L. A. Approximating the centroid is hard. In Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, pp. 302 305, 2007.

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010.

Sellke, M. and Slivkins, A. The price of incentivizing exploration: A characterization via thompson sampling and sample complexity. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 795 796, 2021.

Simchowitz, M. and Slivkins, A. Exploration and incentives in reinforcement learning. Operations Research, 2023.

Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(12):1 286, 2019.

Smith, D. J. and Vamanamurthy, M. K. How small is a unit ball? Mathematics Magazine, 62(2):101 107, 1989.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Woodroofe, M. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799 806, 1979.

Yu, C., Liu, J., Nemati, S., and Yin, G. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1 36, 2021.

Zhong, H., Yang, Z., Wang, Z., and Jordan, M. I. Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopic followers? ar Xiv preprint ar Xiv:2112.13521, 2021.

Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M. I. The sample complexity of online contract design. ar Xiv preprint ar Xiv:2211.05732, 2022.

Incentivized Learning in Bandit Games

A := [K] Set of possible arms. T Horizon. NT := log2 T Number of steps dedicated to the binary search on each arm in IPA. at Arm on which the principal offers an incentive. π(t) Amount of incentive offered by the principal on action at. At Arm chosen by the agent, maximizing his utility, known by everyone. sa + 1at(a)π(t) Agent s utility for action a. νa Principal s reward distribution for action a. Xa(t) 1at(a)π(t) Principal s utility for action a.

µa Principal s expected utility for action a, using the optimal incentive π a. µ Maximal expected utility for the principal. θa := E[Xa(1)] Principal s expected reward. π a Infimum amount of incentives to be offered on action at = a to make the agent choose it. H Shifted history used to feed Alg. RAlg(T) Regret of the subroutine Alg on a horizon T. R(T) Overall regret of IPA on a horizon T.

Table 1: Notations used in Section 3.

T Horizon. B(0, 1) Unit ball in Rd, d 1. At B(0, 1) Action set at time t among which the agent selects At. π(t) Amount of incentive offered by the principal on some action. ηa(t) Noise distribution of the principal s reward associated with action a at time t. r (t, a) = θ , a + ηa(t) π (t, a) Utility collected by the principal on action a at time t if the optimal amount of incentive is used. µ t Principal s maximal expected utility at time t. π (t, a) Infimal amount of incentives to be offered on action a with κ(t, a ) = 1a(a )π a so that the agent eventually chooses action a. ˆπ(t, a) Principal s estimation of π (, a) κ(t, ): B(0, 1) R+ Incentive function, associating each action with some amount of incentives. s B(0, 1) Agent s true reward vector. St s Principal s confidence set for s at time t. s , a + κ(t, a) Agent s utility for action a. aag t = argmaxa At s , a Agent s optimal action with null incentives at time t. a Rec t Action recommended by Ctx Alg at time t. θ , a + ηa(t) κ(t, a) Principal s utility for action a. µ t Maximal expected utility for the principal at time t. εcorrupt t = π (t, a) ˆπ(t, a) Shift between the optimal incentives and the estimated ones. Ccorrupt Total corruption budget due to the shift between ˆπ(t, a) and π (t, a) over the rounds. Et Event being true if the diameter of St projected At is small than 1/T. It Rounds up to time t during which Es, s t is true. H Shifted history used to feed Alg. RCtx Alg(T) Regret of the subroutine Ctx Alg on a horizon T. R(T) Overall regret of Contextual IPA on a horizon T.

Table 2: Notation used in Section 4.

Incentivized Learning in Bandit Games

B Experimental details

We ran the experiments in Figure 2 for a horizon T = 10 000 on an average of 100 runs on a five arms bandit. We plotted the standard error across the different runs. The expected rewards for the principal (θ) and the agent (s) are given in Table 3. The principal s rewards Xa(t) are drawn from an i.i.d. distribution Xa(t) N(θa, 1) for any a [K], t [T]. We also run an oracle UCB instance with rewards following a Gaussian distribution N(µa, 1) where for any a [K], µa := θa + sa maxa [K] sa , as if a UCB algorithm was run with the full knowledge of the optimal incentives and was learning his own mean rewards (µ), taking into account these incentives. The mean rewards µ are also given in Table 3. We observe that the additional exploration steps needed to learn the optimal incentives in IPA are not very costly compared to the regret achieved by the UCB oracle.

For the Principal s ε-Greedy algorithm, we use the hyperparameters α = 1 and m = 500. The hyperparameter m controls the number of exploration steps. We ran the Principal s ε-Greedy algorihtm on the same bandit setting for different values m = 30, m = 100, m = 200, m = 300, m = 400, m = 500, m = 600, m = 800, m = 1000, m = 2000, m = 5000, m = 10 000. Below m = 500, the algorithm does not explore enough and incurs a linear regret on some runs, consequenly yielding a poor mean regret, whereas above m = 500, the algorithm explores excessively, leading to a higher regret due to overexploration. We ran the same experiments on longer horizons T = 100 000 and T = 1 000 000 and the algorithms exhibited the same behavior. In practice, the tuning of the ε-Greedy algorithm depends on the reward gaps and is not common to use. This is why another advantage of IPA compared to the Principal s ε-Greedy algorithm of Dogan et al. (2023b) lies in the fact that it does not need any tuning of hyperparameters, leading to a better use in practice, on potentially broader bandit instances.

s 0.64 0.99 0.73 0.61 0.59 θ 0.30 0.24 0.88 0.07 0.65 µ 0.05 0.24 0.62 0.31 0.25

Table 3: Experimental parameters for Figure 2.

We did not run Contextual IPA in a contextual bandit setting because it is quite tedious to implement, due to the use of the Projected Volume subroutine from the work of Lobel et al. (2018). Even though they obtain an excellent regret bound, the computations raise specific challenges. The first issue is the computation of the centroid which is known to be a #P-hard problem (Rademacher, 2007). However, it can be solved through an approximation of the centroid, which is computable in polynomial time (see, Bertsimas & Vempala, 2004, Lemma 5 and Theorem 12). A second issue is finding directions along which the set St has a small diameter, which is needed to compute the set Vt. It is solved by Lobel et al. (2018) with an ellipsoidal approximation E of St such that E St αE with α > 1, since such an ellipsoid can be computed in polynomial time, (see, Gr otschel et al., 2012, Corollary 4.6.9). Such a variation of the Projected Volume subroutine is presented in the work of Lobel et al. (2018, Section 9.3). It is shown that one can achieve polynomial time computations with still the same regret bound for the multidimensional binary search steps (Lobel et al., 2018, Theorem 9.4). This line of work needs to be explored for implementing Contextual IPA in practice, which is feasible but still requires a significant amount of work.

C Regret Bound for Non-Contextual Setting

Notations. We define πa(t) R+ as the upper estimate and πa(t) R+ as the lower estimate of π a after t rounds of binary search on arm a. For any t [T] and a [K], we define πmid a (t) := (πa(t) + πa(t))/2. We define NT := log2 T as the number of binary search steps per arm and ˆπa is the estimated incentive to make the agent choose action a after NT steps of binary search: ˆπa = πa(NT ) + 1/T. Since our problem is stationary, we write apr := argmaxa [K]{θa π a} for the optimal action that the principal could aim to play at each round.

Lemma 1. For any T N, the regret of any algorithm on our problem instance can be written as

R(T) = T max a [K]{θa + sa max a [K] sa }

t=1 {θAt 1at(At)π(t)}

Incentivized Learning in Bandit Games

Proof of Lemma 1. Recall that the regret is defined in (3) as R(T) := T µ PT t=1 E[XAt(t) 1at(At)π(t)], where µ = supa [K],π R+ Eν[Xa(1)] π , such that a argmaxa [K]{sa +π} . Note that we can write µ = supa [K],π R+{θa 1 A(a, π)π}, where A := {(a, π): sa + π maxa sa + 1a(a )π}. First note that if (a, π) A, then sa sa π for any a [K], which implies by definition of the optimal incentives (4) that π a π. Consequently,

µ = max a [K]{Eν[Xaopt(1)] π a} = max a [K]{θa max a [K]{sa } + sa} ,

hence our result about the regret.

Lemma 7. Assume H1 and that we run Algorithm 3 for an action a [K] and a number of binary searches NT N. Then, for any t [NT ], 0 πa(t) πmid a (t) πa(t) 1.

Proof. The proof is by induction on t [NT ]. For t = 0, it is defined by definition. Then suppose that it holds true for t 0. Note that line 6 in Algorithm 3 can be written as

πa(t + 1) = 1a(At)πmid a (t) + (1 1a(At))πa(t)

πa(t + 1) = (1 1a(At))πmid a (t) + 1a(At)πa(t) , (19)

which completes the proof by applying the induction hypothesis.

Lemma 8. Assume H1 and that we run Algorithm 3 for an action a [K] and a number of binary searches NT N. Then, for any t [NT ], π a [πa(t), πa(t)] and |πa(t) πa(t)| 1/2t .

Proof. The proof is by induction on t. The case t = 0 is trivial by the initialization of Algorithm 3 and H1.

Suppose that the statement holds for t. Note that 1({At = a}) = 1({πmid a (t) π a}), therefore using (19), we obtain by using the induction hypothesis that

π a [πa(t + 1), πa(t + 1)] , πa(t + 1) πa(t + 1) = πa(t) πa(t)

which completes the proof.

Proof of Theorem 1. Recall that Lemma 1 implies that

t=1 max a [K]{θa π a} (XAt(t) 1at(At)π(t))

We decompose the regret between the K log2 T = KNT first steps during which we run the Binary Search Subroutine and all the subsequent ones

R(T) = (A) + (B)

t=1 max a [K]{θa π a} {XAt(t) 1at(At)π(t)}

t=K log2 T +1 max a [K]{θa π a} {XAt(t) 1at(At)π(t)}

We separate the analysis of the regret, bounding independently two terms in the right-hand side of the previous decomposition. Since π(t) is always equal to πmid a (t) for some t [NT ], a [K], during the binary search phase, we use Lemma 7 to bound π(t) by 1 for any t K log2 T in (A), giving

t=1 max a [K]{θa XAt(t) + 1at(At)(t)π(t) π apr}

Incentivized Learning in Bandit Games

t=1 (1 + max a [K]{θa} min a [K]{θa}) (1 + max a [K]{θa} min a [K]{θa})K(1 + log2 T) . (22)

At the end of the binary search phase and for all the subsequent rounds t > K log2 T , Alg recommends an action a Rec t and the principal proposes the incentive π(t) = ˆπa Rec t = πa Rec t ( log2 T ) + 1/T on action a Rec t to make the agent choose it. Lemma 8 ensures that after log2 T rounds of binary search on action a [K], we have

πa( log2 T ) π a πa( log2 T ) and πa( log2 T ) πa( log2 T ) 1/2 log2 T 1/T .

Therefore, π a < ˆπa and ˆπa π a 2/T .

For the agent, the utility associated with action a Rec t is sa Rec t + ˆπa Rec t > sa Rec t + π a Rec t , which guarantees that he eventually

selects a Rec t at time t because of (1) and (4). It ensures that for any t > K log2 T , At = a Rec t .

To compute these recommendations, Alg is fed at any time t N with the shifted history defined in (7): Ht = (a Rec s , Us, Xa Rec s (s) ˆπa Rec s )s [K log2 T +1,t]. Recall that we defined the shifted distribution ρT a for any a [K] as the distribution of Xa(1) ˆπa. For any t > K log2 T , a Rec t = Alg(Ut, Ht 1) and we define Ya(t) ρT a for any t [K log2 T + 1, T], a [K]. In this setup, the regret of Alg after τ subsequent steps is defined as

RAlg(τ, ρT ) = τ max a [K] EρT [Ya(K log2 T + 1)] E

K log2 T +τ X

s=K log2 T +1 YAlg(Us, Hs 1)(s)

Consequently, since at = a Rec t = At

t=K log2 T +1 max a [K]{θa π a} Xa Rec t (t) ˆπa Rec t

t=K log2 T +1 max a [K]

n θa ˆπa (Xa Rec t (t) ˆπa Rec t ) o + max a [K]{ˆπa π a }

t=K log2 T +1 max a [K]{θa ˆπa} (Xa Rec t (t) ˆπa Rec t )

t=K log2 T +1 max a [K]{ˆπa π a }

= (T K log2 T ) max a [K] E[Ya(K log2 T + 1)] E

t=K log2 T +1 (Xa Rec t (t) ˆπa Rec t )

+ (T K log2 T ) max a [K]{ˆπa π a }

RAlg T K log2 T , (ρT a )a [K] + 2.

Plugging (A) and (B) together finally gives a bound for the regret

R(T) 2 + (1 + max a [K]{θa} min a [K]{θa})(1 + K log2 T) + RAlg(T K log2 T , ρT ) .

Proof of Corollary 1. The case T 9K is trivial. Assume then that T 9K. Note that after log2 T rounds of binary search, π a πa π a + 1/T. We define a := maxa [K]{θa π a } (θa π a) = maxa [K]{θa + sa } (θa + sa) and a := maxa [K]{θa ˆπa } (θa ˆπa) = maxa [K]{θa πa } (θa πa) using ˆπa = πa + 1/T. Since π a πa π a + 1/T, we have | a a| 2/T.

Incentivized Learning in Bandit Games

Using the results about UCB algorithm that can be found in (Lattimore & Szepesv ari, 2020, Theorems 7.1 and 7.2), since Alg is run in a black-box manner on a shifted bandit instance ρT for T K log2 T rounds with reward gaps a, we have

RAlg(T K log2 T , ρT ) 3 X

(T K log2 T )K log(T K log2 T ) ; X

2 log(T K log2 T )

TK log T ; X

a [K], a>0 a + 8 min

TK log T ; X

where the last line holds because of T 9K > 6K.

We now analyse the sum P

a [K], >0 2 log T/ a and consider two cases: either there exists a [K] such that a 4/T or not.

First case: if there exists a [K] such that a 4/T, since T > 9K, we have for such an action a: 2 log T/ a T log T/2 > TK log T as well as a a + 2/T 6/T which is equivalent to 2 log T/ a T log T/3 TK log T. Consequently,

TK log T ; X

TK log T = min

TK log T; X

Second case: for any a [K], a > 4/T. Therefore a a 2/T > a a/2 = a/2. Consequently

RAlg(T K log2 T , ρT ) 1 + 3 X

a [K], a>0 a + 8 min

TK log T ; X

a [K], a>0 a + 8 min

TK log T ; X

Finally, combining (24) and (25) in (23) completes the proof.

D Regret Bound for the Contextual Setting

For the whole section, we define ˆaag t := argmaxa At ˆst, a and recall that aag t = argmaxa At s , a .

D.1 Technical lemmas

Lemma 2. For any T N, the regret of any algorithm on our contextual problem instance can be written as

t=1 max a At{ θ + s , a max a At s , a }

t=1 ( θ , At κ(t, At))

Proof of Lemma 2. Recall that the regret is defined in (11) as R(T) = PT t=1 µ t E h PT t=1(XAt(t) κ(t, At)) i , where

µ t := supκ(t, ): Rd R+{ θ , a κ(t, a)} such that a argmaxa At{ s , a + κ(t, a )}. Note that we can write

Incentivized Learning in Bandit Games

µ t = supa At,π R+{ θ , a 1 At(a, π)π} where At := {(a, π): s , a + π maxa At s , a + 1a(a )π}. First note that if (a, π) At, then s , a a π for any a At, which implies by definition of the optimal incentives (12) that π (t, a) π. Consequently

µ t = max a At{ θ , a π (t, a)} = max a At{ θ , a max a At{ s , a } + s , a } ,

hence our result about the regret.

Lemma 3. For any t [T] and closed subset S B(0, 1) with s S, it holds, for any a At, | maxs S,a At s, a a π (t, a)| 2 diam(S, At) where diam(S, At) := maxa At maxs1,s2 S | s1 s2, a |.

Proof of Lemma 3. For any t [T], S B(0, 1) with s S, At B(0, 1) and a At, recall that we defined aag t = argmaxa At s , a and π (t, a) = s , aag t s , a . Consequently, defining for any s S, as t := argmaxa At s, a (the compactness of both S and At as well as the continuity of the applications that we consider guarantee the existence of such an argmax), we have, since s , aag t s , as t for any s S and associated as t,

max s S,a At s, a a π (t, a) = max s S max a At{ s, a a s , aag t a }

= max s S { s, as t a s , aag t a }

max s S { s, as t a s , as t a }

max s S | s s , as t | + max s S | s s , a |

2 diam(S, At) .

Similarly, we have

π (t, a) max s S,a At s, a a s , aag t a max s S s, aag t a

max s S | s s, aag t | + max s S | s s, a |

2 diam(S, At) ,

and the proof follows.

Lemma 4. Consider t [T], At B(0, 1), St B(0, 1) such that Et defined in (14) is true. Then for any action a At, we have: π (t, a) < ˆκa(t, a) π (t, a) + 4/T.

Proof. The proof is similar to the proof of Lemma 3. Consider t 1, At B(0, 1), St B(0, 1) such that Et holds, at At. Then 1/T > maxa1 t =a2 t At diam St, (a1 t a2 t)/ a1 t a2 t . Recall that we defined aag t = argmaxa At s , a and ˆaag t = argmaxa At ˆst, a , π (t, at) = s , aag t s , at and ˆπ(t, at) = ˆst, ˆaag t ˆst, at + 2/T, giving

ˆπ(t, at) ˆst, aag t ˆst, at + 2/T .

Therefore, under Et, it holds

ˆπ(t, at) π (t, at) ˆst, aag t ˆst, at + 2/T s , aag t + s , at

= ˆst s , aag t at + 2/T

> diam St, aag t at aag t at

aag t at | {z } 2

+2 max a1 t =a2 t At diam St, a1 t a2 t a1 t a2 t

Similarly, since s , aag t s , ˆaag t , under Et, we have

ˆπ(t, at) π (t, at) = ˆst, ˆaag t at + 2/T s , aag t at

ˆst, ˆaag t at s , ˆaag t at + 2/T = ˆst s , ˆaag t at + 2/T

ˆaag t a diam St, ˆaag t at ˆaag t at

+ 2/T 4/T . (27)

Combining (26) and (27) with the definition of ˆκ, we obtain the result.

Incentivized Learning in Bandit Games

Lemma 9. Let t [T] such that Et does not hold. Then Vol(ΠV t St) δ2d/d2d, where δ = 1/16T 2d(d + 1)2 and Vt is defined in (31).

Proof. Since Et does not hold, there exists a direction u At B(0, 1) such that diam(St, u) 1/T δ. For any u V t , diam(St, u) δ. Lemma 6.3 of Lobel et al. (2018, Section 6) guarantees that ΠV t (St) contains a k-dimensional ball of radius δ/k, where k := dim(V t ). Therefore, Vol(ΠV t (St)) δkπk/2/kkΓ(k/2 + 1), where Γ stands for Euler s Gamma function (Smith & Vamanamurthy, 1989). Therefore, we have by definition of Euler s Gamma function: Vol(ΠV t (St)) δkπk/2/kkkk δ2d/d2d.

Lemma 5. Consider Et defined by (14) with (St)t [T ] defined by Contextual IPA. Then it holds almost surely that X

t 1 1Ec t 192 d log(d T) ,

where Et is defined by (14).

Proof of Lemma 5. This proof follows the same line as the proof of the main theorem of Lobel et al. (2018, Section 7). Let t [T] such that Et does not hold. We define wt as

wt := argmax diam St, a1 t a2 t a1 t a2 t

: a1 t a2 t a1 t a2 t such that a1 t = a2 t At

where St is defined in (30). Our goal is to bound the number of steps for which the diameter of St in the direction wt is strictly superior to 1/T. Define bt := 1{diam(Cyl(St, V t ), wt) 1/T} ,

where Vt is defined in (31). Let ΠE denote the orthogonal projection onto the subspace E Rd and Vol(ΠES) = µE(ΠES), where µE is the Lebesgue measure of ΠES, well-defined for ΠES being a convex body. Setting δ = T 2/16d(d + 1)2, we can apply the projected Gr unbaum lemma of Lobel et al. (2018, Lemma 7.1) to obtain that

Vol(ΠV t St+1) 1 e 2 bt Vol(ΠV t St) .

By definition of Vt in (31), we have for any u Vt, diam(St, u) δ. Therefore, by definition of St+1 in (30), the directional Gr unbaum Theorem (Lobel et al., 2018, Theorem 5.3) guarantees that we have: diam(St+1, u) δ/(d + 1). Note that for St being a convex body, Lemma 11 ensures that St+1 remains a convex body.

If Jt+1 = 0, where Jt is defined in (32), then V t+1 = V t , and we have Vol(ΠV t+1St+1) = Vol(ΠV t St+1).

Otherwise, let i [Jt+1 1] and v V(i), t such that V(i+1) t = V(i) t {v}. We have V(i), t span(v) = {x V(i), t : v Tx = 0} V(i+1), t V(i), t and dim(V(i+1), t ) = dim(V(i), t ) 1. Then, applying the Cylindrification Lemma from Lobel et al. (2018, Lemma 6.1) we obtain

Vol(ΠV(i+1), t St+1) d(d + 1)2

δ Vol(ΠV(i), t St+1) .

If Jt+1 = r, the volume can blow up by at most d(d + 1)2/ δ r. In particular, since the initial volume is bounded by Vol(B(0, 1)) 8π2/15 6 (Smith & Vamanamurthy, 1989), then by Lemma 9, we obtain: δ2d/d2d Vol(ΠLt St)

6 d(d + 1)2/ δ d 1 1/e2 PT t=1 bt . Therefore, applying the logarithm function, we obtain

t=1 bt 1/ log(1 e 2)(log 6 + 2d log(16) + 5d log(d) + 6d log(d + 1) + 4d log(T)) ,

giving, since 1/ log(1 e 2) < 16: PT t=1 1{diam(Cyl(St, Lt), wt) 1/T} 192d log(d T). Therefore, lemma 12 ensures that diam(St, wt) diam(Cyl(St, V t ), wt) and we get

t=1 1 max s St | s, wt | 1/T 192 d log(d T) ,

which concludes the proof by definition of wt in (28) and Et in (14).

Incentivized Learning in Bandit Games

Lemma 6. Consider (Et)t [T ] defined by (14) with (St)t [T ] defined by Contextual IPA. Let IT and (εcorrupt t )t [T ] as defined in (15) and (16) and t [T] such that Et is true. Then |εcorrupt t | 4/T and P

t IT |εcorrupt t | := Ccorrupt 4.

Proof of Lemma 6. Consider t IT , a At: by (15), Et is true. Consider εcorrupt t as defined in (16)

Using Lemma 4 gives |εcorrupt t | 4/T , and summing over all the iterations t IT , since |IT | T, we have P

t It |εcorrupt t | 4 1/T |IT | 4.

Lemma 10. For any t [T], ε > 0 and action a At, set π ,ε(t, a) := maxaag t At s , aag t s , a + ε, and define the incentive function κ ,ε a (t, a ) = 1a(a )π ,ε(t, a) for any a At. Then At = a, where At is defined by (8).

Proof of Lemma 10. Note that for any a At, a = a, s , a + κ ,ε a (t, a ) < s , a + κ ,ε a (t, a) and, as a result, At = a.

D.2 Proof of Theorem 2

Proof of Theorem 2. Denote for any t [T], apr t := argmaxa At{ θ , a π (t, a)} = argmaxa At θ + s , a , aag t := argmaxa At s , a and ˆaag t = argmaxa At ˆst, a . Note that if Et holds, for any a1 t, a2 t At, a1 t = a2 t, H2 gives that if

diam St, a1 t a2 t a1 t a2 t

T , then max s St s, a1 t a2 t = max s St

s, a1 t a2 t a1 t a2 t

a1 t a2 t | {z } 2

and therefore diam St, a1 t a2 t < 2/T.

If Et does not hold, the incentive function proposed in Algorithm 5 is given by κ(t, a) = 3 1a1 t (a) + (3 + ˆst, a1 t a2 t ) 1a2 t (a). Therefore: κ(t, a1 t) = 3, κ(t, a2 t) = 3 + ˆst, a1 t a2 t 5. When Et holds, the incentive function proposed in Contextual IPA is defined as κ(t, a) = 1a Rec t (a)ˆπ(t, a Rec t ) 2.

For any t [T], we define the instantaneous regret regt at t as

regt := µ t (XAt(t) κ(t, At)) ,

where µ t is defined in (10) and we decompose the regret into two terms, making use of the Cauchy-Schwarz inequality as well as H2 to obtain

t=1 1{Et}regt +

t=1 1{Ec t }regt

t=1 1{Et}regt

t=1 1{Ec t }(max a At θ + s , a max a At s , a θ , At + κ(t, At))

t=1 1{Et}regt

t=1 1{Ec t }(max a At{ θ , a | {z } 1

+ s , a aag t } | {z } 0

θ , At | {z } 1

+ κ(t, At)) | {z } 5

t=1 1{Et}regt

t=1 1{Ec t }

Using Lemma 5, we can bound the second term

t=1 1{Ec t }

192 d log(d T) .

Now we bound the first term. Working on steps t such that Et is true with incentive function ˆκa Rec t (t, ) = 1a Rec t ( )ˆπ(t, a Rec t ), Lemma 4 guarantees that At = a Rec t , and we have

t=1 1{Et}regt

Incentivized Learning in Bandit Games

max a At{ θ + s , a max a At s , a } θ , a Rec t ˆst, ˆaag t ˆst, a Rec t + 2

t=1 1{Et} θ , apr t + s , apr t s , aag t θ , a Rec t + ˆst, ˆaag t ˆst, a Rec t +

t=1 1{Et} 2 T

t=1 1{Et} θ , apr t + s , apr t s , aag t θ , a Rec t #

t=1 1{Et} s , a Rec t + s , aag t + ˆst, ˆaag t ˆst, a Rec t + s , a Rec t s , aag t #

t=1 1{Et} θ , apr t + s , apr t s , aag t θ , a Rec t s , a Rec t + s , aag t #

t=1 1{Et} ˆst, ˆaag t + s ˆst, a Rec t s , aag t #

Since s , aag t s , ˆaag t , we have s , aag t s , ˆaag t . Therefore

t=1 1{Et} ˆst, ˆaag t + s ˆst, a Rec t s , aag t #

t=1 1{Et} ˆst, ˆaag t + s ˆst, a Rec t s , ˆaag t #

t=1 1{Et} s ˆst, a Rec t s ˆst, ˆaag t #

t=1 1{Et} s ˆst, a Rec t ˆaag t #

t=1 1{Et}1{a Rec t = ˆaag t } a Rec t ˆaag t | {z } 2

diam St, a Rec t ˆaag t a Rec t ˆaag t

and plugging this inequality gives

t=1 1{Et}regt

t=1 1{Et} θ + s , apr t s , aag t θ + s , a Rec t + s , aag t #

t IT max a At{ θ + s , a max a At s , a }

t IT θ + s , a Rec t max a At s , a

= Rcorrupt Ctx Alg(IT , εcorrupt IT ) + 4 ,

where Rcorrupt Contextual IPA is defined in (17) with π (t, a) = maxa At s , a s, a . Plugging all the terms together gives the following upper-bound for the regret:

R(T) 1344 d log(d T) + 4 + Rcorrupt Ctx Alg(IT , εcorrupt IT ) .

Proof of Corollary 2. Here, Lemma 6 guarantees that Ccorrupt = 4. With the setup of He et al. (2022), we take L = 1, S = 1, R = 1, and α =

Incentivized Learning in Bandit Games

Choose λ = 1 as a regularization parameter and δ = 1/T as a confidence level. Using CW OFUL proposed in He et al. (2022) as a subroutine robust to corruption in the stochastic linear case, since our subroutine is fed with the same reward r

as in their model while Rcorrupt Ctx Alg(IT , εcorrupt IT ) is defined compared to the true reward r, we can use the result provided in He et al. (2022, Theorem 4.2) to bound Rcorrupt Ctx Alg(IT , εcorrupt IT ) with probability 1 1/T and use the fact that our instantaneous regret is always bounded by 7 as it is shown in the proof of Theorem 2 to get in expectation for some universal constant B > 0

Rcorrupt Ctx Alg(IT , εcorrupt IT ) (1 δ)B

T log 1 + T

log(1 + T) + 4d

T log 1 + T

T log T + T 2 +

log(T + T 2) + 4d log T + T 2 3

therefore, since Ccorrupt 4, there exists a constant CCtx Alg such that

Rcorrupt Ctx Alg(IT , εcorrupt IT ) 7 + CCtx Algd

Finally plugging this term in the bound from Theorem 2 and integrating the 3 factor in constant B gives the result

R(T) 11 + 1344 d log(d T) + CCtx Algd

E Algorithms

E.1 UCB subroutine

We present the Binary search subroutine and the UCB algorithm that we use as a subroutine in IPA as formulated in Lattimore & Szepesv ari (2020, Algorithm 3).

Algorithm 3 Binary Search Subroutine

1: Input: action a, NT 2: Initialize: πa(0), πa(0) = 0, 1 3: for t = 1, . . . , NT do 4: πmid a (t 1) = πa(t 1)+πa(t 1)

2 5: Propose incentive πmid a (t 1) on arm a 6: If At 1 = a then πa(t) = πmid a (t 1) and πa(t) = πa(t 1) else πa(t) = πmid a (t) and πa(t) = πa(t 1) 7: end for 8: Return πa(NT ), πa(NT )

E.2 Projected volume algorithm

We present the Projected volume algorithm from Lobel et al. (2018) that we use as a subroutine in Contextual IPA. For any horizon T and 0 < δ < T 2/(16d(d + 1)2), this algorithm defines recursively a sequence (St, Vt)t [T ] such that (St)t [T ] is a sequence of decreasing subsets (for the inclusion) of B(0, 1) including s and (Vt)t [T ] is a sequence of increasing sets of Rd containing orthogonal directions {vi}i [n] along which the principal has a good knowledge of s , vi .

The main ingredient in Lobel et al. (2018) allowing low regret is cylindrification. Given a compact convex set S Rd, V = {v1, . . . , vn}, and ΠV (S) being the orthogonal projection of S onto V , we define the cylindrification of S on V as

Cyl(S, V) := ΠL(S) + Πspan(v1)(S) + . . . + Πspan(vn)(S)

Incentivized Learning in Bandit Games

Algorithm 4 UCB Subroutine

1: Input: Set of arms K, horizon T 2: Initialize: For any arm a [K], set ˆµa := 0, Ta := 0 3: for 1 t K: do 4: Pull arm a = t 5: Update ˆµa = Xa(t), Ta(t) = 1 6: end for 7: for t K + 1 do

8: Pull arm amax argmaxa [K] n ˆµa(t 1) + 2 q

log T Ta(t 1) o

9: Update Ta(t) = Ta(t 1) + 1, ˆµa(t) = 1 Ta(t)(Ta(t 1)ˆµa(t 1) + Xa(t)) 10: end for

i=1 yivi : x ΠL(S), min s S s, vi yi max s S s, vi

At iteration t, given (St, Vt), we define the estimate ˆst of s as the centroid of Cyl(St, Vt):

ˆst := 1 Vol(Cyl(St, Vt))

Cyl(St,Vt) xdx , (29)

Note that the iterative construction of St described below together with Lemma 11 guarantees that St is always a convex body, making Vol(St) well-defined as the Lebesgue measure of St in dimension d and Vol(St) > 0. Combined with Lemma 12, it guarantees that Vol(Cyl(St, Vt)) is well-defined, and Vol(Cyl(St, Vt)) > 0. Therefore, (29) is well-defined.

At iteration t, given (St, Vt) and two actions a1 t, a2 t At, recall that we defined wt as

wt := argmax diam St, a1 t a2 t a1 t a2 t

: a1 t a2 t a1 t a2 t such that a1 t = a2 t At

Then, the principal offers the incentive function κ(t, a) = 5 1a1 t (a) + (5 + ˆs, a1 t a2 t ) 1a2 t (a). Recall that κ(t, a) is always bounded by 5 and that a1 t, a2 t are chosen such that ˆst, a1 t a2 t 0, ensuring that we either have At = a1 t or At = a2 t.

If At = a1 t, it means that s , wt 0 and we update St+1 = St {s| s, wt xt}. Otherwise, if At = a2 t, it means that s , wt 0 and we update St+1 = St {s| s, wt xt}. This defines the subset St+1 such that s St+1:

St+1 = St (1a1 t (At){s| s, wt xt} + 1a2 t (At){s| s, wt xt}) (30)

Regarding the subspace Vt+1, we consider Vt+1 = VJt+1 t+1 where (Vi t+1)i N is defined as

V(0 t+1) := Vt

V (i+1) t+1 :=

( V (i) t+1 {v} if v (V(i) t+1) : diam(St+1, v) δ V(i) t+1 otherwise (31)

Jt+1 := min{i: it does not exist v (V(i) t+1) such that diam(St+1, v) δ} , (32)

which exists since if it does not exist i such that it does not exist v (V(i) t+1) such that diam(St+1, v) δ, we would have dim(span(V(i+1) t )) = dim(span(V(i) t )) + 1 for any i N, which would imply a contradiction.

In what follows, we provide technical results ensuring that St is a convex body for any t [T] and St Cyl(St, Vt), which implies that Cyl(St, Vt) has non-empty interior.

Lemma 11. Let S be a convex body in Rd and s be a point in the interior of S: s S. Let G be the half-space defined by G := {x Rd : h , x 0}, for some h Rd. Suppose that s G. Then S G is a convex body.

Incentivized Learning in Bandit Games

Proof. Note that we only need to show that S G has non-empty interior since the intersection of two compact convex sets is compact and convex.

The only case that we consider is s H where H is the hyperplane defined by H := {x Rd : h , x = 0}. The other case simply follows from the fact that the intersection of two open sets is also open.

Since S is a convex body and s S, there exists a ball B(s, r) centered in s with r > 0, such that B(s, r) S. For any x B(s, r) G is equivalent to h , x s 0 since h , s = 0 and x s r.

Now we define y0 = s + rh /(2 h ) and consider the ball B(y0, r/2). For any y B(y0, r/2), we can write y = y0 + y with y r/2. Using h , s = 0, we have y s y0 s + y r and h , y = h , y0 + h , y with h , y0 = r/2 and h , y h y 1 r/2 = r/2. Therefore h , y 0 and we obtain y B(s, r) G, which gives B(y0, r/2) B(s, r) G.

Lemma 12. Given a convex body S Rd and a set of orthonormal vectors V = {v1, . . . , vn} Rd, let ΠV (S) be the projection of S on the subspace V = {x Rd | x, vi = 0}. Define the cylindrification of S onto V as

Cyl(S, V) := ΠV (S) + Πspan(v1)(S) + . . . + Πspan(vn)(S)

i=1 yivi|x ΠL(S), min s S s, vi yi max s S s, vi

Then it holds that S Cyl(S, V).

Proof. Define ΠV as the orthogonal projector on V . Then we have for any s S

s = ΠV s + (I ΠV )s ,

where (I ΠV ) is an orthogonal projector on the space span({v1, . . . , vn}), thus (I ΠV )s = Pn i=1 yivi for yi = s, vi . This decomposition allows us to conclude.

Algorithm 5 Projected Volume

1: Input: T, δ, St such that diam(St) δ, Vt, a1 t, a2 t 2: Compute Cyl(St, Vt) and its centroid ˆst, wt = (a1 t a2 t)/ a1 t a2 t , xt = ˆst, wt , δ 0, 1/16d(d + 1)2T 2

3: Propose an incentive function κ(t, a) = 5 1a1 t (a) + (5 + ˆst, a1 t a2 t ) 1a2 t (a) 4: if At = a1 t then 5: St+1 = St {s| s, wt xt} 6: else 7: St+1 = St {s| s, wt xt} 8: end if 9: Let Vt+1 = Vt 10: if v Vt+1 such that diam(St+1, v) δ then 11: add v to Vt 12: Repeat this step as many times as necessary 13: end if 14: Output: St+1, Vt+1.

F Lower Bound

Proof of Proposition 1. Suppose that the principal was to know (sa)a [K]. For any incentive π(t) offered on action at at round t, the agent selects his action following (1): At = argmaxa [K] sa + 1at(a)π(t). The principal s expected reward is

θAt 1at(At)π(t) θAt π At = µa ,

Incentivized Learning in Bandit Games

by definition of π a := maxa [K] sa sa as the infimal amount of incentive to be offered on action a to make the agent choose it. Consequently, we have

t=1 E[θAt 1at(At)π(t)]

t=1 µ (θAt π At)

Assuming the principal knows (sa)a [K], observing Xa(t) is equivalent to observing Xa(t) π a. Using the result of Burnetas & Katehakis (1996) (see, e.g., Lattimore & Szepesv ari, 2020, Theorem 16.2.), it then comes

lim inf T R(T)

µ µa KLinf(νa π a, µ , D) .