# generalized_principalagent_problem_with_a_learning_agent__100037c4.pdf

Published as a conference paper at ICLR 2025

GENERALIZED PRINCIPAL-AGENT PROBLEM WITH A LEARNING AGENT

Tao Lin, Yiling Chen John A. Paulson School of Engineering and Applied Sciences Harvard University tlin@g.harvard.edu, yiling@seas.harvard.edu

Generalized principal-agent problems, including Stackelberg games, contract design, and Bayesian persuasion, are a class of economic problems where an agent best responds to a principal s committed strategy. We study repeated generalized principal-agent problems under the assumption that the principal does not have commitment power and the agent uses algorithms to learn to respond to the principal. We reduce this problem to a one-shot generalized principal-agent problem where the agent approximately best responds. Using this reduction, we show that: (1) if the agent uses contextual no-regret learning algorithms with regret Reg(T),

then the principal can guarantee utility at least U Θ q

T , where U is the principal s optimal utility in the classic model with a best-responding agent. (2) If the agent uses contextual no-swap-regret learning algorithms with swap-regret SReg(T), then the principal cannot obtain utility more than U + O( SReg(T)

T ). But (3) if the agent uses mean-based learning algorithms (which can be no-regret but not no-swap-regret), then the principal can sometimes do significantly better than U . These results not only refine previous results in Stackelberg games and contract design, but also lead to new results for Bayesian persuasion with a learning agent and all generalized principal-agent problems where the agent does not have private information.

1 INTRODUCTION

Classic economic models of principal-agent interactions, including auction design, contract design, and Bayesian persuasion, often assume that the agent is able to best respond to the strategy committed by the principal. For example, in Bayesian persuasion, the agent (receiver) needs to compute the posterior belief about the state of the world after receiving some information from the principal (sender) and take an optimal action based on the posterior belief; this requires the receiver accurately knowing the prior of the state as well as the signaling scheme used by the sender. In contract design, where a principal specifies an outcome-dependent payment scheme to incentivize the agent to take certain actions, the agent has to know the action-dependent outcome distribution in order to best respond to the contract. Requiring strong rationality assumptions, the best-responding behavior is often observed to be violated in practice (Camerer, 1998; Benjamin, 2019).

In this work, using Bayesian persuasion as the main example, we study general principal-agent problems under an alternative behavioral model for the agent: learning. The use of learning as a behavioral model dates back to early economic literature on learning in games (Brown, 1951; Fudenberg & Levine, 1998) and has been actively studied by computer scientists in recent years (e.g., Nekipelov et al. (2015); Braverman et al. (2018); Deng et al. (2019); Mansour et al. (2022); Cai et al. (2024); Guruganesh et al. (2024)). A learning agent no longer has perfect knowledge of the parameter of the game or the principal s strategy. Instead of best responding, which is no longer possible or well-defined, the agent chooses his action based on past interactions with the principal. We focus on no-regret learning, which requires the agent to not suffer a large average regret at the end of repeated interactions with the principal, for not taking the optimal action at hindsight. This is a mild requirement satisfied by many natural learning algorithms (e.g., ε-greedy, MWU, UCB, EXP-3) and can reasonably serve as a possible behavioral assumption for real-world agents.

Published as a conference paper at ICLR 2025

Can the principal achieve a better outcome with a learning agent than with a best-responding agent? Previous works (Deng et al., 2019; Guruganesh et al., 2024) have shown that, in Stackelberg games and contract design, the leader/principal can obtain utility U o(1) against a no-regret learning follower/agent, where U is the Stackelberg value, defined to be the principal s optimal utility in the classic model with a best-responding agent. On the other hand, if the agent does a stronger version of no-regret learning, called no-swap-regret learning (Hart & Mas-Colell, 2000; Blum & Mansour, 2007), then the principal cannot obtain utility more than the Stackelberg value U + o(1). Interestingly, the conclusion that no-swap-regret learning can cap the principal s utility at U +o(1) does not hold when the agent has private information, such as in auctions (Braverman et al., 2018) and Bayesian Stackelberg games (Mansour et al., 2022): the principal can sometimes exploit a noswap-regret learning agent with private information to do much better than U in those games.

Three natural questions then arise: (1) What is the largest class of principal-agent problems under which the agent s no-swap-regret learning can cap the principal s utility at the Stackelberg value U + o(1)? (2) In cases where the principal s optimal utility against a learning agent is bounded by [U o(1), U + o(1)], what is the exact magnitude of the o(1) terms? (3) Instead of analyzing games like Stackelberg games and contract design separately, can we analyze all principal-agent problems with learning agents in a unified way?

Our contributions. Our work defines a general model of principal-agent problems with a learning agent, answering all questions (1) - (3). For (1), we show that the principal s utility is bounded around U in all generalized principal-agent problems where the agent does not have private information but the principal can be privately informed. In particular, this includes complete-information games like Stackelberg games and contract design, as well as Bayesian persuasion where the sender/principal privately observes the state of the world.

For (2) and (3), we provide a unified analytical framework to derive tight bounds on the principal s achievable utility against a no-regret or no-swap-regret learning agent in all generalized principalagent problems where the agent does not have private information. Specifically, we explicitly characterize the o(1) difference between the principal s utility and U in terms of the agent s regret.

Result 1 (from Theorems 3.1, 4.1, 4.2). Against a no-regret learning agent with regret Reg(T) in

T periods, the principal can obtain an average utility of at least U O q

Result 2 (from Theorems 3.4, 4.1, 4.2). Against a no-swap-regret learning agent with swap-regret SReg(T) in T periods, the principal cannot obtain average utility larger than U + O SReg(T )

Interestingly, the squared root bound U O q

T in Result 1 and the linear bound U +

T in Result 2 are not symmetric. We show that such an asymmetry is intrinsic: there exist

cases where the principal cannot achieve better than U O q

Result 3 (from Theorem 3.3 and Example 4.1). There is a Bayesian persuasion instance where, for any strategy of the principal, there is a no-swap-regret learning algorithm for the agent under which

the principal s utility is at most U Ω q

T . The same holds for no-regret algorithms.

Results 1, 2, 3 together characterize the range of utility achievable by the principal against a no-

swap-regret learning agent: [U Θ( q

T ), U + O( SReg(T )

T )]. For no-regret but not neces-

sarily no-swap-regret algorithms, the upper bound U + O( Reg(T )

T ) does not hold:

Result 4 (Theorem 3.5). We construct a Bayesian persuasion instance where, against a no-regret but not no-swap-regret learning agent (in particular, mean-based learning agent), the principal can do significantly better than the Stackelberg value U .

In summary, our Results 1, 2, 3 not only refine previous works on playing against learning agents in specific games by characterizing the principal s utility exactly, but also generalize to all principalagent problems where the agent does not have private information. In particular, when applied to Bayesian persuasion, our results imply that the sender cannot exploit a no-swap-regret learning receiver even if the sender possesses informational advantage over the receiver.

Published as a conference paper at ICLR 2025

Some intuition. What is the intuition behind the asymmetry between the worst-case utility

T ) and the best-case utility U + O( SReg(T )

T ) that the principal can obtain against a no-swap-regret learning agent? At a high level, a no-swap-regret learning agent is approximately best responding to the principal s average strategy over all T periods, with the degree of approximate best response measured by the average regret SReg(T )

T = δ. However, because no-swap-regret learning algorithms are randomized, they correspond to randomized approximately best responding strategies of the agent that are worse than the best responding strategy by a margin of δ in expectation. This means that the agent might take

δ-sub-optimal actions with probability

δ, which can cause a loss of 1 to the principal s utility with probability

δ. So, the principal s expected utility can

be decreased to U Ω(

δ) = U Ω( q

T ) in the worst case. On the other hand, when considering the principal s best-case utility, we care about the δ-approximately-best-responding strategy of the agent that maximizes the principal s utility. That strategy turns out to be equivalent to a deterministic strategy that gives the principal a utility of at most U + O(δ) = U + O( SReg(T )

T ). This explains the asymmetry between the worst-case and best-case bounds.

2 GENERALIZED PRINCIPAL-AGENT PROBLEM WITH A LEARNING AGENT

This section defines our model, generalized principal-agent problem with a learning agent. This model includes Stackelberg games, contract design, and Bayesian persuasion with learning agents.

2.1 GENERALIZED PRINCIPAL-AGENT PROBLEM

Generalized principal-agent problem, proposed by Myerson (1982); Gan et al. (2024), is a general model that includes auction design, contract design, Stackelberg games, and Bayesian persuasion. While Myerson (1982) and Gan et al. (2024) allow the agent to have private information, our model assumes an agent with no private information. There are two players in a generalized principal-agent problem: a principal and an agent. The principal has a convex, compact decision space X and the agent has a finite action set A. The principal and the agent have utility functions u, v : X A R. We assume that u(x, a), v(x, a) are linear in x X, which is satisfied by all the examples of generalized principal-agent problems we will consider (Bayesian persuasion, Stackelberg games, contract design). There is a signal/message set S. Signals are usually interpreted as recommendations of actions for the agent, where S = A, but we allow any signal set of size |S| |A|. A strategy of the principal is a distribution π (X S) over pairs of decision and signal. When the utility functions u, v are linear, it is without loss of generality to assume that the principal does not randomize over multiple decisions for one signal (Gan et al., 2024), namely, the principal chooses a distribution over signals and a unique decision xs associated with each signal s S. So, we can write a principal strategy as π = {(πs, xs)}s S where πs 0 is the probability of signal s S, P

s S πs = 1, and xs X. There are two variants of generalized principal-agent problems:

Unconstrained (Myerson, 1982): there is no restriction on the principal s strategy π. Constrained (Gan et al., 2024): the principal s strategy π has to satisfy constraint P

s S πsxs C where C X is some convex set.

Unconstrained generalized principal-agent problems include contract design and Stackelberg games. Constrained generalized principal-agent problems include Bayesian persuasion (see Section 2.3).

In a one-shot generalized principal-agent problem where the principal has commitment power, the principal first commits to a strategy π = {(πs, xs)}s S, then nature draws a signal s S according to the distribution {πs}s S and sends s to the agent (note: due to the commitment assumption, this is equivalent to revealing the pair (s, xs) to the agent), then the agent takes an action as arg maxa A v(xs, a) that maximizes its utility, and the principal obtains utility u(xs, as). The principal aims to maximize its expected utility Es π[u(xs, as)] by choosing the strategy π.

2.2 LEARNING AGENT

Now we define the model of generalized principal-agent problem with a learning agent. The game is repeated for T rounds. Unlike the static model above, the principal here does not commit. The agent

Published as a conference paper at ICLR 2025

does not know the strategy πt or the decision xt of the principal at each round. Instead, the agent uses some adaptive algorithm to learn which action to take in response to each possible signal.

Generalized Principal-Agent Problem with a Learning Agent

In each round t = 1, . . . , T:

(1) Using some algorithm that learns from history (including signals, actions, and utility feedback in the past, described in details later), the agent chooses a strategy ρt : S (A) that maps each possible signal s S to a distribution over actions ρt(s) (A).

(2) The principal chooses a strategy πt = {(πt s, xt s)}s S, which is a distribution over signals S and a decision xt s X associated with each signal.

(3) Nature draws signal st πt and reveals it. The principal makes decision xt = xt st. The agent draws action at ρt(st).

(4) The principal and the agent obtain utility ut = u(xt, at) and vt = v(xt, at). The agent observes some feedback (e.g., vt(xt, at) or xt).

We assume that the principal knows the utility functions u and v, has some knowledge about the agent s learning algorithm, and aims to maximize the expected average utility 1

T E PT t=1 u(xt, at) .

Agent s learning problem. The agent s learning problem can be regarded as a contextual multiarmed bandit problem (Tyler Lu et al., 2010) where A is the set of arms, and a signal st S serves as a context that affects the utility of each arm a A. The agent picks an arm to pull based on the current context st and the historical information about each arm under different contexts, adjusting its strategy over time based on the feedback collected after each round.

What feedback can the agent observe after each round? One may assume that the agent sees the principal s decision xt after each round (this is call full-information feedback in the multi-armed bandit literature), or the utility vt = v(xt, at) obtained in that round (this is called bandit feedback), or some unbiased estimate of v(xt, at). We do not make specific assumptions on the feedback. All we need is that the feedback is sufficient for the agent to achieve contextual no-regret or contextual no-swap-regret, which are defined below: Definition 2.1. The agent s learning algorithm is said to satisfy:

contextual no-regret if: there is a function CReg(T) = o(T) such that, for any strategy of the principal, for any deviation function d : S A, the regret of the agent not deviating according to d is at most CReg(T): E PT t=1 v(xt, d(st)) v(xt, at) CReg(T).

contextual no-swap-regret if: there is a function CSReg(T) = o(T) such that, for any strategy of the principal, for any deviation function d : S A A, the regret of the receiver not deviating according to d is at most CSReg(T): E PT t=1 v(xt, d(st, at)) v(xt, at) CSReg(T).

We call CReg(T) and CSReg(T) the contextual regret and contextual swap-regret of the agent.

Contextual no-regret is implied by contextual no-swap-regret because the latter has a larger set of deviation functions. Contextual no-(swap-)regret algorithms with O(|A| p

|S|T) contextual (swap- )regret are known to exist under bandit feedback. In fact, they can be easily constructed by running an ordinary no-(swap-)regret algorithm for each context independently. See Appendix B for details.

2.3 SPECIAL CASE: BAYESIAN PERSUASION WITH A LEARNING AGENT

We show that Bayesian persuasion (Kamenica & Gentzkow, 2011) is a special case of constrained generalized principal-agent problems. We will also show that Bayesian persuasion is in fact equivalent to cheap talk (Crawford & Sobel, 1982) under our learning agent model.

Bayesian persuasion as a generalized principal-agent problem. There are two players in Bayesian persuasion: a sender (principal) and a receiver (agent). There are a finite set Ωof states of the world, a signal set S, an action set A, a prior distribution µ0 (Ω) over the states, and utility functions u, v : Ω A R for the sender and the receiver. When the state is ω Ωand the receiver takes action a A, the sender and the receiver obtain utility u(ω, a), v(ω, a), respectively.

Published as a conference paper at ICLR 2025

Both players know µ0, but only the sender has access to the realized state ω µ0. The sender commits to some signaling scheme π : Ω (S), mapping any state to a probability distribution over signals, to partially reveal information about the state w to the receiver. In the classic model, after receiving a signal s S, the receiver will form the posterior belief µs (Ω) about the state: µs(ω) = µ0(ω)π(s|ω)

πs , where πs = P

ω Ωµ0(ω)π(s|ω) is the total probability that signal s is sent, and take an optimal action with respect to µs, i.e., as arg maxa A P

ω Ωµs(ω)v(ω, a). The sender aims to find a signaling scheme to maximizde its expected utility E[u(ω, as)].

It is well-known (Kamenica & Gentzkow, 2011) that a signaling scheme π : Ω (S) decomposes the prior µ0 into a distribution over posteriors whose average is equal to the prior µ0: X

s S πsµs = µ0 {µ0} =: C, X

s S πs = 1. (1)

Equation (1) is called the Bayes plausibility condition. Conversely, any distribution over posteriors {(ps, µs)}s S satisfying Bayes plausibility P

s S psµs = µ0 can be converted into a signaling scheme that sends signal s with probability ps. Thus, we can use a distribution over posteriors {(πs, µs)}s S satisfying Bayes plausibility to represent a signaling scheme. Then, let s equate the posterior belief µs in Bayesian persuasion to the principal s decision xs in the generalized principalagent problem, so the principal/sender s decision space becomes X = (Ω). The Bayes plausibility condition (1) becomes the constraint in the constrained generalized principal-agent problem. When the agent/receiver takes action a, the principal/sender s (expected) utility under decision/posterior xs = µs is u(xs, a) = Eω µsu(ω, a) = P ω Ωµs(ω)u(ω, a). Suppose the agent takes action as given signal s S. Then we see that the sender s utility of using signaling scheme π in Bayesian persuasion (left) is equal to the principal s utility of using strategy π in the generalized principalagent problem (right): X

ω Ω µ0(ω) X

s S π(s|ω)u(ω, as) = X

ω Ω µs(ω)u(ω, as) = X

s S πsu(xs, as) = Es π[u(xs, a)].

Similarly, the agent/receiver s utilities in the two problems are equal. The utility functions u(x, a), v(x, a) are linear in the principal s decision x X, satisfying our assumption.

Persuasion (or cheap talk) with a learning agent When specialized to Bayesian persuasion, the generalized principal-agent problem with a learning agent becomes the following:

Persuasion (or Cheap Talk) with a Learning Receiver

In each round t = 1, . . . , T, the following events happen:

(1) Using some algorithm that learns from history, the receiver chooses a strategy ρt : S (A) that maps each signal s S to a distribution over actions ρt(s) (A).

(2) The sender chooses a signaling scheme πt : Ω (S).

(3) A state of the world ωt µ0 is realized, observed by the sender but not the receiver. The sender sends signal st πt(ωt) to the receiver. The receiver draws action at ρt(s).

(4) The sender obtains utility ut = u(ωt, at) and the receiver obtains utility vt = v(ωt, at).a

a The definition of utility here, u(ωt, at), v(ωt, at), is different from the definition in the general model, which was the expected utility on decision/posterior xt, u(xt, at), v(xt, at). Because we only care about the sender s expected utility and the receiver s expected regret, this difference does not matter.

In the above model, the receiver does not need to know the prior µ0 or the sender s signaling scheme because his multi-armed bandit learning algorithm does not need such information. This makes the receiver s and the sender s decisions simultaneous, which corresponds to the cheap talk model (Crawford & Sobel, 1982) where the sender does commit to the signaling scheme. So, our persuasion with a learning receiver model is equivalent to cheap talk with a learning receiver .

3 REDUCTION FROM LEARNING TO APPROXIMATE BEST RESPONSE

In this section, we reduce the generalized principal-agent problem with a learning agent to the problem with an approximately-best-responding agent. We show that, if the agent uses contextual no-regret learning algorithms, then the principal can obtain an average utility that is at least the

Published as a conference paper at ICLR 2025

maxmin approximate-best-response objective OBJR CReg(T)/T (to be defined below). On the other hand, if the agent does contextual no-swap-regret learning, then the principal cannot do better than the maxmax approximate-best-response objective OBJ R CSReg(T)/T . In addition, if the agent uses some learning algorithms that are no-regret but not no-swap-regret, the principal can sometimes do better than the maxmax objective OBJ R CSReg(T)/T .

3.1 GENERALIZED PRINCIPAL-AGENT PROBLEM WITH APPROXIMATE BEST RESPONSE

We first define the generalized principal-agent problem with an approximately-best-responding agent. The classic generalized principal-agent problem (Section 2.1) assumes that, after receiving a signal s S (and observing the principal s decision xs X), the agent will take an optimal action with respect to xs. This means that the agent uses a strategy ρ that best responds to the principal s strategy π: ρ (s) arg max a A v(xs, a), s S = ρ arg max ρ:S (A) V (π, ρ). (2)

Here, V (π, ρ) = P

a A ρ(a|s)v(xs, a) denotes the expected utility of the agent when the principal uses strategy π and the agent uses (randomized) strategy ρ : S (A).

Here, we allow the agent to approximately best respond. Let δ 0 be a parameter. We define two types of δ-best-responding strategies for the agent: deterministic and randomized.

A deterministic strategy ρ: for each signal s S, the agent takes an action a that is δ-optimal for xs. Denote this set of strategies by Dδ(π):

Dδ(π) = ρ : S A | v(xs, ρ(s)) v(xs, a ) δ, a A . (3)

A randomized strategy ρ: for each signals s, the agent can take a randomized action. The expected utility of ρ is at most δ-worst than the best strategy ρ . Rδ(π) = ρ : S (A) | V (π, ρ) V (π, ρ ) δ . (4)

Equivalently, Rδ(π) = ρ : S (A) | V (π, ρ) V (π, ρ ) δ, ρ : S A .

Our model of approximately-best-responding agent includes, for example, two other models in the Bayesian persuasion literature that also relax the agent s Bayesian rationality assumption: the quantal response model (proposed by (Mc Kelvey & Palfrey, 1995) in normal-form games and studied by (Feng et al., 2024) in Bayesian persuasion) and a model where the agent makes mistakes in Bayesian update (de Clippel & Zhang, 2022). See Appendix C for details.

Principal s objectives. With an approximately-best-responding agent, we will study two types of objectives for the principal. The first type is the maximal utility that the principal can obtain if the agent approximately best responds in the worst way for the principal: for X {D, R}, define

OBJX(δ) = sup π min ρ Xδ(π) U(π, ρ), (5)

where U(π, ρ) = P

a A ρ(a|s)u(xs, a) is the principal s expected utility when the principal uses strategy π and the agent uses strategy ρ. We used sup in (5) because the maximizer does not necessarily exist. OBJX(δ) is a maxmin objective and can be regarded as the objective of a robust generalized principal-agent problem .

The second type of objectives is the maximal utility that the principal can obtain if the agent approximately best responds in the best way:

OBJ X(δ) = max π max ρ Xδ(π) U(π, ρ). (6)

This is a maxmax objective that quantifies the maximal extent to which the principal can exploit the agent s irrational behavior.

Clearly, OBJX(δ) OBJX(0) OBJ X(0) OBJ X(δ). And we note that OBJ X(0) = OBJ(0) is independent of X and equal to the optimal utility of the principal in the classic generalized principal-agent problem, which we denote by U : U = OBJ(0) = max π max ρ: best-response to π U(π, ρ). (7)

Published as a conference paper at ICLR 2025

Finally, we note that, because D0(π) Dδ(π) Rδ(π), the chain of inequalities OBJR(δ) OBJD(δ) U OBJ D(δ) OBJ R(δ) hold.

3.2 AGENT S NO-REGRET LEARNING: LOWER BOUND ON PRINCIPAL S UTILITY

Theorem 3.1. Suppose the agent uses a contextual no-regret learning algorithm with a contextual regret upper bounded by CReg(T). The principal knows CReg(T) but not the exact learning algorithm of the agent. By using some fixed strategy πt = π in all T rounds, the principal can obtain an average utility 1

T E PT t=1 u(xt, at) that is arbitrarily close to OBJR CReg(T )

To prove Theorem 3.1, we provide a lemma (with proof in Appendix E.1) to relate the agent s regret and the principal s utility in the learning model to those in the static model. We define some notations. Let the principal use some fixed strategy πt = π and the agent use some learning algorithm. Let pt a|s = Pr[at = a | st = s] be the probability that the agent s algorithm chooses action a conditioning on signal s being sent in round t. Let ρ : S (A) be a randomized agent strategy that,

given signal s, chooses each action a A with probability ρ(a|s) =

PT t=1 pt a|s T .

Lemma 3.2. When the principal uses a fixed strategy πt = π in all T rounds, the regret of the agent not deviating according to d : S A is equal to 1

T E PT t=1 v(xt, d(st)) v(xt, at) = V (π, d) V (π, ρ), and the average utility of the principal 1

T E PT t=1 u(xt, at) is equal to U(π, ρ).

Proof of Theorem 3.1. By Lemma 3.2 and the no-regret condition that the agent s regret E PT t=1 v(xt, d(st)) v(xt, at) CReg(T), we have

V (π, d) V (π, ρ) = 1

v(xt, d(st)) v(xt, at) i CReg(T)

T , d : S A.

This means that the agent s randomized strategy ρ is a δ = CReg(T )

T -best-response to the principal s fixed signaling scheme π, ρ Rδ= CReg(T )

T (π). This holds for any π. In particular, if for any ε > 0

the principal uses a signaling scheme πε that obtains an objective that is ε-close to OBJR(δ) = supπ minρ Rδ(π) U(π, ρ), then the principal obtains an expected utility of, by Lemma 3.2,

1 T E h T X

t=1 u(at, ωt) i = U(πε, ρ) min ρ Rδ(πε) U(πε, ρ) OBJR δ = CReg(T)

in the learning model. Letting ε 0 proves the theorem.

We then show that the result in Theorem 3.1 is tight: there exist cases where the principal cannot do better than OBJR CReg(T )

2T ) even using adaptive strategies (see Appendix E.2 for the proof): Theorem 3.3. For any adaptive strategy of the principal, there exists a contextual no-regret learning algorithm for the agent under which the principal s average utility is no more than OBJR CReg(T )

2T ). There also exists a contextual no-swap-regret learning algorithm for the agent under which the principal s average utility is no more than OBJR CSReg(T )

3.3 AGENT S NO-SWAP-REGRET LEARNING: UPPER BOUND ON PRINCIPAL S UTILITY

Theorem 3.4. Against a contextual no-swap-regret learning agent, the principal cannot obtain utility more than 1

T E PT t=1 u(xt, at) OBJ R CSReg(T )

T even using adaptive strategies.

The key idea to prove this theorem is to think of the signal st πt from the principal and the action at ρt(st) recommended by the agent s learning algorithm together as a joint signal (st, at) from some hypothetical signaling scheme π . The agent takes the recommended action at, namely using the mapping (st, at) 7 at as his strategy, in response to π . A no-swap-regret algorithm guarantees that the agent is at most CSReg(T )

T worse compared to the best-responding strategy d : S A A. So, the agent s overall strategy is a CSReg(T )

T -approximate best response to π , which limits the

principal s overall utility to be at most OBJ R CSReg(T )

T . See details in Apendix E.3.

Published as a conference paper at ICLR 2025

3.4 AGENT S MEAN-BASED LEARNING: EXPLOITABLE BY THE PRINCIPAL

Many no-regret (but not no-swap-regret) learning algorithms (e.g., MWU, FTPL, EXP-3) satisfy the following contextual mean-based property:

Definition 3.1 (Braverman et al. (2018)). Let σt s(a) = P

j [t]:sj=s v(ωj, a) be the sum of historical utilities of the receiver in the first t rounds if he takes action a when the signal/context is s. An algorithm is called γ-mean-based if: whenever a such that σt 1 s (a) < σt 1 s (a ) γT, the probability that the algorithm chooses action a at round t if the context is s is Pr[at = a|st = s] < γ, with γ = o(1).

Theorem 3.5. There exists a Bayesian persuasion instance where, as long as the receiver does γ-mean-based learning, the sender can obtain a utility significantly larger than OBJ R(γ) and U .

The proof of this theorem is in Appendix E.4.

4 GENERALIZED PRINCIPAL-AGENT PROBLEMS WITH APPROXIMATE BEST RESPONSE

After presenting the reduction from learning to approximate best response, we now study generalized principal-agent problems with approximate best response. We will show that both the maxmin objectives OBJD(δ), OBJR(δ) and the maxmax objectives OBJ D(δ), OBJ R(δ) are close to the optimal principal objective U in the best-response model when the degree δ of the agent s approximate best response is small, under some natural assumptions described below.

Assumptions and notations. We make some innocuous assumptions. First, the agent has no weakly dominated action:

Assumption 1 (No Dominated Action). An action a0 A of the agent is weakly dominated if there exists a mixed action α (A \ {a0}) such that v(x, α ) = Ea α [v(x, a)] v(x, a0) for all x X. We assume that the agent has no weakly dominated action.

Claim 1. Assumption 1 implies: there exists a constant G > 0 such that, for any agent action a A, there exists a principal decision x X such that v(x, a) v(x, a ) G for every a A \ {a}.

The proof of this claim is in Appendix F.1. The constant G > 0 in Claim 1 is analogous to the concept of inducibility gap in Stackelberg games (Von Stengel & Zamir, 2004; Gan et al., 2023). In fact, Gan et al. (2023) show that, if the inducibility gap G > δ, then the maximin approximate-bestresponse objective satisfies OBJD(δ) U δ

G in Stackelberg games. Our results will significantly generalize theirs to any generalized principal-agent problem, to randomized agent strategies, and to the maximax objectives OBJ D(δ), OBJ R(δ).

To present our results, we need to introduce a few more notions and assumptions. Let diam(X; ) = maxx1,x2 X x1 x2 be the diameter of the space X, where is some norm. For convenience we assume X Rd and use the ℓ1-norm x 1 = Pd i=1 |x(i)| or the ℓ -norm x = maxd i=1 |x(i)|. For a generalized principal-agent problem with constraint P

s S πsxs C, let X be the boundary of X and let dist(C, X) = minc C,x X c x be the distance from C to the boundary of X. We assume that C is away from the boundary of X:

Assumption 2 (C is in the interior of X). dist(C, X) > 0.

Assumption 3 (Bounded and Lipschitz utility). The principal s utility function is bounded: |u(x, a)| B, and L-Lipschitz in x X: |u(x1, a) u(x2, a)| L x1 x2 .

Main results. We now present the main results of this section: lower bounds on OBJX(δ) and upper bounds on OBJ X(δ) in generalized principal-agent problems without and with constraints.

Theorem 4.1 (Without constraint). For an unconstrained generalized principal-agent problem, under Assumptions 1 and 3, for 0 δ < G, we have

OBJD(δ) U diam(X)L δ

Published as a conference paper at ICLR 2025

OBJR(δ) U 2 q

G diam(X)δ for δ < diam(X)GL

OBJ D(δ) OBJ R(δ) U + diam(X)L δ

G. Theorem 4.2 (With constraint). For a generalized principal-agent problem with the constraint P

s S πsxs C, under Assumptions 1, 2 and 3, for 0 δ < Gdist(C, X)

diam(X) , we have

OBJD(δ) U diam(X)L + 2B diam(X)

dist(C, X) δ

OBJR(δ) U 2 q

G diam(X)L + 2B diam(X)

dist(C, X) δ.

OBJ D(δ) OBJ R(δ) U + diam(X)L + 2B diam(X)

dist(C, X) δ

The expression diam(X)

dist(C, X)δ suggests that 1 dist(C, X) is similar to a condition number (Renegar, 1994) that quantifies the stability of the principal-agent problem against the agent s approximatebest-responding behavior. When dist(C, X) is larger (C is further away from the boundary of X), the condition number is smaller, the problem is more stable, and the δ-best-response objectives OBJX(δ) and OBJ X(δ) are closer to the best-response objective U .

High-level idea: perturbation. The high level idea to prove Theorems 4.1 and 4.2 is a perturbation argument. Consider proving the upper bounds on OBJ D(δ) for example. Let (π, ρ) be any pair of principal s strategy and agent s δ-best-responding strategy. We perturb the principal s strategy π slightly to be a strategy π to which ρ is exactly best-responding (such a perturbation is possible due to Assumption 1). Since ρ is best-responding to π , the pair (π , ρ) cannot give the principal a higher utility than U (which is the optimal principal utility under the best-response model). This means that the original pair (π, ρ) cannot give the principal a utility much higher than U , thus implying an upper bound on OBJ D(δ). Extra care is needed when dealing with randomized strategies of the agent. See details in Appendix F.3.

The bound OBJR(δ) U O(

δ) is tight. We note that, in Theorems 4.1 and 4.2, the maxmin objective with randomized agent strategies is bounded by OBJR(δ) U O(

δ), while the objective with deterministic agent strategies is bounded by OBJD(δ) U O(δ). This is not because our analysis is not tight. In fact, the squared root bound U Θ(

δ) for OBJR(δ) is tight. We prove this by giving an example where OBJR(δ) U Ω(

δ). Consider the following classical Bayesian persuasion example: Example 4.1. There are 2 states Ω= {Good, Bad}, 2 actions A = {a, b}, with utility matrices

sender a b receiver a b Good 1 0 Good 1 0 Bad 1 0 Bad 1 0

The prior probability of Good state is µ0 < 1

2, so the receiver takes action b by default. In this example, for δ < µ0

2 , OBJR(δ) U 2 2µ0δ +δ = U Ω(

δ). See Appendix F.2 for a proof.

5 APPLICATIONS TO SPECIFIC PRINCIPAL-AGENT PROBLEMS

We apply the general results in Section 3 and 4 to derive concrete results for three specific principalagent problems: Bayesian persuasion, Stackelberg games, and contract design.

Bayesian persuasion. As noted in Section 2, Bayesian persuasion is a generalized principal-agent problem with constraint P

s S πsxs C = {µ0} where each xs = µs = (µs(ω))ω Ω X = (Ω) is a posterior belief. Suppose the principal s utility is bounded: |u(ω, a)| B. Then, the principal s utility function u(µs, a) = P

ω Ωµs(ω)u(ω, a) is (L = B)-Lipschitz in µs (under ℓ1norm), so Assumption 3 is satisfied. Suppose the prior µ0 has positive probability for every ω Ω, and let p0 = minω Ωµ0(ω) > 0. Then, we have the distance

dist(C, X) = min µ0 µ 1 : µ (Ω) s.t. µ(ω) = 0 for some ω Ω p0 > 0,

Published as a conference paper at ICLR 2025

so Assumption 2 is satisfied. The diameter satisfies diam(X; ℓ1) = maxµ1,µ2 (Ω) µ1 µ2 1 2. Finally, we assume Assumption 1 (no dominated action for the agent). Then, Theorem 4.2 gives bounds on the approximate-best-response objectives in Bayesian persuasion:

Corollary 5.1 (Bayesian persuasion with approximate best response). For 0 δ < Gp0

OBJD(δ) U 2B(1 + 2

G, and OBJR(δ) U 4B q

OBJ D(δ) OBJ R(δ) U + 2B(1 + 2

Further applying Theorem 3.1 and 3.4, we obtain the central result for our motivating problem, persuasion with a learning agent:

Corollary 5.2 (Persuasion with a learning agent). Suppose T is sufficiently large such that CReg(T )

2 and CSReg(T )

with a contextual no-regret learning agent, the principal can obtain utility at least

t=1 u(xt, at) OBJR CReg(T )

using a fixed signaling schemes in all rounds.

with a contextual no-swap-regret learning agent, the principal s obtainable utility is at most

t=1 u(xt, at) OBJ D CSReg(T )

T U + 2B(1 + 2

G CSReg(T )

even using adaptive signaling schemes.

Result (9) is particularly interesting. First, it shows that the principal cannot exploit a no-swapregret learning agent beyond U + o(1) even if the principal has informational advantage (knowing the state ω). Second, this result still holds even if we allow the principal to see the agent s strategy ρt before choosing the signaling scheme πt. The principal still cannot exploit the agent in this case.

Stackelberg games and contract design. When applied to Stackelberg games and contract de-

sign, our results show that the principal can obtain U O( q

T ) against a no-regret agent and

no more than U + O( CSReg(T )

T ) against a no-swap-regret agent. These results refine the U + o(1) and U o(1) bounds in Deng et al. (2019); Guruganesh et al. (2024). See Appendix D for details. This demonstrates the generality and usefulness of our framework.

6 DISCUSSION

In summary, our work provides an explicit characterization of the principal s achievable utility in generalized principal-agent problems with a contextual no-swap-regret learning agent. It is an asym-

metric range U O( q

T ), U + O( CSReg(T )

T ) . We show that this conclusion holds in all generalized principal-agent problems where the agent does not have private information, in particular including Bayesian persuasion where the principal is privately informed. As we mentioned in the Introduction, the upper bound U + O( CSReg(T )

T ) does not hold when the agent has private information or does certain types of no-regret but not no-swap-regret learning. Deriving the exact upper bound in the latter cases is an interesting direction for future work.

Other directions for future work include, for example, relaxing the assumption that the principal has perfect knowledge of the environment what if both principal and agent are learning players? And what if the environment is non-stationary, like a Markovian environment (Jain & Perchet, 2024) or an adversarial dynamic environment (Camara et al., 2020)? In unknown or non-stationary environments, the benchmark U needs to be redefined, and a joint design of both players learning algorithms might be interesting.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

Yiling is supported by the National Science Foundation under grant no. IIS-2147187 and by Amazon. Tao is supported by a Siebel Ph D scholarship.

Published as a conference paper at ICLR 2025

Jerry Anunrojwong, Krishnamurthy Iyer, and David Lingenbrink. Persuading Risk-Conscious Agents: A Geometric Approach. Operations Research, pp. opre.2023.2438, March 2023. URL http://pubsonline.informs.org/doi/10.1287/opre.2023.2438.

Eshwar Ram Arunachaleswaran, Natalie Collina, and Jon Schneider. Pareto-Optimal Algorithms for Learning in Games. In Proceedings of the 25th ACM Conference on Economics and Computation, pp. 490 510, New Haven CT USA, July 2024. ACM. URL https://dl.acm.org/doi/ 10.1145/3670865.3673517.

Jean-Yves Audibert and S ebastien Bubeck. Regret Bounds and Minimax Policies under Partial Monitoring. Journal of Machine Learning Research, pp. 2785 2836, 2010. URL http:// jmlr.org/papers/v11/audibert10a.html.

Peter Auer, Nicol o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The Nonstochastic Multiarmed Bandit Problem. SIAM Journal on Computing, pp. 48 77, January 2002. URL http://epubs.siam.org/doi/10.1137/S0097539701398375.

Yakov Babichenko, Inbal Talgam-Cohen, Haifeng Xu, and Konstantin Zabarnyi. Regret-Minimizing Bayesian Persuasion. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 128 128, Budapest Hungary, July 2021. ACM. URL https://dl.acm.org/doi/ 10.1145/3465456.3467574.

Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Markov persuasion processes: Learning to persuade from scratch. ar Xiv preprint ar Xiv:2402.03077, 2024.

Daniel J Benjamin. Errors in probabilistic reasoning and judgment biases. Handbook of Behavioral Economics: Applications and Foundations 1, 2:69 186, 2019.

Avrim Blum and Yishay Mansour. From External to Internal Regret. Journal of Machine Learning Research, pp. 1307 1324, 2007. URL http://jmlr.org/papers/v8/blum07a.html.

Mark Braverman, Jieming Mao, Jon Schneider, and Matt Weinberg. Selling to a No-Regret Buyer. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 523 538, Ithaca NY USA, June 2018. ACM. URL https://dl.acm.org/doi/10.1145/3219166. 3219233.

George W. Brown. Iterative Solution of Games by Fictitious Play. In Activity Analysis of Production and Allocation. Wiley, New York, 1951.

Linda Cai, S. Matthew Weinberg, Evan Wildenhain, and Shirley Zhang. Selling to Multiple No-Regret Buyers. In Web and Internet Economics, pp. 113 129, Cham, 2024. Springer Nature Switzerland. URL https://link.springer.com/10.1007/ 978-3-031-48974-7_7.

Modibo K. Camara, Jason D. Hartline, and Aleck Johnsen. Mechanisms for a No-Regret Agent: Beyond the Common Prior. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp. 259 270, Durham, NC, USA, November 2020. IEEE. URL https:// ieeexplore.ieee.org/document/9317992/.

Colin Camerer. Bounded rationality in individual decision making. Experimental economics, 1: 163 183, 1998.

Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online Bayesian Persuasion. In Advances in Neural Information Processing Systems, pp. 16188 16198. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ ba5451d3c91a0f982f103cdbe249bc78-Paper.pdf.

Vincent P. Crawford and Joel Sobel. Strategic Information Transmission. Econometrica, pp. 1431, November 1982. URL https://www.jstor.org/stable/1913390?origin= crossref.

Published as a conference paper at ICLR 2025

Geoffroy de Clippel and Xu Zhang. Non-Bayesian Persuasion. Journal of Political Economy, pp. 2594 2642, October 2022. URL https://www.journals.uchicago.edu/doi/10. 1086/720464.

Yuan Deng, Jon Schneider, and Balasubramanian Sivan. Strategizing against No-regret Learners. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 8b6dd7db9af49e67306feb59a8bdc52c-Paper.pdf.

Piotr Dworczak and Alessandro Pavan. Preparing for the Worst but Hoping for the Best: Robust (Bayesian) Persuasion. Econometrica, pp. 2017 2051, 2022. URL https://www. econometricsociety.org/doi/10.3982/ECTA19107.

Yiding Feng, Wei Tang, and Haifeng Xu. Online Bayesian Recommendation with No Regret. In Proceedings of the 23rd ACM Conference on Economics and Computation, pp. 818 819, Boulder CO USA, July 2022. ACM. URL https://dl.acm.org/doi/10.1145/3490486. 3538327.

Yiding Feng, Chien-Ju Ho, and Wei Tang. Rationality-robust information design: Bayesian persuasion under quantal response. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 501 546. SIAM, 2024.

Drew Fudenberg and David K. Levine. The theory of learning in games. MIT Press series on economic learning and social evolution. MIT Press, Cambridge, Mass, 1998.

Jiarui Gan, Minbiao Han, Jibang Wu, and Haifeng Xu. Robust Stackelberg Equilibria. In Proceedings of the 24th ACM Conference on Economics and Computation, pp. 735 735, London United Kingdom, July 2023. ACM. URL https://dl.acm.org/doi/10.1145/3580507. 3597680.

Jiarui Gan, Minbiao Han, Jibang Wu, and Haifeng Xu. Generalized principal-agency: Contracts, information, games and beyond. In Proceedings of the 20th Conference on Web and Internet Economics, WINE 24, 2024.

Guru Guruganesh, Yoav Kolumbus, Jon Schneider, Inbal Talgam-Cohen, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Joshua Ruizhi Wang, and S. Matthew Weinberg. Contracting with a Learning Agent. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Bh0LLUp8OA.

Keegan Harris, Nicole Immorlica, Brendan Lucier, and Aleksandrs Slivkins. Algorithmic Persuasion Through Simulation: Information Design in the Age of Generative AI. ar Xiv preprint ar Xiv:2311.18138, 2023.

Sergiu Hart and Andreu Mas-Colell. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica, pp. 1127 1150, September 2000. URL http://doi.wiley.com/10. 1111/1468-0262.00153.

Shinji Ito. A Tight Lower Bound and Efficient Reduction for Swap Regret. In Advances in Neural Information Processing Systems, pp. 18550 18559. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/d79c8788088c2193f0244d8f1f36d2db-Paper.pdf.

Atulya Jain and Vianney Perchet. Calibrated Forecasting and Persuasion. In Proceedings of the 25th ACM Conference on Economics and Computation, pp. 489 489, New Haven CT USA, July 2024. ACM. URL https://dl.acm.org/doi/10.1145/3670865.3673455.

Emir Kamenica and Matthew Gentzkow. Bayesian Persuasion. American Economic Review, pp. 2590 2615, October 2011. URL https://pubs.aeaweb.org/doi/10.1257/aer. 101.6.2590.

B. H. Korte and Jens Vygen. Combinatorial optimization: theory and algorithms. Algorithms and combinatorics. Springer, Heidelberg ; New York, 5th ed edition, 2012.

Published as a conference paper at ICLR 2025

Svetlana Kosterina. Persuasion with unknown beliefs. Theoretical Economics, pp. 1075 1107, 2022. URL https://econtheory.org/ojs/index.php/te/article/view/4742.

Rachitesh Kumar, Jon Schneider, and Balasubramanian Sivan. Strategically-Robust Learning Algorithms for Bidding in First-Price Auctions. In Proceedings of the 25th ACM Conference on Economics and Computation, pp. 893 893, New Haven CT USA, July 2024. ACM. URL https://dl.acm.org/doi/10.1145/3670865.3673514.

Tao Lin and Ce Li. Information Design with Unknown Prior. In Proceedings of Innovations in Theoretical Computer Science, 2025.

Yue Lin, Wenhao Li, Hongyuan Zha, and Baoxiang Wang. Information design in multi-agent reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Ny Qw Btt Tn G.

Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against Learners in Bayesian Games. In Proceedings of Thirty Fifth Conference on Learning Theory, Proceedings of Machine Learning Research, pp. 5221 5252. PMLR, July 2022. URL https: //proceedings.mlr.press/v178/mansour22a.html.

Richard D. Mc Kelvey and Thomas R. Palfrey. Quantal Response Equilibria for Normal Form Games. Games and Economic Behavior, pp. 6 38, July 1995. URL https://linkinghub. elsevier.com/retrieve/pii/S0899825685710238.

Roger B Myerson. Optimal coordination mechanisms in generalized principal agent problems. Journal of Mathematical Economics, pp. 67 81, June 1982. URL https://linkinghub. elsevier.com/retrieve/pii/0304406882900064.

Denis Nekipelov, Vasilis Syrgkanis, and Eva Tardos. Econometrics for Learning Agents. In Proceedings of the Sixteenth ACM Conference on Economics and Computation - EC 15, pp. 1 18, Portland, Oregon, USA, 2015. ACM Press. URL http://dl.acm.org/citation.cfm? doid=2764468.2764522.

James Renegar. Some perturbation theory for linear programming. Mathematical Programming, pp. 73 91, February 1994. URL http://link.springer.com/10.1007/BF01581690.

Aviad Rubinstein and Junyao Zhao. Strategizing against No-Regret Learners in First-Price Auctions. In Proceedings of the 25th ACM Conference on Economics and Computation, pp. 894 921, New Haven CT USA, July 2024. ACM. URL https://dl.acm.org/doi/10.1145/ 3670865.3673613.

Antoine Scheid, Aymeric Capitaine, Etienne Boursier, Eric Moulines, Michael Jordan, and Alain Oliviero Durmus. Learning to mitigate externalities: the coase theorem with hindsight rationality. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=omyzrkacme.

Tyler Lu, David Pal, and Martin Pal. Contextual Multi-Armed Bandits. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 485 492. PMLR, March 2010. URL https://proceedings.mlr.press/v9/lu10a.html.

Bernhard Von Stengel and Shmuel Zamir. Leadership with commitment to mixed strategies. Technical report, Citeseer, 2004.

Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan, and Haifeng Xu. Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning. In Proceedings of the 23rd ACM Conference on Economics and Computation, pp. 471 472, Boulder CO USA, July 2022. ACM. URL https://dl.acm.org/doi/10.1145/ 3490486.3538313.

Kunhe Yang and Hanrui Zhang. Computational Aspects of Bayesian Persuasion under Approximate Best Response. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=9B0i Okn3UP.

Published as a conference paper at ICLR 2025

Gabriel Ziegler. Adversarial bilateral information design. Technical report, 2020.

You Zu, Krishnamurthy Iyer, and Haifeng Xu. Learning to Persuade on the Fly: Robustness Against Ignorance. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 927 928, Budapest Hungary, July 2021. ACM. URL https://dl.acm.org/doi/10. 1145/3465456.3467593.

Published as a conference paper at ICLR 2025

A ADDITIONAL RELATED WORKS

Learning agents have been studied in principal-agent problems like auctions (Braverman et al., 2018; Cai et al., 2024; Rubinstein & Zhao, 2024; Kumar et al., 2024), bimatrix Stackelberg games (Deng et al., 2019; Mansour et al., 2022; Arunachaleswaran et al., 2024), contract design (Guruganesh et al., 2024; Scheid et al., 2024), and Bayesian persuasion (Lin et al., 2023; Jain & Perchet, 2024). These problems belong to the class of generalized principal-agent problems (Myerson, 1982; Gan et al., 2024). We thus propose a general framework of generalized principal-agent problem with a learning agent, which encompasses several previous models, refines previous results, and provides new results.

Camara et al. (2020) also propose a general framework of principal-agent problems with learning players, but has two key differences with ours: (1) They drop the common prior assumption while we still keep it. This assumption allows us to compare the principal s utility in the learning model with the classic model with common prior. (2) Their principal has commitment power, which is reasonable in, e.g., auction design, but less realistic in information design where the principal s strategy is a signaling scheme. Our principal does not commit.

Deng et al. (2019) show that the follower s no-swap-regret learning can cap the leader s utility at U + o(1) in Stackelberg games. We find that this conclusion holds for all generalized principalagent problems where the agent does not have private information. This conclusion does not hold when the agent is privately informed, as shown by Mansour et al. (2022) in Bayesian Stackelberg games. We view our work as characterizing the largest class of games under which this conclusion holds.

The literature on information design (Bayesian persuasion) has investigated various relaxations of the strong rationality assumptions in the classic models. For the sender, known prior (Camara et al., 2020; Ziegler, 2020; Zu et al., 2021; Kosterina, 2022; Wu et al., 2022; Dworczak & Pavan, 2022; Harris et al., 2023; Lin & Li, 2025) and known utility (Babichenko et al., 2021; Castiglioni et al., 2020; Feng et al., 2022; Bacchiocchi et al., 2024) are relaxed. For the receiver, the receiver may make mistakes in Bayesian updates (de Clippel & Zhang, 2022), be risk-conscious (Anunrojwong et al., 2023), do quantal response (Feng et al., 2024) or approximate best response (Yang & Zhang, 2024). Independently and concurrently of us, Jain & Perchet (2024) also study Bayesian persuasion with a learning agent. Their work has a few differences with us: First, their model is a general Bayesian persuasion model with imperfect and non-stationary dynamics for the state of the world. Our model generalizes Bayesian persuasion in another direction (namely, generalized principalagent problems), while still assuming a perfect and stationary environment. Second, their results are qualitatively similar to our Result 1 and Result 4, while our results are more quantitative and precise. Third, we additionally show that no-swap-regret learning can cap the sender s utility (Result 2).

As our problem reduces to generalized principal-agent problems with approximate best response, our work is also related to recent works on approximately-best-responding agents in Stackelberg games (Gan et al., 2023) and Bayesian persuasion (Yang & Zhang, 2024). We focus on the range of payoff that can be obtained by a computationally-unbounded principal, ignoring the computational aspect considered by Gan et al. (2023); Yang & Zhang (2024). Besides the maxmin/robust objective, we also study the maxmax objective where the agent approximately best responds in favor of the principal, which is usually not studied in the literature.

B DETAILS ABOUT CONTEXTUAL NO-(SWAP-)REGRET ALGORITHMS

Contextual no-(swap-)regret algorithms can be constructed by running an ordinary no-(swap-)regret algorithm for each context independently. Since algorithms with O(

T) (swap-)regret exist under bandit feedback (Audibert & Bubeck, 2010; Ito, 2020), they lead to algorithms with O( p

|S|T) contextual (swap-)regret. This is formalized by the following proposition:

Proposition B.1. There exist learning algorithms with contextual regret CReg(T) = O( p

|A||S|T) and contextual swap-regret CSReg(T) = O(|A| p

|S|T). They can be constructed by running an ordinary no-(swap-)regret multi-armed bandit algorithm for each context independently.

We prove Proposition B.1 in the rest of this section.

Published as a conference paper at ICLR 2025

Let A be an arbitrary no-regret (no-swap-regret) learning algorithm for a multi-armed bandit (MAB) problem with |A| arms. There exist such algorithms with regret O( p

T|A| log |A|) (variants of Exp3 (Auer et al., 2002)) and even O( p

T|A|) (doubling trick + poly INF (Audibert & Bubeck, 2010)) for any time horizon T > 0. By swap-to-external regret reductions, they can be converted to multi-armed bandit algorithms with swap regret O( p

T|A|3 log |A|) (Blum & Mansour, 2007) and O(|A|

T) (Ito, 2020). We then convert A into a contextual no-regret (contextual no-swap-regret) algorithm, in the following way:

Algorithm 1: Convert any MAB algorithm to a contextual MAB algorithm Input: MAB algortihm A. Arm set A. Context set S. Instantiate |S| copies A1, . . . , A|S| of A, and initialize their round number by t1 = = t|S| = 0. for round t = 1, 2, . . . do

Receive context st. Call Ast to obtain an action at. Play at and obtain feedback (which includes the reward vt(at) of action at). Feed the feedback to Ast. Increase its round number tst by 1. end

Proposition B.2. The contextual regret of Algorithm 1 is at most

CReg(T) max n |S| X

s=1 Reg(Ts) T1 + + T|S| = T o ,

where Reg(Ts) is the regret of A for time horizon Ts.

The contextual swap-regret of Algorithm 1 is at most

CSReg(T) max n |S| X

s=1 SReg(Ts) T1 + + T|S| = T o ,

where SReg(Ts) is the swap-regret of A for time horizon Ts.

When plugging in Reg(Ts) = O( p

|A|Ts), we obtain CReg(T) O( p

When plugging in SReg(Ts) = O(|A| Ts), we obtain CSReg(T) O(|A| p

Proof. The contextual regret of Algorithm 1 is

CReg(T) = max d:S A E h T X

vt(d(st)) vt(at) i

= max d:S A E h |S| X

vt(d(s)) vt(at) i

s=1 max a A E h X

vt(a ) vt(at) i

s=1 ETs Reg(Ts) where Ts is the number of rounds where st = s

max n |S| X

s=1 Reg(Ts) T1 + + T|S| = T o .

When Reg(Ts) = O( p

|A|Ts), by Jensen s inequality we obtain

|A|Ts) O( p

s=1 Ts = O( p

Published as a conference paper at ICLR 2025

The argument for contextual swap-regret is similar:

CSReg(T) = max d:S A A E h T X

vt(d(st, at)) vt(at) i

= max d:S A A E h |S| X

vt(d(s, at)) vt(at) i

s=1 max d :A A E h X

vt(d (at)) vt(at) i

s=1 ETs SReg(Ts) where Ts is the number of rounds where st = s

max n |S| X

s=1 SReg(Ts) T1 + + T|S| = T o .

When SReg(Ts) = O(|A| Ts), by Jensen s inequality we obtain

s=1 O(|A| p

Ts) O(|A|) p

s=1 Ts = O(|A| p

C EXAMPLE OF APPROXIMATELY BEST RESPONDING AGENTS

Our model of approximately-best-responding agent (Section 3.1) includes, for example, two other models in the Bayesian persuasion literature that also relax the agent s Bayesian rationality assumption: the quantal response model (proposed by (Mc Kelvey & Palfrey, 1995) in normal-form games and studied by (Feng et al., 2024) in Bayesian persuasion) and a model where the agent makes mistakes in Bayesian update (de Clippel & Zhang, 2022): Example C.1. Assume that the receiver s utility is in [0, 1]. In Bayesian persuasion, the following receiver strategies are δ-best-responding:

Quantal response: given signal s S, the agent chooses action a A with probability exp(λv(µs,a)) P

a A exp(λv(µs,a )), with λ > 0. This strategy belongs to Rδ(π) with δ = 1+log(|A|λ)

Inaccurate belief: given signal s S, the agent forms some posterior µ s that is different yet close to the true posterior µs in total variation distance d TV(µ s, µs) ε. The agent picks an optimal action for µ s. This strategy belongs to D2ε(π).

Proof. Consider the quantal response model. Let γ = log(|A|λ)

λ . Given signal s, with posterior µs, we say an action a A is not γ-optimal for posterior µs if

v(µs, a s) v(µs, a) γ

where a s is an optimal action for µs. The probability that the receiver chooses not γ-optimal action a is at most: exp(λv(µs, a)) P

a A exp(λv(µs, a)) exp(λv(µs, a))

exp(λv(µs, a s)) = exp λ v(µs, a s) v(µs, a) exp( λγ) = 1 |A|λ.

By a union bound, the probability that the receiver chooses any not γ-approxiamtely optimal action is at most 1

λ. So, the expected loss of utility of the receiver due to not taking the optimal action is at most

λ 1 log(|A|λ) + 1

Published as a conference paper at ICLR 2025

This means that the quantal response strategy is a log(|A|λ)+1

λ -best-responding randomized strategy.

Consider inaccurate belief. Given signal s, the receiver has belief µ s with total variation distance d TV(µ s, µs) ε to the true posterior µs. For any action a A, the difference of expected utility of action a under beliefs µ s and µs is at most ε: Eω µ s[v(ω, a)] Eω µs[v(ω, a)] d TV(µ s, µs) ε.

So, the optimal action for µ s is a 2ε-optimal action for µs. This means that the receiver strategy is a deterministic 2ε-best-responding strategy.

D ADDITIONAL DETAILS ON APPLICATIONS TO SPECIFIC PRINCIPAL-AGENT PROBLEMS

D.1 STACKELBERG GAMES

In a Stackelberg game, the principal (leader), having a finite action set B, first commits to a mixed strategy x = (x(b))b B (B), which is a distribution over actions. So the principal s decision space X is (B). The agent (follower) then takes an action a A in response to x. The (expected) utilities for the two players are u(x, a) = P

b B x(b)u(b, a) and v(x, a) = P

b B x(b)u(b, a). The signal s can (but not necessarily) be an action that the principal recommends the agent to take.

Assume bounded utility |u(b, a)| B. Then, the principal s utility function u(x, a) is bounded in [ B, B] and (L = B)-Lipschitz in x. The diameter diam(X) = maxx1,x2 (B) x1 x2 1 2. Applying the theorem for unconstrained generalized principal-agent problems (Theorem 4.1) and the theorems for learning agent (Theorem 3.1 and 3.4), we obtain:

Corollary D.1 (Stackelberg game with a learning agent). Suppose T is sufficiently large such that CReg(T )

T < G and CSReg(T )

T < G, then:

with a contextual no-regret learning agent, the principal can obtain utility

1 T E PT t=1 u(xt, at) OBJR CReg(T )

T using a fixed strategy in all rounds.

with a contextual no-swap-regret learning agent, the principal cannot obtain utility more than 1 T E PT t=1 u(xt, at) OBJ D CSReg(T )

G CSReg(T )

T even using adaptive strategies.

The conclusion that the principal can obtain utility at least U o(1) against a no-regret learning agent and no more than U + o(1) against a no-swap-regret agent in Stackelberg games was proved by (Deng et al., 2019). Our Corollary D.1 reproduces this conclusion and moreover provides bounds

on the o(1) terms, namely, U O( q

T ) and U + O( CSReg(T )

T ). This demonstrates the generality and usefulness of our framework.

D.2 CONTRACT DESIGN

In contract design, there is a finite outcome space O = {r1, . . . , rd} where each ri R is a monetary reward to the principal. When the agent takes action a A, outcome ri will happen with probability pai 0, Pd i=1 pai = 1. The principal cannot observe the action taken by the agent but can observe the realized outcome. The principal s decision space X is the set of contracts, where a contract x = (x(i))d i=1 [0, + ]d is a vector that specifies the payment to the agent for each possible outcome. So, if the agent takes action a under contract x, the principal obtains expected utility

i=1 pai(ri x(i))

and the agent obtains v(x, a) = Pd i=1 paix(i) ca, where ca 0 is the cost of action a A for the agent. The signal s can (but not necessarily) be an action that the principal recommends the agent to take. The principal s decision space X [0, + ]d in contract design, however, may be unbounded

Published as a conference paper at ICLR 2025

and violate the requirement of bounded diameter diam(X) that we need. We have two remedies for this.

The first remedy is to require the principal s payment to the agent be upper bounded by some constant P < + , so 0 x(i) P and X = [0, P]d. Under this requirement and the assumption of bounded reward |ri| R, the principal s utility becomes bounded by |u(x, a)| Pd i=1 pai(R + P) = R + P = B and (L = 1)-Lipschitz under ℓ -norm:

|u(x1, a) u(x2, a)| = d X

i=1 pai(x1(i) x2(i)) d max i=1 |x1(i) x2(i)|

i=1 pai = x1 x2 . (10)

And the diameter of X is bounded by (under ℓ -norm)

diam(X; ℓ ) = max x1,x2 X x1 x2 = max x1,x2 [0,P ]d

d max i=1 |x1(i) x2(i)| P. (11)

Now, we can apply the theorem for unconstrained generalized principal-agent problems (Theorem 4.1) and the theorems for learning agent (Theorem 3.1 and Theorem 3.4) to obtain:

Corollary D.2 (Contract design (with bounded payment) with a learning agent). Suppose T is sufficiently large such that CReg(T )

T < P G 2(R+P ) and CSReg(T )

T < G, then:

with a contextual no-regret learning agent, the principal can obtain utility at least

1 T E PT t=1 u(xt, at) OBJR CReg(T )

T using a fixed contract in all rounds.

with contextual a no-swap-regret learning agent, the principal cannot obtain utility more than 1 T E PT t=1 u(xt, at) OBJ D CSReg(T )

G CSReg(T )

T even using adaptive contracts.

The second remedy is to write contract design as a generalized principal-agent problem in another way. Let x = ( x(a))a A [0, + ]|A| be a vector recording the expected payment from the principal to the agent for each action a A:

i=1 paix(i). (12)

And let r(a) be the expected reward of action a, r(a) = Pd i=1 pairi. Then, the principal and the agent s utility can be rewritten as functions of x and a:

u( x, a) = r(a) x(a), v( x, a) = x(a) ca, (13)

which are linear (strictly speaking, affine) in x X. Assuming bounded reward | r(a)| R, we can without loss of generality assume that the expected payment x(a) is bounded by R as well, because otherwise the principal will get negative utility. So, the principal s decision space can be restricted to

X = x | x [0, + ]d such that x(a) =

i=1 paix(i) for every a A [0, R]|A|, (14)

which is convex and has bounded diameter (under ℓ norm)

diam( X; ℓ ) diam([0, R]|A|; ℓ ) = R. (15)

The utility function u( x, a) is bounded by 2R and (L = 1)-Lipschitz (under ℓ norm):

|u( x1, a) u( x2, a)| = | x1(a) x2(a)| max a A | x1(a) x2(a)| = x1 x2 . (16)

Thus, we can apply the theorem for unconstrained generalized principal-agent problems (Theorem 4.1) and the theorems for learning agent (Theorem 3.1 and Theorem 3.4) to obtain:

Corollary D.3 (Contract design with a learning agent). Suppose T is sufficiently large such that CReg(T )

2 and CSReg(T )

T < G, then:

Published as a conference paper at ICLR 2025

with a contextual no-regret learning agent, the principal can obtain utility at least

1 T E PT t=1 u(xt, at) OBJR CReg(T )

T using a fixed contract in all rounds.

with a contextual no-swap-regret learning agent, the principal cannot obtain utility more than 1 T E PT t=1 u(xt, at) OBJ D CSReg(T )

G CSReg(T )

T even using adaptive contracts.

Providing the quantitative lower and upper bounds, the above results refine the result in (Guruganesh et al., 2024) that the principal can obtain utility at least U o(1) against a no-regret learning agent and no more than U + o(1) against a no-swap-regret agent. This again demonstrates the versatility of our general framework.

E MISSING PROOFS FROM SECTION 3

E.1 PROOF OF LEMMA 3.2

Since πt = π is fixed, we have πt s = πs and xt s = xs, s S. The regret of the receiver not deviating according to d is:

1 T E h T X

v(xt, d(st)) v(xt, at) i = 1

a A pt a|s v(xt s, d(s)) v(xt s, a)

PT t=1 pt a|s T v(xs, d(s)) v(xs, a)

s S πsv(xs, d(s)) X

a A ρ(a|s)v(xs, a) = V (π, d) V (π, ρ).

Here, d is interpreted as an agent strategy that deterministically takes action d(s) for signal s.

By a similar derivation, we see that the principal s expected utility is equal to

1 T E h PT t=1 u(xt, at) i = P

PT t=1 pt a|s T u(xs, a) = U(π, ρ), which proves the lemma.

E.2 PROOF OF THEOREM 3.3

We prove this theorem for no-swap-regret learning algorithms. The argument for no-regret learning algorithms is analogous.

Fix the principal s adaptive strategy σ = (σt)T t=1 for the T rounds, where each σt is a mapping from the history ht 1 = (si, ai)t 1 i=1 (including past signals and actions) to a single-round strategy πt for round t. Given any function CSReg(T), let δ = CSReg(T )

2T . We construct the following algorithm A for the agent: at each round t, given history ht 1 = (si, ai)t 1 i=1,

If the single-round strategy chosen by the principal at round t is equal to πt = σt(ht 1), then the agent plays a strategy ρt arg minρ Rδ(πt) U(πt, ρ), namely, a randomized δ-bestresponding strategy that minimizes the principal s utility.

If the single-round strategy chosen by the principal at round t is not equal to πt = σt(ht 1), then the agent switches to any existing contextual no-swap-regret algorithm with swap regret at most CSReg(T )

2 (see Proposition B.1 for examples of such algorithms).

We show that the agent s algorithm has swap regret at most CSReg(T) no matter what strategy the principal uses:

- If the principal keeps using strategy σ, namely, at each round t the principal uses single-round strategy πt = σt(ht 1), denoted by πt = {(πt s, xt s)}s S, then the agent will respond by strategy ρt. For any deviation function d : S A A, the expected regret of the agent not deviating according

Published as a conference paper at ICLR 2025

to d in this round is

E[v(xt, d(st, at)) v(xt, at)] = Eht 1 h X

a A ρt(a|s) v(xt s, d(s, a)) v(xt s, a) i

= Eht 1 h X

a A ρt(a|s)v(xt s, d(s, a)) X

a A ρt(a|s)v(xt s, a) i

s S πt s max a A v(xt s, a) X

a A ρt(a|s)v(xt s, a) i

= Eht 1 h X

s S πt s max a A v(xt s, a) V (πt, ρt) i

Eht 1 h δ i because by definition, ρt Rδ(πt) V (πt, ρt) X

s S πt s max a A v(xt s, a) δ.

Summing over all T rounds, we obtain:

t=1 E[v(xt, d(st, at)) v(xt, at)] Tδ = CSReg(T)

- If the principal does not play according to σ in any round, then the agent will switch to an algorithm with swap regret at most CSReg(T )

2 , so the total swap regret of the agent is at most:

Tδ + CSReg(T)

2 CSReg(T),

which proves that the agent s learning algorithm has swap regret at most CSReg(T).

The principal s average utility, when the principal uses strategy σ and the agent uses the above no-swap-regret algorithm, is

t=1 E[u(xt, at)] = 1

t=1 Eht 1 h X

a A ρt(a|s)u(xt s, a) i

t=1 Eht 1 h U(πt, ρt) i

t=1 Eht 1 h sup π min ρ Rδ(π) U(π, ρ) i because ρt arg min ρ Rδ(πt) U(πt, ρ)

= sup π min ρ Rδ(π) U(π, ρ)

= OBJR CSReg(T)

E.3 PROOF OF THEOREM 3.4

Proof. Let pt s = Pr[st = s] = E 1[st = s] = E[πt s] be the probability that signal s S is sent in round t. Let pt a|s = Pr[at = a | st = s] be the probability that the agent takes action a conditioning on signal st = s being sent in round t. Let d : S A A be any deviation function for the agent.

Published as a conference paper at ICLR 2025

The utility gain by deviation for the agent is upper bounded by the contextual swap-regret:

v(xt, d(st, at)) v(xt, at) i (17)

a A pt a|s Exts|st=s h v(xt s, d(s, a)) v(xt s, a) i

a A pt a|s v(E[xt s|st = s], d(s, a)) v(E[xt s|st = s], a) by linearity of v( , a)

PT j=1 pj spj a|s T 1 PT j=1 pj spj a|s

t=1 pt spt a|s v(E[xt s|st = s], d(s, a)) v(E[xt s|st = s], a)

PT j=1 pj spj a|s T

v PT t=1 pt spt a|s E[xt s|st=s] PT j=1 pj spj a|s , d(s, a) v PT t=1 pt spt a|s E[xt s|st=s] PT j=1 pj spj a|s , a .

Define qs,a =

PT j=1 pj spj a|s T and ys,a =

PT t=1 pt spt a|s E[xt s|st=s] PT j=1 pj spj a|s X. Then the above is equal to

a A qs,a h v(ys,a, d(s, a)) v(ys,a, a) i . (18)

We note that P

a A pj spj a|s T = 1, so q is a probability distribution over S A. And note that

s,a S A qs,ays,a = X

t=1 pt spt a|s E[xt s|st = s] = 1

s S pt s E[xt s|st = s]

s S E 1[st = s]xt s = 1

s S 1[st = s]xt s = 1

s S πt sxt s C because X

s S πt sxt s C.

This means that π = {(qs,a, ys,a)}(s,a) S A defines a valid principal strategy with the larger signal space S A. Then, we note that (18) is the difference between the agent s expected utility under principal strategy π when responding using strategy d : S A A and using the strategy that maps signal (s, a) to action a. And (18) is upper bounded by CSReg(T )

(18) = V (π , d) V (π , (s, a) 7 a) CSReg(T )

T , d : S A A. (19)

In particular, this holds when d is the agent s best-responding strategy. This means that the agent strategy (s, a) 7 a is a ( CSReg(T )

T )-best-response to π . So, the principal s expected utility is upper bounded by the utility in the approximate-best-response model:

1 T E h T X

t=1 u(xt, at) i = 1

a A pt a|sv(E[xt s|st = s], a)

a A qs,au(ys,a, a) = U(π , (s, a) a) OBJ R CSReg(T )

E.4 PROOF OF THEOREM 3.5

The instance has 2 states (A, B), 3 actions (L, M, R), uniform prior µ0(A) = µ0(B) = 0.5, with the following utility matrices (left for sender s, right for receiver s):

Published as a conference paper at ICLR 2025

u(ω, a) L M R v(ω, a) L M R A 0 2 2 A γ 1 0 B 0 0 2 B 1 1 0

Claim 2. In this instance, the optimal sender utility U in the classic BP model is 0, and the approximate-best-response objective OBJ R(γ) = O(γ).

Proof. Recall that any signaling scheme decomposes the prior µ0 into multiple posteriors {µs}s S. If a posterior µs puts probability > 0.5 to state B, then the receiver will take action M, which gives the sender a utility 0; if the posterior µs puts probability 0.5 to state B, then no matter what action the receiver takes, the sender s expected utility on µs cannot be greater than 0. So, the sender s expected utility is 0 under any signaling scheme. An optimal signaling scheme is to reveal no information (keep µs = µ0); the receiver takes R and the sender gets utility 0.

This instance satisfies the assumptions of Theorem 4.2, so OBJ R(γ) U + O(γ) = O(γ).

Claim 3. By doing the following, the sender can obtain utility 1 2 O( γ) if the receiver is γ-mean-based learning:

in the first T/2 rounds: if the state is A, send signal 1; if the state is B, send 2.

in the remaining T/2 rounds, switch the scheme: if the state is A, send 2; if state is B, send 1.

Proof. In the first T/2 rounds, the receiver finds that signal 1 corresponds to state A so he will take action L with high probability when signal 1 is sent; signal 2 corresponds to B so he will take action M with high probability. In this phase, the sender obtains utility 0 per round. At the end of this phase, for signal 1, the receiver accumulates utility T

2 1 2 γ = T

4 γ for action L. For signal 2, the receiver accumulates utility T

2 1 2 1 = T

4 for action M.

In the remaining T/2 rounds, the following will happen:

For signal 1, the receiver finds that the state is now B, so the utility of action L decreases by 1 every time signal 1 is sent. Because the utility of L accumulated in the first phase was T

4 γ, after T

4 γ rounds in second phase the utility of L should decrease to below 0, and the receiver will no longer play L (with high probability) at signal 1. The receiver will not play M at signal 1 in most of the second phase either, because there are more A states than B states at signal 1 historically. So, the receiver will play action R most times, roughly T

4 γ rounds. This gives the sender a total utility of ( T

For signal 2, the state is now A. But the receiver will continue to play action M in most times. This because: R has utility 0; L accumulated T

4 utility in the first phase, and only increases by γ per round in the second phase, so its accumulated utility is always negative; instead, M has accumulated T

4 utility in the first phase, and decreases by 1 every time signal 2 is sent in the second phase, so its utility is positive until near the end. So, the receiver will play M. This gives the sender utility 0.

Summing up, the sender obtains total utility T

2 O(T γ) in these two phases, which is 1

2 O( γ) > 0 per round in average.

The above two claims together prove the theorem.

Published as a conference paper at ICLR 2025

F MISSING PROOFS FROM SECTION 4

F.1 PROOF OF CLAIM 1

If no G > 0 satisfies the claim, then there must exist an a0 A such that for all x X, v(a0, µ) v(a , µ) 0 for some a A \ {a0}. Namely,

max x X min a A\{a0}

v(x, a0) v(x, a ) 0.

Then, by the minimax theorem, we have

min α (A\{a0}) max x X v(x, a0) v(x, α ) = max x X min a A\{a0}

v(x, a0) v(x, a ) 0.

This means that a0 is weakly dominated by some mixed action α (A \ {a0}), violating Assumption 1.

F.2 PROOF OF EXAMPLE 4.1

We use the probability µ [0, 1] of the Good state to represent a belief (so the probability of Bad state is 1 µ).

First, the sender s optimal utility when the receiver exactly best responds is 2µ0:

This is achieved by decomposing the prior µ0 into two posteriors µa = 1

2 and µb = 0 with probability 2µ0 and 1 2µ0 respectively, with the receiver taking action a under posterior µa and b under µb.

Then, consider any signaling scheme of the sender, π = {(πs, µs)}s S, which is a decomposition of the prior µ0 into |S| posteriors µs [0, 1] such that P

s S πsµs = µ0. Let ρ : S (A) be a randomized strategy of the receiver, where ρ(a|s) (and ρ(b|s)) denotes the probability that the receiver takes action a (and b) under signal s. The sender s expected utility under π and ρ is:

U(π, ρ) = X

s S πs ρ(a|s) 1 + ρ(b|s) 0 = X

s S πsρ(a|s). (20)

The receiver s utility when taking action a at posterior µs is µs 1 + (1 µs) ( 1) = 2µs 1. So, the receiver s expected utility under π and ρ is

V (π, ρ) = X

s S πs ρ(a|s) (2µs 1) + ρ(b|s) 0 = X

s S πsρ(a|s)(2µs 1). (21)

Clearly, the receiver s best response ρ is to take action a with certainty if and only if µs > 1

2, with expected utility V (π, ρ ) = X

πs(2µs 1). (22)

To find OBJR(δ) = supπ minρ Rδ(π) U(π, ρ), we fix any π and solve the inner optimization problem (minimizing the sender s utility) regarding ρ:

min ρ U(π, ρ) = X

s S πsρ(a|s)

s.t. ρ Rδ(π) δ V (π, ρ ) V (π, ρ)

πs(2µs 1) X

s S πsρ(a|s)(2µs 1).

Without loss of generality, we can assume that the solution ρ satisfies ρ(a|s) = 0 whenever µs 1

2 (if ρ(a|s) > 0 for some µs 1

2, then making ρ(a|s) to be 0 can decrease the objective

Published as a conference paper at ICLR 2025

s S πsρ(a|s) while still satisfying the constraint). So, the optimization problem can be simplified to:

min ρ U(π, ρ) = X

πs(2µs 1) X

πsρ(a|s)(2µs 1)

πs(2µs 1)(1 ρ(a|s)),

ρ(a|s) [0, 1], s S : µs > 1

We note that this is a fractional knapsack linear program, which has a greedy solution (e.g., (Korte & Vygen, 2012)): sort the signals with µs > 1

2 in increasing order of 2µs 1 (equivalently, increasing order of µs); label those signals by s = 1, . . . , n; find the first position k for which Pk s=1 πs(2µs 1) > δ:

k = min j :

s=1 πs(2µs 1) > δ ;

then, an optimal solution ρ is given by:

ρ(a|s) = 0 for s = 1, . . . , k 1;

ρ(a|k) = 1 δ Pk 1 s=1 πs(2µs 1) πk(2µk 1) for s = k; ρ(a|s) = 1 for s = k + 1, . . . , n.

The objective value (sender s expected utility) of the above solution ρ is

U(π, ρ) = X

= πk 1 δ Pk 1 s=1 πs(2µs 1) πk(2µk 1)

s=k πs δ 2µk 1 +

Since the signaling scheme π must satisfy P

s S πsµs = µ0, we have

s=k πs µ0 Pk 1 s=1 πsµs µk .

U(π, ρ) µ0 Pk 1 s=1 πsµs µk δ 2µk 1 +

µk δ 2µk 1 +

s=1 πs 2µs 1

Since 2µs 1

µk = µs µk (2µs 1)µk 0 for any s k 1, we get

µk δ 2µk 1 = f(µk).

Published as a conference paper at ICLR 2025

We find the maximal value of f(µk) = µ0

µk δ 2µk 1. Take its derivative:

f (µk) = µ0

µ2 k + 2δ (2µk 1)2 =

2δ + 2 µ0)µk µ0 (

2δ 2 µ0)µk + µ0

µ2 k(2µk 1)2 ,

which has two roots µ0

2δ+2 µ0 < 1 2 and µ0 2 µ0

2, 1) when 0 < δ < µ0

2 . So, f(x) is

increasing in [ 1

2δ) and decreasing in ( µ0 2 µ0

2δ, 1]. Since µk > 1

2, f(µk) is maximized

at µk = µ0 2 µ0

2δ. This implies

U(π, ρ) f µ0 2 µ0

2δ = µ0 µ0 (2 µ0

2δ 1 = 2µ0 2 p

This holds for any π. So, OBJR(δ) = supπ minρ Rδ(π) U(π, ρ) U 2 2µ0δ + δ = U Ω(

F.3 PROOF OF THEOREMS 4.1 AND 4.2

Lower bounds on OBJD(δ) and upper bounds on OBJ R(δ). First, we prove the lower bounds on OBJD(δ) and the upper bounds on OBJ R(δ) in Theorems 4.1 and 4.2, given by the following two lemmas:

Lemma F.1. In an unconstrained generalized principal-agent problem, OBJD(δ) U diam(X)L δ

With the constraint P

s S πsxs C, OBJD(δ) U diam(X)L + 2B diam(X)

dist(C, X) δ

Lemma F.2. In an unconstrained generalized principal-agent problem, OBJ R(δ) U + diam(X)L δ

With the constraint P

s S πsxs C, OBJ R(δ) U + diam(X)L + 2B diam(X)

dist(C, X) δ

The proofs of Lemmas F.1 and F.2 are similar and given in Appendix F.4 and F.5. The main idea to prove Lemma F.2 is the following. Let (π, ρ) be any pair of principal s strategy and agent s δbest-responding strategy. We perturb the principal s strategy π slightly to be a strategy π for which ρ is exactly best-responding (such a perturbation is possible due to Assumption 1). Since ρ is bestresponding to π , the pair (π , ρ) cannot give the principal a higher utility than U (which is the optimal principal utility under the best-response model). This means that the original pair (π, ρ) cannot give the principal a utility much higher than U , implying an upper bound on OBJ R(δ).

Upper bounds on OBJ R(δ) imply upper bounds on OBJ D(δ). Then, because OBJ D(δ) OBJ R(δ), we immediately obtain the upper bounds on OBJ D(δ) in the two theorems.

Lower bounds for OBJD(δ) imply lower bounds for OBJR(δ) Finally, we show that the lower bounds for OBJD(δ) imply the lower bounds for OBJR(δ), using the following lemma:

Lemma F.3. For any δ 0, > 0, OBJR(δ) OBJD( ) 2Bδ

The proof of this lemma is in Appendix F.6.

Using Lemma F.3 with = q

2BGδ diam(X)L and the lower bound for OBJD( ) in Lemma F.1 for the unconstrained case, we obtain:

OBJR(δ) OBJD( ) 2Bδ

G diam(X)δ,

which gives the lower bound for OBJR(δ) in Theorem 4.1.

Published as a conference paper at ICLR 2025

Using Lemma F.3 with = r

Ldiam(X)+2B diam(X)

dist(C, X) and the lower bound for OBJD( ) in

Lemma F.1 for the constrained case, we obtain:

OBJR(δ) OBJD( ) 2Bδ

U diam(X)L + 2B diam(X)

G diam(X)L + 2B diam(X)

dist(C, X) δ.

This proves the lower bound for OBJR(δ) in Theorem 4.2.

F.4 PROOF OF LEMMA F.1

Let (π, ρ) be a pair of principal strategy and agent strategy that achieves the optimal principal utility with an exactly-best-responding agent, namely, U(π, ρ) = U . Without loss of generality ρ can be assumed to be deterministic, ρ : S A. The strategy π consists of pairs {(πs, xs)}s S that satisfy X

s S πsxs =: µ0 C, (23)

and the action a = ρ(s) is optimal for the agent with respect to xs. We will construct another principal strategy π such that, even if the agent chooses the worst δ-best-responding strategy to π , the principal can still obtain utility arbitrarily close to U Ldiam(X; ℓ1) + 2B diam(X)

dist(C, X) δ

To construct π we do the following: For each signal s S, with corresponding action a = ρ(s), by Claim 1 there exists ya X such that v(ya, a) v(ya, a ) G for any a = a. Let θ = δ

G+ε [0, 1] for arbitrarily small ε > 0, and let xs be the convex combination of xs and yρ(s) with weights 1 θ, θ:

xs = (1 θ)xs + θyρ(s). (24)

We note that a = ρ(s) is the agent s optimal action for xs and moreover it is better than any other action a = a by more than δ:

v( xs, a) v( xs, a ) = (1 θ) v(xs, a) v(xs, a ) | {z } 0 because a = ρ(s) is optimal for xs

+ θ v(ya, a) v(ya, a ) | {z } G by our choice of ya

GG = δ. (25)

Let µ be the convex combination of { xs}s S with weights {πs}s S:

s S πs xs. (26)

Note that µ might not satisfy the constraint µ C. So, we want to find another vector z X and a coefficient η [0, 1] such that (1 η)µ + ηz C. (27) (If µ already satisfies µ C, then let η = 0.) To do this, we consider the ray starting from µ pointing towards µ0: {µ + t(µ0 µ ) | t 0}. Let z be the intersection of the ray with the boundary of X:

z = µ + t (µ0 µ ), t = arg max{t 0 | µ + t(µ0 µ ) X}.

Then, rearranging z = µ + t (µ0 µ ), we get

1 t (z µ ) = µ0 µ (1 1

t z = µ0 C,

which satisfies (27) with η = 1

t . We then give an upper bound on η = 1

Claim 4. η = 1

t diam(X) dist(C, X)θ.

Proof. On the one hand,

s S πs xs = X

s S πsθ(yρ(s) xs)

s S πs yρ(s) xs θ X

s S πs diam(X) = θ diam(X).

Published as a conference paper at ICLR 2025

On the other hand, because z µ and µ0 µ are in the same direction, we have

z µ = z µ0 + µ0 µ z µ0 dist(C, X)

because µ0 is in C and z is on the boundary of X. Therefore, η = 1

z µ diam(X) dist(C, X)θ.

The convex combinations (27) (26) define a new principal strategy π with |S|+1 signals, consisting of xs with probability (1 η)πs and z with probability η, satisfying P

s S(1 η)πs xs + ηz = µ0 C. Consider the agent s worst (for the principal) δ-best-responding strategies ρ to π :

ρ arg min ρ Dδ(π ) U(π , ρ).

We note that ρ ( xs) must be equal to ρ(s) for each s S. This is because a = ρ(s) is strictly better than any other action a = a by a margin of δ (25), so a is the only δ-optimal action for xs.

Then, the principal s expected utility under π and ρ is

U(π , ρ ) (27),(26) = (1 η) X

s S πsu( xs, ρ ( xs)) + ηu(z, ρ (z))

s S πsu( xs, ρ(s)) ηB

s S πs u(xs, ρ(s)) L xs xs | {z } = θ(yρ(s) xs) θdiam(X)

(1 η)U(π, ρ) Lθdiam(X) ηB U(π, ρ) Lθdiam(X) 2ηB

(Claim 4) U(π, ρ) Lθdiam(X) 2B diam(X)

dist(C, X)θ

= U(π, ρ) Ldiam(X) + 2B diam(X)

dist(C, X) ( δ

= U Ldiam(X) + 2B diam(X)

dist(C, X) δ

So, we conclude that

OBJD(δ) = sup π min ρ Dδ(π) U(π, ρ) min ρ Dδ(π ) U(π , ρ)

= U(π , ρ ) U Ldiam(X) + 2B diam(X)

dist(C, X) δ

Letting ε 0 finishes the proof for the case with the constraint P

s S πsxs C.

The case without P

s S πsxs C is proved by letting η = 0 in the above argument.

F.5 PROOF OF LEMMA F.2

Let π be a principal strategy and ρ Rδ(π) be a δ-best-responding randomized strategy of the agent. The principal strategy π consists of pairs {(πs, xs)}s S with X

s S πsxs =: µ0 C. (28)

At signal s, the agent takes action a with probability ρ(a|s). Let δs,a be the suboptimality of action a with respect to xs:

δs,a = max a A v(xs, a ) v(xs, a) . (29)

By Claim 1, for action a there exists ya X such that v(ya, a) v(ya, a ) G for any a = a. Let θs,a = δs,a G+δs,a [0, 1] and let xs,a be the convex combination of xs and ya with weights 1 θs,a and θs,a: xs,a = (1 θs,a)xs + θs,aya. (30)

Published as a conference paper at ICLR 2025

Claim 5. We have two useful claims regarding xs,a and θs,a:

(1) a is an optimal action for the agent with respect to xs,a: v( xs,a, a) v( xs,a, a ) 0, a A.

a A πsρ(a|s)θs,a δ

Proof. (1) For any a = a, by the definition of xs,a and θs,a,

v( xs,a, a) v( xs,a, a ) = (1 θs,a) v(xs, a) v(xs, a ) + θs,a v(ya, a) v(ya, a )

(1 θs,a)( δs,a) + θs,a G = G G+δs,a ( δs,a) + δs,a G+δs,a G = 0.

(2) By the condition that ρ is a δ-best-response to π, we have

δ max ρ :S A V (π, ρ ) V (π, ρ) = X

s S πs max a A v(xs, a ) X

a A ρ(a|s)v(xs, a)

a A πsρ(a|s) max a A v(xs, a ) v(xs, a) = X

a A πsρ(a|s)δs,a.

a A πsρ(a|s)θs,a = P

a A πsρ(a|s) δs,a G+δs,a P s S P a A πsρ(a|s) δs,a

We let µ be the convex combination of { xs,a}s,a S A with weights {πsρ(a|s)}s,a S A:

s,a S A πsρ(a|s) xs,a. (31)

Note that µ might not satisfy the constraint µ C. So, we want to find another vector z X and a coefficient η [0, 1] such that (1 η)µ + ηz C. (32)

(If µ already satisfies µ C, then let η = 0.) To do this, we consider the ray pointing from µ to µ0: {µ + t(µ0 µ ) | t 0}. Let z be the intersection of the ray with the boundary of X:

z = µ + t (µ0 µ ), t = arg max{t 0 | µ + t(µ0 µ ) X}.

Then, rearranging z = µ + t (µ0 µ ), we get

1 t (z µ ) = µ0 µ (1 1

t z = µ0 C,

which satisfies (32) with η = 1

t . We then give an upper bound on η = 1

Claim 6. η = 1

t diam(X) dist(C, X) δ G.

Proof. On the one hand,

a A πsρ(a|s) xs,a = X

a A πsρ(a|s)θs,a(ya xs)

a A πsρ(a|s)θs,a ya xs X

a A πsρ(a|s)θs,adiam(X) Claim 5 diam(X) δ

On the other hand, because z µ and µ0 µ are in the same direction, we have

z µ = z µ0 + µ0 µ z µ0 dist(C, X)

because µ0 is in C and z is on the boundary of X. Therefore, η = 1

z µ diam(X) dist(C, X) δ G.

Published as a conference paper at ICLR 2025

The convex combinations (32) (31) define a new principal strategy π (with |S| |A| + 1 signals) consisting of xs,a with probability (1 η)πsρ(a|s) and z with probability η. Consider the following deterministic agent strategy ρ in response to π : for xs,a, take action ρ ( xs,a) = a; for z, take any action that is optimal for z. We note that ρ is a best-response to π , ρ R0(π ), because, according to Claim 5, a is an optimal action with respect to xs,a.

Then, consider the principal s utility under π and ρ :

U(π , ρ ) (32),(31) = (1 η) X

a A πsρ(a|s)u( xs,a, ρ ( xs,a)) + ηu(z, ρ (z))

a A πsρ(a|s)u( xs,a, a) ηB

a A πsρ(a|s) u(xs, a) L xs xs | {z } = θs,a(ya xs) θs,adiam(X)

(1 η)U(π, ρ) Ldiam(X) X

a A πsρ(a|s)θs,a ηB

(Claim 5) U(π, ρ) Ldiam(X) δ

(Claim 6) U(π, ρ) Ldiam(X) + 2B diam(X)

dist(C, X) δ

Rearranging, U(π, ρ) U(π , ρ ) + Ldiam(X) + 2B diam(X)

dist(C, X) δ

G. Note that this argument holds for any pair (π, ρ) that satisfies ρ Rδ(π). And recall that ρ R0(π ). So, we conclude that

OBJ R(δ) = max π max ρ Rδ(π) U(π, ρ) max π max ρ R0(π) U(π , ρ ) + Ldiam(X; ℓ1) + 2B diam(X)

dist(C, X) δ

= U + Ldiam(X; ℓ1) + 2B diam(X)

dist(C, X) δ

This proves the case with the constraint P

s S πsxs C.

The case without P s S πsxs C is proved by letting η = 0 in the above argument.

F.6 PROOF OF LEMMA F.3

Let A (x) = a A | v(x, a) v(x, a ) , a A be the set of -optimal actions of the agent in response to principal decision x X. The proof of Lemma F.3 uses another lemma that relates the principal utility under a randomized δ-best-responding agent strategy ρ Rδ(π) and that under an agent strategy ρ that only randomizes over A (xs). Lemma F.4. Let π = {(πs, xs)}s S be a principal strategy and ρ Rδ(π) be a randomized δ-bestresponse to π. For any > 0, there exists an agent strategy ρ : s 7 (A (xs)) that randomizes over -optimal actions only for each xs, such that the principal s utility under ρ and ρ satisfies: U(π, ρ ) U(π, ρ) 2Bδ

Proof. Let a s = maxa A v(xs, a) be the agent s optimal action for xs. Let A (xs) = A \ A (xs) be the set of actions that are not -optimal for xs. By the definition that ρ Rδ(π) is a δ-bestresponse to π, we have

s S πs h v(xs, a s) X

a A ρ(a|s)v(xs, a) i

a A (xs) ρ(a|s) v(xs, a s) v(xs, a) | {z } 0

a A (xs) ρ(a|s) v(xs, a s) v(xs, a) | {z } >

a A (xs) ρ(a|s)

s S πsρ(A (xs) | s).

Published as a conference paper at ICLR 2025

Rearranging, X

s S πsρ(A (xs) | s) δ

Then, we consider the randomized strategy ρ that, for each s, chooses each action a A (xs) with the conditional probability that ρ chooses a given a A (xs):

ρ (a | s) = ρ(a | s) ρ(A (xs) | s).

The sender s utility under ρ is:

U(π, ρ ) = X

ρ(a | s) ρ(A (xs) | s)u(xs, a).

The sender s utility under ρ is

U(π, ρ) = X

a A (xs) ρ(a | s)u(xs, a) + X

a A (xs) ρ(a | s)u(xs, a)

Taking the difference between the two utilities, we get U(π, ρ ) U(π, ρ)

s S πs 1 ρ(A (xs) | s) 1 X

a A (xs) ρ(a | s)u(xs, a) + X

a A (xs) ρ(a | s)u(xs, a)

s S πs 1 ρ(A (xs) | s)

ρ(A (xs) | s)

a A (xs) ρ(a | s)u(xs, a) + X

a A (xs) ρ(a | s)u(xs, a)

s S πs 1 ρ(A (xs) | s)

ρ(A (xs) | s)

a A (xs) ρ(a | s) B + X

a A (µs) ρ(a | s) B

s S πs ρ(A (xs) | s) ρ(A (xs) | s)ρ(A (xs) | s) + B X

s S πsρ(A (xs) | s)

s S πsρ(A (xs) | s) (33) 2Bδ

This proves the lemma.

We now prove Lemma F.3.

Proof of Lemma F.3. Consider the objective OBJR(δ) = supπ minρ Rδ(π) U(π, ρ). By Lemma F.4, for any (π, ρ) there exists an agent strategy ρ : s 7 (A (xs)) that only randomizes over -optimal actions such that U(π, ρ ) U(π, ρ) 2Bδ

. Because minimizing over (A (xs)) is equivalent to minimizing over A (xs), which corresponds to deterministic -bestresponding strategies, we get:

OBJR(δ) = sup π min ρ Rδ(π) U(π, ρ) sup π min ρ :s7 (A (xs)) U(π, ρ ) 2Bδ

= sup π min ρ :s7 A (xs) U(π, ρ ) 2Bδ

= OBJD( ) 2Bδ