# contract_design_under_approximate_best_responses__809654d8.pdf Contract Design Under Approximate Best Responses Francesco Bacchiocchi 1 Jiarui Gan 2 Matteo Castiglioni 1 Alberto Marchesi 1 Nicola Gatti 1 Principal-agent problems model scenarios where a principal incentivizes an agent to take costly, unobservable actions through the provision of payments. Such problems are ubiquitous in several real-world applications, ranging from blockchain to the delegation of machine learning tasks. In this paper, we initiate the study of hidden-action principal-agent problems under approximate best responses, in which the agent may select any action that is not too much suboptimal given the principal s payment scheme (a.k.a. contract). Our main result is a polynomial-time algorithm to compute an optimal contract under approximate best responses. This positive result is perhaps surprising, since, in Stackelberg games, computing an optimal commitment under approximate best responses is computationally intractable. We also investigate the learnability of contracts under approximate best responses, by providing a no-regret learning algorithm for a natural application scenario where the principal has no prior knowledge about the environment. 1. Introduction In hidden-action principal-agent problems, a principal tries to steer the behavior of a self-interested agent toward favorable outcomes. The agent has to take a costly action that stochastically determines an outcome resulting in a reward for the principal. The main challenge is that the agent s action is hidden to the principal, who can only observe the realized outcome. Thus, the principal influences the agent s behavior by committing to a contract, which is an outcomedependent payment scheme whose aim is to induce the agent to take a high-cost action leading to high principal s rewards. The principal s goal is to design an optimal contract, namely one maximizing their utility, i.e., rewards minus payments. 1Politecnico di Milano 2University of Oxford. Correspondence to: Francesco Bacchiocchi . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Nowadays, principal-agent problems find application in a terrific number of real-world settings, such as, e.g., crowdsourcing (Ho et al., 2014), online labor platforms (Kaynar & Siddiq, 2023), blockchain (Cong & He, 2019), delegation of machine learning tasks (Cai et al., 2015), and pay-forperformance healthcare (Bastani et al., 2016; 2019). Moreover, algorithmic contract design is playing a crucial role in today s world, which increasingly relies on AI agents to perform complex tasks (see, e.g., (Hadfield-Menell & Hadfield, 2019; Saig et al., 2024)). Previous works on algorithmic contract design assume that the agent always plays a best-response action to the principal s contract. However, if the agent actually responds (even slightly) suboptimally to the principal, then the principal s utility may substantially deteriorate. This may be the case in most of the real-world applications of interest, for several different reasons. For instance, the principal may not perfectly know agent s features and account for the wrong best response, the agent may not be powerful enough to compute an (exact) best-response action, or they may inaccurately interpret the principal s contract. In this paper, we initiate the study of hidden-action principalagent problems under approximate agent s best responses. Specifically, we consider settings in which the agent may take actions that are up to δ (0, 1) suboptimal under the principal s contract. We do not make any assumption on the specific δ-best response selected by the agent, but we allow them to take any of such actions. Thus, we take a worst-case approach and consider the problem of designing contracts under the assumption that the agent selects the worst δ-best response for the principal. Contracts designed in such a way are said to be robust, as they guard the principal against any possible (conceivable) suboptimal behavior of the agent. 1.1. Results and Techniques In this paper, we provide an extensive treatment of the computational and learnability aspects of the design of robust contracts. The results presented in the paper are organized into three main parts as summarized below. The Price of Robustness In the first part of the paper, as a preliminary analysis, we provide a characterization of the maximum utility that the principal can achieve by means of Contract Design Under Approximate Best Responses robust contracts. Specifically, we provide upper and lower bounds on this utility, as a function of a parameter δ (0, 1) quantifying the agent s best response suboptimality. These bounds give insights on the price (in terms of utility) that the principal incurs for being robust, as the parameter δ can be seen as a measure of the robustness level of the principal s contracts. Interestingly, our results show that, differently from what happens in general Stackelberg games (see, e.g., (Gan et al., 2023)), upper/lower bounds do not depend on the inducibility gap > 0 characterizing the problem instance. In order to derive the bounds, we prove that it is possible to convert any non-robust contract into a robust one by properly moving its payments in the direction of principal s rewards, and that, in an optimal robust contract, which provides the principal with a utility greater than zero, the agent s utility should be at least δ. Computing Robust Contracts The second part of the paper addresses the problem of computing a robust contract that is optimal for the principal. Our main contribution is a polynomial-time algorithm for this problem. This is perhaps surprising, since analogous problems in Stackelberg and Bayesian persuasion settings are known to be computationally intractable (Gan et al., 2023; Yang & Zhang, 2024), despite having more amenable solution spaces ( m in these problems vs. Rm + in contract design). At a high level, our algorithm cleverly exploits a particular structure that we discover in the robustness constraints (which ensure that the agent plays the worst approximate best response for the principal). Intuitively, we show that, when the agent s best response and their (worst) approximate best response are fixed to two arbitrarily-selected actions, the problem of computing a utility-maximizing robust contract can be formulated by means of a union of n + 1 different linear programs (LPs). Therefore, by taking the maximum over all these LPs, one can compute a utility-maximizing robust contract once the agent s best response and their (worst) approximate best response are fixed. By further iterating this procedure for all the possible choices of the action pair, we then determine an optimal robust contract. Learning Robust Contracts In the third and last part of the paper, we investigate the learnability of robust contracts. Specifically, we study an online learning framework similar to the one analyzed by Zhu et al. (2023), in which the features of agent s actions, i.e., costs and probabilities over outcomes, depend on an agent s type that is sampled at each round from some (fixed) unknown probability distribution. At each round, after committing to some contract, the principal only observes the outcome (and its associated reward) realized as an effect of the approximate best response played by the agent. Our main result within this online learning framework is the design of an algorithm that achieves sublinear regret with respect to always playing an optimal robust contract. Additionally, we show that when the parameter δ (0, 1) measuring the suboptimality of agent s actions is sufficiently small with respect to the time horizon, our algorithm also achieves sublinear regret with respect to always playing an optimal non-robust contract. Our approach presents some advantages compared to the state-of-the-art proposed by Zhu et al. (2023) for the non-robust version of the problem, while achieving similar regret guarantees. Indeed, our algorithm employs a simpler discretization of the set of contracts employed by the principal, which does not require that the principal knows its rewards beforehand. This is made possible by the fact that we rely on a novel continuity argument for the principal s expected utility (see Lemma 3), which is different from the one originally proposed by Zhu et al. (2023). 1.2. Related Works Robustness to Approximate Best Responses Two works that are closely related to ours are (Gan et al., 2023; Yang & Zhang, 2024), which consider robustness notions analogous to ours, though in different settings. Specifically, Gan et al. (2023) initiate this research line, by studying th problem of computing robust leader s commitments in Stackelberg games where the follower plays an approximate best response. They show that it is NP-hard to approximate an optimal robust commitment of the leader and, in accordance to this hardness result, they provide a quasi-polynomial-time approximation scheme (QPTAS). Yang & Zhang (2024) extend the study initiated by Gan et al. (2023) to Bayesian persuasion, with the goal of computing robust signaling schemes under approximate best responses of the receiver. Similarly to Gan et al. (2023), they show that computing an approximately-optimal robust signaling scheme is NP-hard and provide a QPTAS. In sharp contrast with these works, we show that, in hidden-action principal-agent problems, an optimal robust contract can be computed efficiently. Contract Design Contract theory has been extensively studied in economics (Holmstrom & Milgrom, 1991). However, the interest in its computational aspects is more recent. D utting et al. (2019) analyze linear contracts, proving approximation bounds. Guruganesh et al. (2021); Castiglioni et al. (2022a;b) study the computational aspects of contract design in Bayesian settings. Alon et al. (2023) study Bayesian linear contracts, proving near-optimality under sufficient uncertainty. Other works explore combinatorial principal-agent problems (Babaioff et al., 2006; 2009) and multi-agent extensions of hidden-action problems (Castiglioni et al., 2023; Duetting et al., 2024). Other Forms of Robustness in Contract Design Our work is also related to other research lines addressing different concepts of robustness in contract design. For instance, Contract Design Under Approximate Best Responses Carroll (2015) studies settings where the principal only knows a superset of agents actions, while D utting et al. (2019) introduce a different notion of uncertainty in which the principal has partial knowledge of the distributions over outcomes associated with agent s actions. Both these works show that linear contracts are a sufficient class of contracts to determine the min-max robust optimal contract. Notice that these two frameworks differ from ours, as within our framework, when the robustness parameter δ (0, 1) is arbitrarily small, the problem becomes very close to the classical version of the hidden-action principal-agent problem, in which it is known that linear contracts are not generally optimal (see, e.g., (D utting et al., 2019)). Recently, Bernasconi et al. (2024) study settings where uncertainty lies in the costs of agent s actions. In this framework, the principal only knows a set containing the true cost vectors, and computing an optimal min-max robust contract is APX-hard. Learning in Principal-Agent Problems Our work is also related to online learning problems in hidden-action principal-agent problems. Zhu et al. (2023) study general hidden-action principal-agent problem instances in which the principal faces multiple agent s types. They show that it is possible to design an algorithm that achieves a regret bound of the order of e O( m T 1 1/(2m+1)) when the principal selects contracts from the hypercube [0, 1]m, where m is the number of outcomes. In our work, we show that it is possible to design an algorithm achieving similar regret guarantees even when the different agent s types select approximate best responses. Our algorithm presents some advantages compared to the one proposed by Zhu et al. (2023). Specifically, our approach employs a simpler discretization of the hypercube used during the execution of the algorithm compared to (Zhu et al., 2023), and it does not require prior knowledge of the principal s rewards. Furthermore, Zhu et al. (2023) provide an (almost-matching) lower bound of Ω(T 1 1/(m+2)), which holds even with a single agent s type. Some recent works have introduced additional hypothesis to overcome this negative result (see, e.g., (Bacchiocchi et al., 2024; Chen et al., 2024)). 2. Preliminaries In this section, we first introduce the classical hidden-action principal-agent problems (Section 2.1), and then the variation studied in this paper, in which the agent plays an approximate best response (Section 2.2). 2.1. Hidden-Action Principal-Agent Problems An instance of hidden-action principal-agent problem is characterized by a tuple (A, Ω, F, r, c), where A is a finite set of n := |A| actions available to the agent, Ωis a finite set of m := |Ω| possible outcomes, F [0, 1]m n is a matrix representing the effects of agent s action, r [0, 1]m is a reward vector for the principal, and c [0, 1]n is a vector of agent s costs. Each agent s action a A determines a probability distribution over outcomes, encoded by a column Fa Ωof the matrix F, and it results in a cost for the agent, encoded by a component ca [0, 1] of vector c.1 We denote by Fa,ω [0, 1] the probability with which action a results in outcome ω Ω, as prescribed by Fa. Thus, it must be the case that P ω ΩFa,ω = 1 for all a A. Each outcome ω Ωis associated with a reward for the principal, encoded by a component rω [0, 1] of the vector r. Thus, whenever the agent selects an action a A, the principal s expected reward can be computed as Ra := Fa r.2 The principal commits to an outcome-dependent payment scheme called contract, which is a vector p Rm + defining a payment pω 0 from the principal to the agent for every outcome ω Ω.3 Given a contract p Rm +, the agent plays a best-response action that is: (i) incentive compatible (IC), which means that it maximizes their expected utility; and (ii) individually rational (IR), meaning that it has non-negative expected utility. We assume w.l.o.g. that there always exists at least one opt-out action a A with null cost, i.e., ca = 0, and for which the principal s expected reward is equal to zero, i.e., Fa r = 0. Since the agent s utility is at least zero when playing the opt-out action, any IC action is also IR, allowing us to focus on incentive compatibility only. Whenever the principal commits to a contract p Rm + and the agent responds by playing an action a A, the agent s and the principal s expected utilities are, respectively, u A(p, a) := Fa p ca, and u P(p, a) := Fa (r p). The set A(p) A of agent s best responses in a contract p Rm + is defined as follows: A(p) := arg max a A {Fa p ca} . In classical (non-robust) hidden-action principal-agent problems, the agent breaks ties in favor of the principal when having multiple best responses available (see, e.g., (D utting et al., 2019)). We denote by a(p) A(p) the action played by the agent in a given contract p Rm +. This is an action a A(p) that maximizes the principal s utility Fa (r p). Formally, a(p) arg maxa A(p) Fa (r p). Then, the goal of the principal is to design a contract p Rm + that maximizes their expected utility u P(p, a(p)). We say that a contract p Rm + is a (non-robust) optimal contract if it holds p arg maxp Rm + u P(p, a(p)). In the following, we define 1We denote by X the set of all probability distributions over the set X. Given n N>0, we write [n] := {1, . . . , n}. 2We denote by x y the dot product of two vectors x, y Rd. 3As customary in contract theory (Carroll, 2015), we assume that the agent has limited liability, meaning that the payments can only be from the principal to the agent, and not viceversa. Contract Design Under Approximate Best Responses the principal s utility in an optimal (non-robust) contract as OPT := maxp Rm + u P(p, a(p)), while we let the value of the social welfare be SW := maxa A {Fa r ca}. 2.2. Robust Contracts and Approximate Best Responses In this paper, we study a variation of the classical hiddenaction principal-agent problem, where the agent plays an action that is an approximate best response. Given δ (0, 1), we define the set Aδ(p) A of agent s δ-best responses in a given contract p Rm + as follows: Aδ(p):= a A | Fa p ca >max a A {Fa p ca } δ . We adopt an adversarial robust approach, in the sense that, whenever the principal commits to a contract p Rm +, the agent selects a δ-best response that minimizes principal s expected utility, namely an action aδ(p) Aδ(p) such that aδ(p) arg mina Aδ(p) Fa (r p). We refer to the utility of a δ-robust contract p as u P(p, aδ(p)). Given δ (0, 1), the principal s goal is to design an optimal δ-robust contract p Rm +, which is formally defined as: p arg max p Rm + min a Aδ(p) u P(p, a) = arg max p Rm + Ψ(p), (1) where Ψ(p) := mina Aδ(p) u P(p, a) denotes the principal s expected utility in a δ-robust contract p Rm +. Notice that analogous δ-robust solution concepts have been already introduced in similar settings, namely Stackelberg games (Gan et al., 2023) and Bayesian persuasion (Yang & Zhang, 2024). Our δ-robust contracts are their analogous in the context of hidden-action principal-agent problems. In the following, given δ (0, 1), we denote the expected utility of the principal in an optimal δ-robust contract p as OPT(δ) := u P(p , aδ(p )). We remark that OPT(δ) is always well defined, as an optimal δ-robust contract, according to the definition in Eq. (1), always exists. Intuitively, this is due to the strict inequality in the definition of the set Aδ(p), as observed by Gan et al. (2023) for Stackelberg games. Thus, in order to prove the existence of an optimal δ-robust contract, it is possible to employ the same argument used to prove Proposition 1 in (Gan et al., 2023). 3. The Price of Robustness We start by providing a characterization of how the value OPT(δ) of an optimal δ-robust contract varies as a function of the parameter δ (0, 1), which controls the suboptimality of the agent s best response. The goal of our analysis is to quantify how much is the price (in terms of utility) that the principal incurs for being robust to agent s approximate best responses. Indeed, the parameter δ can be interpreted as a measure of the robustness level of the principal s contract, with higher δ values indicating higher levels of robustness. We first establish upper and lower bounds that identify a suitable region in which the values of OPT(δ) are contained. Such bounds only depend on the parameter δ (0, 1), the value of an optimal non-robust contract OPT, and the social welfare SW achievable with non-robust contracts. Proposition 1 (Upper and lower bounds). Given an instance of hidden-action principal-agent problem: 1. For every δ (0, 1), it holds: OPT(δ) OPTLB(δ) := OPT 2 2. For every δ (0, 1), it holds: OPT(δ) OPTUB(δ) := max {0, SW δ} . In order to prove the first point in Proposition 1, we show that it is always possible to convert a non-robust contract into a robust one by suitably moving it towards the direction of the principal s reward vector. In order to prove the second point of Proposition 1, we observe that, if an optimal δrobust contract p provides the principal with an expected utility strictly larger than zero, then the opt-out action does not belong to the set of δ-best responses for such a contract. Therefore, in an optimal δ-robust contract that provides the principal with an expected utility strictly larger than zero, the agent s utility is at least δ > 0, and thus the principal s expected utility is at most SW δ. The following proposition shows that the upper and lower bounds in Proposition 1 are tight. Formally: Proposition 2. Given δ (0, 1), for every integer n > n(δ), there is an instance of hidden-action principal-agent problem (parametrized by δ) with 2n + 1 actions, where: OPT(δ) OPTLB(δ) O 1 Furthermore, for any δ (0, 1), there exists an instance of hidden-action principal-agent problem in which it holds: OPT(δ) = OPTUB(δ). Proposition 2 shows that our upper bound OPTUB(δ) is (strictly) tight, while our lower bound OPTLB(δ) is tight up to an additive term that linearly goes to zero as the number of agent s actions n increases. Finally, thanks to the previous propositions, we can show the following property about the value of an optimal δ-robust contracts as a function of δ (0, 1). Proposition 3. The function (0, 1) δ 7 OPT(δ) is continuous and non-increasing in δ. Moreover, limδ 0+ OPT(δ) = OPT and limδ 1 OPT(δ) = 0. Contract Design Under Approximate Best Responses Figure 1. The blue area corresponds to the region in which OPT(δ) is bounded as a function of δ (0, 1) in an instance of hiddenaction principal-agent problem with SW = 0.9 and OPT = 0.7. Figure 1 shows an example of the region in which the value of OPT(δ) is bounded, as defined by the lower and upper bounds in Propositions 1 and 3 as functions of δ. Comparison With Stackelberg Games The characterization results derived in this section exhibit some crucial differences compared to analogous results derived for Stackelberg games by Gan et al. (2023). Indeed, in such settings, Gan et al. (2023) show that the value of an optimal δ-robust commitment crucially depends on a parameter > 0 that represents the inducibility gap of the problem instance. Intuitively, the inducibility gap encodes how easy it is for the leader to induce the follower to play any action; see (Gan et al., 2023) for a formal definition. Specifically, in Stackelberg games, the leader s expected utility OPT(δ) in an optimal δ-robust commitment is a Lipschitz function lower bounded by OPT δ/ if δ < , whereas OPT(δ) may not be even a continuous function if δ > . In contrast, in hidden-action principal-agent problems, the value of an optimal δ-robust contract is a continuous function with respect to δ (0, 1), regardless of the inducibility gap of the instance. Furthermore, the value of OPT(δ) is either zero or it is upper bounded by SW δ, showing that for large values of δ, the maximum utility that the principal can achieve may be particularly small. Intuitively, this is because the principal must provide the agent with a large expected payment to induce them to take desirable actions rather than the opt-out one. This upper bound does not generally hold in Stackelberg games in which the principal may achieve a large utility even for large values of δ. 4. Computing Optimal Robust Contracts Now, we present our main result: a polynomial-time algorithm to compute an optimal δ-robust contract. 4.1. Characterizing an Optimal Contract We begin by presenting an optimization problem that characterizes an optimal δ-robust contract. Consider an arbitrary optimal δ-robust contract p , as well as two arbitrary actions a A(p ) and aδ arg mina Aδ(p ) u P(p , a). By fixing a and aδ, we show that the following optimization problem, over the variable p Rm +, characterizes an optimal δ-robust contract p . max p Rm + u P(p, aδ) (2) subject to the following disjunctive constraints, which must hold for every agent s action a A: u A(p, a) u A(p, a ) δ u P(p, a) u P(p, aδ) . (2a) Intuitively, Eq. (2a) requires that each action a is either not a δ-best response (first inequality) or no worse than aδ for the principal (second inequality). This ensures that the objective function u P(p, aδ) captures the principal s utility in a δ-robust contract p. More formally, we establish the following lemma (recall that Ψ(p) = mina Aδ(p) u P(p, a) denotes the principal s utility in a δ-robust contract p). Lemma 1. Every optimal solution p Rm + to Problem (2) is an optimal δ-robust contract, i.e., Ψ(p) = Ψ(p ). Proof. First, observe that p is a feasible solution to Problem (2). Indeed, if an action a is not a δ-best response to p , then by definition this means u A(p , a) u A(p , a ) δ; otherwise, it must be that u P(p , a) u P(p , aδ) as aδ is by definition the worst δ-best response in p . As a result, Eq. (2a) holds for p for every a A. Furthermore, notice that, according to the definition of aδ: Ψ(p ) = min a Aδ(p ) u P(p , a) = u P(p , aδ). In order to complete the proof, it is sufficient to show that u P(p, aδ) Ψ(p) for every feasible solution p Rm + to Problem (2). Indeed, if the condition above holds, for any arbitrary optimal solution p Rm + to Problem (2), we have: Ψ(p ) = u P(p , aδ) u P(p , aδ) Ψ(p ) Ψ(p ), where u P(p , aδ) u P(p , aδ) since p is a feasible solution to Problem (2) and p is optimal, while Ψ(p ) Ψ(p ) since p is an optimal δ-robust contract by definition. Then, it must be the case that Ψ(p ) = Ψ(p ). Now, to complete the proof, consider any feasible solution p Rm + to Problem (2). We show that u P(p, aδ) Ψ(p). Pick an arbitrary best-response action a A(p) and consider any a Aδ(p). By definition, u A(p, a) > u A(p, a ) δ = max a A u A(p, a) δ u A(p, a ) δ. Contract Design Under Approximate Best Responses Hence, for Eq. (2a) to hold for action a, it must be that u P(p, a) u P(p, aδ). Since the choice of a is arbitrary, this holds for every a Aδ(p). Consequently, Ψ(p) = min a Aδ(p) u P(p, a) u P(p, aδ). The above lemma implies that we can effectively guess a and aδ, fixing these actions in Problem (2) and solving the optimization problem to obtain p (or possibly a different, but still optimal, δ-robust contract). Since there are only O(n2) possible combinations of the values of a and aδ, the approach is efficient as long as Problem (2) can be solved efficiently. A correct guess yields a contract p such that Ψ(p) = Ψ(p ), whereas Ψ(p) Ψ(p ) for incorrect guesses. Thus, by comparing the Ψ values, we can identify a correct guess and a corresponding optimal δ-robust contract. It remains to show how to efficiently solve Problem (2). 4.2. Solving Problem (2) Problem (2) does not directly admit any efficient solution algorithm due to the non-convex constraint in Eq. (2a). To deal with this issue, we rewrite Eq. (2a) as follows: Fa p ca + u A(p, a ) δ Fa p Fa r u P(p, aδ) , (3) by expanding the utilities as u A(p, a) = Fa p ca and u P(p, a) = Fa (r p) and rearranging the terms. Hence, Eq. (2a) is satisfied for action a A if and only if Fa p is smaller than the maximum of the right-hand sides of the two inequalities in Eq. (3). The constraint effectively reduces to the second inequality for all p Rm + such that ca + u A(p, a ) δ Fa r u P(p, aδ), (4) while for the other p Rm +, it reduces to the first inequality. Consequently, we can partition the contract space based on the satisfiability of Eq. (4), considering all a A. Within each subspace in the partition, only one inequality in Eq. (3) is active for every a A. So, effectively, Eq. (3) reduces to a linear constraint for each action a, and the optimization problem to an LP. It then suffices to solve an LP for every subspace, each generating an optimal contract within its corresponding subspace. Among these contracts, the one providing the highest utility is an optimal solution to Problem (2). Now, if the linear inequalities in Eq. (4) (one for each action a A) were n arbitrary inequalities, the above partition may consist of exponentially many subspaces, making the approach inefficient. Fortunately, the partition is much more well-structured, since the hyperplanes corresponding to the inequalities are parallel, as the coefficients of p in Eq. (4) are invariant with respect to a. As a result, they partition the space into only O(n) subspaces. Next, we show how to exploit the above observation to solve Problem (2), by solving O(n) suitable subproblems instead. 4.3. Formulating the Subproblems Let us first rearrange Eq. (4) as follows: u A(p, a ) + u P(p, aδ) δ νa := Fa r ca, (5) where νa is exactly the social welfare generated by action a (which is independent of the specific contract adopted). Then, we can re-order agent s actions a1, . . . , an in such a way that νa1 νa2 νan. For simplicity, we write νj = νaj, and we let ν0 = and νn+1 = + . Then, the following lemma is straightforward. Lemma 2. For every contract p Rm +, if it holds that νj 1 u A(p, a ) + u P(p, aδ) δ νj, then: Fa p Fa r u P(p, aδ) Eq. (3) holds for all actions a {aℓ| ℓ j 1}; and Fa p ca + u A(p, a ) δ Eq. (3) holds for all actions a {aℓ| j ℓ}. In other words, the condition in the lemma defines a suitable subspace of contracts Pj for each j {1, . . . , n + 1}. The following LP solves for a p Rm + that is optimal within Pj. max p Rm + u P(p, aδ) (6) subject to the following constraints: νj 1 u A(p, a ) + u P(p, aδ) δ νj (6a) Fa p Fa r u P(p, aδ) a {aℓ| ℓ j 1} (6b) Fa p ca + u A(p, a ) δ a {aℓ| j ℓ}. (6c) Since Sn+1 j=1 Pj = Rm +, solving the LP in Problem (6) for all j {1, . . . , n+1} and picking the best among the obtained solutions gives an optimal solution to Problem (2). Remark 1. The left-hand side of Eq. (5) is roughly (up to a δ difference) the social welfare of an optimal contract. Thus, Lemma 2 can be interpreted as follows. For lowsocial-welfare actions a1, . . . , aj 1, yielding a sufficiently high utility for the principal automatically provides a low utility for the agent. This fulfills the first inequality in Eq. (3) (and Eq. (2a)). Conversely, for high-social-welfare actions aj, . . . , an, a sufficiently low utility for the agent automatically gives a high utility for the principal. This fulfills the second inequality in Eq. (3) (and Eq. (2a)). Contract Design Under Approximate Best Responses Algorithm 1 Compute an optimal δ-robust contract 1: p null, ψ 2: for all (a , aδ) A A do 3: for all j = 1, . . . , n + 1 do 4: Solve Problem (6) instantiated with (a , aδ) s.t. Eqs. (6a) to (6c), and let an optimal solution be p 5: if Ψ(p ) > ψ then 6: p p 7: ψ Ψ(p ) 8: return p 4.4. Putting All Together We summarize our results in this section into Algorithm 1 and the following main theorem. Theorem 1. Algorithm 1 computes an optimal δ-robust contract in polynomial time. Proof Sketch. The polynomial runtime of Algorithm 1 is obvious as it enumerates O(n3) value combinations and solves an LP for each of them. To see the correctness of the algorithm, note that the inner for-loop of Algorithm 1 effectively solves Problem (2), for the pair (a , aδ) enumerated in the outer for-loop. Now, if a and aδ happen to be the agent s exact and δ-best responses, respectively, under some optimal contract, then according to Lemma 1, the outer loop produces an optimal contract p , with Ψ(p ) Ψ(p) for all p Rm +. By comparing the Ψ values, the algorithm identifies and outputs such an optimal contract. 5. Learning Robust Contracts The algorithm in Section 4 works under the assumption that the principal has full knowledge of all the payoff-relevant information about the agent and, thus, they can compute an optimal δ-robust contract. In this section, we address the case in which the principal has no such knowledge, and, thus, they have to learn an optimal δ-robust contract in an online fashion, by repeatedly interacting with the agent. 5.1. Learning Interaction We consider an online learning framework similar to the one studied by Zhu et al. (2023), in which the features of agent s actions, i.e., costs and probabilities over outcomes, depend on an agent s type that is sampled at each round from some (fixed) unknown probability distribution. Before formally defining the online learning framework, we introduce some notation that is needed to deal with hidden-action principalagent problems in which the agent can be of different types. Hidden-Action Principal-Agent Problems With Types We let Θ be the finite set of possible agent s types, while Fθ,a and cθ,a denote the probability distribution over outcomes and the cost, respectively, of action a A, when the agent has type θ Θ. The agent s type is drawn from an unknown probability distribution λ Θ, with λθ [0, 1] being the probability of type θ Θ. We extend all the notation introduced in Section 2 to account for agent s types. Specifically, given δ (0, 1), we let Aθ,δ(p) A be the set of δ-best responses to a contract p Rm + for type θ Θ. Moreover, we denote by aθ,δ(p) the δ-best response that is played by the agent under a δ-robust contract, namely the worst one for the principal. Formally, aθ,δ(p) arg mina Aδ(p) Fθ,a (r p). Similarly, we use Aθ(p) and aθ(p) for (exact) best responses. Finally, the principal s expected utility when committing to a contract p Rm + against an agent of type θ Θ playing an action a A is u P(p, a, θ) := Fθ,a (r p). We study an online learning framework in which the principal and the agent repeatedly interact over T > 0 rounds. Each round involves a repetition of the same instance of hidden-action principal-agent problem, with only the agent s type changing from round to round. Specifically, at each round t [T], the principal-agent interaction is as follows: 1. The agent s type θt Θ is sampled from the distribution λ. Notice that θt is not observed by the principal. 2. The principal commits to a contract pt C := [0, 1]m. 3. After observing the contract pt, the agent plays a δ-best response at := aθt,δ(pt). Notice that the action at is not observed by the principal. 4. The principal observes the outcome ωt Fat that is realized as an effect of the agent s action at. We remark that, in the interaction described above, the principal s contract design space is assumed to be limited to the hypercube C := [0, 1]m. Restricting the contract space to a bounded set is standard in the literature on online learning in contract design (see, e.g., (Ho et al., 2014; Zhu et al., 2023; Chen et al., 2024)) and it is motivated by the negative result proved in Theorem 1 in (Bacchiocchi et al., 2024). Therefore, we need to introduce a formal definition of the principal s expected utility when committing to an optimal δ-robust contract restricted to such a space. Formally: OPT(C, δ) := max p C θ Θ λθu P(p, aθ,δ(p), θ). The goal of the principal is to minimize the (cumulative) pseudo-regret, or simply regret, which can be defined as: RT (C, δ):=T OPT(C, δ) E t [T ],θ Θ λθu P(pt, aθ,δ(pt), θ) Contract Design Under Approximate Best Responses where the expectation is over the randomness of the algorithm and the environment that generates the feedback received by the learner at each round. Our goal is to design a no-regret algorithm for the principal, namely one achieving RT (C, δ) = o(T). 5.2. A No-Regret Algorithm Before introducing our no-regret learning algorithm (Algorithm 2), we need to prove a key lemma about the continuity of the principal s expected utility over the hypercube. In particular, given a δ-robust contract p Rm +, we show that it is possible to build another contract p Rm + that is (δ + ϵ)-robust and provides the principal with utility at most 2 ϵ worse than the one of contract p. Formally: Lemma 3. Given any δ, ϵ (0, 1) and a contract p Rm +, the following holds for every θ Θ: u P(p , aθ,δ+ϵ(p ), θ) u P(p, aθ,δ(p), θ) 2 ϵ, where p := (1 ϵ)p + ϵr. We observe that the idea of shifting payments towards the principal s reward vector, as in Lemma 3, was first adopted by Dutting et al. (2021) in non-robust settings, in order to deal with approximately incentive-compatible contracts. Thanks to Lemma 3, it is possible to show that there always exists a contract p Bϵ that satisfies the following condition, where Bϵ C is a uniform grid of the hypercube C, built with step size ϵ > 0 (see also Algorithm 2). X θ Θ λθu P(p, aθ,δ(p), θ) OPT(C, δ) O( ϵ). Consequently, by suitably choosing ϵ > 0 as a function of the time horizon T, and by instantiating a no-regret algorithm with set of arms Bϵ, we can upper bound the regret suffered by Algorithm 2 as follows. Theorem 2. The regret suffered by Algorithm 2 can be upper bounded as follows: RT (C, δ) e O T 1 1 2(m+1) . We remark that the regret lower bound introduced by Zhu et al. (2023) also holds in our setting. This is because, when the parameter δ is arbitrarily small, if we consider the same instances used by Zhu et al. (2023) in their lower bound, the principal is still required to enumerate an exponential number of regions, thus suffering Ω(T 1 1/(m+2)) regret. This confirms that an exponential dependence on the number of outcomes is unavoidable in the regret suffered by any algorithm. Furthermore, Theorem 2 shows regret guarantees Algorithm 2 Regret minimizer for δ-robust contracts Require: T > 0 1: Set ϵ T 1 m+1 2: Bϵ {p [0, 1]m | pω {0, ϵ, 2ϵ, ..., 1} ω Ω} 3: Run UCB1 with Bϵ as set of arms similar to those obtained by Zhu et al. (2023) in the nonrobust version of the problem. Indeed, they achieve an upper bound of the order of e O( m T 1 1/(2m+1)) on the regret. Finally, we also show that, when the parameter δ (0, 1) is sufficiently small (with respect to the time horizon T > 0), Algorithm 2 achieves sublinear regret with respect to an optimal non-robust contract within C. Formally, we define the value of such a contract as follows: OPT(C) := max p C θ Θ λθu P(p, aθ(p), θ). Furthermore, when OPT(C) is chosen as baseline, the regret definition becomes the following: RT (C):=T OPT(C) E t [T ],θ Θ λθu P(pt, aθ,δ(pt), θ) where the expectation is over the randomness of the algorithm and the environment that generates the feedback received by the learner at each round. We observe that the baseline in the regret definition above coincides with the one introduced by Zhu et al. (2023). Then, we can prove that the following corollary of Theorem 2 holds. Corollary 1. The regret suffered by Algorithm 2 can be upper bounded as follows: RT (C) e O T 1 1 2(m+1) + 2 We remark that, if we set δ = 1/T α with α > 0 in Corollary 1, then the regret with respect to an optimal non-robust contract within the hypercube C is sublinear. Remark 2. Our algorithm can be extended to deal with settings in which the agent can play any δ-best response within the set Aθt,δ(pt). In such a setting, the principal s utility is not fully stochastic. However, our Algorithm 2 can be easily extended by instantiating an adversarial no-regret algorithm instead of UCB1 (Auer et al., 2002). Comparison With (Zhu et al., 2023) We observe that our algorithm provides some advantages compared to the stateof-the-art algorithm proposed by Zhu et al. (2023). First, our approach employs as set of contracts a simple discretization of the hypercube, while the approach by Zhu et al. (2023) requires determining the minimum set of contracts Vϵ(C) for some suitable ϵ > 0 such that, for every p C, there Contract Design Under Approximate Best Responses exists a p Vϵ(C) that satisfies (r p) p cos(ϵ). However, the need of designing such a set of contracts makes the approach by Zhu et al. (2023) challenging to be employed in practice compared to ours. Second, our algorithm does not require apriori knowledge of the principal s reward, which is instead required by the algorithm of Zhu et al. (2023). Acknowledgments This work was supported by the Italian MIUR PRIN 2022 Project Targeted Learning Dynamics: Computing Efficient and Fair Equilibria through No-Regret Algorithms , by the FAIR (Future Artificial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, Investment 1.3, Line on Artificial Intelligence), and by the EU Horizon project ELIAS (European Lighthouse of AI for Sustainability, No. 101120237). Impact Statement This paper presents theoretical results and has the goal of advancing the field of Machine Learning. There are no potential societal consequences of our work which we feel must be highlighted here. Alon, T., Duetting, P., Li, Y., and Talgam-Cohen, I. Bayesian analysis of linear contracts. In Proceedings of the 24th ACM Conference on Economics and Computation, EC 23, pp. 66, 2023. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235 256, 2002. Babaioff, M., Feldman, M., and Nisan, N. Combinatorial agency. In Proceedings of the 7th ACM Conference on Electronic Commerce, pp. 18 28, 2006. Babaioff, M., Feldman, M., and Nisan, N. Free-riding and free-labor in combinatorial agency. In International Symposium on Algorithmic Game Theory, pp. 109 121. Springer, 2009. Bacchiocchi, F., Castiglioni, M., Marchesi, A., and Gatti, N. Learning optimal contracts: How to exploit small action spaces. In The Twelfth International Conference on Learning Representations, 2024. Bastani, H., Bayati, M., Braverman, M., Gummadi, R., and Johari, R. Analysis of medicare pay-for-performance contracts. Available at SSRN 2839143, 2016. Bastani, H., Goh, J., and Bayati, M. Evidence of upcoding in pay-for-performance programs. Management Science, 65(3):1042 1060, 2019. Bernasconi, M., Castiglioni, M., and Marchesi, A. Regretminimizing contracts: Agency under uncertainty, 2024. URL https://arxiv.org/abs/2402.13156. Cai, Y., Daskalakis, C., and Papadimitriou, C. Optimum statistical estimation with strategic data sources. In Conference on Learning Theory, pp. 280 296. PMLR, 2015. Carroll, G. Robustness and linear contracts. American Economic Review, 105(2):536 563, 2015. Castiglioni, M., Marchesi, A., and Gatti, N. Bayesian agency: Linear versus tractable contracts. Artificial Intelligence, 307:103684, 2022a. Castiglioni, M., Marchesi, A., and Gatti, N. Designing menus of contracts efficiently: The power of randomization. In Proceedings of the 23rd ACM Conference on Economics and Computation, pp. 705 735, 2022b. Castiglioni, M., Marchesi, A., and Gatti, N. Multi-agent contract design: How to commission multiple agents with individual outcomes. In Proceedings of the 24th ACM Conference on Economics and Computation, pp. 412 448, 2023. Chen, Y., Chen, Z., Deng, X., and Huang, Z. Are bounded contracts learnable and approximately optimal? ar Xiv preprint ar Xiv:2402.14486, 2024. Cong, L. W. and He, Z. Blockchain disruption and smart contracts. The Review of Financial Studies, 32(5):1754 1797, 2019. Duetting, P., Ezra, T., Feldman, M., and Kesselheim, T. Multi-agent combinatorial contracts, 2024. D utting, P., Roughgarden, T., and Talgam-Cohen, I. Simple versus optimal contracts. In Proceedings of the 2019 ACM Conference on Economics and Computation, pp. 369 387, 2019. Dutting, P., Roughgarden, T., and Talgam-Cohen, I. The complexity of contracts. SIAM Journal on Computing, 50(1):211 254, 2021. D utting, P., Feldman, M., Talgam-Cohen, I., et al. Algorithmic contract theory: A survey. Foundations and Trends in Theoretical Computer Science, 16(3-4):211 412, 2024. Gan, J., Han, M., Wu, J., and Xu, H. Robust stackelberg equilibria. In Proceedings of the 24th ACM Conference on Economics and Computation, EC 23, pp. 735, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701047. doi: 10. 1145/3580507.3597680. URL https://doi.org/ 10.1145/3580507.3597680. Contract Design Under Approximate Best Responses Guruganesh, G., Schneider, J., and Wang, J. R. Contracts under moral hazard and adverse selection. In Proceedings of the 22nd ACM Conference on Economics and Computation, pp. 563 582, 2021. Hadfield-Menell, D. and Hadfield, G. K. Incomplete contracting and ai alignment. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 417 422, 2019. Ho, C.-J., Slivkins, A., and Vaughan, J. W. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC 14, pp. 359 376, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325653. doi: 10.1145/2600057.2602880. Holmstrom, B. and Milgrom, P. Multitask principal agent analyses: Incentive contracts, asset ownership, and job design. The Journal of Law, Economics, and Organization, 7(special issue):24 52, 1991. Kaynar, N. and Siddiq, A. Estimating effects of incentive contracts in online labor platforms. Management Science, 69(4):2106 2126, 2023. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Saig, E., Einav, O., and Talgam-Cohen, I. Incentivizing quality text generation via statistical contracts. ar Xiv preprint ar Xiv:2406.11118, 2024. Yang, K. and Zhang, H. Computational aspects of bayesian persuasion under approximate best response. ar Xiv preprint ar Xiv:2402.07426, 2024. Zhu, B., Bates, S., Yang, Z., Wang, Y., Jiao, J., and Jordan, M. I. The sample complexity of online contract design. In Proceedings of the 24th ACM Conference on Economics and Computation, EC 23, pp. 1188, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701047. doi: 10. 1145/3580507.3597673. URL https://doi.org/ 10.1145/3580507.3597673. Contract Design Under Approximate Best Responses Proposition 1 (Upper and lower bounds). Given an instance of hidden-action principal-agent problem: 1. For every δ (0, 1), it holds: OPT(δ) OPTLB(δ) := OPT 2 2. For every δ (0, 1), it holds: OPT(δ) OPTUB(δ) := max {0, SW δ} . Proof. We prove the two points separately. 1. Let p = (1 δr for any contract p Rm +. We start by observing that the following inequality holds: X ω Ω Fa(p),ωpω ca(p) X ω Ω Faδ(p ),ωpω caδ(p ), because a(p) A(p). Furthermore, we have: X ω Ω Faδ(p ),ωp ω caδ(p ) > max a A ω Ω Fa ,ωpω ca δ X ω Ω Fa(p),ωp ω ca(p) δ, thanks to the definition of Aδ(p ). Then, by employing the two above inequalities and the definition of p , we have: ω Ω Fa(p),ωp ω ca(p) ω Ω Faδ(p ),ωp ω caδ(p ) ω Ω Fa(p),ωpω ca(p) ω Ω Faδ(p ),ωpω caδ(p ) Fa(p),ω Faδ(p),ω (rω pω) Fa(p),ω Faδ(p),ω (rω pω) Thus, by rearranging the latter inequality we get: Fa(p),ω Faδ(p),ω (rω pω) Finally, we can show that: u P(p, a(p)) u P(p , aδ(p )) = X ω Ω Fa(p),ω(rω pω) ω Ω Faδ(p ),ω(rω p ω) ω Ω Fa(p),ω(rω pω) 1 ω Ω Faδ(p ),ω(rω pω) ω Ω (Fa(p),ω Faδ(p ),ω)(rω pω) where the second equality holds thanks to the definition of p , the first inequality because u P(p, a(p)) 1 for each p Rm + and the second inequality because of Eq. (7). Finally, let p R be an optimal (non-robust) contract, then we have: δ + δ u P(p , aδ(p )) OPT(δ), concluding the first part of the proof. Contract Design Under Approximate Best Responses 2. We split the proof into two parts. (a) if OPT(δ) = 0. Then, we trivially have : 0 = OPT(δ) max ω Ω Fa,ωrω ca δ = max (0, SW δ) (b) if OPT(δ) > 0. We define p Rm + as an optimal δ-robust contract. Then, we notice that a1 Aδ(p ), where a1 is the opt-out action. Indeed, if a1 Aδ(p ), then the agent may select the action a1 as a δ-robust best-response, which provides zero or negative utility to the principal and contradicts the fact that OPT(δ) > 0. Therefore, in an optimal robust contract p , we must have: X ω Ω Fa ,ωp ω ca X ω Ω Fa1,ωp ω ca1 + δ X ω Ω Fa1,ωp ω + δ δ, for some a Aδ(p ). Thus, we have: ω Ω Faδ(p),ω(rω p ω) ω Ω Fa ,ω(rω p ω) ω Ω Fa ,ωrω ca δ ω Ω Fa,ωrω ca δ ω Ω Fa,ωrω ca δ = max (0, SW δ) . The first inequality above follows from the fact that a Aδ(p ), while the second inequality holds due to the previous observation. The two above points conclude the proof. Proposition 2. Given δ (0, 1), for every integer n > n(δ), there is an instance of hidden-action principal-agent problem (parametrized by δ) with 2n + 1 actions, where: OPT(δ) OPTLB(δ) O 1 Furthermore, for any δ (0, 1), there exists an instance of hidden-action principal-agent problem in which it holds: OPT(δ) = OPTUB(δ). Proof. We prove the two points separately. 1. We consider an instance parametrized by δ > 0, with |Ω| = 2 and |A| = 2n + 1 for some n N, to be defined below. We let: Furthermore, we introduce the following decreasing sequence: ( i i 1 i = κ, . . . , n 2n+1 i 2n+2 i i = n + 2, . . . , 2n, Contract Design Under Approximate Best Responses with γn+1 = 1 and γ2n+1 = 0, where n N is such that n > κ. The distributions over the set of outcomes for the different actions and their corresponding costs are given by: Fai,ω1 = 0 cai = 0, i = 1, . . . , κ 1. Fai,ω1 = 1 γi δ cai = 0, i = κ, . . . , n. Fa2n+1,ω1 = 1 ca2n+1 = 0. and the principal s reward is r = (1, 0). Notice that the distribution over outcomes of the different actions are always well defined because of the definition of κ > 0. Furthermore, since all the agent s actions ai with i < k are coincident, we assume for the sake of presentation that the agent always selects a1. It is easy to verify that the value of an optimal (non-robust) contract is OPT = 1 since the agent breaks ties optimistically and c2n+1 = 0. With a similar argument to the one proposed by (D utting et al., 2024) in Proposition 3.9, an optimal δ-robust contract is such that p = (0, α) for some α R+. Consequently, in the rest of the proof, we focus on determining the maximum utility achievable in a linear, δ-robust contract. For every ai A, with κ < i 2n + 1, if α R+ is such that aδ(αr) = ai, then the two following conditions hold. All the actions aj Aδ(αr) for each j < i. This is because u P(α, aj) u P(α, ai) for each α R+ since {γi}i [2n+1] is a decreasing sequence. Thus, the latter observation implies that: u A(α, aj) u A(α, ai 1) = αRai 1 cai 1 max i [2n+1] u A(α, ai) δ = u A(α, a2n+1) δ = αRa2n+1 ca2n+1 δ Therefore, we have that: and, thus, α δ/γi 1. The action ai Aδ(αr). Thus, αRai cai > max i [2n+1] u A(α, ai) δ = α δ, which implies that α < The two observation above shows that ai, with i > κ, is a δ-best response for all the values of α such that: Thus, the smallest value of α R+ such that aδ(αr) = ai is αi := δ/γi 1, when i > κ. With the same argument above, it is possible to show that a1 = aδ(αr) for all α [α1, ακ) and aκ = aδ(αr) for all α [ακ, ακ+1), with α1 = 0 and ακ = δ. We also observe that the principal s utility in each αi, with i {κ + 1, . . . , 2n + 1}, is such that: u P(αi, ai) = (1 αi)Rai = = 1 γi + 1 γi 1 δ + γi γi 1 δ. Therefore, using the latter formula, we have that the following holds. Contract Design Under Approximate Best Responses If i = κ + 1, . . . , n, we have: δ + γi γi 1 δ 1 2 since {γi}i κ is a decreasing sequence. If i = n + 1, n + 2, we have: u(αi) 1 1 + n 1 If i = n + 3, . . . , 2n + 1, we have: δ + γi γi 1 δ 1 2 since {γi}i κ is a decreasing sequence. We also observe that the principal s utility in ακ is such that: u(ακ) = (1 δ)(1 γκ since γk > 1 and δ > 0. Thanks to the definition of κ, with a simple calculation, it possible to show that: δ + 1 if 1 1 Thus, when 1/(1 δ) N, we have: u(ακ) = (1 δ)(1 γκ Similarly, it is possible to show that the same relation holds even when 1/(1 δ) N. Finally, by putting all together, we have that: OPT(δ) OPTLB(δ) max i [2n+1] u(αi) (1 2 = u(αn+1) (1 2 concluding the first part of the proof. 2. We consider an instance with |Ω| = |A| = 2 and r = (0, 1). The distributions over the set of outcomes for the different actions and their corresponding costs are given by: ( Fa1 = (1, 0) ca1 = 0 Fa2 = (0, 1) ca2 = 0. It is easy to verify that SW = Ra2 ca2 = 1. Furthermore, with a similar argument to the one proposed by (D utting et al., 2024) in Proposition 3.9, an optimal δ-robust contract is such that p = (0, α) for some α R+. We show that OPT(δ) = 1 δ. Let a1 A be the opt-out action. We observe that a1 Aδ(αr) for all the α R+ that satisfy: αRa2 ca2 = α αRa1 ca1 + δ = δ. Thus, a1 Aδ(αr) for all α δ. Therefore, the smallest α R+ such that the agent selects the action a2 coincides with δ > 0. Thus, the largest utility the principal can achieve satisfies OPT(δ) = 1 δ = SW δ. Contract Design Under Approximate Best Responses The two points above conclude the proof. Proposition 3. The function (0, 1) δ 7 OPT(δ) is continuous and non-increasing in δ. Moreover, limδ 0+ OPT(δ) = OPT and limδ 1 OPT(δ) = 0. Proof. We prove the claims separately. We first show that OPT(δ) is a continuous function. Let p Rm + be a δ-robust and p = (1 ϵ)p + ϵr. Then, by Lemma 3, we have: OPT(δ + ϵ) u P(p , aδ+ϵ(p )) u P(p , aδ(p )) 2 ϵ + ϵ = OPT(δ) 2 ϵ + ϵ. Thus, rewriting the latter inequality and taking the limit, we have that: 0 = lim ϵ 0(2 ϵ ϵ) lim ϵ 0 OPT(δ) OPT(δ + ϵ). Let p Rm + be a δ -robust and p = (1 ϵ)p + ϵr with δ = δ ϵ. Then, by Lemma 3, we have: OPT(δ) = OPT(δ + ϵ) u P(p , aδ +ϵ(p )) u P(p , aδ (p )) 2 ϵ + ϵ = OPT(δ ϵ) 2 ϵ + ϵ. Thus, by taking the limit, we have that: 0 = lim ϵ 0(2 ϵ ϵ) lim ϵ 0 OPT(δ ϵ) OPT(δ). Since the latter considerations hold for every δ (0, 1), we can easily show that OPT(δ) is continuous in δ (0, 1). We now prove that OPT(δ) is non-increasing. Let δ, δ (0, 1) such that δ < δ. Furthermore, let p Rm + be an optimal δ-robust contract. Therefore, we have: OPT(δ) = u(aδ(p ), p ) u(aδ (p ), p ) OPT(δ ), since, for each p Rm +, we have Aδ (p) Aδ(p). By Proposition 1, we have: lim δ 0+ OPT(δ) lim δ 0+ OPT 2 δ + δ = OPT, and, lim δ 1 OPT(δ) lim δ 1 1 δ = 0. The two points above conclude the proof. Lemma 3. Given any δ, ϵ (0, 1) and a contract p Rm +, the following holds for every θ Θ: u P(p , aθ,δ+ϵ(p ), θ) u P(p, aθ,δ(p), θ) 2 ϵ, where p := (1 ϵ)p + ϵr. Proof. For the sake of the presentation, we avoid the dependence on the type θ Θ. We split the proof in two cases: 1. If aδ+ϵ(p ) Aδ(p), then aδ+ϵ(p ) is not a δ-best-response in p. Therefore, we have that: X ω Ω Faδ(p),ωpω caδ(p) X ω Ω Faδ+ϵ(p ),ωpω caδ+ϵ(p ) + δ, ω Ω Faδ+ϵ(p ),ωp ω caδ+ϵ(p ) X ω Ω Faδ(p),ωp ω caδ(p) δ ϵ. Contract Design Under Approximate Best Responses Then, by taking the summation of the above quantities and employing the definition of p we get: ϵ X ω Ω (Faδ(p),ω Faδ+ϵ(p ),ω)(rω pω). Therefore, we can prove that the following holds: u P(p, aδ(p)) u P(p , aδ+ϵ(p )) = X ω Ω Faδ(p),ω(rω pω) ω Ω Faδ+ϵ(p ),ω(rω p ω) ω Ω Faδ(p),ω(rω pω) 1 ϵ X ω Ω Faδ+ϵ(p ),ω(rω pω) ω Ω (Faδ(p),ω Faδ+ϵ(p ),ω)(rω pω) 2. if aδ+ϵ(p ) Aδ(p), then one of the following hold. (a) if aδ(p) Aδ+ϵ(p ), then either aδ(p) aδ+ϵ(p ) or aδ(p) provides the same principal s utility as aδ+ϵ(p ) in p . Indeed, suppose by contradiction that the following holds: X ω Ω Faδ+ϵ(p ),ω(rω p ω) < X ω Ω Faδ(p),ω(rω p ω). Then, we have: X ω Ω Faδ+ϵ(p ),ω(rω p ω) = (1 ϵ) X ω Ω Faδ+ϵ(p ),ω(rω pω) ω Ω Faδ(p),ω(rω pω) ω Ω Faδ(p),ω(rω p ω). where the inequality is a consequence of the fact that both aδ(p) and aδ+ϵ(p ) belongs to Aδ(p). This, reaches a contradiction with the initial assumption. Thus, we have that: u P(p , aδ+ϵ(p )) = X ω Ω Faδ+ϵ(p ),ω(rω p ω) ω Ω Faδ(p),ω(rω p ω) ω Ω Faδ(p),ω(rω pω) ω Ω Faδ(p),ω(rω pω) ϵ u P(p, aδ(p)) ϵ. (b) if aδ(p) Aδ+ϵ(p ), then the following holds: X ω Ω Faδ(p ),ωpω caδ(p) X ω Ω Faδ+ϵ(p ),ωpω caδ+ϵ(p ) δ, ω Ω Faδ+ϵ(p ),ωp ω caδ+ϵ(p ) X ω Ω Faδ(p),ωp ω caδ(p) + δ + ϵ. Contract Design Under Approximate Best Responses Then, by taking the summation of the above quantities and employing the definition of p we get: ω Ω (Faδ(p),ω Faδ+ϵ(p ),ω)(rω pω). Then, using the same argument employed at point (a) we can show that: u P(p, aδ(p)) u P(p , aδ+ϵ(p )) ϵ, concluding the proof. Theorem 2. The regret suffered by Algorithm 2 can be upper bounded as follows: RT (C, δ) e O T 1 1 2(m+1) . Proof. In the following, we let: 1. p be such that P θ Θ λθu P(p , aθ,δ(p ), θ) = OPT(C, δ). 2ϵr. Notice that p [0, 1]m since both p, r C and C is convex. 3. p Bϵ be such that p p ϵ. Notice that there always exists at least once contract p satisfying the latter condition because of the definition of Bϵ and the fact that p [0, 1]m. We start by observing that, for each θ Θ, it holds aθ,δ(p) Aθ,δ+2ϵ(p ). Indeed, we have: X ω Ω Fθ,aθ,δ(p),ωp ω cθ,aθ,δ(p) X ω Ω Fθ,aθ,δ(p),ωpω cθ,aθ,δ(p) ϵ ω Ω Fθ,a,ωpω cθ,a ϵ δ ω Ω Fθ,a,ωp ω cθ,a 2ϵ δ, for every actions a A. The above inequalities follows since: X ω Ω Fθ,a,ω(pω p ω) Fθ,a 1 p p ϵ, for every actions a A and θ Θ. Therefore, we can prove that the following holds: X θ Θ λθu P(p, aθ,δ(p), θ) X θ Θ λθu P(p , aθ,δ(p), θ) ϵ θ Θ λθu P(p , aθ,δ+2ϵ(p ), θ) ϵ θ Θ λθu P(p , aθ,δ(p ), θ) 2 = OPT(C, δ) 2 where the first inequality above holds because P ω ΩFθ,a,ω(pω p ω) Fθ,a 1 p p ϵ for every actions a A and type θ Θ. The second inequality holds because of the definition of δ-best response, while the third inequality follows from Lemma 3. At this point, we can decompose the cumulative regret as follows: RT (C, δ) = T OPT(C, δ) E h T X θ Θ λθu P(pt, aθ,δ(pt), θ) i Contract Design Under Approximate Best Responses T OPT(C, δ) T max p Bϵ θ Θ λθu P(p, aθ,δ(p), θ) + T max p Bϵ θ Θ λθu P(p, aθ,δ(p), θ) E h X t [T ],θ Θ λθu P(pt, aθ,δ(pt), θ) i We focus on bounding term (a) in Eq. (9). Let p [0, 1]m be the contract defined at point 3. Then, we have: T OPT(C, δ) T max p Bϵ θ Θ λθu P(pt, aθ,δ(pt), θ) T OPT(C, δ) T X θ Θ λθu P( p, aθ,δ( p), θ) where the last inequality holds because of Eq. (8). We focus on bounding term (b) in Eq. (9). By employing the same analysis to prove the regret bound in UCB1 (see, e.g., (Lattimore & Szepesv ari, 2020)), we obtain: θ Θ λθu P(p, aθ,δ(p), θ) E h X θ Θ λθu P(pt, aθ,δ(pt), θ) i O p T|Bϵ| log(T) . (10) Finally, observing that |Bϵ| O((1/ϵ)m), by putting (a) and (b) in Eq. (9) together and setting ϵ = T 1 m+1 , we have: RT (C, δ) e O T 1 1 2(m+1) , concluding the proof. Corollary 1. The regret suffered by Algorithm 2 can be upper bounded as follows: RT (C) e O T 1 1 2(m+1) + 2 Proof. Let p be an optimal non-robust contract inside C. Then we have: RT (C) = T OPT(C) E h T X θ Θ λθu P(pt, aθ,δ(pt), θ) i = T (OPT(C) OPT(C, δ)) + T OPT(C, δ) E h T X θ Θ λθu P(pt, aθ,δ(pt), θ) i δT + T OPT(C, δ) E h T X θ Θ λθu P(pt, aθ,δ(pt), θ) i δT + e O T 1 1 2(m+1) , where the first inequality holds by employing the same argument needed to prove Proposition 1, and the second inequality holds thanks to Theorem 2, concluding the proof.