# delegated_classification__b76e5235.pdf

Delegated Classification

Eden Saig, Inbal Talgam-Cohen, Nir Rosenfeld Technion Israel Institute of Technology Haifa, Israel {edens,italgam,nirr}@cs.technion.ac.il

When machine learning is outsourced to a rational agent, conflicts of interest might arise and severely impact predictive performance. In this work, we propose a theoretical framework for incentive-aware delegation of machine learning tasks. We model delegation as a principal-agent game, in which accurate learning can be incentivized by the principal using performance-based contracts. Adapting the economic theory of contract design to this setting, we define budget-optimal contracts and prove they take a simple threshold form under reasonable assumptions. In the binary-action case, the optimality of such contracts is shown to be equivalent to the classic Neyman-Pearson lemma, establishing a formal connection between contract design and statistical hypothesis testing. Empirically, we demonstrate that budgetoptimal contracts can be constructed using small-scale data, leveraging recent advances in the study of learning curves and scaling laws. Performance and economic outcomes are evaluated using synthetic and real-world classification tasks.

1 Introduction

The acclaimed success of machine learning at effectively solving difficult prediction tasks across diverse problem domains has made it highly appealing for firms, institutions, and individual practitioners. But machine learning has also become increasingly complex, cumbersome, and difficult to operate and not all those who seek to learn have access to the necessary expertise, infrastructure, and designated resources required for learning effectively. This gap has created a new market for outsourced machine learning, in which a client interested in obtaining an accurate predictive model can hire the services of a specialized provider which, for a price, trains the model on their behalf. Consider for example a hospital purchasing a classifier for deciding between hospitalization and outpatient treatment when triaging patients. The provider invests in curating, cleaning and annotating training data, and delivers a trained model in return to payment from the hospital.

Having a budget to expend on outsourced learning [51], we model the client as aiming to obtain the best possible predictive model. At first glance, it is tempting to assume that the optimal strategy is simply to pay the provider the maximal feasible amount and hope to get a high-end model in return. After all, if the client were to spend the budget directly on learning, investing the maximal available sum would yield the best possible results. But this neglects to account for the incentives of the provider, who is interested in maximizing profit. Since the actions of the provider remain private, it is in his best interest to (secretly) minimize efforts, which in turn can result in his delivering a suboptimally-trained model. In our example, the provider can cut costs by annotating only a subset of the data, obtaining cheaper low-quality annotations, or neglecting to meticulously remove all outliers.

Outsourced learning is hence susceptible to moral hazard, an economic situation which might occur under information asymmetry, and to the detriment of the client. Motivated by this observation, in this paper we initiate the study of delegated learning, and aim to explore the economic, algorithmic, and statistical implications that occur when learning is delegated to a specialized provider. Our key novelty is in instantiating delegated learning as a problem of optimal contract design [10, 42, 47, 48]. Broadly, contracts are an important monetary device that allows the client to establish a payment scheme

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Observe initial data; Design contract

Strategically choose training set size ! Sample data at cost "!, and learn ℎ Sample validation set $ &"

Count correct predictions ' = ) acc# ℎ

Private Private Interaction

Classifier ℎ-

Principal Agent

Receive . ' monetary units

Contract .: 0, ,) ℝ$%

Payment . '

Figure 1: Delegated classification interaction sequence. The principal examines initial information, and designs a contract t : {0, . . . , m} R 0. Having observed t, the agent strategically selects a dataset size n that will maximize his expected utility. He samples a training set S Dn, incurs cost cn, then trains the classifier h H and sends it to the principal. Upon receiving h, the principal evaluates its accuracy on a random validation set V Dm, and pays the agent according to the contract t.

which, if properly set, serves to align incentives and guarantee that both parties are well-off. Our main challenge is to design effective contracts specialized to the task of delegated learning on a budget.

Towards this, we begin with a conventional supervised classification setup, and impose economic structure by assuming that acquiring training examples is costly. We then conceptually split the conventional self-sufficient learner into two rational entities: a principal, who controls the budget and is interested in maximizing predictive accuracy; and an agent, who controls learning (in particular the training set) and is interested in maximizing profit. This allows us to model principal-agent relations as a Stackelberg game, in which the principal commits to a contract t, determining a priori the amount to be paid for every possible (stochastic) level of obtained accuracy. The agent best-responds to the contract by choosing the profit-maximizing number of samples n, and training the predictive model.

Under this setting, we study the algorithmic problem of designing an optimal contract. As is standard in economic analysis, we begin with the assumption that the principal has full information on the distribution of possible outcomes for each of the agent s possible actions. In our setting, actions correspond to the number of training samples, and outcomes to the empirical classifier accuracy; thus, the main object of interest for contract design in delegated learning settings is the learning curve, which describes the (stochastic) performance of learning per sample size. Under certain plausible conditions on the learning curve, namely MLRP and a certain notion of concavity, our main result here is that optimal contracts are simple, and in particular, take on the form of simple threshold functions. Simple contracts are appealing because they are straightforward to understand and communicate; in our setting, they are also easy to compute, and we give a closed-form solution for the optimal threshold contract. Providing an interpretation of the closed-form solution, our findings establish a new connection between contracts and the renowned Neymon-Pearson lemma [45], which we consider as one of our central theoretical contributions.

We then switch gears and turn to empirically studying the construction of contracts from partial information. In particular, we consider a setting where the principal can only estimate the learning curve from small available data (e.g., by bootstrapping on small n and extrapolating). Using the recent LCDB dataset of learning curves [43], we show that threshold contracts generally perform well on estimated curves despite the inherent uncertainty. We also explore the role of different parameters of our setup, consider various tradeoffs in curve-fitting and contract design, and discuss limitations by pointing out certain failure modes to which contracts may be susceptible.

Taken together, our results shed light on why and how simple contracts for delegated learning work to correctly balance between the incentives of both delegator and delegatee in outsourced learning.

1.1 Related work Previous works have considered delegation of ML-related tasks that differ from our task of training a classifier: labeling of data points in [14], gathering information in [16], and computing a costly high-dimensional function in [5]. In the delegated task of [24], the agent provides a classifier and the principal verifies its near-optimality within H. A fundamental difference is that their agent is assumed to be adversarial rather than rational, and so interactive proofs are used instead of economic incentives.

The Neyman-Pearson lemma has recently been connected to economic design by [8] in the context of adverse selection rather than moral hazard. The agent has a hidden type (e.g., whether a new

drug is effective), and the optimal menu to offer this agent is designed based on the theory of e-values. When the hidden type has binary support, the method is equivalent to Neyman-Pearson. A moral link between the design of contracts (for a non-budgeted principal) and statistical inference (in particular likelihood ratios) was observed already in [25], but no connection was made to the power of hypothesis tests. Other intersections of ML and contracts that do not involve delegation of learning-related tasks include strategic classification [e.g., 36, 37, 3] and online learning of optimal contracts [29, 17, 52] Contract settings with a binary action and/or outcome space have been studied in [e.g., 20, 7, 22]. Not to be confused with our notion of delegation, there is a growing computational literature on delegation without monetary payments [e.g., 35].

To extrapolate from partial data, our work builds upon recent advancements in the study of learning curves, which characterize the expected generalization of learning as a function of dataset size and other exogenous factors [49]. There is growing empirical evidence that performance of modern neural networks can be predicted using simple scaling laws [e.g., 34, 41, 50, 23, 2, 46, 30], and theoretical results that back these findings in simplified settings [11, 12, 33, 9].

Perhaps closest to ours is the concurrent work [4], which analyzes a similar setting under different assumptions: Rather than assuming budget constraints, they assume that the principal s utility is linear in accuracy and payout, and the learning curve takes a specific functional form. Based on these assumptions, they derive approximately optimal linear contracts which are robust to adverse selection. In contrast, we obtain globally optimal simple contracts based on hypothesis testing, and present data-driven methods to handle partial information. Together, the two studies demonstrate the important role of simple contracts in the growing ecosystem of machine learning delegation.

2 Problem Setup

The core of our setting is based on a standard supervised classification task. Let x X be features and y Y be labels, and assume there is some unknown joint distribution D over (x, y) pairs. Given a sample set S = {(xi, yi)}n i=1 Dn, the goal in learning is to use S to find a classifier h : X Y from a class H that maximizes expected accuracy, acc D(h) = P(x,y) D[h(x) = y]. Because the underlying distribution D is unknown, expected performance is estimated by the empirical average on an additional held-out validation set V Dm of size m, as acc V (h) = 1

m Pm i=1 1 [h(xi) = yi], which is a consistent and unbiased estimator of acc D(h). We will assume throughout that both the learning algorithm and the validation set size m are known and fixed.

Learning on a budget. We will be interested in studying learning when certain resources are limited or costly. Our main focus will be on the setting where the main cost of learning is the number of labeled examples n, but we note that our approach can in principle extend to other forms of learning effort .1 We assume the learner has a monetary budget B to spend on samples, and is interested in maximizing accuracy under budget constraints. Let cn 0 be the cost of n samples (assumed to be increasing in n), then the learner aims to solve:

n = argmaxn Ehn[acc D(hn)] s.t. cn B (1)

where hn is a classifier learned from a random dataset of size |S| = n. Note that hn is a random variable with distribution depending on n. We denote the out-of-sample accuracy of hn by αn = acc D(hn). When the learner is a self-sufficient entity, and when the expected αn improves monotonically in n, then n in Eq. (1) in naturally the largest affordable n (see Sec. 3). However, as we will see, when learning is delegated this seemingly straightforward observation can break.

Delegation. We model the delegation of learning as a conceptual partition of the learner into two distinct entities: an agent, who controls learning; and a principal, who controls the validation process. The principal outsources the learning task to the agent, who in turn uses the training set S to train the classifier h; once delivered, the principal validates the performance of h using the validation set V . Whereas the classifier s accuracy benefits the principal alone, the cost of learning (i.e., the cost of acquiring S) is born exclusively by the agent. Importantly, the amount of invested effort remains private to the agent; in our example, the principal cannot know how many examples received quality labeling. Because the agent seeks to maximize profit, the principal can use her budget as a source of monetary payment to incentivize the agent to invest in larger |S| = n. Intuitively, one could expect

1For example, [34] argue that not only training-set size, but also computation time and model size are related to accuracy through scaling laws, which have tight connections to the assumptions we discuss in Sec. 3.

0.0 0.5 1.0 Outcome (validation-set accuracy)

Action n (training set size)

16 32 64 128 256 512 1024 2048 4096 8192

Outcome distributions fn

Delegated classiﬁcation setting

0.00 0.25 0.50 0.75 1.00 Outcome (validation-set accuracy)

Payment t(j)

Diﬀerent contracts with identical budget

constant linear threshold 100 1000 Action n (training set size)

Utility un(t)

Agent s perspective: Maximize utility

constant linear threshold Contract

E[acc D(hn(t))]

Principal s perspective: Accuracy

Cost cn (data acquisition cost)

Figure 2: A delegated classification setting (data from Sec. 4). (Left) Each costly action taken by the agent (training set size n) induces a distribution fn of possible outcomes (classifier accuracy). The principal seeks to construct a contract t that incentivizes a profit-maximizing agent to take actions entailing favorable outcomes. Note the fn exhibit increasing expectation, but decreasing variance, in n. (Center) Three contracts for a given budget B, mapping outcomes to payments. (Top-right) Agent s utilities un(t) and best responses n(t) (stars) for each contract t. (Bottom-right) Expected accuracies for principal resulting from each contract; here the threshold contract is optimal (see Sec. 3).

larger payments to entail larger n, and therefore higher-accuracy h. However, as we will see, this is not always the case, and careful planning is required in order to fully utilize a given budget.

2.1 Delegation as contract design As the training set remains private to the agent, there is an information gap between the two parties. This creates a conflict of interest for the agent known as moral hazard [10], in which the agent may be tempted to invest sub-par effort, while claiming that efforts were in fact his honest best. In economics, the celebrated solution to moral hazard are contracts [31]: pay-per-performance rules that a-priori determine future payments for every possible outcome, which we formally describe next.

Contract design. A contract setting is defined by a set of actions A = {a1, . . . , a N} that can be taken by the agent, and a set of possible outcomes j {0, . . . , m}. Each action ai is associated with a cost ci, and w.l.o.g. we assume c1 c N so that actions correspond to increasing effort levels. The agent s choice to perform action a A yields a random outcome j fa for the principal, where fa describes a distribution over the possible outcomes associated with action a. The principal, who does not observe the agent s chosen action, can incentivize the agent through a contract, t : {0, . . . , m} R 0, according to which she pays the agent t(j) 0 when the materialized outcome is j. Given contract t, let ua(t) be the agent s expected utility from taking action a A at cost ca (via stochastic outcomes j fa), and let a(t) be the agent s best response an action that maximizes his expected utility (and following standard tie-breaking assumptions as in [18]). Then:

ua(t) = Ej fa[t(j)] ca, a(t) argmaxa Aua(t). (2)

Every action a that is the best response a = a(t) to some contract t is called implementable. In economic terms, the principal and agent are playing a Stackelberg game, in which the principal commits to a contract t and the agent best-responds by choosing action a(t) that maximizes his expected utility ua(t). The goal of the principal is to design a contract t which incentivizes the agent to take best-response actions yielding favorable outcomes for the principal.

Contracts for delegated learning. We propose to formulate delegated learning as a problem of optimal contract design, instantiated as follows. First, we relate agent actions a with the number of samples n, and denote A = {n1, . . . , n N} as the possible sizes of S that the learning agent can work with. The cost of acquiring samples naturally maps as ca = cn, and agent s best response is a(t) = n(t). Next, we associate outcomes j with accuracy for the principal by defining j as the number of validation samples (out of the possible m) on which h is correct; note this implies acc V (h) = j/m, and we will therefore use j and acc V (h) as outcomes interchangeably. Finally, for an action n, we set fn to be the distribution over possible accuracies obtained when learning with n samples, namely fn(j) = Phn,V [acc V (hn) = j/m] j. We will also use the matrix form Fnj = fn(j), where F [0, 1]N (m+1). Note that F admits two sources of variation: (i) a-priori variation in hn due to stochasticity in S Dn; and (ii) a-posteriori variation in j for any fixed hn

due to stochasticity in V Dm. When hn is fixed, the outcome distribution admits a simple binomial form, namely j Binomial(m, αn). Empirically, we observe this to be the dominant component.

2.2 Delegation as an optimization problem

Budget-optimal contracts. Recall that the principal seeks to maximize accuracy under budget constraints (Eq. (1)). Once learning is delegated to an agent and framed as a contract design problem, the principal s objective becomes:

t = argmaxt [0,B]m Ehn(t)[acc D(hn(t))] (3)

Contract t is chosen to incentivize the agent to invest effort n(t) (via Eq. (2)) such that the training of hn(t) yields high dividends for the principal in terms of expected accuracy. We will refer to t as a budget-optimal contract, and to the general task of finding t as budget-optimal contract design.

Information structure. Delegated learning settings have actions n and costs cn known to both sides, and outcome distribution Fnj known to the agent. For the principal, we explore varying levels of knowledge: In Sec. 3, we assume (as in the classic contract design literature) that the principal has full information of F (i.e., knows the learning curve), and focus on characterizing the optimal contract. In Sec. 4 we relax this assumption, and explore a partial-information setting in which the principal relies instead on an empirically-estimated curves ˆF.

3 Budget-Optimal Contract Design

3.1 The problem

Why agents cut corners. The conceptual challenge in designing contracts lies in that agents cannot reliably report what they did. For example, consider a principal who, after delegation, received a classifier attaining 0.74 (validation) accuracy. Should she be happy? The crux is that there are two ways that this could have happened: (i) the agent invested high effort in learning (large n), but received an uninformative S by chance, and delivered a low-quality h as a result; and (ii) the agent invested low effort (small n). Since the agent s actions are private, and because outcomes are stochastic, the principal can never know for certain which is the true underlying cause. In other words, a lazy (or rather strategic) agent can hide behind the uncertainty that is inherent in learning outcomes.2

Contract types. To overcome this informational gap, the principal must devise a contract to align incentives and encourage the agent to prefer certain actions over others. But not all contracts are equally effective. Fig. 2 illustrates for a budget B three contract types and their economic implications:

Constant contract (t(j) = B): The agent is paid B regardless of the outcome. His best-response in this case is to choose the least-costly action to the detriment of the principal. Linear contract (t(j) = Bj/m): The agent is paid a fraction of B, linear in the resulting accuracy. Linear contracts are a popular and extensively-studied class of contracts [e.g., 32, 15]. Nonetheless, and though seemingly sensible, linear contracts turn out to be sub-optimal for our setting. Threshold contract (t(j) = B1 [j j0] for some j0): The agent is paid B provided the empirical accuracy surpasses a threshold j0. In the example in Fig. 2, the threshold contract is optimal.

Rather than committing a-priori to some type of contract, we seek to find the best budget-optimal contract by solving Eq. (3). For this it is useful to have structure.

Stochastic learning curves (and where to find them). Our approach uses the observation that there is a tight connection between the set of distributions {fn} encoded in F, and learning curves, which describe the anticipated accuracy of a classifier as a function of the size of its training set. Learning curves typically depict only expected accuracy, but there is also inherent variation in outcomes. We will therefore broadly use the term stochastic learning curve to describe both mean trend and variation in accuracy as a function of n; formally, a stochastic learning curve is defined precisely by F. This connection is useful because learning curves have structure: First, expected learning curves are typically monotone [34, 11]; when not [43, 49], they can be monotonized [12]. Second, stochastic learning curves are likely to satisfy the monotone likelihood ratio property (MLRP), which states that the better the performance of a classifier, the more likely it was trained on more data (see Def. 1).

2The agent can also hide in the uncertainty due to V , particularly when m is small; see experiment in Sec. 4.2.

Table 1: Characterization of simple min-budget contracts in different settings. Simple contract forms include all-or-nothing contracts and their subclass of threshold contracts. The table specifies for each configuration either the simple form that is optimal (in one case through equivalence to the Neyman-Pearson lemma), or that the simple form is non-optimal or intractable.

Problem size Structural assumptions

Actions Outcomes No assumptions MLRP Concave-MLRP

|A| = 2 any size All-or-nothing (T1) Threshold (B.7.1)

|A| > 2 m + 1 = 2 All-or-nothing (B.5.1) Threshold (B.7.2) m + 1 > 2 Simple is NP-hard (T3) non-threshold (B.7.3) Threshold (T4)

3.2 Optimization via min-budget contracts Our main technique for solving Eq. (3) relies on a reduction to what we refer to as min-budget contracts. Given an (implementable) target action n A, a min-budget contract for n is a contract t that incentivizes the agent to employ precisely the action n , while minimizing the maximum payment by the principal t = maxj {0,...,m}{t(j)}; i.e., t implements n at minimum budget. Formally:

t = argmint t s.t. n(t) = n (4)

Our reduction relies on the following claim (Proof in Appendix B.1):

Proposition 1. Every budget-optimal contract design problem has an optimal solution which is also min-budget.

Using Prop. 1, a solution to Eq. (3) can be obtained by iteratively solving Eq. (4): For all ni A, solve Eq. (4) with target action n = ni, and return t (ni) for the best implementable ni whose budget does not exceed B. The budget-optimal problem thus reduces to solving multiple min-budget problems.

To compute each t (ni) in Eq. (4), we formulate a novel MIN-BUDGET linear program (LP),3 detailed in Appx. B.2. One way to solve this LP is with generic solvers an approach which is valid, but can be costly. One of our contributions is in identifying natural cases where min-budget contracts take on simple forms, which are easier to optimize, and have practical merit. In particular, we show that binary-action contracts have all-or-nothing structure (t(j) {0, B}), and plausible structural assumptions on the learning curve give rise to threshold contracts (t(j) = B1 [j j0] for some j0). Our theoretical results are summarized in Table 1, and detailed in the rest of the section.

3.3 All-or-nothing contracts: Binary action space and the statistical connection We begin with a simple delegated learning setting in which the agent can choose one of two actions, A = {n1, n2}, e.g., a small vs. large training set, and the principal seeks to incentivize training with more data.4 This reduced case will be useful as a building block for the general case (which we return to in Sec. 3.4), and for making a precise connection between contract design and hypothesis tests.

Simple min-budget contracts for binary action space. Our first result shows that optimal binary-action contracts are all-or-nothing contracts whose budget is determined by the total variation distance between outcome distributions f2 and f1, namely f2 f1 TV = 1

2 Pm j=0 |f2,j f1,j|.

Theorem 1 (Optimal binary-action contract). In a binary-action contract setting with outcome distributions f1, f2 and costs c1, c2, the min-budget contract is an all-or-nothing contract, given by:

t (j) = B1 [f2(j) f1(j)] j {0, . . . , m}, where B = (c2 c1)/ f2 f1 TV. (5)

The proof (in Appendix B.4.2) is by LP duality. Intuitively, the optimal contract pays the agent for outcomes that are more likely to come from f2 than from f1. Moreover, it requires a higher budget the smaller the distance is between the two distributions f1, f2. At the extremes, if their distance is 1 (i.e. no overlap among their supports), the required budget for incentivizing n2 is c2 c1, whereas if their distance is 0 (i.e. f1 = f2) it becomes impossible to incentivize n2.

3The MIN-BUDGET LP is closely related to the well-known MIN-PAY LP from non-budgeted contract design [e.g., 19], but with a different objective, and hence very different optimal solutions (see Appx. B.3). 4When n2 is not the target or when it is not implementable, then the solution is immediate: always pay c1.

Formal connection to optimal hypothesis testing. Theorem 1 also uncovers a direct correspondence between optimal contracts and optimal hypothesis tests. Intuitively, given the outcome distributions {f1, f2}, and in order to incentivize n2, the principal wishes to pay the agent if the observed outcome j {0, . . . , m} is more likely to have originated from f2. The principal can attempt to identify whether the outcome j is drawn from distribution f1 or f2 through hypothesis testing, where a hypothesis test ψ : {0, . . . , m} {0, 1} maps a sample j to either f2 (indicated by 1) or to the null hypothesis f1 (indicated by 0). For our purpose it is convenient to allow tests to be non-integral, in which case ψ : {0, . . . , m} [0, 1] maps j to a probability with which it originates from f2. The quality of a hypothesis test is measured by summing its type-1 and type-2 errors: Pm j=0 f1,jψj + Pm j=0 f2,j(1 ψj). The test that minimizes this sum is known as the most powerful hypothesis test, and has been characterized by Neyman and Pearson [45, 4.3].

We now turn to formally establishing the connection. For a fixed B, observe that every contract with budget B can be mapped to a hypothesis test via the bijection ψ(j) = t(j)/B. Then:

Theorem 2 (Optimal contract vs. test). Consider binary-action contract design with distributions f1, f2 and costs c1, c2. A contract t with budget B is optimal if and only if its corresponding hypothesis test ψ = t/B is maximum power with type-1 and type-2 errors summing to 1 c2 c1

The proof (Appendix B.4.2) is by a non-linear variable transformation to the MIN-BUDGET LP. Theorem 2 implies that the optimal contract for the binary-action case (Theorem 1) is equivalent to the well-known Neyman-Pearson lemma characterizing the most powerful hypothesis test:

Lemma 1 (Neyman-Pearson [e.g., 45]). Let f1, f2 be two discrete probability distributions. Then the most powerful hypothesis test for f1, f2 is the likelihood ratio test ψ(j) = 1 [f2(j) f1(j)], which attains the optimal bound 1 p q TV on the sum of type-1 and type-2 errors.

Theorem 2 establishes a new formal connection between the two domains of contract design and hypothesis testing. In the context of contract design, it provides a statistical interpretation: A minbudget contract can be interpreted as an optimal hypothesis test, and the ability to distinguish between the two hypotheses determines the required budget. In the converse direction, it enables a new proof for the Neyman-Pearson using the min-budget contract given by Theorem 1 (see Appendix B.4.2).

3.4 All-or-nothing contracts: Beyond binary action For general action spaces, optimal contracts are not guaranteed to be all-or-nothing. In fact, we show that determining whether there exists an all-or-nothing contract that is optimal is NP-hard:

Theorem 3 (Hardness). Finding a min-budget all-or-nothing contract is NP-hard.

The proof is by reduction from 3SAT, and appears in Appx. B.5.2. Nonetheless, there are special but important cases notably the binary outcome case5 in which results from Sec. 3.3 hold, suggesting that ideas from Thm. 1 apply more broadly. The following algorithm makes use of these ideas, showing good empirical performance, and provable performance under an MLRP condition (Sec. 3.5).

Single binding action algorithm. Revisiting the closed-form all-or-nothing contract in Eq. (5), we observe that the result is based on the fact that binary action spaces have only one alternative action. Building upon this observation, we propose the single binding action (SBA) algorithm, which computes a solution to Eq. (4) in the general (many-actions) case: Given target action n A, loop over all actions n = n , and apply the closed-form formula in Eq. (5) to obtain an (all-or-nothing) contract t (n, n ). If the agent s best response (Eq. (2)) satisfies n (t (n, n )) = n , return t . If the loop ends without returning a contract, return fail . The following claim shows the algorithm is sound:

Proposition 2 (Soundness of SBA). When the single binding action algorithm terminates successfully, it returns an optimal contract which is an all-or-nothing contract.

Proof in Appendix B.6. Prop. 2 ensures that if SBA succeeds, then the returned all-or-nothing contract t is optimal. Moreover, failure does not preclude the existence of an optimal all-or-nothing contract. If SBA fails, we solve Eq. (4) with a generic LP solver. Empirically, SBA was successful in more than 85% of cases, and is 103 times faster than the LP solver (Appendix C.3). Thus, this optimistic try SBA first approach typically succeeds, adds negligible overhead if not, and guarantees correctness.

5In this case there are binary outcomes (m = 1), such as when the agent s efforts can result in either success or failure [6, 28, 21]. For this case, we show that the optimal contract is also all-or-nothing (Appx. B.5.1).

3.5 Threshold contracts: MLRP assumption In this section we provide sufficient conditions, in the form of natural structural properties of (stochastic) learning curves, that guarantee the optimality of even simpler contracts namely threshold contracts and the success of SBA. With inspiration from hypothesis testing [40] and contract theory [26, 19], it is natural to consider the monotone likelihood ratio property (MLRP) assumption:

Definition 1 (MLRP [e.g., 26]). A contract design setting satisfies MLRP if for every pair of actions a, a such that ca < ca , the likelihood ratio fa (j)/fa(j) is monotonically increasing in j.

In our context, MLRP states that the better the validation-set performance of a classifier, the more likely it was trained on more data. This holds in particular for monotone learning curves with binomial outcome distribution [19, B.1]. For a binary action space, MLRP ensures that n2 is always implementable,6 and that the optimal contract is a threshold contract: t (j) = B1 [j j0]. This is by Theorem 1, and by the fact that j0 such that f2(j)/f1(j) 1 iff j j0 (see also Appendix B.7.1).7 Interestingly, this is similar to the relation between the Neyman-Pearson lemma and the Karlin-Rubin theorem, which characterizes the most powerful hypothesis test under monotone likelihood ratio [40].

MLRP for general action space. MLRP does not guarantee threshold contracts in general: In Appendix B.7.3, we give a constructive counterexample satisfying MLRP, but for which the optimal contract is not threshold. However, refining MLRP to also capture diminishing returns turns out to be sufficient for recovering guarantees generally. For target action a N, denote the survival probability of an action ai by si = Pj fi[j j ], where j is the minimal outcome at which action a N is more likely than a N 1. These will serve as formal means for capturing the concavity of learning curves.

Definition 2 (C-MLRP). A contract design setting satisfies Concave-MLRP (C-MLRP) if it satisfies MLRP, and additionally the actions survival probability is concave as a function of the actions cost.

Our final result shows that C-MLRP guarantees optimality of threshold contracts, and success of SBA:

Theorem 4 (Sufficiency for threshold). Consider a contract design setting with C-MLRP. Then the optimal contract is a threshold contract, and is recovered by the SBA algorithm.

We prove this claim by showing that concavity implies that only one alternative action is binding in the linear program equivalent to Eq. (4), reducing the problem to the two-action case. By applying Theorem 1, we obtain optimality of threshold contracts in this case as well (proof in Appendix B.7.3). This also implies that SBC always terminates successfully on inputs that satisfy C-MLRP.

In practice, we believe that C-MLRP is a reasonable assumption for realistic learning curves: in Appendix B.8.1, we prove it is satisfied by a standard theoretical model of learning curves, and in Appendix C.3, we empirically demonstrate that it is (approximately) satisfied in settings where threshold contracts are optimal. Interestingly, in our empirical study, threshold contracts were often optimal even when C-MLRP did not hold suggesting the condition is sufficient, but not necessary (see Sec. 4.1).

4 Experiments

We now turn to our empirical investigation of delegated learning under full and partial information. We base our experiments on the recently curated Learning Curves Database (LCDB) [43], which includes a large collection of stochastic learning curves for multiple classification datasets and methods. For each dataset and method, the database includes held-out accuracy measurements obtained for increasing sample sizes n 24, 24.5, . . . , 215 , with multiple repetitions per n; these provide us with stochastic learning curves. Here we focus primarily on the popular MNIST dataset [39] as our case study, and on MLP and GBDT as representative classifiers, but we refer the reader to Appendix C for further experiments on additional datasets and methods. Code is available at: https://github.com/edensaig/delegated-classification.

4.1 Full information We begin with the full information setting to explore in a clean environment how different parameters of the learning setting and environment affect predictive performance and economic outcomes.

6As a corollary of a similar result for min-pay contracts [19, Lemma 7]. 7This also holds for binary outcomes in an arbitrary action space, see Appendix B.7.2.

102 103 104

Size of training set n

Accuracy acc D(hn)

MNIST learning curves

0 20 40 Size of validation set m

Required budget B

Required budget for 85% accuracy

MLP GBDT cn

0.5 0.6 0.7 0.8 0.9 Target accuracy E[acc D(hn)]

B = B GBDT B MLP

Which algorithm is less costly to delegate?

Figure 3: Delegating with full information. (Left) Typical learning curves for two learning algorithms on MNIST. (Center) Required budget for target accuracy of 0.85 per validation set size m. (Right) Different cost regimes, indicating per accuracy region which of the two methods is cheaper to delegate.

Validation set size. Fig. 3 (left) presents typical stochastic learning curves for two learning algorithms: Multi-Layered Perceptron (MLP) and Gradient-Boosted Decision Trees (GBDT). We take an arbitrary accuracy point on the curve at acc(n) = 0.85 (dotted line) to examine the effects of validation set size m on min-budget contracts. Notice that MLP requires larger n to obtain 0.85; Fig. 3 (center) shows how this translates to a larger required budget B , which holds for all m. As m increases, required budgets and the difference between them both decrease. Larger validation sets are therefore useful for reducing required budget. Nonetheless, even for reasonable m, obtained budgets still remain higher than their theoretical lower bounds (target action costs cn ).

Budget regimes. Fig. 3 (left) also indicates two points in which the learning curves cross (dashed lines), at 0.74 and 0.94 accuracy. These correspond to sample sizes n for which both methods obtain matching accuracies (in expectation). For a self-sufficient learner, the implication is that at each of these points, both methods are equally costly, i.e., both cost cn. Interestingly, and in contrast, delegation can entail different required budgets despite equal accuracies. Fig. 3 (right) shows for each target accuracy the gap in required budgets between both methods, B = B GBDT B MLP. As can be seen, each method is comparatively more (or less) costly in different accuracy regimes (up to 0.6; between 0.6 and 0.92; and above 0.92). Crucially, the budget gap can be large even when accuracies match (dashed lines). For example, even though both MLP and GBDT require n=362 217/2 samples to obtain 0.74 accuracy, GBDT is cheaper ( B = 102); for 0.94 which requires n=23170 229/2 from both, GBDT is significantly more expensive ( B =105). The reason for this is that optimal budgets are determined by the ability to distinguish between distributions (Sec. 3.3).

Prevalence of simple contracts. To understand the applicability of our theoretical findings, in Appendix C.3 we conduct an empirical prevalence evaluation on additional learning algorithms, and across target actions. We observed that min-budget contracts assume a threshold form and the SBC algorithm returns correct results in more than 85% of cases overall. Restricting optimization to simple contracts, budget requirements were generally less than 1% higher than that of a min-budget contract, suggesting that simple contracts may provide a good approximation even when min-budget contracts do not assume a simple form.

4.2 Partial information We now turn to consider delegation under partial information, in which the principal must rely on an estimated learning curve. We instantiate this idea by assuming that the principal has access to a small pilot dataset of size k, where k is considered small. Using this set, the principal creates an estimated learning curve ˆF by fitting a curve to accuracies obtained for up to some n0 k, and extrapolating to larger n > n0. In particular, we experiment with fitting parametric power-law curves of the form E[αn] = a bn c, which have been shown to provide good fit in various scenarios both empirically and theoretically [49, 34, 11]. Since power-law curves are monotone, composition with binomial distributions increasing in p provably results in MLRP stochastic curves [19, B.1].

Bias-variance tradeoff. Given k pilot examples, there are different ways in which the principal can use them to construct an estimated curve. Here we consider a simple tradeoff: setting n0 to be small but with more samples per n < n0 (low variance), or setting n0 to be large but with few samples per n < n0 (low bias). We define r as the number of samples per n (so low r means larger n0). Then, for a

Diﬀerent sampling strategies

Expected accuracy

102 103 104

Training set size n

Exploration set size k

Attained accuracy acc D(hn(ˆt))

Extrapolation convergence

Pilot dataset size k

Size multiplier µ(k) = n(ˆt)/k

Cost-eﬃciency

0.6 0.4 0.2 0.0 0.2 Extrapolation residual at n(t )

Rel. accuracy loss in delegation

Eﬀective loss

Observs. Median

Figure 4: Delegating with partial information. (Left) Extrapolated learning curves for different r. (Center-left) Accuracy obtained via delegation per pilot set size k. (Center-right) Multiplicative gain in effective number of samples due to delegation. (Right) Implications of over vs. under-estimation.

given r, we set n0 such that P

n n0 r n k (i.e., such that the total number of used samples does not exceed k). Fig. 4 (left) shows different curve fits for r {1, 3, 5}, and corresponding n0. Then, Fig. 4 (center-left) shows for a certain fixed budget the accuracy level that can be attained for increasing k, and as a function of r. As can be seen, having sufficient points k for constructing ˆF is important, but performance grows quickly with k (note log-scale x-axis). It is also apparent in our example that low bias (via larger n0) is much more important than low variance for constructing useful ˆF.

Cost-efficiency tradeoff. Because the pilot set provides the principal a basic means for obtaining minimal accuracy, we can ask: given k examples, and for a fixed budget B, what is the added benefit of delegating learning? For this, we define µ(k) = n(ˆt)/k to be the sample-size multiplier, i.e., the multiplicative gain in the effective number of samples due to delegation. Fig. 4 (center-right) shows µ(k) for increasing k and across r. For r = 1 (which is superior in terms of performance and outcomes), µ begins at 10, increases to 30 at around k = 190, and slowly decreases back to 10 towards k = 1, 000. For r > 1, we observe that µ 1, i.e., there is effectively no gain from delegation, until around k = 100, only after which some gain is restored. This highlights the importance of obtaining an accurate estimate ˆF in terms of the economic consequences of delegation.

Over vs. under-estimation. Typically in curve-fitting, over and under-estimation are treated equally, since both types of error can negatively affect goodness of fit and extrapolation quality. However, for delegation, the implications of over vs. under-estimation on contract outcomes are highly asymmetric. Fig. 4 (right) shows for a target incentivized number of samples n(t ) the relation between the (theoretical) signed extrapolation error n(t ) (i.e., overor under-estimate, measured in accuracy points) and the eventual loss in accuracy obtained through delegation, relative to perfect estimation. Each point in the plot corresponds to one curve-fitting instance, with points shown for varying k, n0, and r, and with multiple independent repetitions. Results show that in the underestimation regime (i.e., negative extrapolation error), loss in accuracy degrades gracefully with the estimation error. In stark contrast, even minimal over-estimation (positive extrapolation error) causes accuracy to plummet dramatically, as the agent s rational response in those cases was to use the smallest dataset possible. We interpret this as a consequence of setting the bar too high a rational decision to minimize effort in response to unrealistic expectations. This has important implications for the choice of how to fit and extrapolate learning curves, suggesting that contracts can be tolerant to under-estimation, while over-estimation should be avoided at all costs.

5 Discussion

Motivated by the increasingly-common practice of outsourcing learning tasks, this paper sets out to introduce and study the novel problem of delegated classification. Our findings suggest that conflict of interests should not be overlooked, and that contracts hold potential as a means for aligning them. Our analysis relies on a set of assumptions, which should be carefully considered by practitioners and empiricists alike; we also believe that there are likely further fruitful connections to explore between contracts and statistical hypothesis testing. As a problem of contract design, and when the learning task is reasonably well-behaved, delegated learning manifests in the form simple threshold contracts. A natural question for future work is whether simplicity also implies robustness to partial knowledge as is often the case [19].

Acknowledgements. The authors would like to thank Ruth Heller, Shafi Goldwasser, Jonathan Shafer, Ohad Einav, and anonymous reviewers for their insightful remarks and valuable suggestions. Nir Rosenfeld is supported by the Israel Science Foundation grant no. 278/22. Eden Saig is supported by the Israel Council for Higher Education PBC scholarship for Ph.D. students in data science. Funded by the European Union (ERC, ALGOCONTRACT, 101077862, PI: Inbal Talgam-Cohen).

[1] Milton Abramowitz, Irene A Stegun, and Robert H Romer. Handbook of mathematical functions with formulas, graphs, and mathematical tables, 1988.

[2] Ibrahim Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[3] Tal Alon, Magdalen Dobson, Ariel D. Procaccia, Inbal Talgam-Cohen, and Jamie Tucker-Foltz. Multiagent evaluation mechanisms. In AAAI 2020, pages 1774 1781, 2020.

[4] Nivasini Ananthakrishnan, Stephen Bates, Michael I Jordan, and Nika Haghtalab. Delegating data collection in decentralized machine learning. ar Xiv preprint ar Xiv:2309.01837, 2023.

[5] Pablo D Azar and Silvio Micali. Computational principal agent problems. Theoretical Economics, 13(2):553 578, 2018.

[6] Moshe Babaioff, Michal Feldman, and Noam Nisan. Combinatorial agency. In Proceedings of the 7th ACM Conference on Electronic Commerce, pages 18 28, 2006.

[7] Moshe Babaioff, Michal Feldman, Noam Nisan, and Eyal Winter. Combinatorial agency. Journal of Economic Theory, 147(3):999 1034, 2012.

[8] Stephen Bates, Michael I Jordan, Michael Sklar, and Jake A Soloff. Principal-agent hypothesis testing. ar Xiv preprint ar Xiv:2205.06812, 2022.

[9] Devansh Bisla, Apoorva Nandini Saridena, and Anna Choromanska. A theoretical-empirical approach to estimating sample complexity of dnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3270 3280, 2021.

[10] Patrick Bolton and Mathias Dewatripont. Contract theory. MIT press, 2004.

[11] Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon Van Handel, and Amir Yehudayoff. A theory of universal learning. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 532 541, 2021.

[12] Olivier J Bousquet, Amit Daniely, Haim Kaplan, Yishay Mansour, Shay Moran, and Uri Stemmer. Monotone learning. In Conference on Learning Theory, pages 842 866. PMLR, 2022.

[13] Michael L. Bynum, Gabriel A. Hackebeil, William E. Hart, Carl D. Laird, Bethany L. Nicholson, John D. Siirola, Jean-Paul Watson, and David L. Woodruff. Pyomo optimization modeling in python, volume 67. Springer Science & Business Media, third edition, 2021.

[14] Yang Cai, Constantinos Daskalakis, and Christos H. Papadimitriou. Optimum statistical estimation with strategic data sources. In COLT 2015, pages 280 296, 2015.

[15] Gabriel Carroll. Robustness and linear contracts. American Economic Review, 105(2):536 563, 2015.

[16] Junjie Chen, Minming Li, and Haifeng Xu. Selling data to a machine learner: Pricing via costly signaling. In International Conference on Machine Learning, pages 3336 3359. PMLR, 2022.

[17] Alon Cohen, Argyrios Deligkas, and Moran Koren. Learning approximately optimal contracts. In SAGT 2022, pages 331 346, 2022.

[18] Paul Dütting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In EC 2019, pages 369 387, 2019.

[19] Paul Dütting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 369 387, 2019.

[20] Paul Dütting, Tomer Ezra, Michal Feldman, and Thomas Kesselheim. Combinatorial contracts. In FOCS 2021, pages 815 826, 2021.

[21] Paul Dütting, Tomer Ezra, Michal Feldman, and Thomas Kesselheim. Combinatorial contracts. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 815 826. IEEE, 2022.

[22] Paul Dütting, Tomer Ezra, Michal Feldman, and Thomas Kesselheim. Multi-agent contracts. In STOC 2023, 2023. To appear.

[23] Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022. URL https://openreview.net/forum?id=h R_SMu8cx CV.

[24] Shafi Goldwasser, Guy N Rothblum, Jonathan Shafer, and Amir Yehudayoff. Interactive proofs for verifying machine learning. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.

[25] Sanford J. Grossman and Oliver D. Hart. An analysis of the principal-agent problem. Econometrica, 51(1):7 45, 1983.

[26] Sanford J Grossman and Oliver D Hart. An analysis of the principal-agent problem. In Foundations of insurance economics, pages 302 340. Springer, 1992.

[27] William E Hart, Jean-Paul Watson, and David L Woodruff. Pyomo: modeling and solving mathematical programs in python. Mathematical Programming Computation, 3(3):219 260, 2011.

[28] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 359 376, 2014.

[29] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Journal of Artificial Intelligence Research, 55:317 359, 2016.

[30] Derek Hoiem, Tanmay Gupta, Zhizhong Li, and Michal Shlapentokh-Rothman. Learning curves for analysis of deep networks. In International conference on machine learning, pages 4287 4296. PMLR, 2021.

[31] Bengt Holmström. Moral hazard and observability. The Bell Journal of Economics, 10(1): 74 91, 1979.

[32] Bengt Holmstrom and Paul Milgrom. Aggregation and linearity in the provision of intertemporal incentives. Econometrica: Journal of the Econometric Society, pages 303 328, 1987.

[33] Marcus Hutter. Learning curve theory. ar Xiv preprint ar Xiv:2102.04074, 2021.

[34] Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

[35] Jon Kleinberg and Robert Kleinberg. Delegated search approximates efficient search. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 287 302, 2018.

[36] Jon Kleinberg and Manish Raghavan. How do classifiers induce agents to invest effort strategically? In EC 2019, pages 825 844, 2019.

[37] Jon M. Kleinberg and Manish Raghavan. Algorithmic classification and strategic effort. SIGecom Exch., 18(2):45 52, 2020.

[38] Sébastien Lahaie. How to take the dual of a linear program. Columbia University, New York, 2008.

[39] Yann Le Cun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

[40] Erich Leo Lehmann, Joseph P Romano, and George Casella. Testing statistical hypotheses, volume 3. Springer, 1986.

[41] Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M Alvarez, Zhiding Yu, Sanja Fidler, and Marc T Law. How much more data do I need? Estimating requirements for downstream tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 275 284, 2022.

[42] David Martimort and Jean-Jacques Laffont. The theory of incentives: the principal-agent model. Princeton University Press, 2009.

[43] Felix Mohr, Tom J Viering, Marco Loog, and Jan N van Rijn. Lcdb 1.0: An extensive learning curves database for classification tasks. Machine Learning and Knowledge Discovery in Databases, ECMLPKDD. p. accepted. Lecture Notes in Computer Science, Springer, 2022.

[44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[45] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.

[46] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=ryenvp EKDr.

[47] Royal Swedish Academy of Sciences. Scientific background on the 2016 Nobel Prize in Economic Sciences, 2016.

[48] Bernard Salanie. The Economics of Contracts: A Primer. MIT press, 2017.

[49] Tom Viering and Marco Loog. The shape of learning curves: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[50] Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Veselin Stoyanov. Training trajectories of language models across scales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13711 13738. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL -LONG.767.

[51] Xinyi Zhao, Weixin Liang, and James Zou. Data budgeting for machine learning. ar Xiv preprint ar Xiv:2210.00987, 2022.

[52] Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I. Jordan. The sample complexity of online contract design. In Proceedings of the 24th ACM Conference on Economics and Computation, EC 2023, London, United Kingdom, July 9-12, 2023, page 1188. ACM, 2023. doi: 10.1145/3580507.3597673.

A Broader implications

In this paper, we set out to formalize and study the task of delegating classification through the lens of contract design. Given that our work is largely motivated by the increasingly common practice of outsourcing learning tasks to specialized service providers, we believe our algorithm, analysis, and empirical observations carry meaningful implications for practitioners and decision-makers alike. At the same time, it is important to remember that our work as others considering economic aspects of learning studies delegation in a simplified setting and under certain assumptions. As such, and since devising and agreeing to legally-binding contracts can have concrete implications on real-world outcomes, care should be taken when applying ideas or conclusions that derive from our work in practice.

For example, consider that our formalism relies on the assumption that the learning agent is allknowing and rational. Yet in reality, agents must act under partial information, face irreducible (and often unquantifiable) uncertainty, and being human are subject to common behavioral biases. It is unclear a-priori if and how our statements and conclusions carry over to this setting. As another example, notice that our formalism considers a one-shot setting where a single contract between a single principal-agent pair is instantiated once. But in reality, competition and long-term reputation may play a significant role in determining the agent s incentive structure, and consequently, her behavior. In such cases, naïvely applying our framework without careful inspection of the appropriate incentives on both ends can result in sub-optimal contracts, possibly to the detriment of all involved parties. Nonetheless, given that our work aims to take an initial step towards establishing delegated learning, we view its extension to non-rational agents and temporal and competitive settings as intriguing future work.

One message that our work conveys is that in delegation, simplicity, in the form of threshold contracts, has merit. This draws connections to other works that similarly argue for simplicity as an important and useful property of effective delegation mechanisms. Our work shows that simple contracts are, under reasonable conditions, computationally feasible and theoretically optimal. Economically, threshold contracts are practically appealing since they are easy to understand, communicate, and regulate. Given that contracts are in essence social constructs, we believe these properties are key for establishing threshold contracts as effective building blocks for machine learning markets.

B Min-budget contract design deferred proofs

Notation. To align with traditional contract design notation, in this section, we denote the action space by A = [n] = {1, . . . , n}, and the outcome space by Ω= {0, . . . , m}. We denote by (X) the set of distributions over a given set X. For contract design problems, we denote the outcome probabilities by Fi,j, such that Fi (Ω) is the outcome distribution associated with action i, and also denote the cost of each action by ci. Given x R, we denote x+ = Re LU(x) = max {0, x}. We denote the indicator function by 1 [ ], the total variation distance between distributions P, Q by P Q TV (see Definition 4), and the survival function of P (Ω) by SP ( ) (see Definition 14).

A note on individual rationality. In the contract design literature, a contract is said to be incentive compatible (IC) with respect to some action a if it satisfies ua (t) ua(t) for all actions a A. In the delegated classification setting, this corresponds to the constraint n(t) = n in Eq. (4). As an additional constraint, contracts in which the agent s expected utility is always non-negative (un (t) 0) are said to be individually rational (IR). In the case of delegated classification, it is natural to assume that there always exists a valid action that the agent can take at zero cost for example, returning a dummy classifier that always abstains from prediction (thus having zero accuracy), or a classifier which makes a prediction at random. As such, contracts in the delegated classification setting can be assumed to be individually rational without loss of generality, as any action n that is chosen by the agent has utility which is weakly larger than the utility of the zero-cost action, which is always non-negative.Moreover, even in cases where a zero-loss action does not exist, we show that individual rationality (IR) can be attained in a straightforward manner, by adding the minimal cost c1 to each entry of an incentive compatible (IC) contract (tj 7 tj + c1):

Claim 1. Given an IC contract t with c1 > 0, the contract t + c1 (add c1 to every coordinate of the contract) is both IC and IR.

Proof. An IC contract implementing action a satisfies ua (t) ua(t), for all actions a A. In particular, this inequality holds in relation to the least-costly action, denoted by 1 A. Plugging the definition of the agent s expected utility (Eq. (2)), we obtain ua (t) Ej f1[t] c1. Thus, under

the mapping t j = tj + c1, it holds that:

ua (t ) = ua (t + c1) Ej f1[t + c1] c1 = Ej f1[t] 0

and therefore the contract t = t + c1 is individually rational and still implements action a .

B.1 Relation between budget-optimal and min-budget contracts Proof of Proposition 1. Given budget B > 0, denote by t BO the budget-optimal contract, and denote the action it implements by n . By definition, t BO is a feasible solution to the min-budget contract design problem implementing action n . Denote by t MB the corresponding optimal solution to the min-budget problem implementing action n (Eq. (4)). t MB implements the same action n by definition, and satisfies t MB t BO B due to the optimization objective. Hence, t MB is a budget-optimal contract for the given budget B, which is also min-budget.

Iterative min-budget. To find the budget-optimal contract using iterative applications of Eq. (4), we observe that any budget-limited contract t : Ω [0, B] has bounded expected pay Efn[t] B for any distribution fn. Hence, actions n with cost cn > B cannot be implemented using budget B, as the agent s utility un = Efn [t] cn < 0 will be smaller than the utility of the zero-cost action u0 = Efn[t] 0 0. Define the reduced action set:

A = {n A | cn B n is implementable}

A is finite when A is finite, or when the data cost is unbounded and the learning curve is monotone. To find n within this space, go over all n A , calculate the minimal budget Bn required for implementation, and take argminn A Bn. This is possible within a finite number of steps.

B.2 The min-budget linear program and equivalent forms The min-budget contract (Eq. (4)) implementing action i A is given by the MIN-BUDGET linear program:

min t R|Ω| 0 ,B R 0 B

s.t. j Ω: tj B (BUDGET)

j Ω Fi ,jtj ci X

j Ω Fi,jtj ci (IC)

The dual of the min-budget LP is given by:

Claim 2. The dual linear program of Eq. (6) is given by:

max λ Rn 1 0 ,µ R|Ω| 0

i =i (ci ci ) λi

i =i (Fi,j Fi ,j) λi µj

Proof. We take the dual by translating the optimization problem into canonical form. The canonical form we target: min x 0 c T x s.t. Cx d

and its dual, as given by Lahaie [38], is:

max y 0 d T y s.t. CT y c

To translate Eq. (6), we note that in our case the components c,C,d are given by:

c T = (0 . . . 0 1) R|Ω|+1

F1,0 Fi,0 . . . F1,|Ω| Fi,|Ω| ... ... ... Fi ,j Fi,j ... Fn,0 Fi,0 Fn,m Fi,m

R(n 1+|Ω|) (|Ω|+1)

d T = (c1 ci . . . cn ci 0 . . . 0) Rn 1+|Ω|

To simplify formulation, we denote the dual optimization variable y as follows:

y T = (λ1 . . . λi 1 λi+1 . . . λn µ0 . . . µm) Rn 1+|Ω|

Under this formulation, the dual s objective is:

i =i (ci ci ) λi (8)

The constraints given by the first m rows of CT are:

i =i (Fi ,j Fi,j) + µj 0

and equivalently:

i =i (Fi,j Fi ,j) µj (9)

The constraint corresponding to the last row of CT is given by: X

and equivalently: X

j Ω µm 1 (10)

Combining equations (8, 9, 10) yields the linear program given by Eq. (7).

To map between contracts and hypothesis tests, we introduce the following variable transformation:

Definition 3 (Statistical representation of contracts). For a given contract t : Ω R 0, denote B = maxj tj. The statistical representation of t is given by:

where ϕj [0, 1] and β = B 1.

Note that the transformation (t, B) 7 (ϕ, β) is non-linear, and well-defined for all B > 0. Under this variable transformation, the MIN-BUDGET transforms to an equivalent linear program:

Lemma 2. For a feasible design problem, the min-budget linear program (Eq. (6)) is equivalent to:

max ϕ [0,1]|Ω|,β R 0 β

j Ω Fi,j(1 ϕj) + X

j Ω Fi ,jϕj 1 (ci ci )β (11)

Proof. Given Eq. (6), define ϕ [0, 1]|Ω|, β 0 according to Definition 3:

tj = ϕj B = ϕjβ 1 (12)

Under the transformation defined by Eq. (12), the (BUDGET) constraint in Eq. (6) transforms as follows: tj B ϕjβ 1 β 1

and the (IC) constraint in Eq. (6) transforms as: X

j Fi,jtj ci X

j Fi ,jtj ci X

j (Fi,j Fi ,j) ϕjβ 1 ci ci

j Fi ,jϕj (ci ci ) β

j Fi,j(1 ϕj) X

j Fi ,jϕj (ci ci ) β

j Fi,j(1 ϕj) + X

j Fi ,jϕj 1 (ci ci ) β

where the first equivalence is by Eq. (12), and the third equivalence is valid as P

j Fi ,j = 1 for all i . Finally, the objective of Eq. (6) transforms as:

min B max β (15)

Combining equations (13, 14, 15) yields the linear program in Eq. (11).

B.3 Relation to min-pay contract design Min-pay contract design aims to design a contract which minimizes the expected pay under the implemented action n :

t = argmint Ej fn [tj] s.t. n(t) = n (16)

In contrast to the min-budget contract (Eq. (4)), the t objective measuring maximal pay is replaced with the Ej fn [tj] objective measuring expected pay. Eq. (16) is equivalent to the MIN-PAY linear program: min t R|Ω| 0

j Ω Fi ,jtj ci X

j Ω Fi,jtj ci (IC)

B.3.1 Equivalence of implementability The implementatbility of min-pay contracts is characterized in Dütting et al. [19]. We cite the main result for completeness:

Proposition 3 (Min-pay implementability; [19], A.2). An action i A is implementable (up to tie-breaking) if and only if there is no convex combination of the other actions that results in the same distribution fi = P

i =i αi fi , but lower cost ci > P

i =i αi ci .

0.0 2.5 5.0 7.5 10.0 12.5 15.0 Outcome j

Outcome distributions

0.0 2.5 5.0 7.5 10.0 12.5 15.0 Outcome j

Min-pay and min-budget contracts

MIN-PAY MIN-BUDGET

Figure 5: Qualitative comparison of min-pay and min-budget contracts under MLRP. (Left) The contract design setting, representing two possible actions with binomial outcome distributions (p1 = 0.5, p2 = 0.8, m = 15). Action 1 has zero cost c1 = 0, and action 2 has unit cost c2 = 1. (Right) The resulting min-pay (red) and min-budget (purple) contracts, obtained by solving Eq. (17) and Eq. (6), respectively. The min-pay awards payment only for the highest outcome, in alignment with Proposition 4; the min-budget is a threshold contract, in alignment with claim 12.

We leverage to Proposition 3 to characterize the implementability of min-budget contracts. This is possible due to the following connection:

Claim 3 (Implementability equivalence). A contract t is feasible solution of MIN-BUDGET if and only if it is a feasible solution of MIN-PAY.

Proof. A contract t which satisfies MIN-BUDGET (Eq. (6)) satisfies the (IC) constraint, and thus also satisfies MIN-PAY (Eq. (17)). Conversely, a contract t which satisfies MIN-PAY satisfies the (IC) constraint. Set B = maxj tj and obtain a feasible solution to MIN-BUDGET.

B.3.2 Functional form of min-pay contracts under MLRP The optimal min-pay contracts under the MLRP assumption (Definition 1) are characterized in [19]:

Proposition 4 (Optimal min-pay contract under MLRP; [19], Lemma 7). Consider a contract design setting for which MLRP holds. If the highest-cost action n is implementable, then there is a min-pay contract has a single nonzero-payment, which is rewarded for the highest outcome m.

See Fig. 5 for a qualitative comparison of min-pay and min-budget contracts.

Min-pay contracts in delegated classification settings with MLRP. While simple, the min-pay contract given by Proposition 4 would be impractical in many realistic scenarios of delegated classification. In a delegated classification setting, the highest outcome corresponds to 100% validation set accuracy (i.e all validation set samples classified correctly). As m grows, the highest outcome becomes exponentially less likely, and the min-pay contract awards increasingly high payment with increasingly low probability (See Fig. 6). In such settings, even a slight degree of risk-aversion is likely to affect the agent s decisions: From the agent s perspective, the probability of receiving any payment from a min-pay contract may become small to a degree where even a slight degree of agent risk-aversion will manifest itself in the decision-making process. From the principal s perspective, even though min-pay contracts guarantee small payment in expectation, the amount payment in the case of a rare event would quickly become unfeasible.

B.4 Min-budget contracts with two actions In this section, we explore min-budget settings with two actions, and prove Theorem 2. When n = 2, assume without loss of generality that the contract implements action i = 2, and denote c = c2 c1 > 0.

We recall the definition of total variation distance:

Definition 4 (Total variation distance). Given two distributions P, Q (Ω), the total variation distance between P and Q is:

10 20 30 40 50 Validation set size m

1.4 100 1.6 100 1.8 100

2 100 2.2 100

Expected pay

MIN-PAY MIN-BUDGET c2 = 1

10 20 30 40 50 Validation set size m

Required budget

10 20 30 40 50 Validation set size m

Pj f2(tj > 0)

Probability of any payment

Figure 6: Comparison of min-pay and min-budget contracts under MLRP for varying validation set size m. The delegation setting is similar to the one depicted in Fig. 5: two binomial-outcome actions (p1 = 0.5, p2 = 0.8) and varying m. Action 1 has zero cost c1 = 0, and action 2 has unit cost c2 = 1. In the left and center plots, the cost of action 2 is represented by an orange line (c2 = 1) representing the lower bound as in Fig. 3 (Center). (Left) Expected pay Ej f2[tj]. (Center) Required budget t . (Right) Probability of getting any payment Pj f2[tj > 0].

The following equivalent definition is useful:

Claim 4. Let P, Q (Ω). It holds:

j Ω (Qj Pj)+

where x+ = max {x, 0}.

Proof. By definition:

Decompose the L1 norm: 1 2 P Q 1 = 1

(Qj Pj)+ + (Pj Qj)+

j ΩQj = 1, it holds that: X

j Ω (Qj Pj)+ = X

j Ω (Pj Qj)+

and therefore P Q TV = P

j Ω(Qj Pj)+ as required.

B.4.1 Optimal two-action contract Theorem 5 (Two-action min-budget contract; formal statement of Theorem 1). When n = 2, the optimal min-budget contract t is given by:

t j = c F2 F1 TV 1 [F2,j F1,j] (18)

Proof. We prove this claim using LP duality, by showing that the optimal primal objective corresponding to t is identical to a feasible solution of the dual LP.

The primal LP (Eq. (6)) for two actions is given by:

min t R|Ω| 0 ,B R 0 B (19)

s.t. j Ω: tj B (BUDGET) X

j Ω (F2,j F1,j) tj c (IC)

As t is bounded, the optimal objective B corresponds to the maximal possible payout of t :

B = max j Ωt j = c F2 F1 TV (20)

and therefore the (BUDGET) constraint is satisfied. For the (IC) constraint, denote by Ω the following set: Ω = {j Ω| F2,j F1,2}

using this notation, we note that t j > 0 if and only if j Ω . Plugging into the constraint, we obtain: X

j Ω (F2,j F1,j) t j = X

j Ω (F2,j F1,j) B

j Ω (F2,j F1,j)

F2 F1 TV = c

Where F2 F1 TV = P

j Ω (F2,j F1,j) by definition. This shows that t is a feasible solution for the primal LP.

To prove optimality, we show that the the dual linear program attains an identical objective. By Claim 2, the dual LP for two actions is given by:

max λ R 0,µ R|Ω| 0 cλ

s.t. j Ω: (F2,j F1,j) λ µj X

Denote the vector vj(λ) R|Ω| 0:

vj(λ) = λ (F2,j F1,j)+

We note that outcomes j for which F2,j F1,j (formally j Ω\ Ω ) correspond to constraints which are satisfied for any λ 0. Otherwise j Ω , and in these cases λ can be increased until v(λ) saturates the simplex constraint P

j Ωµj = 1. The simplex constraint is binding for a value λ > 0 satisfying: X

j Ω λ (F2,j F1,j)+ = 1

and therefore the optimal value λ of the dual LP (Eq. (21)) is:

j Ω(F2,j F1,j)+ = c F2 F1 TV (22)

Where the second equality is given by claim 4. The dual objective in Eq. (22) is identical to the primal objective attained by the contract t in Eq. (20), and therefore t is an optimal contract by strong LP duality.

B.4.2 Contracts and hypothesis tests We recall the formal definition of the Neyman-Pearson lemma:

Lemma 3 (Neyman-Pearson, [e.g., 45, 4.3]). Let P, Q (X) be two probability measures over an arbitrary set X. Then for any hypothesis test ψ : X {0, 1}, it holds:

P(ψ(x) = 1) + Q(ψ(x) = 0) 1 P Q TV . (23)

Moreover, equality holds for the Likelihood Ratio test ψ (x) = 1 [q(x) p(x)].

In our analysis, we will assume that the space X is finite. We also note that Lemma 3 is stated for decision rules with binary output, but its domain can be extended to fractional decision functions ψ : X [0, 1] without loss of generality: As the sum of errors is linear in ψ, the optimal fractional decision rule is a solution to the linear program minψ [0,1]X (P(ψ = 1) + Q(ψ = 0)) where P(ψ = 1) = P

x X pxψ(x) and Q(ψ = 0) = P

x X qx(1 ψ(x)). The feasible region of this linear program is the hypercube [0, 1]X and its vertices are the set of binary decision rules {0, 1}X , which also includes the optimal binary rule ψ given by Lemma 3. As every feasible linear program attains its optimum on a vertex, the binary ψ is also the optimal among fractional rules.

Proof of Theorem 2. For a given two-action contract, denote Pj = F1,j, Qj = F2,j, and c = c2 c1. We recall that the statistical min-budget LP for two actions is given by Lemma 2:

max ϕ [0,1]|Ω|,β R 0 β

j Ω Pjϕj + X

j Ω Qj(1 ϕj) 1 cβ (24)

and the Neyman-Pearson lemma is given by Eq. (23).

Given a min-budget contract design problem, apply Theorem 1 to obtain its optimal solution, as given by Eq. (18):

B = c P Q TV

t j = c P Q TV 1 [Qj Pj]

Applying the transformation from Definition 3 on the optimal contract, we obtain equivalently:

β = (B ) 1 = P Q TV

c ϕ j = t j/B = 1 [Qj Pj]

Note that ϕ j is a maximum-likelihood decision rule, similar to the optimal critical function in the Neyman-Pearson lemma. Additionally, by optimality of the contract-design optimization variable β , any feasible solution ϕ of Eq. (24) satisfies:

j=1 Pjϕ j +

j=1 Qj(1 ϕ j) 1 cβ = 1 P Q TV

with equality satisfied by the maximum likelihood rule ϕ . Therefore, the min-budget optimality of the contract t implies the power optimality of the hypothesis test ϕ .

Conversely, let ϕ : Ω [0, 1] be an maximal-power hypothesis test. This provides a lower bound on the constraint in Eq. (24): X

j Ω Pjϕj + X

j Ω Qj(1 ϕj) 1 P Q TV

By the Neyman-Pearson lemma, the bound is tight for the maximum likelihood rule ϕ = 1 [Q P], and therefore the optimal objective β satisfies:

1 cβ = 1 P Q TV

and hence β = P Q TV

c . Applying the transformation in Definition 3 yields the optimal contract t j = c P Q TV 1 [Qj Pj], showing that the corresponding contract is min-budget optimal if the corresponding hypothesis has optimal statistical power.

Remark 1. Since the proof of Theorem 1 is independent of the Neyman-Pearson lemma, the argument proving the optimality of ϕ in the proof above implies the Neyman-Pearson lemma for finite X.

B.5 More than two actions

Definition 5 (All-or-nothing contract). Let B > 0. An all-or-nothing contract t : Ω R 0 satisfies tj {0, B} for all j Ω.

B.5.1 Binary outcomes In this section, we show that every binary-outcome min-budget contract is all-or-nothing. This is a corollary of a more general lemma:

Lemma 4. For any feasible min-budget contract t , there exists j0 Ωsuch that t j0 = 0.

Proof. By contradiction, assume that t j > 0 for all j, and denote j0 = argminj t j. Denote a = minj tj, and note that a > 0. Define the contract t as follows:

j : tj = tj a

By definition, tj 0 for all j, and tj0 = 0. Since t is a feasible min-budget contract, it satisfies the min-budget LP in Eq. (6), and in particular the (IC) constraint:

i [n 1] : X

j Fi,jt j ci X

j Fn,jt j cn

Plugging in the definition t = t a and using the fact that P

j Fi,j = 1 for all i [n], we obtain:

i [n 1] : X

j Fi,j tj ci X

j Fn,j tj cn

and therefore t satisfies the (IC) constraint as well. This is a contradiction, since t is a feasible contract with a lower required budget. From this we conclude that the optimal contract must satisfy tj = 0 for some j.

Corollary 1. In any min-budget contract design setting with two outcomes, the optimal contract is all-or-nothing.

B.5.2 Hardness: Basic definitions and construction In this section, we show that finding optimal all-or-nothing contracts is NP-hard in the general case, by reduction from 3SAT.

Definition 6 (Maximin design matrix). For a contract design setting with action set A, outcome space Ω, target action a A, and ca < ca for all a A \ {a }, the maximin design matrix A is defined as:

Aaω = Fa ,ω Fa,ω

Claim 5. For a contract design problem, denote the maximin design matrix by A. An all-or-nothing min-budget contract t can be written as t = ϕ /β , where:

ϕ = argmax ϕ {0,1}|Ω| min a A\{a } (Aϕ)a

β = min a A\{a } (Aϕ )a (26)

Proof. By Lemma 2, Eq. (11), the min-budget contract design LP is equivalent to:

max ϕ [0,1]|Ω|,β R 0 β

a A \ {a } : X

Where β = B 1. As Aa,ω = Fa ,ω Fa,ω

ca ca , it holds that P

ω Ω Fa ,ω Fa,ω

ca ca ϕω = Aϕ. The LP above is therefore equivalent to:

max ϕ [0,1]|Ω|,β R 0 β

s.t. a A \ {a } : Aaϕ β

as β is a lower bound for every constraint, its optimal value is the maximal minimum attainable through variation of ϕ: max ϕ [0,1]|Ω| min a A\{a } (Aϕ)a

and restricting the optimization space of ϕ to {0, 1}|Ω| yields the desired result.

Definition 7 (3-CNF Conjunctive normal form). A 3-CNF formula over m variables and n clauses is a boolean-valued function ψ : {0, 1}m {0, 1} of the form:

ψ(x1, . . . , xm) =

i=1 (zi1 zi2 zi3)

where zik {x1, . . . , xm, x1, . . . , xm}.

We assume that variables in each clause are distinct. ψ is satisfiable if and only if there exists x {0, 1}m such that ψ(x) = 1.

Definition 8 (Number of positives in clause i). Given a 3-CNF ψ and an assignment x {0, 1}m, denote by σi(ψ, x) {0, . . . , 3} the number of variables zik in clause i which evaluate to 1 under assignment x.

Claim 6. A formula ψ is satisfiable if and and only if there exists an assignment x {0, 1}m such that mini [n] σi(ψ, x) 1.

Proof. If ψ is satisfiable then there exists x {0, 1}m such that ψ(x) = 1. Since ψ is a 3-CNF, every clause i must evaluate to 1, and therefore σi(ψ, x) 1 for all i [n]. Conversely, if there exist an assignment x such that mini [n] σi(ψ, x) 1 then by definition x satisfies every clause, and therefore the whole formula ψ.

B.5.3 Hardness: Reduction from 3SAT Given a 3-CNF ψ with n clauses and m variables, we define a min-budget contract design problem over actions A = [n + 1] and outcome space Ω= [m] {pos, neg, const}. The target action is a = n + 1, and the cost associated with action i is:

ci = 1 i = n + 1 0 otherwise (27)

For simplicity of notations, let ε > 0, which will be assigned a suitable value later in the construction. The outcome distribution of the target action is denoted by Q, and defined as:

j [m] : Qj = ε

Qpos = 1 ε 1 + 3

Qconst = 3ε

Note that Q is well-defined for m 3 and ε 1

For each i [n], denote the number of negated variables in clause i by ki {0, . . . , 3}. The distribution corresponding to the i-th action is denoted by P (i), and defined as:

j [m] : P (i) j =

0 xj exists in clause i 2ε m xj exists in clause i ε m otherwise

P (i) pos = 0

P (i) neg = 1 ε 1 + ki

P (i) const = (3 ki) ε

The distributions P (i) are well-defined when m 3 and ε 1

2. For concreteness, set:

and note that Qpos and P (i) neg are both strictly positive for m 3 and this value of ε.

Plugging equations (27, 28, 29) into Definition 6, the maximin design matrix Aψ Rn (m+3) corresponding to the contract design problem above is given by:

j [m] : Aψ i,j =

ε m xj exists in clause i and is not negated ε

m xj exists in clause i and is negated 0 otherwise

Aψ i,pos = Qpos

Aψ i,neg = P (i) neg

Aψ i,const = ki ε m

Definition 9 (Assignment normalized contract). For an assignment x {0, 1}m, the corresponding normalized contract ϕx {0, 1}m+3 is: j [m] : ϕx j = xj ϕx pos = 1

ϕx const = 1

Claim 7. Let ψ be a 3-CNF, and x {0, 1}m an assignment. For all i [n]: Aψϕx

mσi(ψ, x) + Qpos

Proof. To prove this claim, plug ϕx from Definition 9 into Aψ defined in Eq. (31). To simplify notations, here we denote A = Aψ, ϕ = ϕx, zi = {zi1, zi2, zi3}. We obtain:

ω [m] {pos,neg,const} Ai,ωϕω

j [m] Ai,jϕj

+ Ai,pos | {z } =Qpos

ϕpos |{z} =1

+Ai,neg ϕneg |{z} =0

+ Ai,const | {z } =ki ε

ϕconst | {z } =1

xj zi ϕj + X

xj zi (1 ϕj )

mσi(ψ, x) + Qpos

Claim 8. For a given 3-CNF ψ, denote by ϕ the optimal solution for Eq. (26), with A = Aψ. There exists an assignment x such that ϕ = ϕx .

Proof. For the maximin matrix Aψ defined in Eq. (31), the optimization objective in Eq. (26) is:

By the choice of ε in Eq. (30), Aψ satisfies the following for all i [n]:

Ai,pos = Qpos > 0

Ai,neg = P (i) neg < 0

Ai,const = ki ε m > 0

Thus, any optimal solution ϕ must satisfy:

ϕ const = 1

Otherwise, the value of any Aψϕ

i would strictly increase by changing the value of ψ at the coordinates {pos, neg, const} to the values specified above. This would increase the optimization objective mina A\{a } Aψϕ

a, in contradiction to the optimality of ϕ .

Hence, denoting x j = ϕ j for all j [m], we obtain ϕ = ϕx as required.

Claim 9. A 3-CNF ψ is satisfiable if and only if the optimization problem in Eq. (26) with A = Aψ satisfies β Qpos + ε

Proof. If ψ is satisfiable by assignment x {0, 1}m, then σi(ψ, x) 1 for all i [n] by Claim 6. Observe that β = Aψϕx

i for some i [n] by construction. Let ϕx denote the vector corresponding to the satisfying assignment, according to Definition 9. Apply Claim 7 to obtain:

mσi (ψ, x) + Qpos

Conversely, assume the optimal solution to Eq. (26) satisfies β Qpos + ε

m, and denote the vector attaining the optimal solution by ϕ . By Claim 8, there exists an assignment x such that ϕ = ϕx . Combining the lower bound on the value of the optimal solution with the result of Claim 7, we obtain that the following holds for all i [n]: Aψϕx

mσi(ψ, x ) + Qpos ε

Therefore, it must hold that σi(ψ, x ) 1, and thus x satisfies ψ according to Claim 6.

B.5.4 Proof of hardness

Proof of Theorem 3. By reduction from 3SAT. Given a 3-CNF ψ with n clauses and m variables, construct in polynomial time the corresponding matrix Aψ as defined by Eq. (31), and apply an all-or-nothing min-budget contract solver according to Eq. (26) to obtain the optimal solution. The validity of the LP is given by Eq. (26). By Claim 9, the formula ψ is satisfiable if and only if the value attained in the optimization is at least ε

m + Qpos, where m is the number of variables in x, and ε, Qpos are constants defined in Eq. (30), Eq. (28) respectively.

B.6 The single binding action algorithm To prove the soundness of SBA, we first make the following definition:

Definition 10 ((i -IC) relaxation). Consider a MIN-BUDGET LP with target action i A, as given by Eq. (6). For any action i = i, the (i -IC) relaxation of the min-budget problem is given by eliminating all (IC) constraints except for the one corresponding to action i .

Proof of Proposition 2. Denote the target action by i A. On each iteration of the loop, the algorithm considers some action i = i, and applies Theorem 1 on the (i -IC) relaxation of the original MIN-BUDGET LP. Since the relaxed LP has only one (IC) constraint, its optimal solution, denoted by t (i , i), is given by Eq. (5).

By Definition 10, the feasible region of the original LP lies within feasible region corresponding to its (i -IC) relaxation. Thus, if an optimal solution of the relaxed LP lies within the feasible region of the original LP, then it is also a global optimum of the original LP. t (i , i) lies within the feasible region of the original LP is it satisfies the remaining (IC) constraints a condition equivalent to the notation a(t (i , i)) = i by Eq. (2). The SBA algorithm terminates successfully only in such cases, and therefore it is sound.

A note on ties in SBA. Denote the target action by i. To simplify presentation, the algorithm presentation in Section 3.4 implicitly assumes that required budgets t (i , i) are distinct for every pair of actions (i , i), and therefore the return value of SBA is well-defined. In case of ties, the return value is not well-defined (as the iteration order is not well-defined), but small a modification to the algorithm allows ties to be broken explicitly without affecting other properties of the algorithm: In case of ties in optimal required budget, the soundness of the algorithm and the all-of-nothing property of the returned contracts is not affected. However, the exact functional form of the returned contract may be affected by the iteration order. In case such issue becomes relevant, the SBA algorithm can be modified to first collect a set of optimal contracts (instead of immediately returning when an optimal feasible contract is found), and then select one of the optimal contracts based on the desired criteria. As the proof of Proposition 2 does not depend on iteration order, soundness will not be affected, and worst-case time complexity will remain the same.

B.7 MLRP In this section, we explore contract design under the MLRP assumption, and prove Theorem 4. We first recall the statistical notion of monotone likelihood ratio:

Definition 11 (Mononote Likelihood Ratio MLR). The distributions P, Q (Ω) have Monotone Likelihood Ratio when Qj

Pj is monotonically increasing for all j Ω= {0, . . . , m}. We denote this by P Q.

An introduction to MLR and its relation to statistical hypothesis testing is provided in Lehmann et al. [40, Section 3.4]. Using this notation, we can reformulate Definition 1:

Definition 12 (MLRP; reformulation of Def. 1 using MLR notation). A principal-agent problem satisfies the Monotone Likelihood Ratio Property if Fi Fi for every pair of actions i, i such that ci < ci.

Claim 10. Let P, Q ([m]) such that P Q. Then P0 > Q0 and Pm < Qm.

Proof. For the first inequality, assume by contradiction that Q0 P0. Combining with the MLR property, this assumption implies that Qj Pj for all j {0, . . . , m}. As the outcome distributions P, Q are distinct, there exists at least one j for which Qj > Pj. Summing over j yields: P j ΩQj > P

j ΩPj, which is a contradiction, as the inequality is strict while both sides sum to 1. The proof for the second inequality follows in the same way.

Definition 13 (MLR crossing point j P,Q). Let P, Q (Ω) such that P Q. The crossing point of Q over P is the outcome j {0, . . . , m} such that Qj 1 < Pj 1 and Qj Pj . When distribution are not clear from context, we denote j = j P,Q

In words, the crossing point j is the outcome such that for every lower outcome, P is strictly more likely than Q, and starting with this outcome Q is weakly more likely. By claim 10, the MLR crossing point j is uniquely defined for any P Q, and it holds that j > 0.

Definition 14 (Survival function). Given P ([m]), the survival function SP ( ) : Ω [0, 1] is defined as:

SP (j) = Pj P [j > j] =

Informally, the survival function is 1 minus the distribution s CDF. This function is useful in our context, since the total variation distance between two distributions (as appearing in Theorem 5) can be represented as the difference between their survival functions when they satisfy MLR. The intuitive reason for this is that MLR means the distributions are single-crossing .

Claim 11 (Total variation under MLR). Let P, Q (Ω) such that P Q. It holds:

P Q TV = SQ(j P,Q 1) SP (j P,Q 1)

Proof. By Claim 4, it holds that:

j {0,...,m} (Qj Pj)+

Using the j notation (see Definition 13), we obtain:

j {0,...,m} (Qj Pj)+ =

j=j P,Q Qj Pj

And the result is obtained by applying the definition of survival function (see Definition 14).

B.7.1 Two-action contracts with MLRP Combining Theorem 5 with Claim 10 leads to a characterization of threshold contracts in the case of two actions:

Claim 12. For any two-action contract design problem satisfying MLRP, the optimal contract is a threshold contract:

t j = c F2 F1 TV 1 j j F1,F2

where j F1,F2 = min {j {0, . . . , m} | F2,j F1,j} as in Definition 13.

Proof. Combining the result of Claim 10 with the monontonicity assumption, the likelihood ratio F2,j/F1,j crosses 1 exactly once. Denote the crossing point by j , and apply Theorem 5 to obtain the optimal contract.

B.7.2 Two-outcome contracts with MLRP When there are more than two actions, assume without loss of generality that the contract aims to implement the last action n. The following claim establishes the existence of theshold contracts in the two-outcome setting |Ω| = 2. In contrast to corollary 1, the proof in constructive, and yields a concrete contract:

Claim 13. For any contract design problem satisfying MLRP with n > 2 actions and m = 2 outcomes, the optimal min-budget contract is a threshold contract.

Proof. By Claim 2, the dual LP for two outcomes is given by:

max λ Rn 1 0 ,µ R2 0

i=1 (cn ci)λi

i=1 (Fn,j Fi,j) λi µj

j {1,2} µj 1

From Claim 10, we obtain that Fn,1 Fi,1 < 0 and Fn,2 Fi,2 > 0 for all i [n 1], and therefore the first constraint in Eq. (33) is always satisfied for j = 1. Simplifying Eq. (33) we obtain:

max λ Rn 1 0 ,µ R 0

i=1 (cn ci)λi

i=1 (Fn,2 Fi,2) λi 1

which is maximized by allocating all budget to the λi maximizing the bang for buck . The dual objective is therefore given by:

B = max λ Rn 1 0 ,µ R 0

i [n 1] (cn ci)λi

= max i [n 1] cn ci Fn,2 Fi,2

To see that a threshold contract is optimal, let t j = B 1 [j = 1]. Primal objective is B , and the contract is feasible if the primal LP is feasible. The (IC) constraint in Eq. (6) can be written as:

P2 j=1 (Fn,j Fi,j) tj

and indeed for every action i [n 1]: P2 j=1 (Fn,j Fi,j) tj

cn ci = Fn,2 Fi,2

= Fn,2 Fi,2

cn ci max i [n 1] cn ci Fn,2 Fi,2 1 And therefore t j is feasible in the primal problem, and also optimal by strong LP duality. Also note that the resulting contract coincides exactly with the optimal min-pay contract as attained by Dütting et al. [19, Lemma 7].

B.7.3 General contracts with MLRP In this section, we explore min-budget contracts in MLRP settings with more than two actions and more than two outcomes. We start with a negative example, showing that the MLRP assumption is not sufficient for establishing the optimality of threshold contracts:

Claim 14. For |Ω| > 2, there exists a design problem satisfying MLRP for which the optimal contract is not a threshold.

Proof. Consider the following contract design problem: F1 Binomial(10, 0.5) F2 Binomial(10, 0.65) F3 Binomial(10, 0.8)

c1 = 0 c2 = 0.45 c3 = 1

0.0 2.5 5.0 7.5 10.0 Outcome j

Probability mass

Outcome distributions Fi

0.0 2.5 5.0 7.5 10.0 Outcome j

Min-budget - LP vs. integer solutions

0.00 0.25 0.50 0.75 1.00 Cost ci

Crossing-point survival si

Non-concavity of crossing-point survival

Figure 7: Graphical illustration of Claim 14. (Left) Outcome distributions F1, F2, F3 and MLR crossing point j F2,F3. (Center) The min-budget contract t j, given by numerically solving Eq. (11) (purple), and numerically solving Eq. (11) while imposing integer constraints ϕj {0, 1} (cyan). Note that the fractional LP solution achieves lower max payout. (Right) Crossing-point survival si = SFi(j Fi,Fn 1) as a function of cost ci (see Definition 15). Note that the function is not concave, thus the sufficient condition given in Theorem 4 does not hold in this case.

The distributions are binomial, and therefore the contract satisfies MLRP. Numerically solving Eq. (11) yields the following fractional contract:

t LP = (0, 0, 0, 0, 0, 0, 0, 1.1, 1.46, 1.46, 1.46)

Numerically solving Eq. (11) while imposing integer constraints ϕj {0, 1} yields the following contract, which has higher max payout:

t IP = (0, 0, 0, 0, 0, 0, 0, 1.51, 1.51, 1.51, 1.51)

Thus for this case any threshold contract is min-budget suboptimal. A graphical illustration of the proof is provided in Fig. 7.

Remark 2. The proof of Claim 14 is provided with 10 outcomes (|Ω| = 10) for ease of graphical interpretation, however a minimal counterexample may also be constructed using only 3 outcomes. For example, consider the following design problem:

F1 = (0.5, 0.3, 0.2) F2 = (0.3, 0.4, 0.3) F3 = (0.1, 0.35, 0.55)

c1 = 0 c2 = 0.45 c3 = 1

The proof for this case follows in the same way, where numerical computation yields the contracts:

t LP = (0, 1.9, 2.6) t IP = (0, 2.7, 2.7)

Remark 3. We also note that a counterexample with 2 outcomes is not possible due to Claim 13.

The negative example shows that even with MLRP, the optimal contract incentivizing the highest implementable action needs to rule out deviations of the agent to all other actions, and not just to the second-highest one. This makes the contract complex. We identify a natural economic condition that is sufficient for considering only a single deviation (to the second-highest action), resulting in a simple contract. Note that considering a single such deviation is equivalent to restricting the action space to actions {n 1, n}.

B.8 Sufficiency condition for more than two actions For the following definition, recall the definition of MLR crossing point j (Def. 13), and survival function S (Def. 14):

Definition 15 (Concave-MLRP; Formal statement of Definition 2). For a contract design setting satsifying MLRP, denote by j = j Fn,Fn 1 the crossing point of the outcome distribution corresponding to the highest action Fn, and the outcome distribution corresponding to the second-highest action Fn 1. For any action i [n], the crossing-point survival si is defined as:

si = SFi(j 1)

In words, si is the probability to get an outcome at or above crossing-point j according to outcome distribution Fi. From an economic perspective, for every action i [n], si is the agent s probability to receive any nonzero payment from choosing action i, if the contract is designed by restricting the action-space to actions {n 1, n}. Note however that the definition of si only depends on the outcome distribution Fij, and does not require solving the contract design problem.

B.8.1 Binomial power-law curves satisfy Concave-MLRP Recent theoretical work on learning curves has focused mainly on power-law expected learning curves of the form an = 1 βn γ [11, 33]. In addition, these curves were also found to provide a good fit for certain practical applications [49, POW2]. In this section, we show that a stochastic generalization of these curves satisfies the Concave-MLRP property:

Definition 16 (Power-law stochastic learning curve). Let β, γ R 0, and m N. A delegated learning setting with actions A has a realizable power-law stochastic learning curve with parameters β, γ if Fi = Binomial 1 βn γ i , m for all ni A.

For the proof, we also recall the definition of the regularized incomplete beta function:

Definition 17 (Regularized incomplete beta function). For k1, k2 1, the regularized incomplete beta function is defined as:

Ip(k1, k2) =

R p 0 tk1 1(1 t)k2 1dt R 1 0 tk1 1(1 t)k2 1dt

We also recall that Ip is related to the survival function of the binomial distribution [e.g., 1, 6.6.4]: Ip(k + 1, n k) = Px Binomial(p,n)[x > k]

Claim 15. Let β, γ R 0, m N. A delegated learning setting with a power-law stochastic learning

curve Fi = Binomial(1 βn γ i , m), action costs ci = ni, and mini ni β(m+γ 1)/1+γ 1 1

γ satisfies the Concave-MLRP assumption.

Proof. Denote the expected accuracy of each action by an = 1 βn γ. The survival function of the binomial distribution is given by: sn = Ian (j , m + 1 j ) Where Ian is the regularized incomplete beta function (Definition 17). With slight abuse of notation, we treat an and sn as continuous functions of n. Taking the derivative by n and ignoring the constant denominator, we obtain: dsn

dn aj 1 n (1 an)m j dan

dn Plugging the functional form of an: dsn

dn 1 βn γ j 1 βn γ m j βγn (γ+1)

βn γ m j +1+γ 1 1 βn γ j 1

As a function of βn γ, the function dsn

dn attains its extremum at:

β n γ = m + 1 j + γ 1

n = β(m + γ 1) m + 1 j + γ 1

And therefore the function sn has an inflection point at n (see Fig. 8). From the upper bound j m we obtain:

n β(m + γ 1)

And as mini ni n0, the function sn is convex as a function of cn for all n A.

The proof is illustrated in Fig. 8.

0 20000 40000 60000 Action cost cn = n

1.0 Expected accuracy - MLP

LCDB data Power-law ﬁt

0 20000 40000 60000 Action cost cn = n

Survival sn

Survival at likelihood transition

0 20000 40000 60000 Action cost cn = n

10 8 Second derivative of sn

Figure 8: Graphical intuition for the sufficiency condition in Claim 15. (Left) Expected accuracy curve for MNIST-784 MLP. Blue dots represent empirical data from the LCDB dataset, cyan curve represents power-law fit (an = 1 βn γ, with ˆβ = 1.89, ˆγ = 0.33). (Center) Crossing point survival si as a function of cost ci (see Definition 15), for m = 30. Cyan dot represents inflection point n 2022. Red vertical line represents the inflection point bound n0 3957 suggested by the proof. It can be observed that the curve is concave for all n > n. (Right) Second derivative of the survival function sn, illustrating the position of the bound n0 in relation to the inflection point n.

B.8.2 Proof of sufficiency theorem

Theorem 6 (Concave-MLRP implies threshold contracts; formal statement of Theorem 4). For a contract design problem satisfying MLRP, consider si as a function of cost ci. If this function is concave, then the optimal contract is a threshold contract. Furthermore, the contract is successfully recovered by the SBA algorithm.

Proof. To prove this claim, we construct a relaxed version of the min-budget LP (Eq. (6)), and apply Theorem 5 in order to solve it. We then show that this solution is also feasible for the original LP.

Given the min-budget LP, construct a relaxed LP by eliminating (IC) constraints for all i n 2:

min t R|Ω| 0 ,B R 0 B

s.t. j Ω: tj B X

j Ω Fn 1,jtj cn 1 X

j Ω Fn,jtj cn

Eq. (36) only depends on Fn and Fn 1, and therefore claim 12 can be applied to obtain the optimal min-budget contract:

t j = cn cn 1 Fn Fn 1 TV 1 [Fn,j Fn 1,j] (37)

Let j = min {j {0, . . . , m} | Fn,j Fn 1,j} as in Definition 13, and denote si = SFi(j 1). By claim 11, we obtain: Fn Fn 1 TV = sn sn 1 For all i < n, denote:

sn si Using the definition of bi, Eq. (37) can be rewritten as:

t j = bn 11 [j j ]

By assumption, si is a concave function of ci, and therefore the function bi is monotonically nondecreasing for all i < n: bn 1 bi Dividing both sides by bi, we obtain: bn 1

0 5 10 15 Outcome j

Probability Fi,j

Outcome distributions and crossing point j

F1 F2 F3 F4 F5 j F4,F5

0 2500 5000 7500 10000 Cost ci

si = SFi(j 1)

Crossing-point survival si

0 5 10 15 Outcome j

Optimal min-budget contract

Figure 9: Graphical intuition for the sufficiency condition in Theorem 4 (restated in Theorem 6). (Left) Outcome distribution in a contract design setting satisfying MLRP, generated by a series of binomial distributions with power-law expectation acc D(hn) = 0.9 0.4 n

100 0.3. The gray dotted line represents the MLR crossing point j F4,F5 (see Definition 13). (Center) Crossing point survival si as a function of cost ci (see Definition 15). It can be observed that the curve is concave. (Right) The min-budget contract implementing action 5. It is a threshold contract, as guaranteed by Theorem 4. Compare to Fig. 7, where the sufficiency condition does not hold.

Plugging in the definition of bi, and multiplying both sides by (cn ci):

bn 1 (sn si) cn ci (38)

By definition of t j, the expected payout of contract t j under action i [n] is given by:

j=0 t j Fi ,j = bn 1

j=j Fi ,j = bn 1si (39)

Plugging Eq. (38) into Eq. (39) yields:

j=0 tj Fn,j

j=0 tj Fi,j cn ci (40)

As Eq. (40) is identical to the (IC) constraint in Eq. (6), we obtain that t j satisfies all the (IC) constraints in the original LP. From this we conclude that t j, which is a threshold contract, is also an optimal contract with the respect to the full min-budget LP.

For the second part of the claim, note that the SBA algorithm also attempts to solve Eq. (36) on the iteration corresponding to action N 1. As the solution of Eq. (36) is guaranteed to be feasible for the original LP under Concave-MLRP, the SBA successfully recovers it.

B.9 Min-budget contracts with guaranteed minimum payout Our problem setting assumes that the agent selects its action rationally, as described in Eq. (2). However, in some practical settings the agent may be risk averse to some extent, and require guaranteed minimum payment from the contract. In this section, we show that the rationality assumption is made without loss of generality in such cases: Lemma 5, which we prove below, shows that any optimal contract with guaranteed minimum payout can be represented as the sum of a min-budget contract without guaranteed payout, and a constant representing the payout guarantee.

Definition 18 (Guaranteed minimum payout). Let δ 0. A contract t R|Ω| 0 has guaranteed payout of size δ if tj δ for all j Ω.

Lemma 5. A contract t is a min-budget contract with guaranteed payout δ requiring budget B if any only if there exist a min-budget contract t (without guaranteed payout) requiring budget B such that t = t + δ and B = B + δ.

Proof. When agents require a guaranteed minimum payout δ 0, we add a constraint to the MIN-BUDGET linear program defined in Eq. (6), such that an optimal min-budget contract t with

guaranteed payout δ is an optimal solution to the following linear program:

min t R|Ω| 0 ,B R 0 B

s.t. j Ω: tj B (BUDGET)

j Ω Fi ,jtj ci X

j Ω Fi,jtj ci (IC)

j Ω: tj δ (MINWAGE)

To prove the first direction of the equivalence, assume that t is a min-budget contract with guaranteed payout δ and budget B, and thus an optimal solution of Eq. (41). Define the variable transformation:

B = B t0 (42)

By Eq. (42), the (BUDGET) constraint in Eq. (41) transforms into:

t j + δ B + δ

The (IC) constraint transforms into:

j Ω Fi ,jtj ci X

j Ω Fi,jtj ci

j Ω Fi ,j(t j + δ) ci X

j Ω Fi,j(t j + δ) ci

j Ω Fi ,jt j ci + δ X

j Ω Fi,jt j ci + δ X

j Ω Fi ,jt j ci X

j Ω Fi,jt j ci

And the (MINWAGE) constraint transforms into:

Plugging back the transformed constraints (44, 43, 45) into Eq. (41), we obtain that (t , B ) is an optimal solution of the original MIN-BUDGET linear program Eq. (6), and therefore t is an optimal min-budget contract without minimum guaranteed payout.

Conversely, assume that t is an optimal min-budget contract (without minimum guaranteed payout), satisfying the MIN-BUDGET linear program in Eq. (17) with budget B . Apply the inverse transformation t = t + δ and B = B + δ, and the equivalence relations (44, 43, 45) in the inverse direction to obtain that t, B is an optimal solution to Eq. (41).

C Experimental details

C.1 Data LCDB. We base our main experimental environment on the LCDB dataset [43], which includes a large collection of stochastic learning curves for multiple learning algorithms and classification benchmark datasets. For each learning method and benchmark dataset, the database includes held-out accuracy measurements, obtained for exponentially-increasing training set sizes n 24 = 16, . . . , round(2k/2), . . . , 215 = 32, 768 , with multiple repetitions per n obtained by cross-validation. For all learning curves we consider in our analysis, and for each n, the number of

repetitions provided in the dataset is in the range R {5, . . . , 125}, where the specific number of repetitions in LCDB depends on the algorithm and benchmark dataset (see the dataset documentation for additional details). For each trained classifier, each accuracy point on the learning curve is estimated using 5,000 held-out samples. Formally, and using our notation, for each learning algorithm Alg (e.g. MLP), dataset D (e.g. MNIST), and training set size n, LCDB provides a set of accuracy estimates a1,Alg,D n , . . . , a R,Alg,D n , such that each an is distributed according to an acc D(hn), where hn is a classifier trained using training set S Dn and learning algorithm Alg (see Section 2 for definition of acc D(hn)).

Benchmark dataset and algorithms. In our main analysis, we focus on the popular MNIST dataset [39, Open ML 554], and on multilayer perceptrons (MLP) and gradient-boosted decision trees (GBDT) as representative classifiers. For all classifiers, results are obtained for the respective default scikit-learn [44] implementations (e.g. MLPClassifier and Gradient Boosting Classifier for MLP and GBDT, respectively).

Delegated learning settings from empirical data. For a given validation set size m where m = |V |, we instantiate a contract design task (A, c, Ω, F) as follows: (i) the set of actions with the set of training set sizes provided by LCDB (A = 24, 24.5, . . . , 215 ); (ii) action costs are set to fixed per-unit cost, i.e., cn = n; and (iii) the distribution F over outcomes Ωis associated with a binomial mixture distribtuion, resulting from applying bootstrap sampling to empirical error measurements:

r=1 Binomial(m, ar,Alg,D n )

where an are the accuracy estimates defined above. Fig. 2 (Left) shows an example of such a setting, obtained by applying the above procedure to data describing learning curves for the MLP algorithm trained on the MNIST benchmark dataset.

C.2 Implementation details Code. We implement our code in Python. Our code relies on Pyomo [13, 27] and GLPK for solving linear and mixed-integer programs. Code is available at: https://github.com/edensaig/delegated-classification.

Hardware. All experiments were run on a single laptop, with 16GB of RAM, M1 Pro processor, and with no GPU support.

Runtime. A single run consisting the entire pipeline takes roughly 5 minutes. The main bottleneck is running the LP solvers, taking roughly 70% of runtime to compute.

C.2.1 Contract design solvers To find the optimal contracts, we implement and compare several different solvers:

LP solver: To find min-budget contracts given outcome distributions {fn} and costs {cn}, we solve the MIN-BUDGET LP (Eq. (6)) directly using Pyomo and GLPK. Given budget B, budgetoptimal contracts are obtained by iterating through the actions set A, invoking the min-budget solver on each target action n A, enforcing incentive compatibility against all actions n which satisfy αn < αn, and taking the action yielding the maximal expected accuracy within budget (see Appendix B.1). In our code, the LP solver is implemented within the Min Budget Contract class. Typical running time for a single problem instance: 10 1s. SBA solver: Implementation of the single binding action algorithm presented in Section 3.4. The local solver is implemented within the Min Budget Single Binding Constraint Contract class. Typical running time for a single problem instance: 10 4s. Hybrid solver: A meta-solver implementing the try SBA first computational approach. The solver starts by invoking the SBA solver, and applies the LP solver if the former fails. In our code, the LP solver is implemented within the Min Budget Hybrid Contract class. Running time depends on whether SBA is applicable. MIP solver: To find optimal all-or-nothing contracts, we solve the MIN-BUDGET LP in its statistical formulation (Eq. (11)) while restricting ϕj {0, 1}. This turns the LP into a mixedinteger program, which can also be solved using GLPK. In our code, the IP solver is implemented within the Min Budget Statistical Contract class. Typical running time for a single problem instance: 10 1s.

Min-pay LP solver: To compare between min-budget and min-pay contracts (Appendix B.3), min-pay contracts were obtained by solving the MIN-PAY LP (Eq. (17)) using Pyomo and GLPK, similar to the full min-budget LP solver. In our code, the min-pay is implemented within the Min Pay Contract class. Running time is similar to the other LP-based methods.

Full enumeration solver: To find all-or-nothing contracts for low-dimensional problems and avoid numerical instabilities, we implement a solver which performs full enumeration of all-or-nothing contracts. By the statistical variable transformation in Definition 3, the (IC) constraint in the MIN-BUDGET LP translates to P

j Ω(Fi,j Fi ,j) ϕj (ci ci )β, where i is the target action,

i A \ {i}, and the objective is minimize β (see Eq. (11)). Thus, when ϕ {0, 1}Ωis fixed, the optimal β is given by:

β (ϕ) = min i A\{i}

j Ω(Fi,j Fi ,j)ϕj

and optimizing β over ϕ {0, 1}Ωyields an optimal all-or-nothing contract. The full enumaration solver is implemented within the Full Enumeration Contract class. For for enumeration of allor-nothing contracts, this solver is mostly applicable for small problem instances (m < 20) due to its exponential complexity. In contrast, for threshold contracts the number of possible ϕ configurations is linear in m, and therefore enumeration is more efficient.

Fixed-budget solver: To find budget-optimal threhsold contracts given a fixed budget B, we implement a simple solver which iterates through all possible threshold contracts tj0(j) = B1 [j j0] for all j0 {0, . . . , m}. The solver then simulates the agent s rational choice (Eq. (2)) and selects the value of j0 which incentivizes the best action. In case of ties between possible values of j0, they are broken in favor of larger values, as this was shown to lead to greater numerical stability. The fixed-budget solver is implemented within the Fixed Budget Threshold Contract class.

C.3 Empirical prevalence estimation

Since our theoretical analysis relies on MLRP assumptions, we would like to understand whether the results are applicable to real-world learning curves. Towards this, we run an empirical prevalence evaluation on the MNIST dataset. For each learning algorithm Alg, and training set size n, we construct a delegated learning setting with m = 10 and collect the following statistics:

Structural properties:

Is MLRP? (% MLRP): Check whether all pairs n1, n2 n such that cn1 < cn2 satisfy the monotone likelihood ratio property (see Definition 1).

Computational properties:

SBA successful? (% SBA): Check whether the SBA algorithm was successful on the given instance.

Functional form of optimal contract:

Is monotone? (% M): Check whether the resulting min-budget contract satisfies tj tj 1 for all j [m]. Is all or nothing? (% Ao N): Check whether the resulting min-budget contract is an all-ornothing contract (i.e. whether there exists j0 Ω,B > 0 such that tj = 0 for all j < j0 and tj = B otherwise). Is threshold? (% T): Check whether the resulting min-budget contract is a threshold contract (i.e. whether there exists j0 Ω,B > 0 such that tj = 0 for all j < j0 and tj = B otherwise).

Excess cost:

Min-budget: Optimal objective BLP of MIN-BUDGET LP without additional restrictions, implemented using the full LP solver. All-or-nothing budget: Optimal objective of MIN-BUDGET LP, when the resulting contract is restricted to all-or-nothing form tj {0, BAo N}. Implemented using the full enumeration solver. Threshold budget: Optimal objective of MIN-BUDGET LP, when the resulting contract is restricted to threshold form tj = BThr1 [j j0]. Implemented using the full enumeration solver.

Table 2: Empirical robustness estimation on the MNIST dataset. Each row corresponds to a learning algorithm, and columns are specified in Appendix C.3. Results are averaged across all implementable actions.

Structure Compute Functional form of opt. contract Excess cost

% MLRP % SBA % M % Ao N % T Ao N T

MLP 100% 94.4% 100% 94.4% 94.4% 0.04% 0.04% GBDT 100% 76.5% 100% 82.4% 82.4% 0.71% 0.71% Logistic 93.8% 81.2% 93.8% 93.8% 93.8% 0.00% 0.00% Perceptron 68.8% 75% 93.8% 93.8% 93.8% 0.00% 0.00% Linear SVM 71.4% 78.6% 85.7% 85.7% 85.7% 0.95% 1.45% Poly SVM 100% 94.4% 100% 94.4% 94.4% 0.24% 0.24% RBF SVM 100% 94.4% 100% 94.4% 94.4% 0.04% 0.04% KNN 100% 100% 100% 100% 100% 0.00% 0.00%

Overall 92.6% 87.4% 97% 92.6% 92.6% 0.22% 0.26%

The excess cost columns in Table 2 indicate the relative excess cost incurred by restricting the min-budget optimization space to simple contracts ( B{Ao N,Thr} BLP

BLP ). As all-or-nothing contracts are a superset of threshold contracts, we expect this excess cost to be smaller than the one associated with threshold contracts. The averaging in table 2 is performed on problem instances where both all-or-nothing and threshold contracts are feasible.

Averaging over all implementable actions, we obtain the results in Table 2. The results show that threshold contracts are relatively common in this dataset (more than 85%), and that excess cost of simple contracts is relatively low (around 1%), indicating that threshold contracts may provide good approximation to the optimal min-budget contracts in some cases. All simple optimal contracts in our dataset had a threshold functional form, and some simple optimal contracts were not recovered by SBA, indicating that a tighter characterization of simple contracts may be possible.

Fig. 10 provides qualitative interpretation of the analysis for a selected subset of learners. While the survival functions are not perfectly concave (and thus don t satisfy Concave-MLRP), the SBA algorithm successfully terminates in three cases (MLP, Logistic, KNN). The binding action is a N 1 for two of the classifiers (MLP, KNN), similar to the condition implied by Theorem 4. Note that SBA also terminated successfully on Logistic Regression data, despite the learning curve having a distinctive non-concave shape. In the case of GBDT, the survival function is not concave, and the optimal contract does not assume a threshold form.

Expected accuracy E[acc D(hni)]

Qualitative comparison of delegation problems

Survival at transition si

Optimal contract t j

0 5000 10000 15000 Action cost ci

0 5000 10000 15000 Action cost ci

0 5 10 15 20 25 Outcome j

Learning algorithm

Figure 10: Qualitative comparison of optimal contract computation on LCDB MNIST data (n = 16384, m = 25). Each row represents a learning algorithm. (Left column) Expected accuracy curves αn. (Center column) Survival function at transition si, whenever SBA is successful, the binding actions are marked with triangles. (Right column) Optimal contract t j.