# planning_with_participation_constraints__eed9f12e.pdf

Planning with Participation Constraints

Hanrui Zhang1, Yu Cheng2, Vincent Conitzer3

1 Carnegie Mellon University 2 University of Illinois at Chicago 3 Duke University hanruiz1@cs.cmu.edu, yucheng2@uic.edu, conitzer@cs.duke.edu

We pose and study the problem of planning in Markov decision processes (MDPs), subject to participation constraints as studied in mechanism design. In this problem, a planner must work with a self-interested agent on a given MDP. Each action in the MDP provides an immediate reward to the planner and a (possibly different) reward to the agent. The agent has no control in choosing the actions, but has the option to end the entire process at any time. The goal of the planner is to ﬁnd a policy that maximizes her cumulative reward, taking into consideration the agent s ability to terminate. We give a fully polynomial-time approximation scheme for this problem. En route, we present polynomial-time algorithms for computing (exact) optimal policies for important special cases of this problem, including when the time horizon is constant, or when the MDP exhibits a deﬁnitive decisions property. We illustrate our algorithms with two different game-theoretic applications: the problem of assigning rides in ride-sharing and the problem of designing screening policies. Our results imply efﬁcient algorithms for computing (approximately) optimal policies in both applications.

1 Introduction In this paper, we consider the general problem of planning in a Markov decision process (MDP), when the actions are chosen by the planner, but must be taken by an agent that is separate from the planner. The agent cannot choose the actions, but can refuse to participate at any time. To illustrate the setting, consider the following scenario. A ride-hailing company is designing a system to assign tasks to drivers. The company monitors the location (or state) of each driver in real-time via a mobile app. The company then has to choose which ride to assign to a driver (the company s choice of action), with each possible assignment resulting in some value to the company. If drivers had no incentives of their own (e.g., were self-driving vehicles belonging to the company), this could be modeled straightforwardly as an MDP. However, the driver also obtains rewards from the rides, which may be different. For example, drivers generally prefer not to take rides that will lead them to isolated areas,1 or that are

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1Sometimes drivers can see the destination; sometimes they cannot but may still suspect the destination based on the origin.

likely to involve troublesome passengers (e.g., pick-up at a bar at closing time). The company still wants to assign these rides, but does not want the driver to become sufﬁciently unhappy that they sign off (stop participating) altogether. It is therefore crucial to take into consideration drivers interests in planning task assignments, and provide enough incentives for drivers to remain in the system even when they receive less proﬁtable or even unproﬁtable tasks. One intuitive solution is to promise drivers who receive undesirable tasks high priority for receiving more proﬁtable tasks (or other perks) when they ﬁnish the current task. However, implementing this solution in a way that is most beneﬁcial to the company can be challenging incentivizing drivers and maximizing proﬁt can be in conﬂict with each other, and it is not immediately clear how to balance them optimally. In fact, the company faces a planning problem with participation constraints: the company as the planner (henceforth the principal) wants to compute and commit to a policy that maximizes her cumulative reward, subject to the constraint that each driver (henceforth the agent) must be motivated to participate in the policy, and not just at the beginning but also from any point onward.2

In addition to ride-hailing, many other real-world scenarios can be modeled as planning with participation constraints. We provide a few such examples:

Financial services, such as automated investment platforms ( robo-advisors ) or traditional mutual funds. In the case of robo-advisors, when a customer puts in money, the platform automatically creates and maintains a portfolio for the customer, and trades on behalf of the customer without asking for conﬁrmation. The customer can see the trades and has the option to withdraw their money (i.e., to quit) at any time. The customer wants to make as much money as possible in a given amount of time, while the platform wants to keep the customer and optimize their own objective (which partially aligns with

2Restricting the focus to policies under which no agent ever drops out is without loss of generality by a revelation principle: if the optimal policy involves an agent quitting at some point, since the principal knows when the agent would quit, she can use an equivalent policy where the agent is told that he will not have to do anything else (e.g., take any further rides) at that point. (Throughout the paper, we use she for the principal and he for the agent.)

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

the customer s goal, but may also include other considerations, such as charging an advisor fee). Distributed computing platforms (e.g., MQL 5, Load Team, or Golem Network) where an agent can sell CPU hours. Typically, such platforms assign tasks to machines based on how much task providers are willing to pay and how much computing power is required. Agents are paid an amount of money that depends on the task when a task is ﬁnished. The agents want to make more money, while the platform may want to ﬁnish as many tasks as possible. If the agent is receiving too little payment, they may decide to quit the platform and make use of their CPU hours in other ways (e.g., mining cryptocurrencies). Crowdsourcing marketplaces (e.g., Amazon Mechanical Turk). Workers are paid after they ﬁnish a task (e.g., completing surveys, labeling images). Workers can quit the platform or switch to another platform if they are not making enough money relatively. The platform wants to assign tasks (or a list of possible tasks to choose from) to workers and decide how much to pay to retain the workers, while optimizing their own objectives at the same time (e.g., charging a commission fee).

In this paper, we pose and study the computational problem of ﬁnding an (approximately) optimal policy in MDPs subject to participation constraints. We focus on the setting where agents have no private information, because in all the aforementioned applications, we have a very good ﬁrst-order approximation of the agent s reward function. For example, ride-sharing drivers want to make more money in a given amount of time, and applicants going through a screening process would like to pass. (See Section 1.2 for more discussion on related work that considers private information.)

1.1 Our Results

Our main result is a fully polynomial-time approximation scheme (FPTAS) for planning with participation constraints.

We can compute a policy satisfying participation constraints and losing at most an additive ε > 0 in the principal s utility compared to the optimal policy. Moreover, we can compute such a policy in time poly(m, n, 1/ε), where m is the total number of actions and n is the number of states.

Theorem 4 is a formal version of the above claim. We obtain this FPTAS by ﬁrst designing an exponential-time algorithm (Algorithm 1 in Section 3.1) which computes an exact optimal policy, and then carefully discretizing the algorithm without violating participation constraints (Section 3.2). When the MDP has some additional structures, our exact algorithm (Algorithm 1) in fact runs in polynomial time.

There is an exact algorithm (Algorithm 1) for planning with participation constraints that runs in time poly(mn 0), where m0 is the maximum number of actions available at a state and n is the number of states. When the time horizon is H, the exact algorithm runs in time poly(m H 0 , n H).

When the MDP has a deﬁnitive decisions property (see Deﬁnition 5 in Section 3.3), the exact algorithm runs in time poly(m, n), where m is total number of actions.

The formal statements of these results appear as Lemma 3 in Section 3.1 and Lemma 6 in Section 3.3. Some applications, such as the ride-hailing model that we gave above, are straightforward to ﬁt into our framework. Consequently, the above results immediately imply efﬁcient approximation algorithms for the ride assignment problem in ride-hailing, as well as efﬁcient exact algorithms when the instance admits certain additional structures. Our algorithmic framework in fact allows more and somewhat surprising applications. To illustrate this, we consider the design of optimal screening policies (Section 4), where the goal is to separate good candidates from bad ones by observing their performance in a series of tests. The more outcomes we observe, the more conﬁdent we are but candidates may drop out if we ask them to take so many tests that the cost of taking tests outweighs the expected utility from passing the screening. By casting the problem as planning with participation constraints and applying our algorithms, we obtain the following results (formal statements in Theorem 8) for this problem.

We can compute an optimal screening policy that uses at most N tests in time poly(N). Moreover, an approximately optimal policy losing at most an additive ε > 0 in the principal s utility can be computed in time poly(1/ε).

1.2 Related Work Our results are along the line of research on planning in Markov decision processes (MDPs). The basic problem of computing optimal policies in MDPs has been extensively studied, and numerous efﬁcient methods have been proposed, e.g., (Bellman 1957; Howard 1960; Puterman and Shin 1978). The problem studied in this paper is closely related to constrained MDPs (CMDPs) (see (Altman 1999) for a comprehensive survey), where the goal is to ﬁnd a rewardmaximizing policy subject to overall constraints, such as that the expected cumulative cost of the policy must not exceed a predeﬁned threshold. It is known that CMDPs can be solved using linear programming (Altman and Spieksma 1995; Altman 1996, 1998). The key difference between our model and CMDPs is that we consider uniform participation constraints, which must hold in all states throughout the execution of the policy, rather than just in the initial state. In particular, such participation constraints may require any (nearly) optimal policy to be history-dependent, whereas in CMDPs, one can focus on Markovian (i.e., historyindependent) policies without loss of generality (Altman 1999). More recently, reinforcement learning in CMDPs has received increasing attention (Achiam et al. 2017; Tessler, Mankowitz, and Mannor 2018; Cheung 2019; Le, Voloshin, and Yue 2019; Brantley et al. 2020; Efroni, Mannor, and Pirotta 2020; Singh, Gupta, and Shroff 2020; Ding et al. 2021). These results are quite different from ours, because they tend to focus on exploration rather than the planning

problem itself, and the model we study in this paper is different from CMDPs. A related model is multi-objective MDPs (MOMDPs) (see (Roijers et al. 2013) for a comprehensive survey), where the planner cares about multiple objectives simultaneously. Similar to CMDPs, the work on MOMDPs focuses on the overall cumulative reward vector, whereas the participation constraints in our model must hold throughout the process. Another line of research considers multi-agent (partially observable) MDPs, where individual agents act in a common environment based only on local information and/or beliefs about each other (Gmytrasiewicz and Doshi 2005; Oliehoek 2012; Hoang and Low 2013). One key difference between our setting and the above is that we consider an asymmetric environment where the principal has the exclusive power to commit to a policy. From an economic point of view, our results can be interpreted as dynamic mechanism design (see (Bergemann and V alim aki 2010; Athey and Segal 2013; Pavan, Segal, and Toikka 2014; Pavan 2017; Bergemann and V alim aki 2019)) under only individual-rationality (i.e., participation) constraints. Roughly speaking, in dynamic mechanism design, the principal controls which actions to play, but relies on self-interested agents to report the current state. Since the agents reward functions may not align perfectly with the principal s, the main challenge in dynamic mechanism design is to encourage agents to truthfully report the state, while maximizing the planner s cumulative reward as much as possible. In contrast, in our model, the principal always observes the true state3 but does not have the power to force participation. From a computational perspective, the fact that the principal can observe the true state enables efﬁcient algorithms for long-horizon environments, which are known to be hard when the state must be reported by an agent (Zhang and Conitzer 2021).

2 Preliminaries

In this section, we give a formal deﬁnition of the problem of planning with participation constraints, which generalizes planning in MDPs. Recall that in standard (ﬁnite-time) MDPs, the state of the world evolves stochastically depending on the actions taken. Each state-action pair has a corresponding reward, and the goal of the planner is to choose actions in each state to maximize the total reward collected before reaching a terminal state. In our model, there are two entities, a principal (which plays a similar role as the planner in standard MDPs) and an agent. Each state-action pair generally gives different rewards to the principal and the agent. The principal controls which action to choose in each state, but the agent has the power to end the entire process at any moment. So, in order to gain more rewards, the principal has to commit to taking actions that incentivize the agent to remain in the process. Below we formally state the problem.

3In our model, it is without loss of generality to have the principal choose only one action, rather than several actions for the agent to choose from. This is because the principal knows which (if any) action the agent would choose since the agent has no private information and hence the principal may as well give this action as the only option (mindful that the agent can refuse).

States, actions, rewards, and transitions. There is a state space S, which contains an initial state sinit and a terminal state sterm. 4 For each state s S, there is an associated set of actions As, where Asterm = , i.e., no actions can be played at the terminal state. The principal s and the agent s rewards are captured by two reward functions r P : S A R and r A : S A R. That is, in state s S, taking action a As gives the principal a reward of r P (s, a) and the agent a reward of r A(s, a). When action a As is played in state s S, the probabilistic transition to the next state is given by a distribution P(s, a) (S) over the state space, where P(s, a, s ) is the probability that the next state is s S. Throughout the paper, we assume the transition operator P is acyclic,5 meaning that the probability of reaching the same state twice is 0 no matter what actions are played. In other words, if a state s can reach another state s with positive probability, then s cannot reach s. Consequently, we can label the states in topological order. That is, S = [n] 6 where sinit = 1 and sterm = n, such that all transitions go from a state with smaller index to one with larger index. Histories and policies. A history of length t is a sequence of interleaved states and actions (s1, a1, s2, a2, . . . , st, at), where si S and ai Asi for each i [t], and si < si+1 for each i [t 1]. Such a history corresponds to a (partial) run of the MDP, in which the initial state is s1 and action a1 is played, and the second state is s2 and action a2 is played, etc. Let Ht be the set of all histories of length t, and H0 = { } where denotes the empty history. Let H = S t [n 1] Ht H0 be the set of all histories, and let

H<s = {(s1, a1, . . . , st, at) H | st < s}

be the set of histories visiting only states before s. For a history h = (s1, a1, . . . , st, at) and a state-action pair (s, a), let h + (s, a) be the history given by appending (s, a) to h:

h + (s, a) = (s1, a1, . . . , st, at, s, a).

We say a history h extends another history h, if h is a preﬁx of h , with the notation h h, A policy π is a (probabilistic) mapping from history-state pairs to distributions over actions. For any h H, s S, and a As, π(h, s, a) is the probability that action a will be chosen when state s is reached with history h. Note that we allow policies to be randomized and history-dependent, which is necessary when we later consider participation constraints. We always require that π(h, s, a) = 0 if a / As. Fixing a policy π, the utility (or value) function of the principal uπ P : H S R can be deﬁned recursively.

uπ P (h, s) = X

a As π(h, s, a)

s S P(s, a, s ) uπ P (h + (s, a), s )

4We note that this is without loss of generality, i.e., equivalent to having a distribution of initial states and multiple terminal states. 5In particular, this subsumes all ﬁnite-horizon environments. 6Throughout the paper, we use [n] to denote the set {1, . . . , n}.

and uπ P (h, s) = 0 if s = sterm. Note that uπ P (h, s) is well-deﬁned given that transitions are acyclic. In particular, uπ P ( , sinit) is the principal s expected cumulative reward under policy π. 7 Similarly, the agent s utility function uπ A under policy π can be deﬁned as

uπ A(h, s) = X

a As π(h, s, a)

s S P(s, a, s ) uπ A(h + (s, a), s )

and uπ A(h, s) = 0 if s = sterm. The planning problem. We take the principal s point of view and aim to ﬁnd a policy π that maximizes the principal s expected total utility uπ P ( , sinit). Unlike planning in standard MDPs, we have the additional constraint that the policy should always incentivize the agent to remain in the process. That is, we only consider policies π that satisfy:

uπ A(h, s) 0, for all h H and s S.

This condition is commonly known as individual-rationality (IR) (or voluntary participation, or just participation) in economic theory, and captures the intuition that the agent should not be worse off by participating in the process. It is worth noting that, while our model as deﬁned above only allows the agent to quit before an action is chosen (i.e., ex-ante IR), it is easy to extend the model to allow the agent to quit after seeing the randomly selected action (i.e., ex-post IR). This can be done by inserting a dummy state to represent ready to take action a in state s . We remark that it is sometimes better for the principal to let the agent quit, if the cost of keeping the agent in the system is too high. We capture this by adding a quit action for the principal in each non-terminal state, which gives both parties a reward of 0 and transitions the state of the world to sterm. Having added this action, imposing the IR constraint is without loss of generality: if our solution relied on the agent at some point refusing to participate further, we could instead just select the quit action for the agent at that point.

3 Computing Optimal Policies with Participation Constraints In this section we present our main result: an algorithm that computes an ε-approximate optimal policy with participation constraints in time poly(n, m, 1/ε), where n = |S| is the number of states, and m = P s |As| is the total number of actions. We ﬁrst present an exact algorithm, which runs in exponential time in the worst case, and then discretize the algorithm to allow for efﬁcient approximation. Moreover, we show that the exact algorithm itself is strongly polynomial time if the input MDP satisﬁes a deﬁnitive decisions property (Deﬁnition 5).

7Because we consider ﬁnite-horizon environments, no discount factor is required; if a discount factor γ is nevertheless desired, we can add this into our model by introducing an additional probability of 1 γ of immediately transitioning to the terminal state.

Algorithm 1: Algorithm for computing the Pareto frontier of the principal/agent s utility for all states.

Input: State space S, sets of actions {A}s, reward functions r P and r A, transition operator P. Output: Pareto frontier functions {fs}s for all states.

1 let fn = {(0, 0)}; /* we use {(x, f(x))} to represent f; all these functions are piecewise linear, so we only store the turning points. */

2 for s = n 1, . . . , 1 do

3 initialize fs = ;

4 for a As do

5 call Algorithm 2 and let gs,a subcurve(s, a); /* gs,a is the Pareto frontier after taking action a in state s */

6 let gs,a {(x + r A(s, a), y + r P (s, a)) | (x, y) gs,a}; /* shift gs,a by adding (r A(s, a), r P (s, a)) to all its turning points */

7 let ˆfs concenv({gs,a}a); /* call concenv to compute the upper concave envelope of {gs,a}a; concenv can be any algorithm for concave envelope */

8 let fs ˆfs {(x, y) | x 0, y R}; /* truncate ˆfs at x = 0 and keep the nonnegative part as fs */

9 return {fs}s; /* Algorithm 3 requires additional bookkeeping information from Algorithms 1 and 2; for ease of presentation, we do not explicitly return it */

3.1 An Exact Algorithm

Overview of Algorithm 1. Algorithm 1 has the same highlevel structure as backward induction in standard (ﬁnitehorizon) MDPs. Recall that states are labeled in topological order. Without participation constraints, we can simply process the states backward, and recursively compute the optimal onward utility and the corresponding sub-policy in every state. However, in our problem, the algorithm has to keep track of the cumulative reward of the principal and the agent, and enforce the participation constraints. We compute the Pareto frontier of the principal s and the agent s utility at every state. The Pareto frontier at s is a function fs such that fs(u) is the principal s maximum achievable onward utility, subject to the agent s onward utility is exactly u (while satisfying onward participation constraints). We show that the functions {fs}s can be computed efﬁciently in a recursive way, and consequently, the principal s optimal overall utility can be read off from the Pareto frontier in the initial state, and is equal to maxu 0 fsinit(u). Algorithmic insights. Algorithm 1 heavily exploits the fact that feasible policies (i.e., those ensuring participation) are closed under convex combination, and as a result, the Pareto frontier function at each state is concave. In each state s, we

Algorithm 2: subcurve(s, a): algorithm for computing the subcurve for a state-action pair.

Input: State s S, action a As, transition operator P, Pareto frontier functions {fs }s >s for downstream states s . Output: Subcurve for state-action pair (s, a). /* we represent each fs and the subcurves by their turning points; let fs = (fs ,0, . . . , fs ,ks ) where ks is the number of pieces of fs and each fs ,i = (xs ,i, ys ,i) is a point in R2 */

1 let p0 P s >s P(s, a, s ) fs ,0;

2 let the source point of p0 from each onward state s be fs ,0; /* the source points are for bookkeeping and are used in Algorithm 3 to execute the policy */

3 let k P s >s ks , and cs 0 for all s > s; /* cs is the number of pieces of fs that we have already used */

4 for i = 1, . . . , k do

5 let Li {s | s < s n, cs < ks }; /* L1 is the set of states left to consider */

6 let si argmaxs Li(ys ,cs +1 ys ,cs )/(xs ,cs +1 xs ,cs );

7 let pi pi 1 + P(s, a, si) (xs ,cs +1 xs ,cs , ys ,cs +1 ys ,cs );

8 let csi csi + 1;

9 let the source point of pi from each onward state s be fs ,cs ;

10 return (p0, p1, . . . , pk);

ﬁrst compute, for each action a that can be taken at that state, the Pareto frontier gs,a after taking that action, and then shift each gs,a by the principal/agent s reward for the state-action pair (s, a). Then, since the policy can randomize among actions, we simply take the upper concave closure of all these shifted curves and then truncate it to ensure participation, which gives the Pareto frontier of this state. A key subroutine of Algorithm 1 is Algorithm 2, which computes the Pareto frontier function g(s, a) after taking action a in state s we call this the subcurve for state-action pair (s, a). Intuitively, to evaluate this subcurve at some u, one needs to determine the optimal way of distributing the agent s utility u into onward states, such that the resulting principal s utility is maximized. Because the Pareto frontier functions of all onward states are concave, one can construct the entire subcurve using a water-ﬁlling procedure, where the construction always follows the Pareto frontier function that has the largest slope. In other words, if we want to increase the principal s utility by an inﬁnitesimal amount after (s, a), we should increase the principal s utility (and decrease the agent s utility) in exactly one of the onward states, namely the onward state that provides the best tradeoff.

Implementing the optimal policy. Algorithm 1 solves only half of the problem it computes the principal s optimal utility, but not the corresponding policy. In fact, it could be

the case that any optimal policy requires exponential space to explicitly describe (even approximately), so it is impossible to output an optimal policy in polynomial time. Nevertheless, given the Pareto frontier functions constructed in Algorithm 1, we present an algorithm (Algorithm 3) that can efﬁciently execute an optimal policy on the ﬂy. At a high level, while there are exponentially many possible runs under the optimal policy, to execute the policy we need to follow only the one that actually happened. More speciﬁcally, the optimal policy chooses a point p1 on the Pareto frontier of sinit to realize. We can backtrack how Algorithm 1 achieved p1, which tells us to take an action (say a1) and gives us a point {qs} on the Pareto frontier of each onward state s reachable from (sinit, a1). We then play action a1 and transition to one of these states (say s2) at random, and we can continue to execute the point qs2 on the Pareto frontier of s2 using the same process as above.

Correctness of Algorithms 1 and 3. Now we are ready to prove the correctness of our algorithms.

Lemma 1. Algorithm 1 computes the Pareto frontier functions for all states, i.e., for all s S, h H<s, and u 0,8

fs(u) = max{uπ P (h, s) | π : uπ A(h, s) = u and

a As, s > s, h H<s with

h h + (s, a), uπ A(h , s ) 0}.

Lemma 2. Let (s1 = sinit, a1, s2, a2, . . . , at 1, st = sterm) be any (random) output sequence of Algorithm 3. Then for each i [t 1], ai π (hi, si),

where π is an optimal policy subject to participation constraints, and hi = (s1, a1, . . . , si 1, ai 1). In particular, π guarantees that

uπ A (hi, si) = ui and uπ P (hi, si) = fsi(ui),

where ui is computed in line 1 or line 12 in Algorithm 3.

Runtime of Algorithm 1. We remark that Algorithm 1 may take exponential time in the worst case. In fact, let |fs| be the number of pieces in fs (recall that all fs are piecewise linear). Then for each s S, |fs| can be as large as Θ |As| P s >s |fs | , so |fs| may grow exponentially as s decreases. Formally, we have the following result regarding the time complexity of Algorithm 1.

Lemma 3. Algorithm 1 runs in time n (2m0)n, where n = |S| is the number of states and m0 = maxs =sterm |As| is the maximum number of actions available in a single state. Moreover, suppose the state space can be partitioned into H sets {S1, . . . , SH}, such that for any two states s Si and s Sj, for all a As, P(s, a, s ) > 0 only if j = i + 1. Then Algorithm 1 runs in time O((m0n)H log n). In particular, Algorithm 1 is polynomial-time if H is constant.

8Note that the equation of fs(u) implies that the policy satisﬁes IR constraints at histories it would never reach; this is without loss of generality as the quit action could always be chosen.

Algorithm 3: Algorithm for executing the optimal policy on the ﬂy (i.e., choosing actions on the ﬂy)

Input: Problem instance (S, {A}s, r P , r A, P), Pareto frontier functions {fs}s and other auxiliary variables computed by Algorithm 1. Output: A random trajectory (s1 = sinit, a1, . . . , at 1, st = sterm) of the optimal policy subject to participation constraints. /* we require the following bookkeeping information: the source points (lines 2 and 9 of Algorithm 2), the Pareto frontier functions before truncation { ˆfs}s (line 7 of Algorithm 1), and which turning points come from which subcurve (lines 7 and 8 of Algorithm 1) */

1 let i 1, u1 argmaxu fsinit(u ), s1 sinit;

2 while si = sterm do

3 let p i , p+ i ˆfsi and αi [0, 1] be such that (ui, ˆfsi(ui)) = αi p i + (1 αi) p+ i ; /* we view fsi and ˆfsi interchangeably as mappings and sets of turning points; note that p i may have a negative x-coordinate, but pi deﬁned below always has a nonnegative x-coordinate, and is therefore on the curve of fsi+1 */

4 let a i , a+ i Ai be such that p i and p+ i come from the (shifted) subcurves for (si, a i ) and (si, a+ i ) respectively;

5 let randi be a uniformly random real number between 0 and 1;

6 if randi αi then

7 let ai a i , pi p i (r A(si, ai), r P (si, ai));

9 let ai a+ i , pi p+ i (r A(si, ai), r P (si, ai));

10 play action ai and transition to next state si+1 P(si, ai);

11 let qi = (xi, yi) be the source point of pi from onward state si+1 in the subcurve (before shifting) for (si, ai);

12 let ui+1 xi, i i + 1;

13 let t i;

14 return (s1 = sinit, a1, . . . , st 1, at 1, st = sterm);

3.2 Achieving Efﬁciency by Approximation Below we show how to make Algorithm 1 polynomial-time by sacriﬁcing an arbitrarily small additive error ε. The idea is simple: since each Pareto frontier function is concave, when the number of pieces in each function is large, there must be many pieces that are essentially the same and therefore can be merged at little cost. To make this concrete, instead of the exact Pareto frontier curves, we maintain their intersections with equally

spaced horizontal lines, and use the upper concave closures of these intersections as proxy curves to approximate the actual Pareto frontier functions. Since the Pareto frontier curves are concave, each curve can intersect each horizontal line at most twice once in the increasing phase, and once in the decreasing phase. Moreover, the error from this approximation is uniformly upper bounded by the distance between two neighboring horizontal lines. Formally, for any desired error ε > 0, we insert the following operation after line 9 of Algorithm 1:

let fs concenv fs {(x, y) | x R, y { n, n + ε

n , . . . , n 2ε

The approximation guarantee and the runtime of the modiﬁed algorithms are stated in the following theorem.

Theorem 4. Suppose the principal s reward function is bounded, i.e., for all s S and a As,

1 r P (s, a) 1.

Then, the modiﬁed Algorithm 1 with parameter ε > 0 together with Algorithm 3 implicitly computes a policy πε guaranteeing participation, which satisﬁes

uπε P ( , sinit) uπ P ( , sinit) ε,

where π is the optimal policy guaranteeing participation. Moreover, the modiﬁed Algorithm 1 with parameter ε > 0 takes time O((mn3 log n)/ε).

3.3 Achieving Efﬁciency with Deﬁnitive Decisions

Although Algorithm 1 can take exponential time in general, we show that it is in fact strongly polynomial-time for a special case of the problem, where all decisions are deﬁnitive. Formally, we consider the model with the following restriction.

Deﬁnition 5 (Deﬁnitive decisions). We say an MDP has the deﬁnitive decisions property if, for each state s S, the set of available actions As at s includes at most one decide-later action, for which there are no restrictions on its transition probabilities; all other actions in As are decision actions, which immediately terminate the process by taking the state deterministically to sterm.

Notice that there are no restrictions on the rewards, so each action can induce arbitrary rewards for the principal and the agent. MDPs with deﬁnitive decisions property can capture many real-world scenarios, such as candidate screening and training programs (see Section 4). We can show that our exact algorithm (Algorithm 1) runs in polynomial time if the input MDP satisﬁes the deﬁnitive decisions property.

Lemma 6. In the case of deﬁnitive decisions, Algorithm 1 computes the (exact) Pareto frontier functions for all states in time O(m2 log n).

4 Application: Screening Policy Design

In this section, we illustrate the wide applicability of our results by examining a natural application scenario: the design of screening policies. The setting we consider is the following. Suppose we run a crowdsourcing task that requires very speciﬁc qualiﬁcations, so we would like all crowd workers to be screened before entering the pool. Whether a worker actually qualiﬁes can only be observed during the job (even workers themselves do not know beforehand), but we know that workers who qualify tend to perform better in certain simple online tests. In order to estimate workers qualiﬁcations, we provide these tests to all prospective workers, so they can potentially take as many as they wish. Ideally, we would like to admit only workers who take a large number of tests and perform well on average, because in that case we are very conﬁdent that they qualify. However, if the time cost of repeatedly taking tests outweighs the beneﬁt of getting the job, then workers will simply not try at all. Even if a worker starts trying, he may quickly give up after seeing a few negative results, deciding that his chance of eventually qualifying does not justify the time cost of keep trying. Taking this into consideration, our goal is to design a policy of admission that in expectation maximizes our gain (i.e., the amount of useful job done minus the total payment) by encouraging workers to take more tests. The setup. Formally, there are two types of workers: good and bad . Before taking tests, neither we nor the worker know whether the worker is good or bad, but there are commonly known prior probabilities for both types, pg and pb. A test has two possible results: pass (P) and fail (F). A good worker passes each test independently with probability qg, and a bad worker passes with probability qb < qg. Without loss of generality, suppose our utility for accepting a good agent is ug > 0, and that for accepting a bad one is ub < 0; each worker receives utility 1 for being accepted, and c for the time cost of taking a test. A policy maps each sequence of test outcomes e.g., PFPPF to a distribution over three possible actions: accept, reject, and decide later. Our goal is to ﬁnd an optimal policy, taking into consideration that workers may drop out at any moment if their onward expected utility is negative. We remark that our formulation and algorithms can in fact handle much more general settings, but we choose to present this speciﬁc application, because it admits nice structures and can demonstrate several intriguing aspects of our results.

4.1 Screening Policy Design as Planning with Participation Constraints

We show how to cast the screening problem as planning with participation constraints. First observe that at any moment in the above process, our belief on the worker s type (i.e., the posterior probability that a worker is good) only depends on the sequence of test outcomes so far. In fact, this probability depends only on the numbers of Ps and Fs that a worker gets. Therefore, without loss of generality, we can consider an MDP where the states are S = {(p, f) | p, f N} {sterm} = N2 {sterm}, where sinit = (0, 0). The posterior probability of being good in a state (p, f) is given by:

Pr[good | (p, f)] = pg qp g(1 qg)f

pg qp g(1 qg)f + pb qp b(1 qb)f .

The set of actions associated with each state s has 3 elements: As = {accept, reject, decide-later}. Among these actions, accept and reject are decision actions, in the sense that they terminate the process immediately. The rewards they induce for the agent and the principal are: (1, Pr[good | (p, f)] (ug ub) + ub) for accept, and (0, 0) for reject. The decide-later action induces rewards ( c, 0) and transition probabilities: Pr[(p + 1, f) | (p, f)] = Pr[good | (p, f)] qg + (1 Pr[good | (p, f)]) qb, Pr[(p, f + 1) | (p, f)] = 1 Pr[(p + 1, f) | (p, f)].

Computing optimal policies. With the above formulation, we can apply our algorithms to ﬁnd optimal policies. First consider the case where there is an upper limit N on the number of tests each worker can take. In such cases, the state space becomes S = {(p, f) | p + f N} {sterm}, where |S| = O(N 2). Since the formulation has deﬁnitive decisions, we can directly apply Algorithm 1 to compute the optimal policy in time O(N 4 log N). Even in the idealized setting where the number of tests a worker can take is arbitrarily large (which is possible because workers may not consider sunk cost when making decisions), we can still approximate the optimal policy to arbitrary precision. More speciﬁcally, the following lemma states that there exists a policy that uses at most N tests and achieves a principal s utility at most O(1/N) worse than the optimal policy. Lemma 7. Let OPT be the principal s optimal utility in screening policy design with no limit on the number of tests. For any N, there is a policy guaranteeing participation which plays reject for all (p, f) where p + f > N, and achieves expected utility OPT O(1/N) for the principal. Combining Lemma 6 with Lemma 7, we can compute an ε-approximately optimal policy in time O(log(1/ε)/ε8) for any ε > 0. To summarize, we have the following theorem. Theorem 8. We can compute an optimal screening policy in time O(N 4 log N) when the maximum number of tests a worker is willing to take is N. Moreover, when workers may take arbitrarily many tests, we can compute an approximately optimal policy up to any additive error ε > 0 in time O(log(1/ε)/ε8).

5 Future Directions We leave a number of open questions as interesting avenues for future work: Is there a poly-time algorithm (or a hardness result) for exactly solving MDP with participation constraints? What if the policy must be deterministic? What additional applications can ﬁt in the MDP model in this paper? Another future direction is the case with incomplete information, e.g., when the agent s reward function is only approximately known. One potential solution is to run a robust version of our algorithm that allows some slack in the IR constraints, where the slack corresponds to a conﬁdence interval for our estimate of the agent s reward function.

Acknowledgments

Hanrui Zhang and Vincent Conitzer are supported by NSF award IIS-1814056. Yu Cheng is supported in part by NSF award CCF-2122628. The authors thank anonymous reviewers for their helpful feedback.

References Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. In International Conference on Machine Learning, 22 31. PMLR. Altman, E. 1996. Constrained Markov decision processes with total cost criteria: Occupation measures and primal LP. Mathematical methods of operations research, 43(1): 45 72. Altman, E. 1998. Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48(3): 387 417. Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC Press. Altman, E.; and Spieksma, F. 1995. The linear program approach in multi-chain Markov decision processes revisited. Zeitschrift f ur Operations Research, 42(2): 169 188. Athey, S.; and Segal, I. 2013. An efﬁcient dynamic mechanism. Econometrica, 81(6): 2463 2485. Bellman, R. 1957. A Markovian decision process. Journal of mathematics and mechanics, 6(5): 679 684. Bergemann, D.; and V alim aki, J. 2010. The dynamic pivot mechanism. Econometrica, 78(2): 771 789. Bergemann, D.; and V alim aki, J. 2019. Dynamic mechanism design: An introduction. Journal of Economic Literature, 57(2): 235 74. Brantley, K.; Dudik, M.; Lykouris, T.; Miryooseﬁ, S.; Simchowitz, M.; Slivkins, A.; and Sun, W. 2020. Constrained episodic reinforcement learning in concave-convex and knapsack settings. ar Xiv preprint ar Xiv:2006.05051. Cheung, W. C. 2019. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. Advances in Neural Information Processing Systems, 32: 726 736. Ding, D.; Wei, X.; Yang, Z.; Wang, Z.; and Jovanovic, M. 2021. Provably efﬁcient safe exploration via primal-dual policy optimization. In International Conference on Artiﬁcial Intelligence and Statistics, 3304 3312. PMLR. Efroni, Y.; Mannor, S.; and Pirotta, M. 2020. Explorationexploitation in constrained MDPs. ar Xiv preprint ar Xiv:2003.02189. Gmytrasiewicz, P. J.; and Doshi, P. 2005. A framework for sequential planning in multi-agent settings. Journal of Artiﬁcial Intelligence Research, 24: 49 79. Hoang, T. N.; and Low, K. H. 2013. Interactive POMDP Lite: Towards practical planning to predict and exploit intentions for interacting with self-interested agents. In Twenty Third International Joint Conference on Artiﬁcial Intelligence.

Howard, R. A. 1960. Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press. Le, H.; Voloshin, C.; and Yue, Y. 2019. Batch policy learning under constraints. In International Conference on Machine Learning, 3703 3712. PMLR. Oliehoek, F. A. 2012. Decentralized pomdps. In Reinforcement Learning, 471 503. Springer. Pavan, A. 2017. Dynamic mechanism design: Robustness and endogenous types. In Advances in Economics and Econometrics: Eleventh World Congress, 1 62. Pavan, A.; Segal, I.; and Toikka, J. 2014. Dynamic mechanism design: A myersonian approach. Econometrica, 82(2): 601 653. Puterman, M. L.; and Shin, M. C. 1978. Modiﬁed policy iteration algorithms for discounted Markov decision problems. Management Science, 24(11): 1127 1137. Roijers, D. M.; Vamplew, P.; Whiteson, S.; and Dazeley, R. 2013. A survey of multi-objective sequential decisionmaking. Journal of Artiﬁcial Intelligence Research, 48: 67 113. Singh, R.; Gupta, A.; and Shroff, N. B. 2020. Learning in Markov decision processes under constraints. ar Xiv preprint ar Xiv:2002.12435. Tessler, C.; Mankowitz, D. J.; and Mannor, S. 2018. Reward constrained policy optimization. ar Xiv preprint ar Xiv:1805.11074. Zhang, H.; and Conitzer, V. 2021. Automated Dynamic Mechanism Design. ar Xiv preprint ar Xiv:2105.06008.