# sequential_blocked_matching__b8f9e717.pdf

Sequential Blocked Matching

Nicholas Bishop1, Hau Chan2, Debmalya Mandal3, Long Tran-Thanh4

1University of Southampton, 2University of Nebraska-Lincoln, 3Max Planck Institute, 4University of Warwick nb8g13@soton.ac.uk, hchan3@unl.edu, dmandal@mpi-sws.org, long.tran-thanh@warwick.ac.uk

We consider a sequential blocked matching (SBM) model where strategic agents repeatedly report ordinal preferences over a set of services to a central planner. The planner s goal is to elicit agents true preferences and design a policy that matches services to agents in order to maximize the expected social welfare with the added constraint that each matched service can be blocked or unavailable for a number of time periods. Naturally, SBM models the repeated allocation of reusable services to a set of agents where each allocated service becomes unavailable for a ﬁxed duration. We ﬁrst consider the ofﬂine SBM setting, where the strategic agents are aware of their true preferences. We measure the performance of any policy by distortion, the worst-case multiplicative approximation guaranteed by any policy. For the setting with s services, we establish lower bounds of Ω(s) and Ω( s) on the distortions of any deterministic and randomised mechanisms, respectively. We complement these results by providing approximately truthful, measured by incentive ratio, deterministic and randomised policies based on random serial dictatorship which match our lower bounds. Our results show that there is a signiﬁcant improvement if one considers the class of randomised policies. Finally, we consider the online SBM setting with bandit feedback where each agent is initially unaware of her true preferences, and the planner must facilitate each agent in the learning of their preferences through the matching of services over time. We design an approximately truthful mechanism based on the explore-then-commit paradigm, which achieves logarithmic dynamic approximate regret.

1 Introduction

In recent years, machine learning algorithms have been extremely successful in various domains, from playing games to screening cancer. However, despite such success, most learning algorithms cannot be deployed directly in practice to make decisions under uncertainty. The main reason is that most real-world applications involve multiple agents, and learning algorithms are often constrained due to the unavailability of resources. Motivated by such constraints in multi-agent systems, we consider the problem of repeated matching with blocking constraints, a scenario where mul-

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

tiple agents simultaneously learn their preferences with repeated blocking or unavailability of resources. In particular, we are interested in the repeated one-sided matching problem where strategic agents report their ordinal preferences based on their expected rewards over a set of alternatives, or services. The agents are matched to the services given their reported preferences each time period or round. It is well-known that one-sided matching can be used to model various real-world situations such as matching patients to kidneys across health institutions (Durlauf and Blume 2008; Roth, S onmez, and Unver 2004), assigning students to rooms in residential halls (Durlauf and Blume 2008), allocating workers to tasks in crowdsourcing (Difallah, Demartini, and Cudr e-Mauroux 2013; Aldhahri, Shandilya, and Shiva 2015), and recommending users to activities in recommender systems (Satzger, Endres, and Kießling 2006; Ansari, Essegaier, and Kohli 2000; Isinkaye, Folajimi, and Ojokoh 2015). In many of these situations, there are several obstacles. First, for a setting with reusable services, a major caveat is that an agent-alternative match within a round can result in blocking of some services in which the services may not be available until a later time. For example, a recommended activity (e.g., a special promotion offer from a restaurant) that is matched to (or used by) a user may not be available to all users again until a later time. Or, in cloud computing, where tasks are matched to resources (e.g. GPUs), once a task is assigned to a resource, that resource is blocked for a certain number of rounds. Second, the agents are often unaware of their exact preferences, and the planner must coordinate their explorations without incurring a signiﬁcant loss. This is often true for recommending restaurants to customers, as the restaurants have limited capacity and people are rarely informed of all possible choices (Waldfogel 2008). Note that, even when the agents are themselves using learning algorithms over time, coordination by the planner becomes necessary to avoid different agents exploring the same service at a time a problem which is exacerbated by blocking of the services. Finally, in several settings, the agents are aware of their preferences, but they might be strategic in reporting their preferences for getting matched to better services. This is particularly prominent in assigning rooms to students. Rooms can be blocked due to occupancy and can be made

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

available again once the students leave the rooms. As a result, there is a potential for the students to misreport their private preferences to manipulate matching outcomes.

1.1 Main Contributions In order to capture the notion of one-sided matching with blocking, we introduce a sequential blocked matching (SBM) model, in which a set of n strategic agents are matched to a set of s services repeatedly over rounds and where matched services are blocked for a deterministic number of time steps. Each agent reports her ordinal preferences over the services every round, based on her current estimates of the expected rewards of the services. As is standard in the matching literature, we focus on the setting where agents just report ordinal preferences over the services. The planner s goal is to derive a matching policy that, at each round, elicits true preferences from the agents and matches them to services in order to maximize the expected social welfare, which is the sum of the expected utilities of the agents from the matchings over time, whilst accounting for the blocking of services. To the best of our knowledge, SBM models have not been studied before and can be applied to a wide range of real-world matching scenarios. We investigate the ofﬂine and online variations of the SBM model. For both variations, we are interested in deriving deterministic and randomized policies that are approximately truthful and efﬁcient. We measure truthfulness by incentive ratio (Chen et al. 2012), which measures how much a single agent can gain via misreporting preferences. We measure efﬁciency through the notion of distortion from social choice theory (Procaccia and Rosenschein 2006), which measures the loss in social welfare due to access to only preferences, and not utility functions and rewards. We formally deﬁne these concepts in Section 3. Ofﬂine SBM Benchmarks. In the ofﬂine setting of SBM, each agent knows their own preferences and rewards over the services, but the planner does not. In addition, each agent reports their preferences only once to the planner, before matching begins. Essentially, the ofﬂine benchmarks establish what we can achieve in terms of distortion if the agents don t have to learn. Table 1 summarizes our results. More speciﬁcally, we derive lower bounds on the distortion of any deterministic and randomised mechanism. The main ingredient of our proof is the careful construction of reward proﬁles that are consistent with reported preferences that guarantees poor social welfare for the planner. We then focus on the upper bound and provide approximately truthful mechanisms with bounded incentive ratios that match the distortion lower bounds. In short, both the deterministic and randomised mechanisms we provide are based on the repeated random serial dictatorship (RSD) mechanism for one-shot one-sided matching problems. Our randomised mechanism, repeated RSD (RRSD), iterates randomly over all agents, greedily choosing the current agents preferred service at each time step. Our deterministic mechanism, derandomised RRSD (DRRSD), is a derandomised version of this algorithm and matches the corresponding lower bound. Interestingly, we ﬁnd that there is a strict separation of s between the achievable distortion by a deterministic and ran-

domized mechanism. Online SBM Algorithms. For the online setting of SBM, the agents do not know their preferences or rewards and must learn their preferences via repeated matching to services. After each matching, the agents update their preferences and strategically report them to the planner. We design an approximately truthful mechanism, bandit RRSD (BRRSD), based on the Explore-then-Commit (ETC) paradigm, which achieves sublinear dynamic approximate regret. In particular, BRRSD has two phases. In the ﬁrst phase, it allows the participating agents to learn their preferences via uniform allocation of services. Using the learnt estimates from this phase, the mechanism then runs RRSD in the second phase.

1.2 Related Work We provide a brief discussion of the related work in the matching and bandit literature and highlight major differences comparing to our SBM models, which have not been considered previously. Ordinal Matching and Distortion. We consider the objective of maximizing expected rewards as our ofﬂine benchmark. Since we do not observe the exact utilities of the agents rather ordinal preferences over items, we use the notion of distortion (Procaccia and Rosenschein 2006) from voting to quantify such a benchmark. In the context of voting, distortion measures the loss of performance due to limited availability of reward proﬁles (Boutilier et al. 2015; Mandal et al. 2019; Anshelevich et al. 2018; Kempe 2020; Anshelevich and Postl 2017). Our ofﬂine benchmark is related to the literature on the distortion of matching (Amanatidis et al. 2021; Filos-Ratsikas, Frederiksen, and Zhang 2014; Anshelevich and Sekar 2016). However, our ofﬂine benchmark needs to consider repeated matching over time, and because of the blocking of services, has a very different distortion than the distortion of a single-round matching. Online Matching. There are existing online notions of weighted bipartite matching (e.g., (Karp, Vazirani, and Vazirani 1990; Kalyanasundaram and Pruhs 1993; Karande, Mehta, and Tripathi 2011)) and stable matching (e.g., (Khuller, Mitchell, and Vazirani 1994)) where the matching entities (i.e. agents or services) arrive dynamically over time and the corresponding information in the notions is publicly known (e.g., weights of the matched pairs or agents ordinal preferences). These online settings are different from our repeated matching settings, where the entities do not arrive dynamically and our objective is to maximize expected rewards of the repeated matching given agents ordinal preferences. Other recent works explore dynamic agent preferences that can change over time (e.g., (Parkes and Procaccia 2013; Hosseini, Larson, and Cohen 2015a,b)). However, they do not consider the problem of maximizing expected rewards and blocking. Blocking Bandits. Our work in the online SBM models is closely related to the recent literature on blocking bandit models (Basu et al. 2021, 2019; Bishop et al. 2020), where each pulled arm (i.e., service) can be blocked for a ﬁxed number of rounds. Our work is also related to bandits with different types of arm-availability constraints (Neu

Distortion Incentive Ratio Any Deterministic Mechanism (lower bound) Ω(s) (0, 1] Derandomized Repeated Random Serial Dictatorship (upper bound) O(s) (1 1/e) Any Randomised Mechanisms (lower bound) Ω( s) (0, 1] Repeated Random Serial Dictatorship (upper bound) O( s) (1 1/e)

Table 1: Lower and Upper Bound Results for Ofﬂine SBM Models

and Valko 2014; Kleinberg and Immorlica 2018; Kleinberg, Niculescu-Mizil, and Sharma 2010). However, these models do not consider the sequential matching setting where multiple strategic agents have (possibly unknown) ordinal preferences over arms and report ordinal preferences to a planner in order to be matched to some arm at each round. Multi-agent multi-armed bandits. The online setting in our work is broadly related to the growing literature on multi-agent multi-armed bandits (Liu, Mania, and Jordan 2020; Sankararaman, Basu, and Sankararaman 2021; Bistritz et al. 2020). Liu, Mania, and Jordan (2020) consider a matching setting where strategic agents learn their preferences over time, and the planner outputs a matching every round based on their reported preferences. However, our setting is more challenging as we need to compete against a dynamic ofﬂine benchmark because of the blocking of services, whereas the existing works compete against a ﬁxed benchmark e.g. repeated applications of Gale Shapley matching in each round (Liu, Mania, and Jordan 2020).

2 Preliminaries In this section, we introduce our model for sequential blocked matching. We start by describing how the preferences of each agent are modeled and describe formally how agents can be matched to services in a single time step. After which, we introduce the ﬁrst of two sequential settings that we study in this paper, which features the blocking of services when they are assigned to agents. In our model, we have a set of agents, N = {1, . . . , n}, who hold cardinal preferences over a set of services, S = {1, . . . , s}, where s n 1. We use µi,j R+ to describe the cardinal reward agent i receives for being assigned service j. Similarly, we denote by µi = (µ)s j=1 the vector of rewards associated with agent i. In what follows, we will also refer to µi as the reward proﬁle associated with agent i. Moreover, we restrict ourselves to reward proﬁles which lie in the probability simplex. That is, we assume µi s 1 for all i N. In other words, we make a unit-sum assumption about the reward proﬁle of each agent. Bounding constraints on reward proﬁles are common in the ordinal onesided matching literature (Filos-Ratsikas, Frederiksen, and Zhang 2014), and are typically required in order to prove lower bounds for truthful algorithms such as RSD. Moreover, the unit-sum assumption is prevalent in social choice theory (Boutilier et al. 2015). Lastly, we denote by µ the n by s matrix of rewards.

1Note that this is without loss of generality, as we may always add dummy services corresponding to a null assignment.

We say that agent i (weakly) prefers service a to service b if agent i receives greater reward by being assigned service a over service b. That is, agent i prefers service a over service b if and only if µi,a µi,b. We use the standard notation a i b to say that agent i prefers service a to service b. Additionally, we use the notation i(j) to indicate the service in the jth position of the preference ordering i. Note that every reward proﬁle induces a linear preference ordering of services 2. We use the notation µi i to denote that i is a preference ordering induced by agent i s reward proﬁle. We let P(S), or P for short, denote the class of all linear preferences over S. We write i to denote the preferences induced by agent i s reward proﬁle. Furthermore, we let = ( )n i=1 Pn denote the preference proﬁle of the agents. As is standard, we write i to denote ( 1, . . . , i 1, i+1, . . . , n). As a result, we may denote by ( i, i). A matching m : N S {0} is a mapping from agents to services. We let m(i) denote the service allocated to agent i by the matching m. We use 0 to denote the null assignment. That is, agent i is assigned no service in a matching if m(i) = 0. We let denote the null matching, in which no agent is assigned a service. We say matching is feasible if no two agents are mapped to the same service. We let M denote the set of all feasible matchings. In this paper, we consider discrete-time sequential decision problems, in which a planner selects a sequence of (feasible) matchings over T time steps. We let mt denote the matching chosen by the planner at time step t, and denote by M = (mt)T t=1 a sequence of T matchings. We denote by M(t, i) = mt(i) the service matched to agent i at time t. Furthermore, we assume that, when a service is assigned, it may be blocked for a time period depending on the agent it was assigned to. More speciﬁcally, when agent i is matched with service j, we assume that service j cannot be matched to any agent for the next Di,j 1 time steps. We refer to Di,j as the blocking delay associated with the agent-service pair i and j. Additionally, we let D denote the maximal blocking delay possible, and let D denote the n by s matrix of blocking delays. From now on, we assume that all blocking delays are known a priori by both the planner and all agents. We say that a matching sequence M is feasible with respect to the delay matrix D if no service is matched to an agent on a time step where it has been blocked by a previous matching.

Deﬁnition 1. For a given blocking delay matrix D, the set

2One reward proﬁle may induce many linear orderings. However, the linear preference proﬁle induced by a reward proﬁle can be made unique via tie-breaking rules.

of feasible matching sequences of length T, MD T MT , is the set of all matching sequences M MT such that for all t {1, . . . , T}, i N, and j S, if M(t, i) = j then M(t , i ) = j for all i N and for all t such that t < t t + Di,j 1.

In other words, we say that a matching sequence is feasible if there is no matching in the sequence which assigns an agent a service which has been blocked by a previous matching. Note that blocking of services is a common phenomenon in real-world scenarios. For example, consider a setting in which each service corresponds to a freelance contractor, and each agent corresponds to an employer. The matching of services and agents then corresponds to employers contracting freelancers. For the duration of the contract, which may differ from employer to employer, the matched freelancer is unavailable before returning to the pool of available services once their contract ends. We deﬁne the utility, Wi(M, µi), agent i receives from a matching sequence M as the sum of rewards it receives from each matching in the sequence. That is, Wi(M, µi) = PT t=1 µi,M(t,i). Similarly, we deﬁne the social welfare, SW(M, µ), of a matching sequence M as the summation of the utilities for all agents. More speciﬁcally, SW(M, µ) = Pn i=1 Wi(M, µi). Next, we will describe the ﬁrst sequential matching setting we consider in this paper, which we call the ofﬂine SBM setting. In this setting, the planner must produce a feasible matching sequence of length T. Prior to the selection of a matching sequence, each agent submits a linear preference ordering to the planner. We denote by i the preference ordering, or report, submitted by agent i. Analogously, we deﬁne as the preference proﬁle submitted cumulatively by the agents, and call it the report proﬁle. A matching policy π(M | , D) assigns a probability of returning a matching sequence M given a submitted report proﬁle and blocking delay matrix D. When it is clear from context, we will abuse notation and use π( , D) to refer to the (random) matching sequence prescribed by a policy π given a report proﬁle and blocking delay matrix D. We say that a matching policy is admissible, if for all possible report proﬁles and blocking delay matrices, the matching sequence returned by the policy is always feasible. The goal of the planner is to adopt an admissible matching policy which achieves high social welfare in expectation relative to the best feasible matching sequence in hindsight, M (µ, D) = argmax M MD T SW(M, µ). We assume that each agent, with full knowledge of the matching policy employed the planner, submits a linear preference ordering with the intention of maximising their own utility, and therefore may try to manipulate the planner by submitting a preference ordering which is not induced by their underlying cardinal preferences. We say that an agent is truthful if they submit a preference ordering induced by their underlying cardinal preferences. That is, an agent is truthful if µi i. We denote by i the report by agent i which maximises agent i s utility in expectation under the assumption that all other agents are truthful. We say that a policy is truthful if for all possible µ and D it is optimal for

each agent to be truthful if all other agents are truthful. In other words, a policy is truthful if for all µ and D we have that µi i for all i N. To evaluate the efﬁciency of a given policy we use distortion, a standard notion of approximation for settings with ordinal preferences.

Deﬁnition 2. The distortion of a matching policy is the worst-case ratio between the expected social welfare of the matching sequence, π( , D), returned by the policy under the assumption that all agents are truthful, and the social welfare of the optimal matching sequence, M (µ, D):

SW(M (µ, D), µ) E [SW(π( , D), µ)]

Note that distortion is the approximation ratio of the policy π with respect to best matching sequence. In addition, note that the distortion is only a useful measure of a matching policies efﬁciency if said policy encourages truthful reporting. For example, for truthful policies, distortion is completely characterising of a policy s expected performance. As a result, we not only seek policies which have low distortion, but also policies which incentivise agents to submit their reports truthfully. To this end, we introduce the notion of incentive ratio, which measures the relative improvement in utility an agent can achieve by lying about their preferences.

Deﬁnition 3. The incentive ratio ζ(π) R+ of a matching policy π is given by:

ζ(π) = max D, i, µi i E[Wi(π(( i, i), D), µi)] E[Wi(π(( i , i), D), µi)]

If a policy has an incentive ratio of 1, then it is truthful. There are many reasons that we may expect a policy with bounded incentive ratio to do well. A bounded incentive ratio implies truth telling is a good approximation to the optimal report. If computing the optimal report is computationally intractable for the agent, being truthful is therefore an attractive alternative, especially if the approximation ratio implied by the incentive ratio is tight. In summary, we seek matching policies with good guarantees when it comes to both incentive ratio and distortion. This topic is treated in detail in the forthcoming sections.

3 The Ofﬂine SBM Setting In this section, we present our analysis of the ofﬂine SBM setting. We ﬁrst provide a lower bound on the distortion achievable by both randomised and deterministic policies. Then, we discuss why trivial extensions of truthful one-shot matching algorithms do not result in truthful policies. Instead, we focus on designing policies which use truthful one-shot matching mechanisms as a basis, and have bounded incentive ratio. More precisely, we present the RRSD algorithm. We show that the incentive ratio of RRSD is bounded below by 1 1/e, and provide upper bounds on the distortion achieved by RRSD, which match our previously established lower bounds on the best distortion achievable by any randomised algorithm.

3.1 Lower Bounds on the Distortion of Deterministic and Randomised Policies First, we prove that the distortion of any deterministic policy is Ω(s). That is, the distortion of any deterministic policy scales linearly with the number of services in the best case. In the proof, we ﬁrst carefully construct a set of ordinal preferences. Then, given any matching sequence M, we show that there exists a set of reward proﬁles which induces the aforementioned ordinal preferences and on which M incurs distortion of order Ω(s).3

Theorem 1. The distortion of any deterministic policy is Ω(s). Next, we prove that the distortion incurred by any randomised policy is Ω( s). To prove this, we ﬁrst show that it is sufﬁcient to consider only anonymous policies. That is, policies that assign each service to each agent the same number of times in expectation for all possible preference proﬁles. Then, we construct a set of reward proﬁles which yields the desired distortion for all anonymous truthful policies. Theorem 2. The distortion of the best randomised policy is Ω( s).

3.2 Constructing Truthful Algorithms for the Ofﬂine SBM Setting As previously mentioned, we assume that agents submit reports with the intention of maximising their own utility. As a result, the distortion incurred by a policy may not reﬂect its performance in practice, as agents may be incentivised to misreport their preferences in order to increase their utility. Note that in standard one-shot one-sided matching problems, this issue is sidestepped via the employment of truthful policies, like RSD. In addition, the restriction to considering truthful policies is well justiﬁed by the revelation principle. In a similar way, we would like to develop truthful algorithms for the ofﬂine SBM setting. One may be tempted to apply such truthful one-shot policies to our setting directly. That is, to apply an algorithm such as RSD repeatedly on every time step in sequence in order to devise a matching sequence. This intuition is correct when there is no blocking, as the matching problems for each time step are then independent of each other. However, with blocking, the matchings from previous time steps will have a substantial effect on the set of matchings which preserve the feasibility of the matching sequence in future rounds. As a result, immediately obvious approaches, such as matching according to RSD repeatedly, do not result in truthful policies. One simple way of generating truthful policies is to run a truthful one-shot one-sided matching policy once every D time steps and simply return the empty matching in the remaining time steps. Such an approach decouples each time step from the next, resulting in truthfulness, but comes at the cost of only matching in at most T/ D rounds. Instead, we construct an algorithm for the ofﬂine SBM setting from truthful one-shot matching algorithms in a different manner. More speciﬁcally, we propose the repeated

3All missing proofs are deferred to the full version

random serial dictatorship (RRSD) algorithm, which uses RSD as a basis. Whilst RRSD is not truthful, it does have bounded incentive ratio.

3.3 A Greedy Algorithm for the Ofﬂine SBM Setting The RRSD algorithm slowly builds up a matching sequence M over time by iterating through agents and services. In other words, RRSD begins with the empty matching sequence, where M(t, i) = 0 for all t and i N. To begin, RRSD samples a permutation of agents σ uniformly at random. Next, RRSD iterates through the agents in the order given by the permutation sampled. For each agent i, RRSD iterates through services in the order speciﬁed by the preference ordering i reported by agent i. For a given service j, RRSD repeatedly assigns service j to agent i at the earliest time step which does not cause the matching sequence to become infeasible. When no such time step exists, RRSD moves onto the next service in agent i s preference ordering. Once RRSD has iterated through the entire preference ordering of agent i, RRSD moves onto the next agent in the permutation σ and repeats this process until the end of the permutation is reached. The pseudocode for RRSD is given in the full version. We will now brieﬂy give the intuition behind RRSD. In essence, RRSD attempts to mimic the RSD algorithm for one-shot matching problems by allowing each agent to sequentially choose a feasible assignment of services over the entire time horizon (whilst respecting the assignments chosen by previous agents) via its reported ordering. In the case of RSD, given an agent s preference ordering, the same assignment is always optimal no matter the underlying the reward proﬁle of the agent. That is, it is optimal for the agent to be assigned its most preferred available service, no matter its cardinal preferences. As a result, RSD is trivially truthful in the one-shot matching setting. In contrast, in the ofﬂine SBM setting, the optimal assignment of services can be different for two reward proﬁles which induce the same preference ordering. Hence, there is no trivial assignment, based on the preference ordering submitted by the agent which guarantees that agents are truthful. Instead, given an agent s preference ordering, we attempt to ﬁnd an assignment which performs reasonably well, no matter the underlying reward proﬁle of the agent. RRSD uses a greedy algorithm to compute the assignment given to an agent. As long as this greedy algorithm is a good approximation of the optimal assignment, no matter the agent s underlying reward proﬁle, then RRSD will have a bounded incentive ratio. The next theorem formalises this argument.

Theorem 3. The incentive ratio of RRSD is asymptotically bounded below by 1 1/e.

Remark. It is an open question as to whether we can achieve incentive ratios better than 1 1/e when RRSD is used. In particular, one can show that many scheduling problems such as generic job interval scheduling and (dense) pinwheel scheduling can be reduced to the optimal manipulation problem each agent faces in RRSD. Whilst it is known that generic job interval scheduling problems are MAXSNP-

hard (Chuzhoy, Ostrovsky, and Rabani 2006), it is still not known whether there exists a scheduling algorithm with approximation ratio better than 1 1/e. We now provide an upper bound on the distortion achieved by RRSD, which matches our previously established lower bound for randomised policies described in Theorem 2. Theorem 4. The distortion of RRSD is at most O( s). Finally, we show that it is possible to match the previously established lower bound for the distortion of deterministic algorithms. More speciﬁcally, we show that a derandomised version of RRSD incurs a distortion of at most O(s). The main idea is that we can select O(n2 log n) permutations of agents so that the number of times an agent i is selected in the jth position is Θ(n log n). We can then run through these permutations one by one instead of selecting one permutation uniformly at random as in the RRSD algorithm. Theorem 5. There is an admissible deterministic policy with distortion at most O(s) for any T O(n2 log(n))

4 SBM with Bandit Feedback Note that, in order for the guarantees above to hold in practice, we must assume that agents are fully aware of their ordinal preferences before matching begins. However, in many real-world scenarios, agents may be initially unaware of their preferences and learn them over time by matching with services. In addition, the reward an agent receives for being matched with a service may be inherently stochastic, depending on unobservable aspects of the underlying environment. With these concerns in mind we present a new sequential blocked matching setting, which we call the online SBM setting with bandit feedback, or online SBM for short. In the online SBM setting, matching occurs time step by time step. At the beginning of each time step, every agent must submit a report, t i, to the planner. The planner is then tasked with returning a matching of agents to services which obeys the blocking constraints imposed by the matchings from previous time steps. At the end of each time step, agent i receives a reward, ri,t [0, 1], sampled from a distribution with mean µi,j, where j is the service agent i was assigned in the matching returned by the planner. Additionally, we assume that each agent maintains an internal estimation, t i, of its own preference ordering at every time step, based on the rewards received thus far. We use H t = ( 1, . . . , t) to denote the report history up to time step t. Furthermore, we use Hm t = (m1, . . . , mt) to describe the matching history at the end of time step t. We say that a matching history is feasible if its matchings form a feasible matching sequence. Similarly, we use Hr t = (r1, . . . , rt) to denote the reward history. That, is the tuple of reward vectors, rt, observed by the agents at every time step. An (online) matching policy π = (π1, . . . , πT ) is a tuple of functions πt(m| H t , Hm t , D) which assigns a probability of returning the matching m given a report history H t , a feasible matching history Hm t and a blocking delay matrix D. Similarly to the ofﬂine setting, we say that a matching policy is admissible if it always returns a feasible matching sequence.

Likewise, an (online) report policy for agent i, ψi = ( ψ1, . . . ψt), is a tuple of functions ψt( t i|Hr t , Hm t , D) which assign a probability of agent i reporting t i at time step t given a reward history Hr t , a matching history Hm t , and blocking delay matrix D. We denote by ψ = ( ψ1, . . . , . . . ψn) the tuple of report policies used by the agents. As before we use the notation ψ i to denote the report policies of all agents bar agent i and use ψ to denote the tuple of report policies where each agent reports its internal estimation t i at every time step. We say that an agent is truthful if it employs the report policy ψi. The goal of each agent is to employ a report policy that maximises the sum of their own rewards across the time horizon. In contrast, goal of the planner is to employ a matching policy which maximises the sum of rewards across all agents and across all time steps. In the bandit literature, a performance metric that is typically used to measure the efﬁciency of a policy is regret, which is deﬁned as the expected difference between the rewards accumulated by a matching policy, and the expected reward accumulated by the best ﬁxed matching policy in hindsight. That is, the best policy which repeatedly selects the same matching in as many time steps as possible. Such a benchmark policy may have very poor performance relative to the optimal matching sequence in expectation, and as such, the classical notion of regret is an unsuitable performance measure in the online SBM setting. To resolve this issue, we propose the following regret deﬁnition:

Deﬁnition 4. The dynamic α-regret of a policy π is:

Rα π(D, µ, T) = αSW(M , µ) Eψ,π

In other words, we compare the expected performance of a matching policy against a dynamic oracle which returns an 1/α-optimal solution to the corresponding ofﬂine SBM problem, under the assumption that agents truthfully report their internal estimation of their preferences at each time step. Recall that, in the ofﬂine SBM setting, the distortion incurred by any policy is at least Ω(s). As a result, we cannot expect to construct algorithms with vanishing 1/α-regret for α < s. In addition, one would not expect any matching policy to have low dynamic regret if the internal estimations computed by each agent are inaccurate. For example, if any agent s internal estimator consists of returning a random preference ordering, then we cannot hope to learn about said agent s preferences. As a result, we need to make reasonable assumptions regarding the internal estimator of each agent. Similar to distortion for the ofﬂine SBM setting, dynamic α-regret is only a meaningful performance measure for policies which motivate agents to adopt truthful reporting policies. Inspired by the concept of incentive ratio for the ofﬂine SBM setting, we deﬁne a new notion of regret which, given a matching policy π captures the expected gain in cumulative reward an agent can achieve by misreporting.

Deﬁnition 5. For a given matching policy π, we deﬁne agent

i s α IC regret (or α incentive compatible regret) as follows:

Iα π(D, µ, T) =

α max ψ E(ψ i, ψi),π

Note that for some matching policies, computing the optimal reporting policy may be computational intractable. If agents have vanishing α-IC regret for a such policy, then adopting a truthtelling forms a good approximation of each agent s optimal reporting policy. If this approximation is better than what can be computed by the agent, then we can expect each agent to adopt their truthful reporting policy. Thus, we seek matching policies with good guarantees with respect to both dynamic α-regret and α-IC regret.

4.1 Algorithms for Online SBM Next, we present a matching policy which achieves meaningful guarantees with respect to both dynamic α-regret and α-IC regret. More precisely, we present the bandit repeated random serial dictatorship (BRRSD) algorithm. Before we describe BRRSD formally, we ﬁrst state our assumptions regarding the internal estimator used by each agent. Let ˆµi,j denote the empirical mean of the reward samples agent i receives from being assigned service j. We say that an agent i is mean-based if service a is preferred to service b in t i if and only if ˆµi,a ˆµi,b. That is, a mean-based agent prefers services with higher empirical mean reward. From hereon, we assume that all agents are mean-based. Additionally, we use min to denote the smallest gap in mean rewards between two services for the same agent. That is, min = mini,a =b |µi,a µi,b|. Note that min is analogous to common complexity measures used in bandit exploration problems. Intuitively, if the mean rewards received from being assigned two services are similar, it will take more samples for a mean-based agent to decide which service they prefer. We are now ready to describe BRRSD. BRRSD is split into two phases. In the ﬁrst phase, BRRSD assigns each agent each service exactly 2 log(2Tsn)/ 2 min times. BRRSD performs these assignments in a greedy manner. At each time step, BRRSD iterates through the agent-service pairs that still need to be assigned in an arbitrary order. If an agentservice pair does not violate blocking constraints, then it is added to the current matching. Once this iteration is completed, or all agents have been assigned services, the matching is returned and BRRSD moves onto the next time step. Once all required assignments have been completed, BRRSD waits until all services are available, matching no agents to services in the meantime. Note that this takes a maximum of D rounds. Then, BRRSD begins its second phase. At the beginning of the next time step, BRRSD observes the report proﬁle t i and selects matchings according to RRSD using this report proﬁle for the remainder of the time horizon. The full pseudocode for BRRSD is deferred to the full version. BRRSD falls in the class of explore-then-commit (ETC) algorithms common in the bandit literature. The ﬁrst phase of BRRSD serves as an exploration phase in which agent s

learn their preference ordering. Meanwhile, the second phase of BRRSD serves as exploitation phase in which agents have the opportunity to disclose their accumulated knowledge to the planner in the form of ordinal preferences. Observe that this decoupling of exploration and exploitation avoids complicated incentive issues that may arise for sequential algorithms, which make no such clear separation. The exploration phase of BRRSD is simple relative to typical approaches in the bandit exploration literature. One may hope to apply a more complicated scheme for exploration, however approaches with better performance guarantees typically depend directly on the reward samples observed, which the planner does not access to. The next theorem describes the guarantees of BRRSD.

Theorem 6. Under the assumption that agents are meanbased, the following is true for all µ and D: (i) The dynamic (1/ s)-regret of BRRSD is O D s log (Tsn) / 2 min . (ii) The (1 1/e)-IC regret for all agents under BRRSD is O Ds log (Tsn) / 2 min . (iii) The greedy algorithm used by BRRSD in the exploration phase uses at most twice as many time steps as the shortest feasible matching sequence which completes the required assignments.

5 Conclusions and Future Work In this paper, we introduced the sequential blocking matching (SBM) model to capture repeated one-sided matching with blocking constraints. For the ofﬂine setting, we lower bounded the performance of both deterministic and randomised policies, presented algorithms with matching performance guarantees and bounded incentive ratio. Then, we analysed an online SBM setting, in which agents are initially unaware of their preferences and must learn them. For this setting, we presented an algorithm with sublinear regret with respect to an ofﬂine approximation oracle. There are many interesting directions for future work. A natural generalisation would be to consider a two-sided matching setting (Roth and Sotomayor 1992) where services also hold preferences over agents. Additionally, our algorithms for both the ofﬂine and online settings are centralised. It is worth investigating whether similar performance guarantees can be achieved by a decentralised approach. Furthermore, we assumed that the preferences are static over time. It remains to be seen whether our approach generalises to settings where agents preferences are dynamic and change over time (Bergemann and V alim aki 2019).

Acknowledgements Nicholas Bishop was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Doctoral Training Partnership grant.

References Aldhahri, E.; Shandilya, V.; and Shiva, S. 2015. Towards an Effective Crowdsourcing Recommendation System: A Sur-

vey of the State-of-the-Art. In 2015 IEEE Symposium on Service-Oriented System Engineering, 372 377. Amanatidis, G.; Birmpas, G.; Filos-Ratsikas, A.; and Voudouris, A. A. 2021. Peeking behind the ordinal curtain: Improving distortion via cardinal queries. Artiﬁcial Intelligence, 296: 103488. Ansari, A.; Essegaier, S.; and Kohli, R. 2000. Internet Recommendation Systems. Journal of Marketing Research, 37(3): 363 375. Anshelevich, E.; Bhardwaj, O.; Elkind, E.; Postl, J.; and Skowron, P. 2018. Approximating optimal social choice under metric preferences. Artiﬁcial Intelligence, 264: 27 51. Anshelevich, E.; and Postl, J. 2017. Randomized social choice functions under metric preferences. Journal of Artiﬁcial Intelligence Research, 58: 797 827. Anshelevich, E.; and Sekar, S. 2016. Blind, greedy, and random: Algorithms for matching and clustering using only ordinal information. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 30. Basu, S.; Papadigenopoulos, O.; Caramanis, C.; and Shakkottai, S. 2021. Contextual Blocking Bandits. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130, 271 279. Basu, S.; Sen, R.; Sanghavi, S.; and Shakkottai, S. 2019. Blocking Bandits. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d Alch e Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Bergemann, D.; and V alim aki, J. 2019. Dynamic mechanism design: An introduction. Journal of Economic Literature, 57(2): 235 74. Bishop, N.; Chan, H.; Mandal, D.; and Tran-Thanh, L. 2020. Adversarial Blocking Bandits. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 8139 8149. Bistritz, I.; Baharav, T.; Leshem, A.; and Bambos, N. 2020. My fair bandit: Distributed learning of max-min fairness with multi-player bandits. In International Conference on Machine Learning, 930 940. PMLR. Boutilier, C.; Caragiannis, I.; Haber, S.; Lu, T.; Procaccia, A. D.; and Sheffet, O. 2015. Optimal social choice functions: A utilitarian view. Artiﬁcial Intelligence, 227: 190 213. Chen, N.; Deng, X.; Zhang, H.; and Zhang, J. 2012. Incentive Ratios of Fisher Markets. In Czumaj, A.; Mehlhorn, K.; Pitts, A.; and Wattenhofer, R., eds., Automata, Languages, and Programming, 464 475. Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-31585-5. Chuzhoy, J.; Ostrovsky, R.; and Rabani, Y. 2006. Approximation algorithms for the job interval selection problem and related scheduling problems. Mathematics of Operations Research, 31(4): 730 738. Difallah, D. E.; Demartini, G.; and Cudr e-Mauroux, P. 2013. Pick-a-Crowd: Tell Me What You like, and I ll Tell You

What to Do. In Proceedings of the 22nd International Conference on World Wide Web, 367 374. Durlauf, S. N.; and Blume, L. 2008. The new Palgrave dictionary of economics. Palgrave Macmillan. Filos-Ratsikas, A.; Frederiksen, S. K. S.; and Zhang, J. 2014. Social welfare in one-sided matchings: Random priority and beyond. In International Symposium on Algorithmic Game Theory, 1 12. Springer. Hosseini, H.; Larson, K.; and Cohen, R. 2015a. Matching with Dynamic Ordinal Preferences. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 29(1). Hosseini, H.; Larson, K.; and Cohen, R. 2015b. On Manipulablity of Random Serial Dictatorship in Sequential Matching with Dynamic Preferences. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 29(1). Isinkaye, F.; Folajimi, Y.; and Ojokoh, B. 2015. Recommendation systems: Principles, methods and evaluation. Egyptian Informatics Journal, 16(3): 261 273. Kalyanasundaram, B.; and Pruhs, K. 1993. Online Weighted Matching. Journal of Algorithms, 14(3): 478 488. Karande, C.; Mehta, A.; and Tripathi, P. 2011. Online Bipartite Matching with Unknown Distributions. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, 587 596. Karp, R. M.; Vazirani, U. V.; and Vazirani, V. V. 1990. An Optimal Algorithm for On-Line Bipartite Matching. In Proceedings of the Twenty-Second Annual ACM Symposium on Theory of Computing, 352 358. Kempe, D. 2020. Communication, distortion, and randomness in metric voting. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, 2087 2094. Khuller, S.; Mitchell, S. G.; and Vazirani, V. V. 1994. Online algorithms for weighted bipartite matching and stable marriages. Theoretical Computer Science, 127(2): 255 267. Kleinberg, R.; and Immorlica, N. 2018. Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 309 319. IEEE. Kleinberg, R.; Niculescu-Mizil, A.; and Sharma, Y. 2010. Regret bounds for sleeping experts and bandits. Machine learning, 80(2): 245 272. Liu, L. T.; Mania, H.; and Jordan, M. 2020. Competing bandits in matching markets. In International Conference on Artiﬁcial Intelligence and Statistics, 1618 1628. PMLR. Mandal, D.; Procaccia, A. D.; Shah, N.; and Woodruff, D. P. 2019. Efﬁcient and thrifty voting by any means necessary. Advances in Neural Information Processing Systems. Neu, G.; and Valko, M. 2014. Online combinatorial optimization with stochastic decision sets and adversarial losses. In Neural Information Processing Systems. Parkes, D. C.; and Procaccia, A. D. 2013. Dynamic Social Choice with Evolving Preferences. In Proceedings of the Twenty-Seventh AAAI Conference on Artiﬁcial Intelligence, 767 773. Procaccia, A. D.; and Rosenschein, J. S. 2006. The distortion of cardinal preferences in voting. In International

Workshop on Cooperative Information Agents, 317 331. Springer. Roth, A. E.; S onmez, T.; and Unver, M. U. 2004. Kidney Exchange*. The Quarterly Journal of Economics, 119(2): 457 488. Roth, A. E.; and Sotomayor, M. 1992. Two-sided matching. Handbook of game theory with economic applications, 1: 485 541. Sankararaman, A.; Basu, S.; and Sankararaman, K. A. 2021. Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship. In International Conference on Artiﬁcial Intelligence and Statistics, 1252 1260. PMLR. Satzger, B.; Endres, M.; and Kießling, W. 2006. A Preference-Based Recommender System. In Bauknecht, K.; Pr oll, B.; and Werthner, H., eds., E-Commerce and Web Technologies, 31 40. Waldfogel, J. 2008. The median voter and the median consumer: Local private goods and population composition. Journal of urban Economics, 63(2): 567 582.