# groupfair_online_allocation_in_continuous_time__db881dac.pdf Group-Fair Online Allocation in Continuous Time Semih Cayci scayci@illinois.edu Swati Gupta swatig@gatech.edu Atilla Eryilmaz eryilmaz.2@osu.edu The theory of discrete-time online learning has been successfully applied in many problems that involve sequential decision-making under uncertainty. However, in many applications including contractual hiring in online freelancing platforms and server allocation in cloud computing systems, the outcome of each action is observed only after a random and action-dependent time. Furthermore, as a consequence of certain ethical and economic concerns, the controller may impose deadlines on the completion of each task, and require fairness across different groups in the allocation of total time budget B. In order to address these applications, we consider continuous-time online learning problem with fairness considerations, and present a novel framework based on continuous-time utility maximization. We show that this formulation recovers reward-maximizing, maxmin fair and proportionally fair allocation rules across different groups as special cases. We characterize the optimal offline policy, which allocates the total time between different actions in an optimally fair way (as defined by the utility function), and impose deadlines to maximize time-efficiency. In the absence of any statistical knowledge, we propose a novel online learning algorithm based on dual ascent optimization for time averages, and prove that it achieves e O(B 1/2) regret bound. 1 Introduction With the prevalence of automated decision methods and machine learning methods, it is important to analyze the impact of learning and evaluate models not only with respect to traditional objectives such as reward or model accuracy, but also to account for the impact on individuals that interact with the system. Indeed, there are many studies highlighting algorithmic discrimination due to problems in the machine learning pipeline: imbalance in data [1], learnt representations [2, 3], choice of model proxies [4], demographic group-dependent difference in error rates of the learned models [5, 6, 7], to name a few. With rising ethical and legal concerns, addressing such issues has become urgent, specially as these impact critical societal decisions involving job opportunities and hiring. In 2014, it was estimated that 25% of the total workforce in the US was involved in some form of freelancing, and this number was predicted to grow to 40% by 2020 [8]. In reality, this percentage might be much higher, due to COVID-19 restrictions leading to increased work-from-home and changes in job opportunities [9, 10]. In online platforms however, there has been a strong evidence of bias observed in number of user reviews and user ratings4 on completing jobs with significant correlations Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332 Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210 4The mean (median) normalized rating score for White workers was 0.98 (1), while it is 0.97 (1) for Black workers on TASKRABBIT. The mean (median) rating of White workers was found to be 3.3 (4.8), 3.0 (4.6) for Black workers, 3.3 (4.8) for Asian workers, 3.6 (4.8) for workers with a picture that does not depict a person, and 1.7 (0.0) for workers with no image on FIVERR [11]. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: Freelancer profiles on UPWORK with their past performance and corresponding reviews for fixedprice" contracts. Contractors can access these profiles and allocate fixed-timed contracts with deadlines. with race, gender, location of work and length of profiles5 [11]. Motivated by these problems in online contractual hiring, we study a theoretical framework for sequential resource allocation to workers, where the controller (decision maker) can enforce deadlines for each task s completion. Our key contribution is to quantify impact of reward maximization in terms of equality of opportunity for jobs and develop algorithms that can achieve a meaningful trade-off between these via online utility maximization. The challenge is to maximize total reward within a given time budget, while accounting for random completion times by workers from different groups and fairness in allocation. Formally, we consider K groups of individuals who can be hired sequentially for each task, i.e., at any point, exactly one individual can be hired. If an individual from group k [K] is chosen for the n-th task and given a contractual deadline t by the controller, he/she generates a random reward of Rk,n if the task is completed by (random) time Xk,n within deadline t. If the task is not completed by the deadline, the reward obtained by the controller is zero and the time until the deadline is wasted (i.e., yields 0 reward for the controller). Completion times and reward distributions are assumed group-dependent and i.i.d. across tasks. The objective of the controller is to maximize utility (trade-off between total reward and fair allocation) in the offline (known distributions) and online settings (unknown distributions) under a budget constraint on time. As we will show in this paper, controlled deadlines set are essential for optimal time-efficiency under the budget constraint. The ethical problems we are concerned with involve the rate of jobs allocated to different demographic groups and the deadlines imposed on these under reward maximization regimes [11]. Our sequential framework would also apply to other settings, for e.g., comparative clinical trials with varying follow-up durations as well as to server allocation in cloud computing where jobs are drawn from different application groups and must commit computational resources until a specific amount of time due to service level agreements (Section 2). We will often focus on the first application involving online contractual hiring, since fairness concerns are most naturally motivated in this domain. Given a time budget constraint B and the diverse random nature of completion time and reward pairs, the main question we consider is how to decide distribution of tasks and deadlines between different groups of people. Two potential extreme allocations are: (i) Reward-maximizing task allocation: The controller assigns all tasks to the most rewarding group to maximize the total reward within the given time budget. The other groups do not get any chance to receive tasks. (ii) Proportional task allocation: The controller completely ignores the reward distributions, and attempts to give equal time share to each group. In other words, each group receives a fraction of the tasks inversely proportional to their mean completion times. There is clearly a trade-off between the reward maximization and equal time-share considerations in continuous-time sequential task allocation, and well-chosen utility functions [12] can be helpful in modeling this in a unified way. In this paper, we consider a very general class of utility functions, which recovers broadly used fairness criteria such as proportional fairness, max-min fairness, reward maximization among many others [13, 14, 12]. The controller can determine her priorities in terms of notions of fairness and model the task allocation problem by choosing the utility function accordingly. The main contributions of this paper are summarized as follows: 1. Incorporation of random completion time dynamics and fairness in allocation: In discretetime online learning models, each action is assumed to take a unit completion time, thus the 5Mean (median) number of reviews: for women 33 (11), 59 (15) for men on TASKRABBIT. Mean (median) number of reviews: for Black workers was found to be 65 (4), 104 (6) for White workers, 101 (8) for Asian workers, 94 (10) for non-human pictures and 18 (0) for users with no image on FIVERR [11]. random and diverse nature of task completion times, as required in many fundamental real-life applications, is ignored. In this work, we incorporate this aspect and develop a sequential learning framework in continuous time using tools from the theory of renewal processes and stochastic control. We show how controlled deadlines improve the time-efficiency in continuous-time decision processes. Moreover, this is the first work, to the best of our knowledge, that analyzes fair distribution policies in online contractual hiring. 2. Characterization of Approximately Optimal Offline Policies: As a consequence of the random and controlled task completion times, the optimal policy for fair resource allocation is PSPACEhard akin to unbounded stochastic knapsack problems. For tractability in design and analysis, we propose an approximation to the optimal offline policy based on Lagrange duality and renewal theory, and prove that it is asymptotically optimal. These approximate policies allocate tasks independently with respect to a fixed probability distribution. 3. Online learning for utility maximization: For utility maximization in an online setting with full information feedback, we develop a novel and low-computational-complexity online learning algorithm based on dynamic stochastic optimization methods for time averages, and show that it achieves O(B 1/2) regret for a time budget B. The optimal offline control policy in this paper is time-dependent, randomized and attempts to optimize time averages unlike the reward maximization problems in discrete-time problems. Despite these, the online learning algorithm we developed adapts to the randomness in completion time-reward pairs, and achieves optimal performance with vanishing regret at a fast rate. Related Work: The problem of fair resource allocation via utility maximization has been widely considered in economics and network management [15, 16, 17, 18]. The utility maximization approach to fair resource allocation in these papers predominantly deals with discrete-time systems, therefore the randomness and diversity in task completion times is completely ignored. Furthermore, these works either assume perfect knowledge of rewards and completion times prior to decisionmaking, or they assume the knowledge of statistics, therefore they do not incorporate online learning. The only continuous-time utility maximization approach to fair resource allocation is [19], which assumes the knowledge of first-order statistics, thus considers an offline optimization setting. Our work utilizes the (offline) Lyapunov optimization methods proposed in [19] to develop online learning algorithms based on the notion of approximate Lyapunov drift. Online learning under budget constraints has been considered under the scope of bandits with knapsacks [20, 21, 22]. In the classical bandits with knapsacks model, the objective is to maximize expected total reward under knapsack constraints in a stochastic setting. In [23], an interrupt mechanism is employed to incorporate the continuous-time dynamics into the budget-constrained online learning model. Note that these works focus solely on reward maximization, therefore do not address the fair resource allocation problem. The bandits with knapsacks setting was extended to concave rewards and convex constraints in [24], which assumes bounded cost and reward, and the deadline mechanism is not involved in decision-making, thus optimal time-efficiency in continuous time is not achieved. Our paper deviates from this line of work as it proposes a versatile and comprehensive framework for fairness, and incorporates continuous-time dynamics into the decisionmaking for time-efficiency. We include an extended discussion of related work in Appendix A. 2 Online Learning Framework for Group Fairness We consider the sequential and fair allocation of tasks to individuals from different groups, whose completion times and rewards randomly vary. This goal differs significantly from traditional online learning models that aim to maximize the expected total reward with unit completion times. Under this traditional setting, the controller s goal is to find and persistently select the reward-maximizing groups to allocate its tasks. As a consequence, the reward-maximization objective leads to the starvation of suboptimal groups, which causes unfairness amongst the groups with different statistical characteristics. Next, we provide a few motivating examples with group fairness requirements: Contractual Hiring in Online Freelancing Platforms: Online freelancing sites like UPWORK host contractual workers (freelancers) that can be hired by contractors" who require specific tasks to be completed. Each freelancer has a profile and performance on past tasks that can be learned by the contractors via ratings and reviews (see, typical profile in Figure 1). Fixed-timed contracts are popular on UPWORK, wherein contractors enforce a deadline by which the task must be completed otherwise the contract is terminated (i.e., there is no payment). Contractors can browse profiles and post a job to a selected set of freelancers with a deadline. However, there is a large literature documenting bias in online rating systems, which in turn impact job opportunities disparately [11, 25, 26], thus making it critical to develop theory of online learning for such settings. Server Allocation in Cloud Computing: An important application of our framework is online learning for fair resource allocation in cloud computing systems. In a very basic setting, a single server is sequentially allocated to tasks from one of K user groups, which exhibit similar execution time statistics and priority levels within each group. In many practical scenarios, the execution time of a task is unknown at the time decision [27, 28], and exhibits a power-law behavior [29], which necessitates a deadline mechanism for optimal time-efficiency [23]. In this setting, the objective of the controller is to allocate the server in an optimally fair way across the groups in a given time interval [0, B], depending on the completion time statistics and priority levels. Our work proposes a versatile framework to model fairness for this problem based on the concept of continuous-time utility maximization, and develops online learning algorithms to achieve the optimal performance with low regret in the absence of any statistical knowledge. More examples can be found in other domains, including multi-user wireless communication over fading channels (e.g., see [23]), comparative clinical trials with optimal follow-up duration (e.g., see [30, 31]), whereby the goal is to fairly share the limited resources between groups of users. Motivated by these examples, next we introduce an online learning framework that expands the traditional setting substantially to incorporate group fairness characteristics into its formulation. Suppose that there are K 1 groups of individuals that are available for serving tasks, and the controller assigns tasks sequentially among these K groups. If an individual from group k is chosen for the n-th task, he/she takes Xk,n units of completion time, and a reward of Rk,n is obtained upon successful completion, where Xk,n and Rk,n are positive random variables. We assume that individuals within a group exhibit statistical similarities as a consequence of their common background, therefore we assume that the process (Xk,n, Rk,n) is independent and identically distributed (iid) over n. In order to model the possibility of highly different skill sets for individuals within a group, we assume that the completion time Xk,n is random, and can be potentially heavytailed. Note that the completion time Xk,n and reward Rk,n can be correlated, e.g., in the server allocation example, the completion time Xk,n and size Rk,n of a task are positively correlated [32]. Before the n-th task begins, the controller makes two decisions: the group Gn [K] of the individual that will be assigned the task, and a deadline Tn T, where T R+ is the set of all possible deadlines. Since (Xk,n, Rk,n) is independent and identically distributed over n and unknown at the time of decision, i.e., each individual is statistically symmetric, we do not consider the assignment of tasks to individuals within a group, thus the controller only makes a choice for the group in task allocation. If the task is not completed by the selected deadline t T, the task is interrupted without collecting any reward, therefore we denote the reward obtained t time units after the initiation by Rk,n(t) = Rk,n I{Xk,n t} Rmax for some constant Rmax < . In many applications, deadlines are chosen within a discrete set (e.g., days/months in contractual hiring or time-slots in server allocation), thus we assume a finite decision set T = {t1, t2, . . . , t L} with tl < for all l in this paper. The sequential task allocation continues until a given time budget B > 0 is exceeded, therefore, the completion time of a task is as important as the reward. To describe this process mathematically, let Hk,n 1 denote the available feedback for group k, and Hn 1 = k [K]Hk,n 1 denote the history before making a decision for task n. For a given time budget B > 0, a causal policy π = {π1, π2, . . .} sequentially makes two decisions πn = (Gn, Tn) [K] T for each task n based on the history Hk 1, where Gn is the chosen group and Tn is the assigned deadline. Under a policy π, the number of initiated tasks is the following first-passage time: N π(B) = inf n n : i=1 min{XGi,i, Ti} > B o , (1) which is a random and controlled stopping time. Moreover, the reward rate of any user type k is: rπ k(B) = E h 1 n=1 I{Gn = k}Rk,n(Tn) i , under policy π. (2) If Rk,n(t) = I{Xk,n t}, i.e., each task completion yields a unit reward, then rπ k(B) simply denotes the task completion rate (i.e., throughput) of group k individuals in the time interval [0, B]. Note that designing strategies that aim to maximize the total reward rate in (2) will lead to the persistent selection of the group with the highest reward rate at the cost of starvation of all the rest (see [23]). In order to address group fairness considerations, we propose a continuous-time online learning framework based on the utility maximization concept that is used effectively in the fair resource allocation domain (e.g., see [16]). Specifically, for a given continuously-differentiable, concave and monotonically increasing utility function Uk : R R, we let the utility of group k under a policy π be given by Uk rπ k(B) . Then, the total utility under a policy π is defined as: rπ k(B) , for time interval [0, B]. For a given time budget B > 0, the optimum total utility over a class of policies Π, and the regret for a given policy π Π are, respectively: OPTΠ(B) = max π Π rπ k(B) and REGπ Π(B) = OPTΠ(B) U π(B). (3) Note that, due to the monotonically increasing and concave nature of utility functions, allocating the tasks always to the most rewarding group is not a good choice, because the same amount of time could yield a higher utility for another group because of the diminishing return property of concave functions. Thus, each given set of utility functions {Uk : k [K]} defines a fairness concept. A particularly important class of utility functions is α-fair class, given next. Definition 1 (α-Fair Allocation). For any given α > 0 and weight wk > 0, let Uk(x) = wk x1 α 1 α , for all k. Resource allocation by using these utility functions is called α-fair resource allocation. This class is attractive since it includes as special cases proportional fairness, minimum potential delay fairness, reward maximization and max-min fairness [12]. 3 Approximation of the Optimal Offline Policy Note that a simpler version of the sequential maximization problem in (3) with linear utility functions over all causal policies is called an unbounded knapsack problem, and it is PSPACE-hard even in the case of known statistics [33, 20]. Therefore, the optimal causal policy for the problem in (3) has a very high computational complexity even in the offline setting, which makes it intractable for online learning. For tractability in design and analysis, we consider a class of simple policies that allocate tasks in an i.i.d. randomized way according to a fixed probability distribution over groups, and show its efficiency in this section. Definition 2 (Stationary Randomized Policies). Let P be a fixed probability distribution over [K] T. A stationary randomized policy (SRP) π = π(P) makes a randomized decision independently according to P for every task until the time budget B is depleted. In other words, under the SRP π(P), we have P πn = (k, t) = P(k, t), n Nπ(B), for all (k, t) [K] T. We denote the class of all stationary randomized policies as ΠS. Proposition 1 (Asymptotic optimality of SRP). There exists a probability distribution P such that the stationary randomized policy π(P ) is asymptotically optimal over all causal policies as B . The proof of Proposition 1 can be found in Appendix B. In the following, we characterize the total utility under π(P) by providing tight bounds. Proposition 2. Let P be any given probability distribution over [K] T. Then, the reward per unit time for group k under the stationary randomized policy π(P) is as follows: t T P(k, t)E[Rk,1(t)] P (i,t) [K] T P(i, t)E[min{Xi,1, t}], k [K]. Consequently, the total utility under the stationary randomized policy π(P) is bounded as follows: k [K] Uk ρk(P) X rπ(P ) k (B) X k [K] Uk ρk(P) + O 1 We include the complete proof of Proposition 2 in Appendix B. The key idea is that under an SRP, the total reward of a group k is a regenerative process. Then, by using the theory of stopped random walks for regenerative processes, the reward per unit time under π(P) is found as ρk(P), and the upper bound for the total utility is found by using Lorden s inequality [34] and concavity of Uk. Proposition 2 emphasizes the significance of the reward per unit time ρk(P). In conjunction with Proposition 1, this suggests that using a probability distribution that maximizes the limiting total utility would be an effective offline approximation. Definition 3 (Optimal Stationary Randomized Policy). Let P be a probability distribution defined as P arg max P P k [K] Uk ρk(P) . Then, the optimal SRP π makes a selection independently for every task according to P : P π n = (k, t) = P (k, t) for all (k, t) [K] T and n Nπ(B). An interesting question regarding P is the choice of deadline policy for each group. The following proposition characterizes the optimal deadline policy under π , and yields a significant simplification in finding the optimal policy by reducing the size of the search space. Proposition 3 (Optimal Deadline Policy). For any k, the optimal probability distribution P makes a deterministic deadline decision for group k, that is, |{t T : P (k, t) > 0}| 1. For any k, we denote t k T as the (unique) optimal deadline for group k such that P (k, t k) > 0. The detailed proof of Prop. 3 can be found in Appendix C. As we will see later, we can explicitly characterize the optimal deadline for a broad class of utility functions used for the so-called α-fair allocations. In the following, we use Prop. 2 to characterize the performance of the optimal SRP. Proposition 4 (Optimal Total Utility). For any group k, let t k T be the (unique) optimal deadline by Prop. 3; r k = E[Rk,1(t k)]/E[min{Xk,1, t k}] be the reward per processing time for group k; and ϕk = P (k, t k) E[min{Xk,1, t k}] P j [K] P (j, t j) E[min{Xj,1, t j}], (4) be the fraction of time budget allocated to group k under π(P ). Then, for any SRP π(P), the total utility is bounded as P k Uk ρk(P) P k Uk (U k) 1 λ r k , where the upper bound is achieved by the probability distribution that satisfies ϕk = 1 r k (U k) 1 λ r k for λ such that P The proof of Proposition 4 follows from Lagrange duality and Prop. 3, and can be found in Appendix D. Note that the above analysis is very general in the sense that it holds for any set of utility functions {Uk : k [K]} that are continuously differentiable and concave. In the following, we apply the results to the class of α-fair allocations (cf. Definition 1) and discuss their implications. Proposition 5 (α-Fair Resource Allocation in Continuous Time). For any group k, the optimal deadline optimizes the reward per processing time: t k = arg max t T E[Rk,1(t)] E[min{Xk,1, t}]. Let r k = maxt T E[Rk,1(t)] E[min{Xk,1,t}] be the reward per processing time and µk = E[min{Xk,1, t k}] be the mean processing time for group k under the optimal deadline policy. Then, for any α > 0, we have the following results for α-fair utility functions: max P U π(P )(B) = 1 1 α k [K] (r k) 1 α 1w 1 α k α , (5) where the optimum probability distribution P k and the optimum fraction of time budget ϕk allocated to group k are, respectively, given by: P (k, t) = I{t = t k} w 1 α k (r k) 1 α 1/µk P 1 α j (r j ) 1 α 1/µj , ϕk = (r k) 1 α 1w j [K](r j ) 1 α 1w for all (k, t) [K] T. Proposition 5 characterizes the optimal stationary randomized policy for a wide class of utility functions, and implies that the optimal deadline for a group is chosen to maximize the reward per processing time for that group in this setting. To gain a clear understanding of the notion of α-fairness, we consider the following special cases. Corollary 1. For any given set of parameters {wk > 0 : k [K]}, we have the following results for continuous-time α-fair resource allocation problem for various α > 0 values. (i) Proportional fairness: In this case, we have limα 1 Uk(x) = wk log(x) for all k. Let µk = E[min{Xk,1, t k}] be the mean processing time for group k. Then, the optimum utility is achieved by the probability distribution P (k, t) = I{t = t k} wk/µk P j [K] wj/µj , (k, t) [K] T, thus we have j [K] wj for all k, which implies the time budget B is allocated to a group k proportional to its weight wk, and the optimal total utility is OPTΠS(B) = P k log r kwk P k [K] wk + O( 1 (ii) Reward maximization: If α = 0, we have Uk(x) = ωkx for all k. Let k = arg maxk [K] wkr k be the group with highest weighted reward rate. Then, the optimal probability distribution is P (k, t) = I{k = k , t = t k}, for all (k, t). Thus, OPTΠS(B) = maxk [K] wkr k + O(1/B). Remark 1. Note that optimal deadline t k for any group k is chosen so as to maximize the reward per processing time of group k. Under proportional fairness (α 1), the controller distributes the time budget proportional to group weights, i.e., ϕk = wk/ P j wk, which reduces to equal time-sharing under uniform weights. To achieve this, the controller allocates tasks with probability inversely proportional to the mean processing time µk. Under reward maximization (α = 0), the controller allocates the entire time budget B to a single group that yields the highest reward per processing time to maximize the expected total reward, i.e., ϕk = I{k = k }. As such, the trade-off between reward maximization and equal (i.e., reward-insensitive) time-sharing is modeled by α-fairness for any α [0, 1). Further, the α-fair utility maximization framework includes max-min fairness (α ) and minimum potential delay fairness (α = 2) as subcases. 4 Online Learning for Utility Maximization (OLUM) In the previous section, we provided key results on the asymptotically optimal approximations to the offline utility maximization problem. In this section, we will build on these to attack the online learning problem for continuous-time fair allocation. In particular, we will propose a novel online learning algorithm for the fair resource allocation problem based on renewal theory, Lyapunov optimization (see [19, 35]) and PAC-bandits, and show that it achieves e O(B 1/2) regret. Feedback model: We assume a delayed full-information feedback model where the completion time and reward of all groups for task n are revealed to the controller at stage n + τ for some delay τ 1. This assumption holds approximately for our target applications. In freelancing platforms, there are often multiple contractors that hire freelancers for various tasks. It is often possible to get full information on various freelancers due to employment by other companies and their reviews can serve as the feedback for the controller. Competitions hosting websites like TOPCODER have also recently been catering to businesses who need fast-prototyping using freelancers. In their business model, a controller might invest in a few topcoders at a time, however, she can potentially get access to updated rankings (quality and time to complete tasks) via topcoder competitions over time. In server applications such as Amazon AWS and Microsoft Azure as well, although a controller might be optimizing operations on a local set of servers, they can request task performance data from a centralized server or a scheduler after a delay in time [36]. This feedback model already presents with technical challenges due to random completion times, as we discuss next. In order to design the online learning algorithm, let us define, for any (k, t) [K] T, the empirical estimates of the mean completion time and reward after n stages, respectively, as bµk,n(t) = 1 i=1 min{t, Xk,i}, and bθk,n(t) = 1 i=1 Rk,i(t). Definition 4 (OLUM Algorithm). For any k, let Qk,0 = 1 and Qk,i be defined recursively as follows: Qk,i+1 = Qk,i + γk(i) min{XGi,i, Ti} Rk,i(Ti)I{Gi = k} + , i > 0 (6) where the auxiliary variable γk(i) = U k 1 Qk,i/V , where V > 0 is a design choice. Then, for the task n, the OLUM Algorithm, denoted by πOLUM, makes the following decision: (Gn, Tn) arg max (k,t) [K] T bθk,n τ(t)Qk,n bµk,n τ(t) . Upon observing the corresponding feedback, the controller updates Qk,n+1 via (6). Interpretation: The OLUM Algorithm aims to maximize the time-average reward weighted with Qk,n at each round. Note that for any k [K], if the sequence Qk,n gets very big, then its reward rate is much smaller than the optimal value, thus the controller tends to select that group. In other words, the magnitude of Qk,n is a measure of the unfairness that group k has endured by stage n. The algorithm is designed so as to balance the weights Qk,n to maximize the total utility. In the following theorem, we prove regret bounds for the OLUM Algorithm. Theorem 1 (Regret bounds for OLUM). For any V > 0 and constant delay τ, the regret under πOLUM is bounded as REGΠS πOLUM(B) = O q V . By choosing V = Θ( p B/ log(B)), we obtain REGΠS πOLUM(B) = O( p log(B)/B) = e O(1/ The proof is based on renewal theory, Lyapunov optimization theory and PAC-type bounds, and can be found in Appendix E. 5 Simulations We implemented the OLUM Algorithm on a fair resource allocation problem with K = 2 groups. In the application domains that we considered in Section 2, the task completion times naturally follow a power-law distribution. For the contractual online hiring setting, creativity of individuals has been shown to follow a Pareto(1, γ) distribution with exponent γ > 1, where γ is dependent on the field of expertise [37]. Motivated by this application, we consider the following group statistics: Group 1: Xk,n Pareto(1, 1.2) and Rk,n(t) = X0.6 k,n I{Xk,n t} Group 2: Xk,n Pareto(1, 1.4) and Rk,n(t) = X0.2 k,n I{Xk,n t} The reward per processing time as a function of the deadline is shown in Figure 2. For this setting, we implemented the OLUM Algorithm with parameter V = 20, and considered α-fair resource allocation problems with various α values. In Figure 2, we present the simulation results for ϕ2, i.e., the average fraction of time budget B allocated to Group-2 individuals, under the OLUM Algorithm. For these experiments, we chose wk = 1 for k = 1, 2 and ran the OLUM Algorithm for 1000 trials for each set. Note that the optimal reward per processing time of Group-1 individuals is higher than that of Group-2 individuals, thus Group-1 is chosen for reward maximization. Under proportional fairness, the time budget is equally distributed between Group-1 and Group-2 individuals. We observe from Figure 2 that the OLUM Algorithm converges to the optimal operating points very fast, which verifies the theoretical results we presented. In the second example, we consider a fair server allocation problem with K = 5 user groups. In processing systems with shared servers, empirical studies indicate that the distribution of job 1 5 10 15 20 25 Deadline t Reward per processing time Group 1 Group 2 0 2000 4000 6000 8000 10000 Time budget B Fraction of time budget allocated to group 2 = 1 (Proportional fairness) = 0 (Reward maximization) = 0.5 = 2 (Minimum potential delay fairness) = (Max-min fairness) Figure 2: (Left) Reward per processing time for each group. (Right) Fraction of time budget assigned to Group-2 individuals under the OLUM Algorithm for various fairness criteria. execution times (i.e., file sizes) can be accurately approximated by Pareto(1, γ) distribution with γ (0, 2) [38, 39, 40]. Thus, we consider Pareto distributed completion times for each group with varying exponents: Xk,n Pareto(1, αk) for α = (1.25, 1.50, 1.75, 2.00, 2.25). The reward of group k under a deadline t T is Rk,n(t) = ρk Xβk k,n I{Xk,n t} ρktβk L where βk measures the correlation. For this example, we consider β = (0, 0.125, 0.25, 0.375, 0.5) and ρ = (3.0, 2.0, 3.0, 1.0, 1.44). Reward per processing times as a function of deadline are shown in Fig. 3, where the deadline choices t T are marked by squares. We observe that the optimal deadline critically depends on the tail exponent and correlation between completion time and reward [23]. The performance of the OLUM Algorithm for proportionally fair allocation of the server (i.e., Uk(x) = log(x), k [K]) is shown in Fig. 3 for various choices of the parameter V . In Theorem 1, we showed that e O(1/ B) regret is achieved when the parameter V is chosen as V = Θ( p B/ log(B)) for time budget B. From Fig. 3, we verify that this scaling of the parameter V is necessary for good practical performance. 0 5 10 15 20 25 Deadline Reward per processing time Group 1 Group 2 Group 3 Group 4 Group 5 0 100 200 300 400 500 V Total utility B = 200 B = 2000 B = 5000 B = 10000 Figure 3: (Left) Reward per processing time for each group. (Right) Total utility for various time budgets and design parameter V choices. 6 Conclusion In this paper, we proposed a versatile and comprehensive framework for continuous-time online resource allocation with fairness considerations, and proposed a no-regret learning algorithm for this problem in a delayed full-information feedback model. Note that although the full-information feedback is available in many application scenarios, there are cases in which the controller does not have an access to full feedback, thus a mechanism that incorporates bandit feedback is required. The online learning framework introduced in this paper can be extended to bandit feedback. One way to achieve this might be to replace the empirical estimates with upper confidence bounds in the OLUM Algorithm, which makes the analysis even more complicated. We leave the design and analysis of bandit algorithms in this setting as a future work. Broader Impact Our work develops the theory of fair online learning, specifically analyzing the impact of rewardmaximizing allocation policies on opportunities for different groups of people. Our proposal analyzes the trade-offs across various allocation policies (ranging from profit maximizing to equal opportunity for all), thus highlighting the choice of objectives that the controllers should carefully consider. This work does not have any foreseeable negative ethical or societal impact. Acknowledgments and Disclosure of Funding This work was completed when S. Cayci was a Ph D candidate and research assistant at The Ohio State University, Department of Electrical and Computer Engineering. This research was supported in part by: NSF grants: CNS-Ne TS-1717045, CMMI-SMOR-1562065, CNS-ICN-WEN-1719371, CNS-Spec EES-1824337, CNS-Ne TS-2007231; ONR Grant N00014-19-1-2621, and the DTRA grant: HDTRA1-18-1-0050. S. Gupta would like to gratefully acknowledge support from the NSF grant CRII 1850182. [1] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, Men also like shopping: Reducing gender bias amplification using corpus-level constraints, ar Xiv preprint ar Xiv:1707.09457, 2017. [2] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, Man is to computer programmer as woman is to homemaker? debiasing word embeddings, in Advances in Neural Information Processing Systems, 2016, pp. 4349 4357. [3] A. Caliskan, J. J. Bryson, and A. Narayanan, Semantics derived automatically from language corpora contain human-like biases, Science, vol. 356, no. 6334, pp. 183 186, 2017. [4] K. Lum and W. Isaac, To predict and serve? Significance, vol. 13, no. 5, pp. 14 19, 2016. [5] A. L. Washington, How to argue with an algorithm: Lessons from the compas-propublica debate, Colo. Tech. LJ, vol. 17, p. 131, 2018. [6] J. Kleinberg, Inherent trade-offs in algorithmic fairness, in Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems, 2018, pp. 40 40. [7] A. Chouldechova, Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, Big data, vol. 5, no. 2, pp. 153 163, 2017. [8] G. Laumeister, The next big thing in e-commerce: Online labor marketplaces, Forbes (Online), 2014. [9] H. Torry, Coronavirus pandemic deepens labor divide between online, offline workers, Wall Street Journal, 2020. [10] A. Teeley, There are 57 million u.s. independent professionals upwork wants them all to succeed, Built In Chicago, 2020. [11] A. Hannák, C. Wagner, D. Garcia, A. Mislove, M. Strohmaier, and C. Wilson, Bias in online freelance marketplaces: Evidence from taskrabbit and fiverr, in Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 2017, pp. 1914 1933. [12] R. Srikant and L. Ying, Communication networks: an optimization, control, and stochastic networks perspective. Cambridge University Press, 2013. [13] D. Bertsimas, V. F. Farias, and N. Trichakis, On the efficiency-fairness trade-off, Management Science, vol. 58, no. 12, pp. 2234 2250, 2012. [14] K. Jain and V. V. Vazirani, Eisenberg-gale markets: Algorithms and structural properties, in Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, 2007, pp. 364 373. [15] D. P. Palomar and M. Chiang, A tutorial on decomposition methods for network utility maximization, IEEE Journal on Selected Areas in Communications, vol. 24, no. 8, pp. 1439 1451, 2006. [16] A. Eryilmaz and R. Srikant, Fair resource allocation in wireless networks using queue-lengthbased scheduling and congestion control, IEEE/ACM transactions on networking, vol. 15, no. 6, pp. 1333 1344, 2007. [17] H. J. Kushner and P. A. Whiting, Convergence of proportional-fair sharing algorithms under general conditions, IEEE Transactions on Wireless Communications, vol. 3, no. 4, pp. 1250 1259, 2004. [18] D. Kahneman and R. H. Thaler, Anomalies: Utility maximization and experienced utility, Journal of economic perspectives, vol. 20, no. 1, pp. 221 234, 2006. [19] M. J. Neely, Dynamic optimization and learning for renewal systems, IEEE Transactions on Automatic Control, vol. 58, no. 1, pp. 32 46, 2012. [20] A. Badanidiyuru, R. Kleinberg, and A. Slivkins, Bandits with knapsacks, Journal of the ACM (JACM), vol. 65, no. 3, pp. 1 55, 2018. [21] L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings, Knapsack based optimal policies for budget limited multi armed bandits, in Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. [22] A. Slivkins, Introduction to multi-armed bandits, ar Xiv preprint ar Xiv:1904.07272, 2019. [23] S. Cayci, A. Eryilmaz, and R. Srikant, Learning to control renewal processes with bandit feedback, Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 3, no. 2, pp. 1 32, 2019. [24] S. Agrawal and N. R. Devanur, Bandits with concave rewards and convex knapsacks, in Proceedings of the fifteenth ACM conference on Economics and computation, 2014, pp. 989 1006. [25] A. Rosenblat, K. E. Levy, S. Barocas, and T. Hwang, Discriminating tastes: Customer ratings as vehicles for bias, Available at SSRN 2858946, 2016. [26] A. Chakraborty, A. Hannak, A. J. Biega, and K. P. Gummadi, Fair sharing for sharing economy platforms, 2017. [27] M. Harchol-Balter, Task assignment with unknown duration, in Proceedings 20th IEEE International Conference on Distributed Computing Systems. IEEE, 2000, pp. 214 224. [28] R. Motwani, S. Phillips, and E. Torng, Nonclairvoyant scheduling, Theoretical computer science, vol. 130, no. 1, pp. 17 47, 1994. [29] M. Harchol-Balter and A. B. Downey, Exploiting process lifetime distributions for dynamic load balancing, ACM Transactions on Computer Systems (TOCS), vol. 15, no. 3, pp. 253 285, 1997. [30] K. Kim and A. A. Tsiatis, Study duration for clinical trials with survival response and early stopping rule, Biometrics, pp. 81 92, 1990. [31] P. F. Thall, R. Simon, and S. S. Ellenberg, Two-stage selection and testing designs for comparative clinical trials, Biometrika, vol. 75, no. 2, pp. 303 310, 1988. [32] P. R. Jelenkovi c and J. Tan, Characterizing heavy-tailed distributions induced by retransmissions, Advances in Applied Probability, vol. 45, no. 1, pp. 106 138, 2013. [33] C. H. Papadimitriou and J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research, vol. 24, no. 2, pp. 293 305, 1999. [34] S. Asmussen, Applied probability and queues. Springer Science & Business Media, 2008, vol. 51. [35] M. Neely, Stochastic network optimization with application to communication and queueing systems. Morgan & Claypool Publishers, 2010. [36] R. Zabolotnyi, P. Leitner, and S. Dustdar, Profiling-based task scheduling for factory-worker applications in infrastructure-as-a-service clouds, in 2014 40th EUROMICRO Conference on Software Engineering and Advanced Applications. IEEE, 2014, pp. 119 126. [37] J. Kleinberg and M. Raghavan, Selection problems in the presence of implicit bias, ar Xiv preprint ar Xiv:1801.03533, 2018. [38] M. Harchol-Balter, The effect of heavy-tailed job size distributions on computer system design. in Proc. of ASA-IMS Conf. on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics, 1999. [39] W. J. Reed and B. D. Hughes, From gene families and genera to incomes and internet file sizes: Why power laws are so common in nature, Physical Review E, vol. 66, no. 6, p. 067103, 2002. [40] W. Gong, Y. Liu, V. Misra, and D. Towsley, On the tails of web file size distributions, in Proceedings of the annual allerton conference on communication control and computing, vol. 39, no. 1. The University; 1998, 2001, pp. 192 201. [41] J. F. Nash Jr, The bargaining problem, Econometrica: Journal of the Econometric Society, pp. 155 162, 1950. [42] J. W. Pratt, Risk aversion in the small and in the large, in Uncertainty in Economics. Elsevier, 1978, pp. 59 79. [43] N. Nisan and A. Ronen, Computationally feasible vcg mechanisms, Journal of Artificial Intelligence Research, vol. 29, pp. 19 47, 2007. [44] J. Mo and J. Walrand, Fair end-to-end window-based congestion control, IEEE/ACM Transactions on networking, vol. 8, no. 5, pp. 556 567, 2000. [45] F. Kelly, Charging and rate control for elastic traffic, European transactions on Telecommunications, vol. 8, no. 1, pp. 33 37, 1997. [46] L. Tassiulas and A. Ephremides, Jointly optimal routing and scheduling in packet ratio networks, IEEE Transactions on Information Theory, vol. 38, no. 1, pp. 165 168, 1992. [47] , Dynamic server allocation to parallel queues with randomly varying connectivity, IEEE Transactions on Information Theory, vol. 39, no. 2, pp. 466 478, 1993. [48] M. J. Neely, A lyapunov optimization approach to repeated stochastic games, in 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2013, pp. 1082 1089. [49] S. Agrawal and N. Devanur, Linear contextual bandits with knapsacks, in Advances in Neural Information Processing Systems, 2016, pp. 3450 3458. [50] K. A. Sankararaman and A. Slivkins, Combinatorial semi-bandits with knapsacks, ar Xiv preprint ar Xiv:1705.08110, 2017. [51] A. Gut, Stopped random walks. Springer, 2009. [52] S. Cayci, A. Eryilmaz, and R. Srikant, Budget-constrained bandits over general cost and reward distributions, ar Xiv preprint ar Xiv:2003.00365, 2020. [53] M. J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019, vol. 48. [54] B. Hajek, Hitting-time and occupation-time bounds implied by drift analysis with applications, Advances in Applied probability, vol. 14, no. 3, pp. 502 525, 1982.