# fairness_of_exposure_in_stochastic_bandits__432937fc.pdf Fairness of Exposure in Stochastic Bandits Lequn Wang 1 Yiwei Bai 1 Wen Sun 1 Thorsten Joachims 1 Contextual bandit algorithms have become widely used for recommendation in online systems (e.g. marketplaces, music streaming, news), where they now wield substantial influence on which items get exposed to the users. This raises questions of fairness to the items and to the sellers, artists, and writers that benefit from this exposure. We argue that the conventional bandit formulation can lead to an undesirable and unfair winner-takes-all allocation of exposure. To remedy this problem, we propose a new bandit objective that guarantees merit-based fairness of exposure to the items while optimizing utility to the users. We formulate fairness regret and reward regret in this setting, and present algorithms for both stochastic multi-armed bandits and stochastic linear bandits. We prove that the algorithms achieve sub-linear fairness regret and reward regret. Beyond the theoretical analysis, we also provide empirical evidence that these algorithms can fairly allocate exposure to different arms effectively. 1. Introduction Bandit algorithms (Thompson, 1933; Robbins, 1952; Bubeck & Cesa-Bianchi, 2012; Slivkins, 2019; Lattimore & Szepesv ari, 2020) provide an attractive model of learning for online platforms, and they are now widely used to optimize retail, media streaming, and news-feed. Each round of bandit learning corresponds to an interaction with a user, where the algorithm selects an arm (e.g. product, song, article), observes the user s response (e.g. purchase, stream, read), and then updates its policy. Over time, the bandit algorithm thus learns to maximize the user responses, which are often well aligned with the objective of the online platform (e.g. profit maximization, engagement maximization). 1Department of Computer Science, Cornell University, Ithaca, NY, USA. Correspondence to: Lequn Wang , Wen Sun , Thorsten Joachims . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). While maximizing user responses may arguably be in the interest of the platform and its users at least in the short term, there is now a growing understanding that it can also be problematic in multiple respects. In this paper, we focus on the fact that this objective ignores the interests of the items (i.e. arms), which also derive utility from the interactions. In particular, sellers, artists and writers have a strong interest in the exposure their items receive, as it affects their chance to get purchased, streamed or read. It is well understood that algorithms that maximize user responses can be unfair in how they allocate exposure to the items (Singh & Joachims, 2018). For example, two items with very similar merit (e.g. click-through rate) may receive substantially different amounts of exposure which is not only objectionable in itself, but can also degrade the long-term objectives of the platform (e.g. seller retention (Mehrotra et al., 2018), antidiscrimination (Noble, 2018), anti-polarization (Epstein & Robertson, 2015)). To illustrate the problem, consider a conventional (nonpersonalized) stochastic multi-armed bandit algorithm that is used to promote new music albums on the front-page of a website. The bandit algorithm will quickly learn which album draws the largest click-through rate and keep displaying this album, even if other albums are almost equally good. This promotes a winner-takes-all dynamic that creates superstars (Mehrotra et al., 2018), and may drive many deserving artists out of business. Analogously, a (personalized) contextual bandit for news-feed recommendation can polarize a user by quickly learning which type of articles the user is most likely to read, and then exclusively recommend such articles instead of a portfolio that is more reflective of the user s true interest distribution. To overcome these problems of the conventional bandit objective, we propose a new formulation of the bandit problem that implements the principle of Merit-based Fairness of Exposure (Singh & Joachims, 2018; Biega et al., 2018). For brevity, we call this the Fair X bandit problem. It incorporates the additional fairness requirement that each item/arm receives a share of exposure that is proportional to its merit. We define the merit of an arm as an increasing function of its mean reward, and the exposure as the probability of being selected by the bandit policy at each round. Based on these quantities, we then formulate the reward regret and the fairness regret so that minimizing these two regrets corresponds Fairness of Exposure in Stochastic Bandits to maximizing responses while minimizing unfairness to the items. For the Fair X bandit problem, we present a fair upper confidence bound (UCB) algorithm and a fair Thompson sampling (TS) algorithm in the stochastic multi-armed bandits (MAB) setting, as well as a fair linear UCB algorithm and a fair linear TS algorithm in the stochastic linear bandits setting. We prove that all algorithms achieve fairness regret and reward regret with sub-linear dependence on the number of rounds, while the TS-based algorithms have computational advantages. The fairness regret of these algorithms also depends on the minimum merit of the arms and a bounded Lipschitz constant of the merit function, and we provide fairness regret lower bounds based on these quantities. Beyond the theoretical analysis, we also conduct an empirical evaluation that compares these algorithms with conventional bandit algorithms and more naive baselines, finding that the fairness-aware algorithms can fairly allocate exposure to different arms effectively while maximizing user responses. 2. Related Work The bandit problem was first introduced by Thompson (Thompson, 1933) to efficiently conduct medical trials. Since then, it has been extensively studied in different variants, and we refer to these books (Bubeck & Cesa-Bianchi, 2012; Slivkins, 2019; Lattimore & Szepesv ari, 2020) for a comprehensive survey. We focus on the classic stochastic MAB setting where each arm has a fixed but unknown reward distribution, as well as the stochastic linear bandits problem where each arm is represented as a d-dimensional vector and its expected reward is a linear function of its vector representation. In both stochastic MAB and stochastic linear bandits, some of the algorithms we designed leverage the idea of optimism in the face of uncertainty behind the UCB algorithm (Lai & Robbins, 1985), while other algorithms leverage the idea of posterior sampling behind the TS (Thompson, 1933) algorithm. The theoretical results of the proposed fair UCB and fair linear UCB algorithm borrow some ideas from several prior finite time analysis works on the conventional UCB and linear UCB algorithm (Auer, 2002; Dani et al., 2008; Abbasi-Yadkori et al., 2011). We adopt the Bayesian regret framework (Russo & Van Roy, 2014) for our theoretical analysis of the fair TS and the fair linear TS algorithm. Algorithmic fairness has been extensively studied in binary classification (Hardt et al., 2016; Chouldechova, 2017; Kleinberg et al., 2017; Agarwal et al., 2018). These works propose statistical criteria to test algorithmic fairness that often operationalize definitions of fairness from political philosophy and sociology. Several prior works (Blum et al., 2018; Blum & Lykouris, 2019; Bechavod et al., 2019) study how to achieve these fairness criteria in online learning. These algorithms achieve fairness to the incoming users. We, in contrast, achieve fairness to the arms. Joseph et al. (Joseph et al., 2016b;a; 2018) study fairness in bandits that ensure a better arm is always selected with no less probability than a worse arm. Different from our definition of fairness, their optimal policy is still the one that deterministically selects the arm with the largest expected reward while giving zero exposure to all the other arms. Another type of fairness definition in bandits is to ensure a minimum and/or maximum amount of exposure to each arm or group of arms (Heidari & Krause, 2018; Wen et al., 2019; Schumann et al., 2019; Li et al., 2019; Celis et al., 2018; Claure et al., 2020; Patil et al., 2020; Chen et al., 2020). However, they do not take the merit of the items into consideration. Gillen et al. (Gillen et al., 2018) propose to optimize individual fairness defined in (Dwork et al., 2012) in the adversarial linear bandits setting, where the difference between the probabilities that any two arms are selected is bounded by the distance between their context vectors. They require additional feedback of fairness-constraint violations. We work in the stochastic bandits setting and we do not require any additional feedback beyond the reward. We also ensure that similar items obtain similar exposure, but we focus on similarity of merit, which corresponds to closeness in mean reward conditioned on context. The most relevant work may arguably be (Liu et al., 2017), which considers fairness in stochastic MAB problems where the reward distribution is Bernoulli. They aim to achieve calibrated fairness where each arm is selected with the probability equal to that of its reward being the largest, while satisfying a smoothness constraint where arms with similar merit should receive similar exposure. They propose a TS-based algorithm that achieves fairness regret with a T 2/3 dependence on the time horizon T. Our formulation is more general in a sense that we consider arbitrary reward distributions and merit functions, with their formulation as a special case. What is more, our proposed algorithms achieve fairness regret with a T dependence on T. In addition, we further study the more general setting of stochastic linear bandits. Our definition of fairness has connections to the fair division problem (Steihaus, 1948; Brams & Taylor, 1996; Procaccia, 2013), where the goal is to allocate a resource to different agents in a fair way. In our problem, we aim to allocate the users attention among the items in a fair way. Our definition of fairness ensures proportionality, one of the key desiderata in the fair division literature to ensure each agent receives its fair share of the resource. Recently, merit-based fairness of exposure has been studied for ranking problems in the statistical batch learning framework (Singh & Joachims, 2018; 2019). We build upon this work, and extend meritbased fairness of exposure to the online-learning setting. Fairness of Exposure in Stochastic Bandits 3. Stochastic Multi-Armed Bandits in the Fair X Setting We begin by introducing the Fair X setting for stochastic MAB, including our new formulation of fairness and reward regret. We then develop two algorithms, called Fair X-UCB and Fair X-TS, and bound their fairness and reward regret. In the subsequent section, we will extend this approach to stochastic linear bandits. 3.1. Fair X Setting for Stochastic MAB A stochastic MAB instance can be represented as a collection of reward distributions v = (Pa : a [K]), where Pa is the reward distribution of arm a with mean µ a = Er Pa [r]. The learner interacts with the environment sequentially over T rounds. In each round t [T], the learner has to choose a policy πt over the K arms based on the interaction history before round t. The learner then samples an arm at πt. In response to the selected arm at, the environment samples a reward rt,at Pat R from the reward distribution Pat and reveals the reward rt,at to the learner. The history Ht = π1, a1, r1,a1, . . . , πt 1, at 1, rt 1,at 1 consists of all the deployed policies, chosen arms, and their associated rewards. Conventionally, the goal of learning is to maximize the cumulative expected reward PT t=1 Eat πtµ at. Thus conventional bandit algorithms converge to a policy that deterministically selects the arm with the largest expected reward. As many have pointed out in other contexts (Singh & Joachims, 2018; Mehrotra et al., 2018; Biega et al., 2018; Beutel et al., 2019; Geyik et al., 2019; Abdollahpouri et al., 2020), such winner-takes-all allocations can be considered unfair to the items in many applications and can lead to undesirable long-term dynamics. Bringing this insight to the task of bandit learning, we propose to incorporate meritbased fairness-of-exposure constraints (Singh & Joachims, 2018) into the bandits objective. Specifically, we aim to learn a policy π which ensures that each arm receives an amount of exposure proportional to its merit, where merit is quantified through an application-dependent merit function f( ) > 0 that maps the expected reward of an arm to a positive merit value. π (a) f(µ a) = π (a ) f(µ a ) a, a [K]. The merit function f is an input to the bandit algorithm, and it provides a design choice that permits tailoring the fairness criterion to different applications. The following theorem shows that there is a unique policy that satisfies the above fairness constraints. Theorem 3.1.1 (Optimal Fair Policy). For any mean reward parameter µ and any choice of merit function f( ) > 0, there exist a unique policy π of the form π (a) = f(µ a) P a f(µ a ) a [K], that fulfills the merit-based fairness of exposure constraints. We refer to π as the optimal fair policy. All the proofs of the theorems are in the appendix. When the bandit converges to this optimal fair policy π , the expected reward also converges to the expected reward of the optimal fair policy. We thus define the reward regret RRT at round T as the gap between the expected reward of the deployed policy and the expected reward of the optimal fair policy π a πt(a)µ a. (1) While this reward regret quantifies how quickly the reward is optimized, we also need to quantify how effectively the algorithm learns to enforce fairness. We thus define the following fairness regret FRT , which measures the cumulative ℓ1 distance between the deployed policy and the optimal fair policy at round T a |π (a) πt(a)|. (2) The fairness regret and the reward regret depend on both the randomly sampled rewards, as well as the arms randomly sampled from the policy. They are thus random variables and we aim to minimize the regrets with high probability. To prepare for the theoretical analysis, we introduce the following two conditions on the merit function f to suitably characterize a Fair X bandit problem. Condition 3.1.2 (Minimum Merit). The merit of each arm is positive, i.e. minµ f(µ) γ for some positive constant γ > 0. Condition 3.1.3 (Lipschitz Continuity). The merit function f is L-Lipschitz continuous, i.e. µ1, µ2, |f(µ1) f(µ2)| L|µ1 µ2| for some positive constant L > 0. The following two theorems show that neither of the two conditions can be dropped, if we want to obtain bandit algorithms with fairness regret that is sub-linear in the number of rounds T. Theorem 3.1.4 (Lower Bound on Fairness Regret is Linear without Minimum-Merit Condition). For time horizon T > 0, there exists a 1-Lipschitz continuous merit function f where minµ f(µ) = 1/ T, such that for any bandit algorithm, there must exist a MAB instance such that the expected fairness regret is at least E [FRT ] 0.015T. Fairness of Exposure in Stochastic Bandits Theorem 3.1.5 (Lower Bound on Fairness Regret is Linear without Bounded Lipschitz-Continuity Condition). For time horizon T > 0, there exists a T-Lipschitz continuous merit function f with minimum merit 1, such that for any bandit algorithm, there must exist a MAB instance such that the expected fairness regret is at least E[FRT ] 0.015T. 3.2. Fair X-UCB Algorithm Algorithm 1 Fair X-UCB Algorithm 1: input: K, T, f, w0 2: for t = 1 to T do 3: a Nt,a = Pt 1 τ=1 1{aτ = a} 4: a ˆµt,a = Pt 1 τ=1 1{aτ = a}rτ,aτ /Nt,a 5: a wt,a = w0/ p Nt,a 6: CRt = (µ : a µa [ˆµt,a wt,a, ˆµt,a + wt,a]) 7: µt = arg maxµ CRt P 8: Construct policy πt(a) = f(µt,a) P a f(µt,a ) 9: Sample arm at πt 10: Observe reward rt,at 11: end for The first algorithm we introduce is called Fair X-UCB and it is detailed in Algorithm 1. It utilizes the idea of optimism in the face of uncertainty. At each round t, the algorithm constructs a confidence region CRt which contains the true parameter µ with high probability. Then the algorithm optimistically selects a parameter µt RK within the confidence region CRt that maximizes the estimated expected reward subject to the constraint that we construct a fair policy as if the selected parameter is the true parameter. Compared to the conventional UCB algorithm which selects the arm with the largest UCB deterministicly in each round, the proposed Fair X-UCB algorithm selects arms stochastically to ensure fairness. Finally, we apply the constructed policy πt, observe the feedback, and update the confidence region. The following two theorems characterize the fairness and reward regret upper bounds of the Fair X-UCB algorithm. Theorem 3.2.1 (Fair X-UCB Fairness Regret). Under Condition 3.1.2 and 3.1.3, suppose t, a : rt,a [ 1, 1], when T > K, for any δ (0, 1), set w0 = p 2 ln (4TK/δ), the fairness regret of the Fair X-UCB algorithm is FRT = e O L KT/γ with probability at least 1 δ. Theorem 3.2.2 (Fair X-UCB Reward Regret). Suppose t, a : rt,a [ 1, 1], when T > K, for any δ (0, 1), set w0 = p 2 ln (4TK/δ), the reward regret of the Fair X- UCB algorithm is RRT = e O KT with probability at least 1 δ. e O ignores logarithmic factors in O. Note that the well- Algorithm 2 Fair X-TS Algorithm 1: input: f, V1 2: for t = 1 to do 3: Sample parameter from posterior µt Vt 4: Construct policy πt(a) = f(µt,a) P a f(µt,a ) 5: Sample arm at πt 6: Observe reward rt,at 7: Update posterior Vt+1 = Update(V1, Ht+1) 8: end for KT reward regret lower bound (Auer et al., 2002) developed for the conventional bandit problem also holds for the Fair X setting because the conventional stochastic MAB problem that only minimizes the reward regret is a special case of the Fair X setting where we set the merit function f to be an infinitely steep increasing function. Since the reward regret upper bound of Fair X-UCB we proved does not depend on Conditions 3.1.2 and 3.1.3 about the merit function f, our reward regret upper bound of the Fair X-UCB algorithm is tight up to logarithmic factors. The fairness regret has the same dependence on the number of arms K and the number of rounds T as the reward regret. It further depends on the minimum merit constant γ and the Lipschitz continuity constant L, which we treat as absolute constants due to Theorem 3.1.4 and Theorem 3.1.5. Compared to Fair SD TS algorithms proposed in (Liu et al., 2017), our proposed Fair X-UCB algorithm focuses on fairness and reward regret across rounds instead of achieving a smooth fairness constraint for each round. This allows Fair X-UCB to achieve improved fairness and reward regret ( KT compared to (KT)2/3). In addition, Fair X-UCB works for general reward distributions and merit functions while SD TS only works for Bernoulli reward distribution and identity merit function. One challenge in implementing Algorithm 1 lies in Step 7, since finding the most optimistic parameter is a non-convex constrained optimization problem. We solve this optimization problem approximately with projected gradient descent in our empirical evaluation. In the next subsection, we will introduce the Fair X-TS algorithm that avoids this optimization problem. 3.3. Fair X-TS Algorithm Another approach to designing stochastic bandit algorithms that has proven successful both empirically and theoretically is Thompson Sampling (TS). We find that this approach can also be applied to the Fair X setting. In particular, our Fair X-TS as shown in Algorithm 2 uses posterior sampling similar to a conventional TS bandit. The algorithm puts a prior distribution V1 on the expected reward of each arm Fairness of Exposure in Stochastic Bandits µ . For each round t, the algorithm samples a parameter µt from the posterior Vt, and constructs a fair policy πt from the sampled parameter to deploy. Finally, the algorithm observes the feedback and updates the posterior distribution of the true parameter. Following (Russo & Van Roy, 2014), we analyze the Bayesian reward and fairness regret of the algorithm. The Bayesian regret framework assumes that the true parameter µ is sampled from the prior, and the Bayesian regret is the expected regret taken over the prior distribution Bayes RRT = Eµ [E[RRT |µ ]] (3) Bayes FRT = Eµ [E[FRT |µ ]] . (4) In the following two theorems we provide bounds on both the Bayesian reward regret and the Bayesian fairness regret of the Fair X-TS algorithm. Theorem 3.3.1 (Fair X-TS Fairness Regret). Under Condition 3.1.2 and 3.1.3, suppose the mean reward µ a of each arm a is independently sampled from standard normal distribution N(0, 1), and t, a rt,a N(µ a, 1), the Bayesian fairness regret of the Fair X-TS algorithm at any round T is Bayes FRT = e O L Theorem 3.3.2 (Fair X-TS Reward Regret). Suppose the mean reward µ a of each arm a is independently sampled from standard normal distribution N(0, 1), and t, a rt,a N(µ a, 1), the Bayesian fairness regret of the Fair X- TS algorithm at any round T is Bayes RRT = e O Note that these regret bounds are on the same order as the fairness and reward regret of the Fair X-UCB algorithm. However, Fair X-TS relies on sampling from the posterior and thus avoids the non-convex optimization problem that makes the use of Fair X-UCB more challenging. 4. Stochastic Linear Bandits in the Fair X Setting In this section, we extend the two algorithms introduced in the MAB setting to the more general stochastic linear bandits setting where the learner is provided with contextual information for making decisions. We discuss how the two algorithms can be adapted to this setting to achieve both sub-linear fairness and reward regret. 4.1. Fair X Setting for Stochastic Linear Bandits In stochastic linear bandits, each arm a at round t comes with a context vector xt,a Rd. A stochastic linear bandits instance v = (Px : x Rd) is a collection of reward distributions for each context vector. The key assumption of stochastic linear bandits is that there exists a true parameter µ such that, regardless of the interaction history Ht, the mean reward of arm a at round t is the product between the context vector and the true parameter Er Pxt,a [r|Ht] = µ xt,a for all t, a. The noise sequence ηt = rt,at µ xt,at is thus a martingale difference sequence, since E[ηt|Ht] = Ea πt[Er Pxt,a [r|Ht] µ xt,a] = 0. At each round t, the learner is given a set of context vectors Dt Rd representing the arms, and it has to choose a policy πt over these K arms based on the interaction history Ht = (D1, π1, a1, r1,a1, . . . , Dt 1, πt 1, at 1, rt 1,at 1). We focus on problems where the number of available arms is finite t : |Dt| = K, but where K could be large. Again, we want to ensure that the policy provides each arm with an amount of exposure proportional to its merit π t (a) f(µ xt,a) = π t (a ) f(µ xt,a ) t, xt,a, xt,a Dt, where f is the merit function that maps the mean reward of the arm to a positive merit value. Since the set of arms changes over time, the optimal fair policy π t at round t is time-dependent π t (a) = f(µ xt,a) P a f(µ xt,a ) t, a. Analogous to the MAB setting, we define the reward regret as the expected reward difference between the optimal fair policy and the deployed policy a π t (a)µ xt,a a πt(a)µ xt,a, (5) and fairness regret as the cumulative ℓ1 distance between the optimal fair policy and the deployed policy a |π t (a) πt(a)|. (6) The lower bounds on the fairness regret derived in Theorem 3.1.4 and Theorem 3.1.5 in the MAB setting also apply to the stochastic linear bandit setting, since we can easily convert a MAB instance into a stochastic linear bandits instance by constructing K K-dimensional basis vectors, each representing one arm. Thus we again employ Condition 3.1.2 and 3.1.3 to design algorithms that have fairness regret with sub-linear dependence on the horizon T. 4.2. Fair X-Lin UCB Algorithm Similar to the Fair X-UCB algorithm, the Fair X-Lin UCB algorithm constructs a confidence region CRt of the true Fairness of Exposure in Stochastic Bandits Algorithm 3 Fair X-Lin UCB Algorithm 1: input: βt, f, λ 2: initialization: Σ1 = λId, b1 = 0d 3: for t = 1 to do 4: Observe contexts Dt = (xt,1, xt,2, . . . , xt,K) 5: ˆµt = Σ 1 t bt {The ridge regression solution} 6: CRt = (µ : µ ˆµt Σt βt) 7: µt = arg maxµ CRt P a f(µ xt,a) P a f(µ xt,a )µ xt,a 8: Construct policy πt(a) = f(µt xt,a) P a f(µt xt,a ) 9: Sample arm at πt 10: Observe reward rt,at 11: Σt+1 = Σt + xt,atx t,at 12: bt+1 = bt + xt,atrt,at 13: end for parameter µ at each round t. The center of the confidence region ˆµt is the solution of a ridge regression over the existing data, which can be updated incrementally. The radius of the confidence ball βt is an input to the algorithm. The algorithm proceeds by repeatedly selecting a parameter µt that is optimistic about the expected reward within the confidence region, subject to the constraint that we construct a fair policy from the parameter. We prove the following upper bounds on the fairness regret and reward regret of the Fair X-Lin UCB algorithm. Theorem 4.2.1 (Fair X-Lin UCB Fairness Regret). Under Condition 3.1.2 and 3.1.3, suppose t, a xt,a 2 1, ηt is 1 sub-Gaussian, µ 2 1, set λ = 1, with proper choice of βt, the fairness regret at any round T > 0 is FRT = e O Ld T/γ with high probability. Theorem 4.2.2 (Fair X-Lin UCB Reward Regret). Suppose t, a xt,a 2 1, ηt is 1 sub-Gaussian, µ 2 1, set λ = 1, with proper choice of βt, the reward regret at any round T > 0 is RRT = e O d T with high probability. Both fairness and reward regret have square root dependence on the horizon T and a linear dependence on the feature dimension d, and the fairness regret depends on the absolute constants L and γ. Note that the reward regret is not tight in terms of d and there exist algorithms (Chu et al., 2011; Lattimore & Szepesv ari, 2020) that achieve reward regret e O( d T). However these algorithms are based on the idea of arm elimination and thus will likely not achieve low fairness regret. Also Lin UCB is a much more practical option than the ones based on arm elimination (Chu et al., 2011). The optimization Step 7 in Algorithm 3, where we need to find an optimistic parameter µt that maximizes the estimated expected reward within the confidence region CRt subject to the fairness constraint, is again a non-convex constrained optimization problem. We use projected gradient descent to Algorithm 4 Fair X-Lin TS Algorithm 1: input: f, V1 2: for t = 1 to do 3: Observe contexts Dt = (xt,1, xt,2, . . . , xt,K) 4: Sample parameter from posterior µt Vt 5: Construct policy πt(a) = f(µt,a xt,a) P a f(µt,a xt,a) 6: Sample arm at πt 7: Observe reward rt,at 8: Update posterior Vt+1 = Update(V1, Ht+1) 9: end for find approximate solutions in our empirical evaluation. 4.3. Fair X-Lin TS Algorithm To avoid the difficult optimization problem of Fair XLin UCB, we again explore the use of Thompson sampling. Algorithm 4 shows our proposed Fair X-Lin TS. At each round t, the algorithm samples a parameter µt from the posterior distribution Vt of the true parameter µ and derives a fair policy πt from the sampled parameter. Then the algorithm deploys the policy and observes the feedback for the selected arm. Finally, the algorithm updates the posterior distribution of the true parameter given the observed data. Note that sampling from the posterior is efficient for a variety of models (e.g. normal distribution), as opposed to the non-convex optimization problem in Fair X-Lin UCB. Appropriately extending our definition of Bayesian reward regret and fairness regret Bayes RRT = Eµ [E[RRT |µ ]] (7) Bayes FRT = Eµ [E[FRT |µ ]] , (8) we can prove the following regret bounds for the Fair XLin TS algorithm. Theorem 4.3.1 (Fair X-Lin TS Fairness Regret). Under Condition 3.1.2 and 3.1.3, suppose each dimension of the true parameter µ is independently sampled from standard normal distribution N(0, 1), t, a xt,a 2 1, ηt is sampled from standard normal distribution N(0, 1), the Bayesian fairness regret of the Fair X-Lin TS algorithm is Bayes FR = e O L Theorem 4.3.2 (Fair X-Lin TS Reward Regret). Suppose each dimension of the true parameter µ is independently sampled from standard normal distribution N(0, 1), t, a xt,a 2 1, ηt is sampled from standard normal distribution N(0, 1), the Bayesian reward regret of the Fair X-Lin TS algorithm is Bayes RR = e O d Similar to the Fair X-TS algorithm in the MAB setting, the Bayesian fairness regret of Fair X-Lin TS assumes a normal prior. Note that the Bayesian fairness regret of Fair X-Lin TS Fairness of Exposure in Stochastic Bandits differs by order of d from the non-Bayesian fairness regret of the Fair X-Lin UCB algorithm. The Bayesian setting and the normal prior assumption enable us to explicitly bound the total variation distance between our policy and the optimal fair policy, which allows us to avoid going through the UCB-based analysis of the Lin UCB algorithms as in the conventional way of proving Bayesian regret bound (Russo & Van Roy, 2014). 5. Experiments While the theoretical analysis provides worst case guarantees for the algorithms, we now evaluate empirically how the algorithms perform on a range of tasks. We perform this evaluation both on synthetic and real-world data. The synthetic data allows us to control properties of the learning problem for internal validity, and the real-world data provides a data-point for external validity of the analysis. 5.1. Experiment Setup For the experiments where we control the properties of the synthetic data, we derive bandit problems from the multi-label datasets yeast (Horton & Nakai, 1996) and mediamill (Snoek et al., 2006). The yeast dataset consists of 2, 417 examples. Each example has 103 features and belongs to one or multiple of the 14 classes. We randomly split the dataset into two sets, 20% as the validation set to tune hyper-parameters and 80% as the test set to test the performance of different algorithms. For space reasons, the details and the results of the mediamill dataset are in the appendix. To simulate the bandit environment, we treat classes as arms and their labels (0 or 1) as rewards. For each round t, the bandit environment randomly samples an example from the dataset. Then the bandit algorithm selects an arm (class), and its reward (class label) is revealed to the algorithm. To construct context vectors for the arms, we generate 50-dimensional random Fourier features (Rahimi et al., 2007) from the outer product between the features of the example and the one-hot representation of the arms. For the experiments on real-world data, we use data from the Yahoo! Today Module (Li et al., 2010), which contains user click logs from a news-article recommender system that was fielded for 10 days. Each day logged around 4.6 million events from a bandit that selected articles uniformly at random, which allows the use of the replay methodology (Li et al., 2010) for unbiased offline evaluation of new bandit algorithms. We use the data from the first day for hyperparameter selection and report the results on the data from the second day. The results using all the data are presented in the appendix. Each article and each user is represented by a 6-dimensional feature vector respectively. Following (Li et al., 2010), we use the outer product between the user features and the article features as the context vector. To calculate the fairness and reward regret, we determine the optimal fair policy as follows. For MAB experiments, we use the empirical mean reward of each arm as the mean parameter for each arm. For linear bandit experiments, we fit a linear least square model that maps the context vector of each arm to its reward. Note that the linearity assumption does not necessarily hold for any of the datasets, and that rewards are known to change over time for the Yahoo! data. This realism adds a robustness component to the evaluation. We also add straightforward Fair X-variants of the ϵ-greedy algorithms to the empirical analysis, which we call Fair XEG and Fair X-Lin EG. The algorithms are identical to their conventional ϵ-greedy counterparts, except that they construct their policies according to πt(a) = f(ˆµt,a) P a f(ˆµt,a ) or πt(a) = f(ˆµt xt,a) P a f(ˆµt xt,a) where ˆµt is the estimated parameter at round t. While ϵ-greedy has weaker guarantees already in the conventional bandit setting, it is well known that it often performs well empirically and we thus add it as a reference for the more sophisticated algorithms.In addition to the Fair X algorithms, we also include the fairness regret of conventional UCB, TS, Lin UCB, and Lin TS bandit algorithms. We use merit functions of the form f(µ) = exp(cµ), since the choice of the constant c provides a straightforward way to explore how the algorithms perform for steeper vs. flatter merit functions. In particular, the choice of c varies the value of L/γ. For both Fair X-UCB and Fair X-Lin UCB, we use projected gradient descent to solve the non-convex optimization problem each round. We set the learning rate to be 0.01 and the number of steps to be 10. For Fair XLin UCB, we use a fixed βt = β for all rounds. In general, we use grid search to tune hyper-parameters to minimize fairness regret on the validation set and report the performance on the test set. We grid search w0 for Fair X-UCB and UCB; prior variance and reward variance for Fair X-TS, TS, Fair X-Lin TS and Lin TS; λ and β for Fair X-Lin UCB and Lin UCB; ϵ for Fair X-EG; ϵ and the regularization parameter of the ridge regression for Fair XLin EG. We run each experiment 10 times and report the mean and the standard deviation. 5.2. How Unfair are Conventional Bandit Algorithms? We first verify that conventional bandit algorithms indeed violate merit-based fairness of exposure, and that our Fair X algorithms specifically designed to ensure fairness do indeed perform better. Figure 1 shows the average exposure that each arm received across rounds under the conventional UCB and TS algorithms for a typical run, and it compares this to the exposure allocation under the Fair X-UCB and Fair X-TS algorithm. The plots show the average exposure after 2,000 (left) and 2,000,000 (right) rounds, and it also Fairness of Exposure in Stochastic Bandits Figure 1. The average exposure distribution of different algorithms on the yeast dataset in the MAB setting after 2, 000 rounds (left) and 2, 000, 000 rounds (right). (c = 4) includes the optimally fair exposure allocation. Already after 2,000 rounds, the conventional algorithms under-expose many of the arms. After 2,000,000 rounds, they focus virtually all exposure on arm 11, even though arm 12 has only slightly lower merit. Both Fair X-UCB and Fair X-TS track the optimal exposure allocation substantially better, and they converge to the optimally fair solution. This verifies that Fair X algorithms like Fair X-UCB and Fair X-TS are indeed necessary to enforce merit-based fairness of exposure. The following sections further show that conventional bandit algorithms consistently suffer from much larger fairness regret compared to Fair X algorithms across different datasets and merit functions in both MAB and linear bandits setting. 5.3. How Do the Fair X Algorithms Compare in the MAB Setting? Figure 2. Fairness regret and reward regret of different MAB algorithms on the yeast dataset. (c = 10) Figure 2 compares the performance of the bandit algorithms on the yeast dataset. The fairness regret converges roughly at the rate predicted by the bounds for Fair X-UCB and Fair X-TS, and Fair X-EG shows a similar behavior as well. In terms of reward regret, all Fair X algorithms perform substantially better than their worst-case bounds suggest. Note that Fair X-UCB does particularly well in terms of reward regret, but also note that part of this is due to violating fairness more than Fair X-TS. Specifically, in the Fair X setting, an unfair policy can get better reward than the optimal fair policy, making a negative reward regret possible. While Fair X-EG wins neither on fairness regret nor on reward regret, it nevertheless does surprisingly well given the simplicity of the exploration scheme. We conjecture that Fair X-EG benefits from the implicit exploration that results from the stochasticity of the fair policies. Results for other merit functions are given in the appendix, and we find that the algorithms perform more similarly the flatter the merit function. 5.4. How Do the Fair X Algorithms Compare in the Linear Bandits Setting? Figure 3. Fairness regret and reward regret of different linear bandit algorithms on the yeast dataset. (c = 3) We show the fairness regret and the reward regret of the bandit algorithms on the yeast dataset in Figure 3. Results for other merit functions are in the appendix. Similar to the MAB setting, there is no clear winner between the three Fair X algorithms. Again we see some trade-offs between reward regret and fairness regret, but all three Fair X algorithms show a qualitatively similar behavior. One difference is that the fairness regret no longer seems to converge. This can be explained with the misspecification of the linear model, as the estimated optimal policy that we use in our computation of regret may differ from the policy learned by the algorithms due to selection biases. Nevertheless, we conclude that the fairness achieved by the Fair X algorithms is still highly preferable to that of the conventional bandit Fairness of Exposure in Stochastic Bandits Figure 4. Experiment results on the Yahoo! dataset for both the MAB and the linear bandits setting (c = 10 for both settings). algorithms. 5.5. How Do the Fair X Algorithms Compare on the Real-World Data? To validate the algorithms on a real-world application, Figure 4 provides fairness and reward regret on the Yahoo! dataset for both the MAB and the linear bandits setting. Again, all three types of Fair X algorithms perform comparably and have reward regret that converges quickly. Note that even the MAB setting now includes some misspecification of the model, since the reward distribution changes over time. This explains the behavior of the fairness regret. However, all Fair X algorithms perform robustly in both settings, even though the real data does not exactly match the model assumptions. 6. Conclusions We introduced a new bandit setting that formalizes meritbased fairness of exposure for both the stochastic MAB and the linear bandits setting. In particular, we define fairness regret and reward regret with respect to the optimal fair policy that fulfills the merit-based fairness of exposure, develop UCB and Thompson sampling algorithms for both settings, and prove bounds on their fairness and reward regret. An empirical evaluation shows that these algorithms provide substantially better fairness of exposure to the items, and that they are effective across a range of settings. Acknowledgements This research was supported in part by NSF Awards IIS1901168 and IIS-2008139. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011. Abdollahpouri, H., Adomavicius, G., Burke, R., Guy, I., Jannach, D., Kamishima, T., Krasnodebski, J., and Pizzato, L. Multistakeholder recommendation: Survey and research directions. User Modeling and User-Adapted Interaction, 30(1):127 158, 2020. Agarwal, A., Beygelzimer, A., Dud ık, M., Langford, J., and Wallach, H. A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60 69. PMLR, 2018. Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Bechavod, Y., Ligett, K., Roth, A., Waggoner, B., and Wu, S. Z. Equal opportunity in online classification with partial feedback. In Advances in Neural Information Processing Systems, pp. 8974 8984, 2019. Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., Heldt, L., Zhao, Z., Hong, L., Chi, E. H., et al. Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2212 2220, 2019. Biega, A. J., Gummadi, K. P., and Weikum, G. Equity of attention: Amortizing individual fairness in rankings. In Collins-Thompson, K., Mei, Q., 0001, B. D. D., 0001, Y. L., and Yilmaz, E. (eds.), SIGIR, pp. 405 414. ACM, 2018. URL http://dl.acm.org/ citation.cfm?id=3209978. Blum, A. and Lykouris, T. Advancing subgroup fairness via sleeping experts. ar Xiv preprint ar Xiv:1909.08375, 2019. Blum, A., Gunasekar, S., Lykouris, T., and Srebro, N. On preserving non-discrimination when combining expert advice. In Advances in Neural Information Processing Systems, pp. 8376 8387, 2018. Fairness of Exposure in Stochastic Bandits Brams, S. J. and Taylor, A. D. Fair Division: From cakecutting to dispute resolution. Cambridge University Press, 1996. Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. ar Xiv preprint ar Xiv:1204.5721, 2012. Celis, L. E., Kapoor, S., Salehi, F., and Vishnoi, N. K. An algorithmic framework to control bias in bandit-based personalization. ar Xiv preprint ar Xiv:1802.08674, 2018. Chen, Y., Cuellar, A., Luo, H., Modi, J., Nemlekar, H., and Nikolaidis, S. Fair contextual multi-armed bandits: Theory and experiments. In Conference on Uncertainty in Artificial Intelligence, pp. 181 190. PMLR, 2020. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011. Claure, H., Chen, Y., Modi, J., Jung, M., and Nikolaidis, S. Multi-armed bandits with fairness constraints for distributing resources to human teammates. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 299 308, 2020. Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Annual Conference on Learning Theory (COLT), 2008. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012. Epstein, R. and Robertson, R. E. The search engine manipulation effect (seme) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences, 112(33):E4512 E4521, 2015. Geyik, S. C., Ambler, S., and Kenthapadi, K. Fairnessaware ranking in search & recommendation systems with application to linkedin talent search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2221 2231, 2019. Gillen, S., Jung, C., Kearns, M., and Roth, A. Online learning with an unknown fairness metric. In Advances in neural information processing systems, pp. 2600 2609, 2018. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, pp. 3315 3323, 2016. Heidari, H. and Krause, A. Preventing disparate treatment in sequential decision making. In IJCAI, pp. 2248 2254, 2018. Horton, P. and Nakai, K. A probabilistic classification system for predicting the cellular localization sites of proteins. In Ismb, volume 4, pp. 109 115, 1996. Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Fair algorithms for infinite and contextual bandits. ar Xiv preprint ar Xiv:1610.09559, 2016a. Joseph, M., Kearns, M., Morgenstern, J. H., and Roth, A. Fairness in learning: Classic and contextual bandits. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 325 333, 2016b. Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Meritocratic fairness for infinite and contextual bandits. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 158 163, 2018. Kleinberg, J. M., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference, ITCS, volume 67 of LIPIcs, pp. 43:1 43:23, 2017. Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Li, F., Liu, J., and Ji, B. Combinatorial sleeping bandits with fairness constraints. IEEE Transactions on Network Science and Engineering, 2019. Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010. Liu, Y., Radanovic, G., Dimitrakakis, C., Mandal, D., and Parkes, D. C. Calibrated fairness in bandits. ar Xiv preprint ar Xiv:1707.01875, 2017. Mehrotra, R., Mc Inerney, J., Bouchard, H., Lalmas, M., and Diaz, F. Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th acm international conference on information and knowledge management, pp. 2243 2251, 2018. Fairness of Exposure in Stochastic Bandits Noble, S. U. Algorithms of oppression: How search engines reinforce racism. nyu Press, 2018. Patil, V., Ghalme, G., Nair, V., and Narahari, Y. Achieving fairness in the stochastic multi-armed bandit problem. In AAAI, pp. 5379 5386, 2020. Procaccia, A. D. Cake cutting: not just child s play. Communications of the ACM, 56(7):78 87, 2013. Rahimi, A., Recht, B., et al. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 3, pp. 5, 2007. Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952. Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014. Schumann, C., Lang, Z., Mattei, N., and Dickerson, J. P. Group fairness in bandit arm selection. ar Xiv preprint ar Xiv:1912.03802, 2019. Singh, A. and Joachims, T. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2219 2228, 2018. Singh, A. and Joachims, T. Policy learning for fairness in ranking. In Advances in Neural Information Processing Systems, pp. 5426 5436, 2019. Slivkins, A. Introduction to multi-armed bandits. ar Xiv preprint ar Xiv:1904.07272, 2019. Snoek, C. G., Worring, M., Van Gemert, J. C., Geusebroek, J.-M., and Smeulders, A. W. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the 14th ACM international conference on Multimedia, pp. 421 430, 2006. Steihaus, H. The problem of fair division. Econometrica, 16:101 104, 1948. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Wen, M., Bastani, O., and Topcu, U. Fairness with dynamics. ar Xiv preprint ar Xiv:1901.08568, 2019.