# achieving_counterfactual_fairness_for_causal_bandit__a85ecff4.pdf Achieving Counterfactual Fairness for Causal Bandit Wen Huang, Lu Zhang, Xintao Wu University of Arkansas {wenhuang, lz006, xintaowu}@uark.edu In online recommendation, customers arrive in a sequential and stochastic manner from an underlying distribution and the online decision model recommends a chosen item for each arriving individual based on some strategy. We study how to recommend an item at each step to maximize the expected reward while achieving user-side fairness for customers, i.e., customers who share similar profiles will receive a similar reward regardless of their sensitive attributes and items being recommended. By incorporating causal inference into bandits and adopting soft intervention to model the arm selection strategy, we first propose the d-separation based UCB algorithm (D-UCB) to explore the utilization of the d-separation set in reducing the amount of exploration needed to achieve low cumulative regret. Based on that, we then propose the fair causal bandit (F-UCB) for achieving the counterfactual individual fairness. Both theoretical analysis and empirical evaluation demonstrate effectiveness of our algorithms. Introduction Fairness in machine learning has been a research subject with rapid growth recently. Although there are many works focusing on fairness in personalized recommendation (Celis et al. 2018; Liu et al. 2017; Zhu, Hu, and Caverlee 2018), how to achieve individual fairness in bandit recommendation still remains a challenging task. We focus on online recommendation, e.g., customers are being recommended items, and consider the setting where customers arrive in a sequential and stochastic manner from an underlying distribution and the online decision model recommends a chosen item for each arriving individual based on some strategy. The challenge here is how to choose the arm at each step to maximize the expected reward while achieving user-side fairness for customers, i.e., customers who share similar profiles will receive similar rewards regardless of their sensitive attributes and items being recommended. Recently researchers have started taking fairness and discrimination into consideration in the design of personalized recommendation algorithms (Celis et al. 2018; Liu et al. 2017; Zhu, Hu, and Caverlee 2018; Joseph et al. 2016, 2018; Jabbari et al. 2017; Burke 2017; Burke, Sonboli, and Ordonez Gauger 2018; Ekstrand et al. 2018). Among them, Joseph Copyright c 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. et al. (2016) was the first paper of studying fairness in classic and contextual bandits. It defined fairness with respect to one-step rewards and introduced a notion of meritocratic fairness, i.e., the algorithm should never place higher selection probability on a less qualified arm (e.g., job applicant) than on a more qualified arm. The following works along this direction include (Joseph et al. 2018) for infinite and contextual bandits, (Jabbari et al. 2017) for reinforcement learning, (Liu et al. 2017) for the simple stochastic bandit setting with calibration based fairness. However, all existing works require some fairness constraint on arms at every round of the learning process, which is different from our user-side fairness setting. One recent work (Huang et al. 2020) focused on achieving user-side fairness in bandit setting, but it only purposed a heuristic way to achieve correlation based group level fairness and didn t incorporate causal inference and counterfactual fairness into bandits. By incorporating causal inference into bandits, we first propose the d-separation based upper confidence bound bandit algorithm (D-UCB), based on which we then propose the fair causal bandit (F-UCB) for achieving the counterfactual individual fairness. Our work is inspired by recent research on causal bandits (Lattimore, Lattimore, and Reid 2016; Sen et al. 2017; Lee and Bareinboim 2018, 2019; Lu et al. 2020), which studied how to learn optimal interventions sequentially by representing the relationship between interventions and outcomes as a causal graph along with associated conditional distributions. For example, Lu et al. (2020) developed the causal UCB (C-UCB) that exploits the causal relationships between the reward and its direct parents. However, different from previous works, our algorithms adopt soft intervention (Correa and Bareinboim 2020) to model the arm selection strategy and leverage the d-separation set identified from the underlying causal graph, thus greatly reducing the amount of exploration needed to achieve low cumulative regret. We show that our D-UCB achieves O( p |W| T) regret bound where T is the number of iterations and W is a set that dseparates arm/user features and reward R in the causal graph. As a comparison, the C-UCB achieves O( p |Pa(R)| T) where Pa(R) is the parental variables of R that is a trivial solution of the d-separation set. In our F-UCB, we further achieve counterfactual fairness in each round of exploration. Counterfactual fairness requires the expected reward an individual would receive keeps the same if the individual s The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) sensitive attribute were changed to its counterpart. The introduced counterfactual reward combines two interventions, a soft intervention on the arm selection and a hard intervention on the sensitive attribute. The F-UCB achieves counterfactual fairness in online recommendation by picking arms from a subset of arms at each round in which all the arms satisfy counterfactual fairness constraint. Our theoretical analysis shows F-UCB achieves O( |W|T τ π0 ) cumulative regret bound where τ is the fairness threshold and π0 denotes the maximum fairness discrepancy of a safe policy π0, i.e., a policy that is fair across all rounds. We conduct experiments on the Email Campaign data (Lu et al. 2020) whose results show the benefit of using the dseparation set from the causal graph. Our D-UCB incurs less regrets than two baselines, the classic UCB which does not leverage any causal information as well as the C-UCB. In addition, we validate numerically that our F-UCB maintains good performance while satisfying counterfactual individual fairness in each round. On the contrary, the baselines fail to achieve fairness with significant percentages of recommendations violating fairness constraint. We further conduct experiments on the Adult-Video dataset and compare our F-UCB with another user-side fair bandit algorithm Fair-Lin UCB (Huang et al. 2020). The results demonstrate the advantage of our causal based fair bandit algorithm on achieving individual level fairness in online recommendation. Background Our work is based on structural causal models (Pearl 2009) which describe the causal mechanisms of a system as a set of structural equations. Definition 1 (Structural Causal Model (SCM) (Pearl 2009)). A causal model M is a triple M = U, V, F where 1) U is a set of hidden contextual variables that are determined by factors outside the model; 2) V is a set of observed variables that are determined by variables in U V; 3) F is a set of equations mapping from U V to V. Specifically, for each V V, there is an equation f V F mapping from U (V\V ) to V , i.e., v = f V (Pa(V ), u V ), where Pa(V ) is a realization of a set of observed variables called the parents of V , and u V is a realization of a set of hidden variables. Quantitatively measuring causal effects is facilitated with the do-operator (Pearl 2009), which simulates the physical interventions that force some variable to take certain values. Formally, the intervention that sets the value of X to x is denoted by do(x). In a SCM, intervention do(x) is defined as the substitution of equation x = f X(Pa(X), u X) with constant X = x. For an observed variable Y other than X, its variant under intervention do(x) is denoted by Y (x). The distribution of Y (x), also referred to as the post-intervention distribution of Y , is denoted by P(Y (x)). The soft intervention (also known as the conditional action, policy intervention) extends the hard intervention such that it forces variable X to take a new functional relationship in responding to some other variables (Correa and Bareinboim 2020). Denoting the soft intervention by π, the post-interventional distribution of X given its parents is denoted by Pπ(X|Pa(X)). More gen- erally, the new function could receive as inputs the variables other than the original parents Pa(X), as long as they are not the descendants of X. The distribution of Y after performing the soft intervention is denoted by P(Y (π)). With intervention, the counterfactual effect measures the causal effect while the intervention is performed conditioning on only certain individuals or groups specified by a subset of observed variables O = o. Given a context O = o, the counterfactual effect of the value change of X from x1 to x2 on Y is given by E[Y (x2)|o] E[Y (x1)|o]. Each causal model M is associated with a causal graph G = V, E , where V is a set of nodes and E is a set of directed edges. Each node in G corresponds to a variable V in M. Each edge, denoted by an arrow , points from each member of Pa(V ) toward V to represent the direct causal relationship specified by equation f V ( ). The well-known d-separation criterion (Spirtes, Glymour, and Scheines 2000) connects the causal graph with conditional independence. Definition 2 (d-Separation (Spirtes, Glymour, and Scheines 2000)). Consider a causal graph G. X, Y and W are disjoint sets of attributes. X and Y are d-separated by W in G, if and only if W blocks all paths from every node in X to every node in Y. A path p is said to be blocked by W if and only if: 1) p contains a chain i m j or a fork i m j such that the middle node m is in W, or 2) p contains an collider i m j such that the middle node m is not in W and no descendant of m is in W. Achieving Counterfactual Fairness in Bandit In this section, we present our D-UCB and F-UCB bandit algorithms. The online recommendation is commonly modeled as a contextual multi-armed bandit problem, where each customer is a bandit player , each potential item a has a feature vector a A and there are a total number of k items1. For each customer arrived at time t [T] with feature vector xt X, the algorithm recommends an item with features a based on vector xt,a which represents the concatenation of the user and the item feature vectors (xt, a), observes the reward rt (e.g., purchase), and then updates its recommendation strategy with the new observation. There may also exist some intermediate features (denoted by I) that are affected by the recommended item and influence the reward, such as the user feedback about relevance and quality. Modeling Arm Selection via Soft Intervention In bandit algorithms, we often choose an arm that maximizes the expectation of the conditional reward, at = arg maxa E[R|xt,a]. The arm selection strategy could be implemented by a functional mapping from X to A, and after each round the parameters in the function get updated with the newest observation tuple. We advocate the use of the causal graph and soft interventions as a general representation of any bandit algorithm. We consider the causal graph G, e.g., as shown in Figure 1, where A represents the arm features, X represents the user features, 1We use a to represent the feature vector of item/arm a, and they may be used interchangeably when the context is unambiguous. Figure 1: Graph structure for contextual bandit. Node π denotes the soft intervention conducted on arm selection. R represents the reward, and I represents some intermediate features between A and R. Since the arm selection process could be regarded as the structural equation of X on A, we treat X as A s parents. Then, the reward R is influenced by the arm selection, the contextual user features, as well as some intermediate features, so all the three factors are parents of R. In this setting, it is natural to treat the update of the arm selection policy as a soft intervention π performed on the arm features A. Each time when an arm selection strategy is learned, the corresponding soft intervention is considered to be conducted on A while user features X and all other relationships in the causal graph are unchanged. There are several advantages of modeling arm selection learning using the soft intervention. First, it can capture the complex causal relationships between context and reward without introducing strong assumptions, e.g., linear reward function, or Gaussian/Bernoulli prior distribution, which are often not held in practice. Second, it is flexible in terms of the functional form. For example, it can be of any function type, and it can be independent or dependent upon the target variable s existing parents and can also include new variables that are not the target variable s parents. Third, the soft intervention can be either deterministic, i.e., fixing the target variable to a particular constant, or stochastic, i.e., assigns to the target variable a distribution with probabilities over multiple states. As a result, most existing and predominant bandit algorithms could be described using this framework. Moreover, based on this framework we could propose new bandit algorithms by adopting different soft interventions. Formally, let Πt be the arm selection policy space at time t [T], and π Πt be a specific policy. The implementation of policy π is modeled by a soft intervention. Denoting by R(π) the post-interventional value of the reward after performing the intervention, the expected reward under policy π, denoted by µπ, is given by E[R(π)|xt]. According to the σ-calculus (Correa and Bareinboim 2020), it can be further decomposed as follows: µπ = E[R(π)|xt] = X a Pπ(a|xt) E[R(a)|xt] = Ea π [E[R(a)|xt]] (1) where Pπ(a|xt) is a distribution defined by policy π. As can be seen, once a policy is given, the estimation of µπ depends on the estimation of E[R(a)|xt] (denoted by µa). Note that µa represents the expected reward when selecting an arm a, which is still a post-intervention quantity and needs to be expressed using observational distributions in order to be computable. In the following, we propose a d-separation based estimation method and based on which we develop our D-UCB algorithm. For the ease of representation, our discussions in the following subsections assume deterministic policies, in principle the above framework could be applied to stochastic policies as well. D-UCB Algorithm Let W A X I be a subset of nodes that d-separates reward R from features (A X)\W in the causal graph. Such set always exists since A X and Pa(R) are trivial solutions. Let Z = W\(A X). Using the do-calculus (Pearl 2009), we can decompose µa as follows. µa = E[R|do(a), xt] = X Z E[R|z, do(a), xt]P(z|do(a), xt) Z E[R|z, a, xt]P(z|a, xt) = X Z E[R|z, a, xt]P(z|xt,a) Z E[R|w]P(z|xt,a) (2) where the last step is due to the d-separation. Similar to (Lu et al. 2020), we assume that distribution P(z|xt,a) is known based on previous knowledge used to build the causal graph. Then, by using a sample mean estimator (denoted by ˆµw(t)) to estimate E[R|w] based on the observational data up to time t, the estimated reward mean is given by ˆµπ(t) = Ea π Z ˆµw(t) P(z|xt,a) Subsequently, we propose a causal bandit algorithm based on d-separation, called D-UCB. Since there is always uncertainty on the reward given a specific policy, in order to balance exploration and exploitation we follow the rule of optimistic in the face of uncertainty (OFU) in D-UCB algorithm. The policy taken at time t will lead to the highest upper confidence bound of the expected reward, which is given by πt = arg max π Πt Ea π[UCBa(t)] (4) UCBa(t) = X Z UCBw(t)P(z|xt,a) (5) Since ˆµw(t) is an unbiased estimator and the error term of the reward is assumed to be sub-Gaussian distributed, the 1 δ upper confidence bound of µw(t) is given by UCBw(t) = ˆµw(t) + 2 log(1/δ) 1 Nw(t) (6) After taking the policy, we will have new observations on rt and wt. The sample mean estimator is then updated accordingly: ˆµw(t) = 1 Tw(t) k=1 rt1wk=w where Tw(t) = (7) We hypothesize that the choice of d-separation set W would significantly affect the regret of the D-UCB. To this end, we analyze the upper bound of the cumulative regret RT . The following theorem shows that, the regret upper bound depends on the domain size of d-separation set W. Algorithm 1: D-UCB: Causal Bandit based on d-separation 1: Input: Policy space Π, confidence level parameter δ, original causal Graph G with domain knowledge 2: Find the d-separation set W with minimum subset Z in terms of domain space. 3: for t = 1, 2, 3, ..., T do 4: Obtain the optimal policy πt following Eq. (4). 5: Take action at πt and observe a real-valued payoff rt and a d-separation set value wt. 6: Update ˆµw(t) for all w W following Eq. (7). 7: end for Theorem 1 (Regret bound of D-UCB). Given a causal graph G, with probability at least 1 2δT|W| exp( |W| log3(T ) 32 log(1/δ) ), the regret of D-UCB is bounded by |W|T log(T)log(T) + p 32|W|T log(1/δ) where |W| is the domain space of set W. Proof Sketch. 2 The proof of Theorem 1 follows the general regret analysis framework of the UCB algorithm (Auer, Cesa Bianchi, and Fischer 2002). By leveraging d-separation decomposition of the expected reward, we split the cumulative regret into two terms and bound them separately. Since there are less terms to traverse when summing up and bounding the uncertainty caused by exploration-exploitation strategy, DUCB is supposed to obtain lower regret than the original UCB algorithm and C-UCB algorithm. By setting δ = 1/T 2, it is easy to show that D-UCB algorithm achieves O( p |W| T) regret bound. Algorithm 1 shows the pseudo code of the D-UCB. In Line 2, according to Theorem 1, we first determine the d-separation set W with the minimum domain space. In Line 4 we leverage causal graph and the observational data up to time t to find the optimal policy πt = arg maxπ Πt Ea π[UCBa(t)]. In Line 5, we take action at πt and observe a real-valued payoff rt, and in Line 6, we update the observational data with at and rt. Remark. Determining the minimum d-separation set has been well studied in causal inference (Geiger, Verma, and Pearl 1990). We leverage the algorithm of finding a minimum cost separator (Tian, Paz, and Pearl 1998) to identify W. The discovery procedure usually requires the complete knowledge of the causal graph. However, in the situation where the d-separation set to be used as well as the associated conditional distributions P(z|xt,a) are given, the remaining part of the algorithm will work just fine without the causal graph information. Moreover, the assumption of knowing P(z|xt,a) follows recent research works on causal bandit. Generalizing the causal bandit framework to partially/completely unknown causal graph setting is a much more challenging but important task. A recent work (Lu, Meisami, and Tewari 2021) tries to generalize causal bandit algorithm based on causal trees/forests structure. 2Due to space limits, we only include proof sketches. Refer to the appendix in (Huang, Zhang, and Wu 2021) for proof details of all theorems. To better illustrate the long-term regret of causal bandit algorithm, suppose the set A U I includes N variables that are related to the reward and the d-separation set W includes n variables. If each of the variable takes on 2 distinct values, the number of deterministic policies can be as large as 2N for traditional bandit algorithm, leading to a O( 2NT) regret bound. On the other hand, our proposed causal algorithms exploit the knowledge of the d-separation set W and achieves O( 2n T) regret, which implies a significant reduction regarding to the regret bound if n << N. If the number of arm candidates is much smaller than the domain space of W, our bound analysis could be easily adjusted to this case using a subspace of W that corresponds to the arm candidates. Counterfactual Fairness Now, we are ready to present our fair UCB algorithm. Rather than focusing on the fairness of the item being recommended (e.g., items produced by small companies have similar chances of being recommended as those from big companies), we focus on the user-side fairness in terms of reward, i.e., individual users who share similar profiles will receive similar rewards regardless of their sensitive attributes and items being recommended such that they both benefit from the recommendations equally. To this end, we adopt counterfactual fairness as our fairness notion. Consider a sensitive attribute S X in the user s profile. Counterfactual fairness concerns the expected reward an individual would receive assuming that this individual were in different sensitive groups. In our context, this can be formulated as the counterfactual reward E[R(π, s )|xt] where two interventions are performed simultaneously: soft intervention π on the arm selection and hard intervention do(s ) on the sensitive attribute S, while conditioning on individual features xt. Denoting by π = E[R(π, s+)|xt] E[R(π, s )|xt] the counterfactual effect of S on the reward, a policy that is counterfactually fair is defined as follows. Definition 3. A policy π is counterfactually fair for an individual arrived if π = 0. The policy is τcounterfactually fair if | π| τ where τ is the predefined fairness threshold. To achieve counterfactual fairness in online recommendation, at round t, we can only pick arms from a subset of arms for the customer (with feature xt), in which all the arms satisfy counterfactual fairness constraint. The fair policy subspace Φt Πt is thus given by Φt = {π : π τ}. However, the counterfactual fairness is a causal quantity that is not necessarily unidentifiable from observational data without the knowledge of structure equations (Shpitser and Pearl 2008). Wu, Zhang, and Wu (2019) studied the criterion of identification of counterfactual fairness given a causal graph and provided the bounds for unidentifiable counterfactual fairness. According to Proposition 1 in (Wu, Zhang, and Wu 2019), our counterfactual fairness is identifiable if X\{S} are not descendants of S. In this case, similar to Eq. (1), we have that E[R(π, s )|xt] = Ea π [E[R(a, s )|xt]] where s {s+, s }. Similar to Eq. (2), we denote µa,s = E[R(a, s )|xt], which can be decomposed using the do-calculus as µa,s = E[R(a, s )|xt] Z E[R|s , w\st] P(z|s , xt,a\st) (8) where w\st and xt,a\st represent all values in w and xt,a except st respectively. Note that s is the sensitive attribute value in the counterfactual world which could be different from the observational value st. The estimated counterfactual reward can be calculated as ˆµa,s (t) = X Z ˆµw (t) P(z|s , xt,a\st) where w = {s , w\st} and ˆµw (t) is again the sample mean estimator based on the observational data up to time t. The estimated counterfactual discrepancy of a policy is ˆ π(t) = Ea π[ˆµa,s+(t)] Ea π[ˆµa,s (t)] (9) In the case where µa,s is not identifiable, based on Proposition 2 in (Wu, Zhang, and Wu 2019) we derive the lower and upper bounds of µa,s as presented in the following theorem. Theorem 2. Given a causal graph as shown in Figure 1, if there exists a non-empty set B X\{S} which are descendants of S, then µa,s = E[R(a, s )|xt] is bounded by Z max b {E[R|s , w\st]} P(z|xt,a) , Z min b {E[R|s , w\st]} P(z|xt,a) F-UCB Algorithm Taking the estimation error of the counterfactual discrepancy into consideration, we could also use the high probability upper confidence bound of the counterfactual effect to build the conservative fair policy subspace Φt = {π : UCB π(t) τ} where UCB π(t) = ˆ π(t) + X 8 log(1/δ) 1 Nw(t)P(z|xt,a) (10) which is derived based on the fact that the sum of two independent sub-Gaussian random variables is still sub-Gaussian distributed. Thus, the learning problem can be formulated as the following constrained optimization problem: Ea π t [µa] Ea πt[µa] s.t. t, πt Φt where π t is defined as the optimal policy in the policy space Πt at each round, which is the same in D-UCB setting. The Assumption 3 in Appendix gives the definition of a safe policy π0, which refers to a feasible solution under the fair policy subspace at each round, i.e., π0 Πt such that π0 τ for each t [T]. This optimization can be solved similarly by following the rule of OFU. Algorithm 2 depicts our fair bandit algorithm called the F-UCB. Different from the D-UCB algorithm, Algorithm 2: F-UCB: Fair Causal Bandit 1: Input: Policy space Π, fairness threshold τ, confidence level parameter δ, original causal Graph G with domain knowledge 2: Find the d-separation set W with minimum subset Z in terms of domain space. 3: for t = 1, 2, 3, ..., T do 4: for π Πt do 5: Compute the estimated reward mean using Eq. (3) and the estimated fairness discrepancy using Eq. (9). 6: end for 7: Determine the conservative fair policy subspace Φt. 8: Find the optimal policy following Eq. (4) within Φt. 9: Take action at πt and observe a real-valued payoff rt and a d-separation set value wt. 10: Update ˆµw(t) for all w W. 11: end for F-UCB only picks arm from Φt at each time t. In Line 5, we compute the estimated reward mean and the estimated fairness discrepancy. In Line 6, we determine the fair policy subspace Φt, and in Line 7, we find the optimal policy πt = arg maxπ Φt Ea π[UCBa(t)]. The following regret analysis shows that, the regret bound of F-UCB is larger than that of D-UCB as expected, and it is still influenced by the domain size of set W. Theorem 3 (Regret bound of fair causal bandit). Given a causal graph G, let δE = 4|W|Tδ and π0 denote the maximum fairness discrepancy of a safe policy π0 across all rounds. Setting αc = 1 and αr = 2 τ π0 , with probability at least 1 δE, the cumulative regret of F-UCB is bounded by: RT ( 2 τ π0 + 1) 2 p 2T|W| log(1/δE) + 4 p T log(2/δE) log(1/δE) Proof Sketch. Our derivation of the regret upper bound of FUCB follows the proof idea of bandits with linear constraints (Pacchiano et al. 2021), where we treat counterfactual fairness as a linear constraint. By leveraging the knowledge of a feasible fair policy at each round and properly designing the numerical relation of the scale parameters αc and αr, we are able to synchronously bound the cumulative regret of reward and fairness discrepancy term. Merging these two parts of regret analysis together leads to a unified bound of the F-UCB algorithm. By setting δE to 1/T 2 we can show F-UCB achieves O( |W|T τ π0 ) long-term regret. Remark. In Theorem 3, αc and αr refer to the scale parameters that control the magnitude of the confidence interval for sample mean estimators related to reward and fairness term respectively. In the appendix of (Huang, Zhang, and Wu 2021) we show the numerical relation αc and αr should satisfy in order to synchronously bound the uncertainty caused by the error terms. The values taken in Theorem 3 is one feasible solution with αc taking the minimum value under the constraint domain space. The general framework we proposed (Eq. (1)) can be applied to any policy/function class. However, the D-UCB and F-UCB algorithms we proposed still adopt the deterministic policy following the classic UCB algorithm. Thus, the construction of Φt = {π : UCB π(t) τ} can be easily achieved as the total number of policies are finite. In this paper we also assume discrete variables, but in principle the proposed algorithms can also be extended to continuous variables by employing certain approximation approaches, e.g., neural networks for estimating probabilities and sampling approaches for estimating integrals. However, the regret bound analysis may not apply as |W| will become infinite in the continuous space. Experiment In this section, we conduct experiments on two datasets and compare the performance of D-UCB and F-UCB with UCB, C-UCB and Fair-Lin UCB in terms of the cumulative regret. We also demonstrate the fairness conformance of F-UCB and the violations of other algorithms. Email Campaign Dataset We adopt the Email Campaign data as used in previous works (Lu et al. 2020). The dataset is constructed based on the online advertising process. Its goal is to determine the best advertisement recommendation strategy for diverse user groups to improve their click through ratio (CTR), thus optimize the revenue generated through advertisements. Figure 2 shows the topology of the causal graph. We use X1, X2, X3 to denote three user profile attributes, gender, age and occupation; A1, A2, A3 to denote three arm features, product, purpose, send-time that could be intervened; I1, I2, I3, I4 to denote Email body template, fitness, subject length, and user query; and R to denote the reward that indicates whether users click the advertisement. The reward function is R = 1/12(I1+I2+I3+A3)+N(0, σ2), where σ = 0.1. In our experiment, we set δ = 1/t2 for each t [T]. In the appendix of (Huang, Zhang, and Wu 2021) we show the domain values of all 11 attributes and their conditional probability tables. Figure 2: Graph structure under Email Campaign data. Nodes with blue frame denote the variables that can be intervened. The node with red frame is the sensitive attribute. Light shaded nodes denote the minimal d-separation set. Figure 3: Comparison of bandit algorithms (τ = 0.3 for F-UCB). τ Cumulative Regret of F-UCB Unfair Decisions UCB C-UCB D-UCB F-UCB 0.1 392.12 3030 3176 3473 0 0.2 363.55 1383 1487 1818 0 0.3 355.21 482 594 739 0 0.4 317.80 141 185 234 0 0.5 313.89 18 27 47 0 Table 1: Comparison results for Email Campaign data Figure 3 plots the cumulative regrets of different bandit algorithms along T. For each bandit algorithm, the online learning process starts from initialization with no previous observation. Figure 3 shows clearly all three causal bandit algorithms perform better than UCB. This demonstrates the advantage of applying causal inference in bandits. Moreover, our D-UCB and F-UCB outperform C-UCB, showing the advantage of using d-separation set in our algorithms. The identified d-separation set W (send time, fitness, and template) and the domain space of Z (fitness and template) significantly reduce the exploration cost in D-UCB and F-UCB. Remark. Note that in Figure 3, for the first 2000 rounds, F-UCB has lower cumulative regret than D-UCB. A possible explanation is that fair constraint may lead to a policy subspace that contains many policies with high reward. As the number of explorations increase, D-UCB gains more accurate reward estimations for each policy in the whole policy space and eventually outperforms F-UCB. Table 1 shows how the cumulative regret of F-UCB (T = 5000 rounds) varies with the fairness threshold τ. The values in Table 1 (and Table 2) are obtained by averaging the results over 5 trials. The larger the τ, the smaller the cumulative regret. In the right block of Table 1, we further report the number of fairness violations of the other three algorithms during the exploration of T = 5000 rounds, which demonstrates the need of fairness aware bandits. In comparison, our F-UCB achieves strict counterfactual fairness in every round. Adult-Video Dataset We further compare the performance of F-UCB algorithm with Fair-Lin UCB (Huang et al. 2020) on Adult-Video Figure 4: Graph structure for Adult-Video data τ Regret Unfair Decisions F-UCB F-UCB Fair-Lin UCB 0.1 361.43 0 2053 0.2 332.10 0 1221 0.3 323.12 0 602 0.4 303.32 0 82 0.5 296.19 0 6 Table 2: Comparison results for Adult-Video data. dataset. We follow the settings of (Huang et al. 2020) by combining two publicly available datasets: Adult dataset and Youtube video dataset. We include in the appendix of (Huang, Zhang, and Wu 2021) detailed information about datasets and experiment. We select 10,000 instances and use half of the data as the offline data to construct causal graph and adopt the other half to be user sequence and arm candidates for online recommendation. The causal graph constructed from the training data is shown in Figure 4, where X = {age, sex, race, income} denote user features, A = {length, ratings, views, comments} denote video features. Bold nodes denote direct parents of the reward and red nodes denote the sensitive attribute. The minimum d-separation set for this graph topology is W = {age, income, ratings, views}. The reward function is set as R = 1/5(age + income + ratings + views) + N(0, σ2), where σ = 0.1. We set δ = 1/t2 for each t [T]. The cumulative regret is added up through 5000 rounds. We observe from Table 2 a high volume of unfair decisions made by Fair-Lin UCB under strict fairness threshold (nearly forty percent of the users are unfairly treated when τ = 0.1). This implies Fair-Lin UCB algorithm can not achieve individual level fairness when conducting online recommendation compared to F-UCB. On the other hand, the cumulative regret for Fair-Lin UCB is around 250 over 5000 rounds, which is slightly better than F-UCB. This is because we use the same linear reward setting as (Huang et al. 2020) in our experiment and Lin-UCB based algorithm will better catch the reward distribution under this setting. Related Work Causal Bandits. There have been a few works of studying how to learn optimal interventions sequentially by representing the relationship between interventions and outcomes as a causal graph along with associated conditional distributions. Lattimore, Lattimore, and Reid (2016) introduced the causal bandit problems in which interventions are treated as arms in a bandit problem but their influence on the reward, along with any other observations, is assumed to conform to a known causal graph. Specifically they focus on the setting that observations are only revealed after selecting an intervention (and hence the observed features cannot be used as context) and the distribution of the parents of the reward is known under those interventions. Lee and Bareinboim (2018) developed a way to choose an intervention subset based on the causal graph structure as a brute-force way to apply standard bandit algorithms on all interventions can suffer huge regret. Lee and Bareinboim (2019) studied a relaxed version of the structural causal bandit problem when not all variables are manipulable. Sen et al. (2017) considered best intervention identification via importance sampling. Instead of forcing a node to take a specific value, they adopted soft intervention that changes the conditional distribution of a node given its parent nodes. Lu et al. (2020) proposed two algorithms, causal upper confidence bound (C-UCB) and causal Thompson Sampling (C-TS), and showed that they have improved cumulative regret bounds compared with algorithms that do not use causal information. They focus on causal relations among interventions and use causal graphs to capture the dependence among reward distribution of these interventions. Fair Machine Learning. Fairness in machine learning has been a research subject with rapid growth and attention recently. Related but different from our work include long term fairness (e.g., (Liu et al. 2018)), which concerns for how decisions affect the long-term well-being of disadvantaged groups measured in terms of a temporal variable of interest, fair pipeline or multi-stage learning (Bower et al. 2017; Emelianov et al. 2019; Dwork and Ilvento 2019; Dwork, Ilvento, and Jagadeesan 2020), which primarily consider the combination of multiple non-adaptive sequential decisions and evaluate fairness at the end of the pipeline, and fair sequential learning (Joseph et al. 2016), which sequentially considers each individual and makes decision for them. Liu et al. (2018) proposed the study of delayed impact of fair machine learning and introduced a one-step feedback model of decision-making to quantify the long-term impact of classification on different groups in the population. Hu and Rangwala (2020) developed a metric-free individual fairness and a cooperative contextual bandits (CCB) algorithm. The CCB algorithm utilizes fairness as a reward and attempts to maximize it. It tries to achieve individual fairness unlimited to problem-specific similarity metrics using multiple gradient contextual bandits. Conclusions In our paper, we studied how to learn optimal interventions sequentially by incorporating causal inference in bandits. We developed D-UCB and F-UCB algorithms which leverage the d-separation set identified from the underlying causal graph and adopt soft intervention to model the arm selection strategy. Our F-UCB further achieves counterfactual individual fairness in each round of exploration by choosing arms from a subset of arms satisfying counterfactual fairness constraint. Our theoretical analysis and empirical evaluation show the effectiveness of our algorithms against baselines. Acknowledgments This work was supported in part by NSF 1910284, 1920920, 1940093, and 1946391. Ethics Statement Our research could benefit online recommendation system providers so that they can make fair recommendations while still achieving low regret. Our research could also benefit users of online recommendation systems as we achieve counterfactual fairness. Counterfactual fairness is a causalitybased user-side fairness notion. Our research could prevent users from receiving biased recommendations especially for those from disadvantage groups. References Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3): 235 256. Bower, A.; Kitchen, S. N.; Niss, L.; Strauss, M. J.; Vargas, A.; and Venkatasubramanian, S. 2017. Fair Pipelines. Co RR, abs/1707.00391. Burke, R. 2017. Multisided fairness for recommendation. ar Xiv preprint ar Xiv:1707.00093. Burke, R.; Sonboli, N.; and Ordonez-Gauger, A. 2018. Balanced neighborhoods for multi-sided fairness in recommendation. In Facc T 18. Celis, L. E.; Kapoor, S.; Salehi, F.; and Vishnoi, N. K. 2018. An algorithmic framework to control bias in bandit-based personalization. ar Xiv preprint ar Xiv:1802.08674. Correa, J. D.; and Bareinboim, E. 2020. A Calculus for Stochastic Interventions: Causal Effect Identification and Surrogate Experiments. In AAAI 20, 10093 10100. Dwork, C.; and Ilvento, C. 2019. Fairness Under Composition. In ITCS 19, volume 124 of LIPIcs, 33:1 33:20. Dwork, C.; Ilvento, C.; and Jagadeesan, M. 2020. Individual Fairness in Pipelines. In FORC 20, 7:1 7:22. Ekstrand, M. D.; Tian, M.; Kazi, M. R. I.; Mehrpouyan, H.; and Kluver, D. 2018. Exploring author gender in book rating and recommendation. In Rec Sys 18, 242 250. Emelianov, V.; Arvanitakis, G.; Gast, N.; Gummadi, K. P.; and Loiseau, P. 2019. The Price of Local Fairness in Multistage Selection. In IJCAI 19, 5836 5842. Geiger, D.; Verma, T.; and Pearl, J. 1990. d-separation: From theorems to algorithms. In Machine Intelligence and Pattern Recognition. Hu, Q.; and Rangwala, H. 2020. Metric-Free Individual Fairness with Cooperative Contextual Bandits. Co RR, abs/2011.06738. Huang, W.; Labille, K.; Wu, X.; Lee, D.; and Heffernan, N. 2020. Achieving User-Side Fairness in Contextual Bandits. ar Xiv preprint ar Xiv:2010.12102. Huang, W.; Zhang, L.; and Wu, X. 2021. Achieving Counterfactual Fairness for Causal Bandit. Co RR, abs/2109.10458. Jabbari, S.; Joseph, M.; Kearns, M. J.; Morgenstern, J.; and Roth, A. 2017. Fairness in Reinforcement Learning. In ICML 17. Joseph, M.; Kearns, M. J.; Morgenstern, J.; Neel, S.; and Roth, A. 2018. Meritocratic Fairness for Infinite and Contextual Bandits. In AIES 18, 158 163. ACM. Joseph, M.; Kearns, M. J.; Morgenstern, J. H.; and Roth, A. 2016. Fairness in Learning: Classic and Contextual Bandits. In Neur IPS. Lattimore, F.; Lattimore, T.; and Reid, M. D. 2016. Causal Bandits: Learning Good Interventions via Causal Inference. In Neur IPS 16. Lee, S.; and Bareinboim, E. 2018. Structural Causal Bandits: Where to Intervene? In Neur IPS 18, 2573 2583. Lee, S.; and Bareinboim, E. 2019. Structural Causal Bandits with Non-Manipulable Variables. In AAAI 19. Liu, L. T.; Dean, S.; Rolf, E.; Simchowitz, M.; and Hardt, M. 2018. Delayed Impact of Fair Machine Learning. In ICML 18. Liu, Y.; Radanovic, G.; Dimitrakakis, C.; Mandal, D.; and Parkes, D. C. 2017. Calibrated fairness in bandits. ar Xiv preprint ar Xiv:1707.01875. Lu, Y.; Meisami, A.; and Tewari, A. 2021. Causal Bandits with Unknown Graph Structure. ar Xiv preprint ar Xiv:2106.02988. Lu, Y.; Meisami, A.; Tewari, A.; and Yan, W. 2020. Regret Analysis of Bandit Problems with Causal Background Knowledge. In UAI 20, 141 150. Pacchiano, A.; Ghavamzadeh, M.; Bartlett, P.; and Jiang, H. 2021. Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, 2827 2835. PMLR. Pearl, J. 2009. Causality. Cambridge university press. Sen, R.; Shanmugam, K.; Dimakis, A. G.; and Shakkottai, S. 2017. Identifying Best Interventions through Online Importance Sampling. In ICML 17, 3057 3066. Shpitser, I.; and Pearl, J. 2008. Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 9: 1941 1979. Spirtes, P.; Glymour, C. N.; and Scheines, R. 2000. Causation, prediction, and search, volume 81. MIT press. Tian, J.; Paz, A.; and Pearl, J. 1998. Finding minimal dseparators. Citeseer. Wu, Y.; Zhang, L.; and Wu, X. 2019. Counterfactual Fairness: Unidentification, Bound and Algorithm. In IJCAI 19, 1438 1444. Zhu, Z.; Hu, X.; and Caverlee, J. 2018. Fairness-aware tensorbased recommendation. In CIKM 18, 1153 1162.