# causal_logistic_bandits_with_counterfactual_fairness_constraints__0ee18fd7.pdf

Causal Logistic Bandits with Counterfactual Fairness Constraints

Jiajun Chen 1 Jin Tian 2 Christopher John Quinn 1

Artificial intelligence will play a significant role in decision making in numerous aspects of society. Numerous fairness criteria have been proposed in the machine learning community, but there remains limited investigation into fairness as defined through specified attributes in a sequential decision-making framework. In this paper, we focus on causal logistic bandit problems where the learner seeks to make fair decisions, under a notion of fairness that accounts for counterfactual reasoning. We propose and analyze an algorithm by leveraging primal-dual optimization for constrained causal logistic bandits where the non-linear constraints are a priori unknown and must be learned in time. We obtain sub-linear regret guarantees with leading term similar to that for unconstrained logistic bandits (Lee et al., 2024) while guaranteeing sub-linear constraint violations. We show how to achieve zero cumulative constraint violations with a small increase in the regret bound.

1. Introduction

Artificial intelligence (AI) models, using techniques from statistics and machine learning, are increasingly being used to make decisions that affect people s lives. In light of this, a plethora of formal fairness criteria have been proposed (Darlington, 1971; Dwork et al., 2012; Hardt et al., 2016; Zhang et al., 2016; Kusner et al., 2017; Zafar et al., 2017; Nabi & Shpitser, 2018; Chiappa, 2019; Chouldechova & Roth, 2020; Imai & Jiang, 2023; Plecko & Bareinboim, 2024). There has been growing interest in the sequential decisionmaking community for accounting for fairness, including in settings such as classic and contextual bandits (Joseph et al., 2018), combinatorial bandits (Xu et al., 2020), bandits with

1Department of Computer Science, Iowa State University, Ames, IA, USA 2Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE. Correspondence to: Christopher John Quinn <cjquinn@iastate.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

long-term constraints (Liu et al., 2022), and reinforcement learning (Jabbari et al., 2017), just to name a few. Notably, rather than addressing fairness through the lens of specified attributes, these studies typically operationalize fairness in a different manner by defining it with respect to one-step rewards and introducing a notion of meritocratic fairness (Joseph et al., 2018). An algorithm should never assign a higher selection probability to a less qualified decision than to a more qualified one, i.e., arms with higher empirical rewards should be picked more frequently than those with lower empirical rewards, which is distinguishable from the fairness criteria based on specified attribute.

In this paper, we focus on a problem structure wherein arms arrive in a sequential and stochastic manner from an underlying fixed distribution and decisions are made in an online fashion by the agent. The objective of the agent is to optimize cumulative rewards while achieving fairness counterfactually with respect to specified attributes, i.e. the outcome would not have been substantially different if the specified attributes had different values. In general, this type of task belongs to the setting of dynamic treatment regimes (Murphy, 2003; Lavori & Dawson, 2008; Zhang, 2020) for finding a sequence of decisions over a finite set of treatments which appears across a broad range of applications.

1.1. Contributions

In light of the above, the goal of this paper is to analyze the foundations of online causal fair decision-making. More specially, our contributions are as follows:

We formulate a constrained causal logistic bandits problem where the online decision-making processes are characterized within a causal structure. We formalize a (non-linear) fairness constraint based on the counterfactual outcome effect that is a priori unknown and must be learned in time. To the best of our knowledge, this is the first work to study constrained logistic bandits without a known safe decision subset (see Footnote 1).

We provide an unified analysis for the confidence set construction, algorithm design, and performance guarantee, i.e., sublinear reward regret and sublinear cumulative constraint violations by leveraging the regret-toconfidence-set conversion and the primal-dual online

Causal Logistic Bandits with Counterfactual Fairness Constraint

Table 1: Comparison of reward model, constraint types, frequentist regret guarantees, and cumulative violations upper bounds for select related works. Notation: horizon T, the dimension for arm feature vector n, bounded bandit parameter S, truncated parameter ρ (see Section 3.2), Slater s constant δ (see Assumption 3), decision-set-dependent term RZ(T), generalized linear model (GLM). Problem dependent constant κ , κZ, κ, where p

1/κ < 1 κZ κ when compared within the same decision and parameter spaces. Requires prior knowledge (see Footnote 1) of a safe action or policy ( per-round zero constraint violations with high probability).

Algorithm Model Constraint Regret Violation Safe UCB-GLM (Amani et al., 2020) GLM GLM e O κn

OFULog-r (Abeille et al., 2021) Logistic No e O n S 5 2 q

T κ (T ) + min{n2S4κZ, RZ(T)}

F-UCB (Huang et al., 2022b)

Causal MAB Linear O p

OFULog+ (Lee et al., 2024) Logistic No e O n S q

T κ (T ) + min{n2S2κZ, RZ(T)}

CCLB Theorem 1

Causal logistic

Mixture logistics

T + ρn2S2κZ + ρ

T + n2S2κZ +

Zero CCLB Proposition 4

Causal logistic

Mixture logistics

T + ρn2S2κZ + ρ

optimization. We show that the leading term of our regret, O(ρn S

T), is significantly better than regret bounds for related works handling constraints and is similar to the bound for the unconstrained problem (Lee et al., 2024) (see Remark 3). Furthermore, by introducing an user-chosen parameter, one can trade off the regret slightly to achieve zero cumulative constraint violations.

1.2. Related works

We next briefly discuss two lines of literature closely related to our work. See Appendix B for more discussions on additional works.

Firstly, in terms of formulating causal fairness within a multi-armed bandit setting, the closest related work is (Huang et al., 2022b). Like us, they considered a stochastic, contextual MAB problem with a (known) causal graph governing relationships between the stochastic contexts (seen by the learner before making decisions) and the rewards. They assume all variables are discrete. Like us, they proposed characterizing fairness with counterfactual fairness (Kusner et al., 2017; Wu et al., 2019; Chiappa, 2019) w.r.t. specified attributes in the context (e.g. specified user features in an online recommendation system). They make an assumption1 about a fair policy; they provide high probabil-

1 Pacchiano et al. (2021) (which Huang et al. (2022b) s analysis is based on) requires explicit a priori knowledge of a feasible action/policy (Assumption 5) and states that it is absolutely necessary to do so for the problem they study (Remark 1). Huang et al. (2022b) s Assumption 3 only requires the existence of a safe policy π0; π0 is not explicitly used in estimating rewards or estimating a

ity guarantees that all actions are fair. We model fairness as a long-term constraint, for which we seek to bound cumulative violations, as it is unclear whether it is possible to certify policies as fair (feasible) before collecting data to estimate the reward parameter upon which the (non-linear) constraint depends. Unlike our work, they considered that all variables except the reward are discrete-valued with non-parametric (thus more flexible) distributions. They proposed simpler empirical estimation methods for rewards, for which the counterfactual constraints became linear. While the structural causal model was discrete-valued but non-parametric, their regret bound in turn depended on |W|, the number of realizations of the set of parent variables of the reward, which is exponential in the size of the parent set (see Table 1). In contrast, we model rewards parametrically (using a logistic model), depending on feature maps of the context and decision variables. This dramatically improves the dimensional dependence, though the fairness constraint becomes a mixture of logistic functions for which estimating confidence bounds (to estimate region of fair actions) is more challenging.

Among works on MAB with parametric rewards and unknown (stochastic) constraints, there are numerous works on logistic rewards without constraints and linear rewards with linear constraints (see Appendix B for discussions on those works). The only prior work that like us considered a non-linear (parametric) reward model with non-linear un-

set of feasible policies in the main paper. However, to the best of our knowledge it is unclear how the conservatively estimated sets of policies Φt (shown w.h.p. to be feasible for all rounds) that are used to select actions could be guaranteed to be non-empty in early rounds without a known safe policy π0 or additional assumptions.

Causal Logistic Bandits with Counterfactual Fairness Constraint

known (stochastic) constraints is (Amani et al., 2020). They considered generalized linear rewards where the generalized function is assumed to be twice-differentiable and Lipschitz constant of which logistic rewards is a sub-class. We note that in terms of regret bound alone, their bound specialized to logistic rewards is linearly dependent on the worst case parameter κ (see Table 1), which can be arbitrarily large. They considered generalized linear constraints; like in our work, the constraints depend on the unknown parameter vector θ in the reward function. However, they consider a priori knowledge of some feasible actions. At a high level, they explore the environment (improving their estimate of θ ) using those feasible actions and are able to get high probability guarantees of per-round feasibility. We do not assume such prior knowledge. We instead bound long-term constraint violations.

1.3. Preliminaries

In this section, we introduce the basic notations and definitions used throughout the paper. For a twice-differentiable function g, the notation g and g denote the first and second derivative of function g respectively. For a random variable Z, let Z represent the domain of Z and |Z| the latter s cardinality. For two real-valued symmetric matrices A and B, the notation A B indicates that A B is positive semidefinite, and when A is positive definite, we denote A-norm for a vector z as z A =

z Az. Finally, for two univariate real-valued functions f and g, we denote f = e O(g) to indicate that g dominates f up to logarithmic factors; and for an event E Ω, we write 1{E} the indicator function of E.

We adopt the language of Structural Causal Models (SCMs) (Pearl, 2009, Ch. 7). An SCM M is a tuple U, V, F, P(u) , where U is a set of exogenous (unobserved or latent) variables, V is a set of endogenous (observed) variables, F is a set of structural functions, and P(u) is a distribution over the latent variables. For the set of structural functions F, fi F decides values of an endogenous variable i V taking as argument a combination of other variables. That is, i fi(Pai, Ui), Pai V, Ui U, where Pai denotes the parent set (explained below) of i. Realizations of the set of latent variables U P(u) induce an observational distribution P(v) over V . An intervention on a variable i V , denoted by do(i = c)2 is an operation where value of i is set to a constant c regardless of the structural function {fi : i V }. Each SCM is associated with a directed acyclic graph (DAG) G (e.g., see Figure 1), called the causal diagram, where nodes correspond to endogenous variables i V , solid arrows represent arguments of each function fi. A bi-directed arrow between nodes i and j indicates an un-

2When the variable being intervened on is clear from context, we write do(c) for short notation.

observed confounder affecting both i and j, i.e., Ui Uj = . We will use the graph-theoretic family abbreviations, e.g., Pa(V )G stand for the set of parents of V in G. We omit the subscript G when it is obvious. A path from a node X to a node Y in G is a sequence of edges which does not include a particular node more than once. Two nodes X and Y are said to be d-separated by a third set Z in a DAG G denoted by (X Y |Z)G if and only if Z blocks all paths from every node in X to every node in Y . The criterion of blockage follows (Pearl, 2009, Def. 1.2.3) and is included in Appendix A with formal definitions for completeness.

2. A Theoretical Framework for Constrained Causal Logistic Bandits

In this section, we formalize the theoretical framework of constrained causal logistic bandits in the semantics of SCMs and MABs. We start by considering a recruitment example (see Appendix C for more motivating examples), where applicants arrive sequentially and stochastically from an underlying fixed distribution, and the hiring decisions made by an employer are based on some decision-making strategies to optimize organizational productivity and efficiency while achieving individual fairness for each applicant. The decision-making process is characterized by the extended Standard Fairness Model (SFM) (Zhang & Bareinboim, 2018; Plecko & Bareinboim, 2024). See Figure 1 for a graphical model of the SFM. The variable A represents the specified attribute, W is a set of confounded features, and M is a set of intermediate features, D and Y represent the decision and outcome reward. Contextual information Xt = {wt, mt, at} is accessible by the learner before making decisions.

2.1. Logistic bandits with structural causal model

At every round t, the learner observes the contextual features Xt = {wt, mt, at}, which are drawn from a stochastic distribution and then is presented a set of decisions D that depend on the candidate s context. The learner chooses a decision dt D and receives an outcome reward yt+1 {0, 1}. The learner s decision is based on previous round knowledge Ft = (F0, {wt, mt, at, dt, yt+1}t 1 t=1) and causal information, where F0 represents any prior knowledge. In our problem, we assume that the outcome variable Y has a generalized linear relationship (Filippi et al., 2010; Li et al., 2017) with the feature Z, specifically,

E[Y |Z] = g(f(Z) θ ), (1)

where the fixed but unknown parameters θ belong to Rn, g(x) = (1 + e x) 1 is the standard logistic function, f is the mapping function known ahead of time to the learner, and the encoded feature vector f(Z) is in Rn. Then the interventional distribution for the expected reward of do(dt)

Causal Logistic Bandits with Counterfactual Fairness Constraint

and do(at) given the observed contextual features mt and wt is represented as (Plecko & Bareinboim, 2024):

E[Y |do(dt), do(at), wt, mt] = g(f(Zdt) θ ), (2)

where Zdt is the feature consisting of the decision dt and the contexts Xt = {wt, mt, at}, In this paper, we consider one specified attribute variable A, which in general is a parent of the decision and outcome variable in the causal graph (see Figure 1). Note that the specified attribute value at round t is at and we denote the counterfactual value as a t. Both the decision and (hypothetical) counterfactual intervention on the specified attribute value are atomic interventions (Correa & Bareinboim, 2020). Thus, the expected reward for the counterfactual feature a t for any decision dt D is

E[Y |do(dt), do(a t), wt, mt] = g(f(Z dt) θ ), (3)

where Z dt is the counterfactual feature for interventions do(a t) and do(dt). Notice that for our problem, we consider that the factual feature Zdt and counterfactual feature Z dt are both in the feature space Z. The identification of (2) and (3) follows by the do-calculus rule (Pearl, 1995), which expresses the do intervention as the functions of observed data. Readers can refer Appendix C for a more detailed analysis. Therefore, the counterfactual fairness effect for decision dt is represented as:

dt(Xt) = g(f(Zdt) θ ) g(f(Z dt) θ ). (4)

Remark 1. The above counterfactual fairness measure is the conditional average treatment effect (CATE) (Pearl, 2009; Imbens & Rubin, 2015) compared to the total effect, direct effect, or indirect effect (Pearl, 2001). The causal fairness notation proposed in (Kusner et al., 2017) is the average effect of treatment on the treated (ETT), while both CATE and ETT could be decomposed as (conditional) pathspecific effect, i.e., (conditional) counterfactual direct effect, indirect effect, and spurious effect (Zhang & Bareinboim, 2018; Plecko & Bareinboim, 2024).

2.2. Counterfactual fairness modeling via soft constraint

In this section, we discuss modeling fairness as part of the learner s decision-making problem. Consider a stochastic bandit optimization with a soft constraint for our decisionmaking problem. In particular, at every round t [T], the learner selects a decision dt to maximize the expected reward E[Y |do(dt), Xt] subject to a constraint on (violations to) counterfactual fairness (4),

| dt(Xt) | τ, (5)

where τ is a predefined fairness threshold. In this work, the counterfactual fairness constraint by (5) requires that the expected reward is similar regardless if the value of the

Figure 1: Extended Standard Fairness Model (SFM). Variable A denotes the specified attribute, W is a set of confounded features, M is a set of intermediate features, D is the decision, and Y is the outcome. Let X = {W, M, A} denote contextual information the learner has available to make the decision, and let Z = {W, M, A, D} denote variables the reward distribution may depend on.

specified attribute had been different. In addition, the learner only receives bandit feedback (the reward). The learner does not observe feedback on constraint violations.

Huang et al. (2022b) were the first to propose a counterfactual fairness constraint in a bandit framework. We note that their setting confines the learner to decisions from a safe action set (see Footnote 1). To the best of our knowledge, that setting requires strong assumptions on prior knowledge of a subset of safe actions that can be used even before rewards are estimated (the constraint (4) depends on the unknown reward parameter vector θ ). Prior knowledge of safe actions can be mild in some settings, though we argue counterfactual fairness (a convex combination of logistic functions that depend on the unknown reward parameter vector θ ) is more complex, thereby it is less obvious for us to construct a prior safe decision without knowing any information about the reward distribution. Therefore, for the setting we consider, since no safe actions might known a priori, we allow for instantaneous violations but bound the cumulative violations.

The goal of the learner is to maximize the cumulative expected outcome reward while minimize the cumulative expected counterfactual fairness constraint violations throughout the learning process. Define the cumulative expected regret and cumulative expected counterfactual fairness constraint violations as

g(f(Zd t ) θ ) g(f(Zdt) θ ) , (6)

| dt(Xt)| τ

Causal Logistic Bandits with Counterfactual Fairness Constraint

where d t = arg max{d D:| d(Xt)| τ} g(f(Zd) θ ) and [ ]+ = max{ , 0}. In this paper, we establish a stronger version of regret (Liu et al., 2021; Zhou & Ji, 2022), specifically, let π Π be a probability distribution over the set of actions D, and let π, g(f(Zd) θ ) = P

d D g(f(Zd) θ )π(d) and π, | d(Xt)| τ = P

d D[| d(Xt)| τ]π(d). We compare the received reward with the (baseline) expected reward for π and π is the optimal solution of the optimization problem: maxπ{ π, g(f(Zd) θ ) : π, | d(Xt)| τ 0}. Thus, the stronger regret is defined as:

t=1 π , (f(Zd) θ ) g(f(Zdt) θ ). (8)

Note that the probability distribution πt could include some decisions that violate the constraint but on average the constraint is satisfied, while for a single action it must be a feasible one, therefore, R+(T) R(T).

2.3. Model assumptions and definition

To study the constrained causal logistic bandits problem, we make the following standard assumptions (Yu et al., 2017; Efroni et al., 2020; Zhou & Ji, 2022). Let Θ denote a compact set in Rn. Let Z denote the feature space domain.

Assumption 1 (Bounded Bandit Parameter). There is a known bound S on the norm of the (unknown) reward parameter vector θ, θ 2 S, θ Θ.

Assumption 2. The feature mapping function f : Z 7 Rn

is in a reproducing kernel Hilbert space (RKHS) with a bounded norm (i.e., a measure of smoothness), such that f(Z) 2 1, Z Z.

Assumption 3 (Slater s Constraint Qualification). There is a constant δ > 0 such that there exists a feasible probability distribution π0 over decision set D that satisfies π0, [| d(Xt)| τ] δ, t [T]. Without loss of generality, we assume δ 1.

Notice that this is a mild assumption since it only requires that one could find a stochastic policy π0 under which the expected constraint violations will be strictly less than a negative value. Whereas for hard constraints (Amani et al., 2019; Khezeli & Bitar, 2020; Pacchiano et al., 2021), they typically assume the non-empty known safe decision set which is stronger than the assumption of existence for a Slater s constant δ about the learner s knowledge.

We next define a problem dependent quantity that impacts learnability.

Definition 1 (Problem Dependent Constant3).

κZ(θ ) = max Z Z 1/ g(f(Z) θ ). (9)

3We will drop the dependency on θ when there is no ambiguity.

We recall the problem dependent constants discussed in Table 1: κ = 1/ g(f(Z ) θ ), κZ = max Z Z g(f(Z) θ ), and κ = max Z Z,θ Θ 1/ g(f(Z) θ), clearly, p

1/κ < 1 κZ κ. Notice that such problem dependent constants are defined through the first order of logistic function, which quantifies the level of non-linearity of plausible expected reward signals with different scales. In particular, κ can be significantly large even for reasonable logistic bandits problems. Readers can refer Section 2 of (Faury et al., 2020) for a detailed discussion on the importance of this quantity.

3. Methods for Constrained Causal Logistic Bandits

We next design an online algorithm for the constrained causal logistic bandits problem. We will then develop an unified analysis of regret and constraint violations with rigorous performance guarantees for our decision making strategy. Before proposing the algorithm, we first construct a convex confidence set for the reward parameter θ using a regret-to-confidence set conversion (Foster et al., 2018; Lee et al., 2024).

3.1. Convex confidence set

For logistic bandit problems, a natural way to estimate the reward parameter θ given Ft is to use maximum-likelihood estimation. We build on the works for the unconstrained problem (Abeille et al., 2021; Lee et al., 2024). At every round t, a reward value yt+1 is sampled from a Bernoulli distribution with expected value (or success probability) g(f(Zdt) θ ). The unregularized cumulative logistic loss can be written as:

h yτ+1 log g(f(Zτ) θ)

+ (1 yτ+1) log(1 g(f(Zτ) θ)) i . (10)

The loss Lt(θ) is a strongly convex function of θ (Abeille et al., 2021; Lee et al., 2024). The reward parameter is estimated using maximum likelihood estimation (MLE), defined as ˆθt = arg min||θ||2 S Lt(θ). For α (0, 1], we use the confidence set:

Ct(α) = n θ Θ : Lt(θ) Lt(ˆθt) βt(α)2o , (11)

where βt(α) = q

10n log( St

2n + e) + 2((e 2) + S) log 1

α. Then the following proposition ensures that Ct(α) is a confidence set for θ with high probability:

Proposition 1 (Theorem 1 in (Lee et al., 2024)).

P t 1, θ Ct(α) 1 α.

Causal Logistic Bandits with Counterfactual Fairness Constraint

The proof is provided in Appendix D. The proof uses the approach from the online logistic regression regret guarantee of (Foster et al., 2018) without running the online learning algorithm explicitly. We notice that the radius of the convex confidence set in (Abeille et al., 2021, Lemma 1) is around βt(α) = O( p

n S3 log(t)), while the above tightened lossbased confidence set results in O( p

n log(St) + S), leading to an overall improvement in the factor of S, especially when S is large, e.g., S |D|.

3.2. Online learning algorithm

We consider a constrained stochastic causal logistic bandit over horizon T as described in Section 2.2. The objective for the learner is to maximize the cumulative rewards while minimizing cumulative counterfactual fairness violations over time horizon T. To address the challenges on the unknown reward and the unknown counterfactual fairness constraint, we develop a Constrained Causal Logistic Bandits (CCLB) algorithm by leveraging the primal-dual optimization techniques.

The pseudo code for our CCLB algorithm is in Algorithm 1. At every round t, let the Lagrangian of the primal problem maxπ{ π, g(f(Zd) θ ) : π, | d(Xt)| τ 0} be L(π, ϕ) = π, g(f(Zd) θ ) ϕ π, | d(Xt)| τ and then the associated dual function is defined as q(ϕ) = maxπ L(π, ϕ). Since both the reward and counterfactual fairness constraint depend on the unknown parameter θ , we first estimate it through maximal likelihood estimation and construct a confidence set Ct(α) using the observed histories, i.e., feature vectors and rewards. The greedy procedure is based on the principle of optimism in the face of uncertainty (OFU) (Auer et al., 2002; 2008; Osband & Van Roy, 2014), where the optimistic estimate ( θt) is obtained by maximizing the expected reward across the confidence set Ct(α), however, we penalize the expected reward for the constraint violations when the greedy action (dt) is picked by the learner over the decision domain D. The dual update that minimizes q(ϕ) with respect to ϕ is by taking a projected gradient descent with 1/η being the step size. Note that the truncated parameter ρ is chosen to be larger than the optimal dual variable ϕ . And the optimal dual variable ϕ is bounded under the Slater s constraint qualification, specifically, ϕ π , g(f(Zd) θ ) π0, g(f(Zd) θ ) /δ, d D from Theorem 8.42 in (Beck, 2017).

Remark 2. We remark that the computational complexity (O(t) per-round) of Algorithm 1 is the same as standard algorithms for unconstrained logistic bandits problems (Abeille et al., 2021; Lee et al., 2024), since the dual update is executed via a single-step projection, and the primal optimization retains the character of the unconstrained case without constructing a prior safe subset designed for hard

Algorithm 1 CCLB Algorithm

1: Input: Horizon T, truncated parameter ρ, step size η =

T/ρ, and the initial dual value ϕ1 = 0. 2: for t = 1, 2, 3, . . . , T do 3: The learner observes the contextual information Xt = {at, wt, mt}. 4: Update the estimated reward parameter ˆθt. 5: Build a confidence set Ct(α) from (11),

Ct(α) = n θ Θ : Lt(θ) Lt(ˆθt) βt(α)2o .

6: Greedy procedure. Choose the optimistic reward parameter θt and select the greedy action dt:

θt = arg max θ Ct max d D g(f(Zd) θ), (12)

dt = arg max d D g(f(Zd) θt) ϕt(|b d(Xt)| τ). (13)

7: Update the estimates of the dual variable:

ϕt+1 = Proj[0,ρ] ϕt + 1/η(|b dt(Xt)| τ) . (14)

8: Received a random reward yt+1. 9: end for

constraints as in (Amani et al., 2020). Additionally, the reward and counterfactual fairness constraint in our algorithm share the same unknown parameter θ .

3.3. Regret and constraint violations bounds

In this section, we provide the theoretic upper bounds for both regret and constraint violations of Algorithm 1 and explain the main idea behind the proof of Theorem 1. Theorem 1. Suppose ρ 2/δ, and η =

T/ρ. For 0 τ < 1, under the Slater s constraint qualification in Assumption 3 and regularity assumptions in Assumption 1 and 2, the CCLB algorithm achieves the following bounds simultaneously with probability at least 1 α for any α (0, 1],

R+(T) = O ρn S

T + ρn2S2κZ + ρ

V(T) = O n S

T + n2S2κZ +

Where both R+(T) and V(T) hides dependencies on log 1

α. Remark 3. We remark that: (1) the leading term of our regret O(ρn S

T) is similar to the bound O(n S p

T/κ ) established in (Lee et al., 2024) as the logarithmic growth of T, which improves upon (Abeille et al., 2021) (OFULog-r) by a factor of S3/2 and improves upon (Zhang & Sugiyama, 2024) by at least a factor of

S. Though it acquires a multiplicative factor ρ, one could note that, at an extreme case,

Causal Logistic Bandits with Counterfactual Fairness Constraint

when the Slater s constant δ is optimized to 1/ p

log(T), the leading term scales as O(n

T). (2) Compared to the unconstrained case (Abeille et al., 2021; Zhang & Sugiyama, 2024; Lee et al., 2024), the regret bound R+(T) exhibits an additional term ρ

T, which roughly captures the impact of the unknown counterfactual fairness constraint, i.e., a convex combination of logistic functions, which is not logistic function any more. More specifically, the non-convex nature of the logistic mixture introduces a non-linear relationship between the constraint and the reward parameter, thereby resulting in a more complex estimated feasible region of safe decisions at every round. (3) Compared to the constrained generalized linear bandits (Amani et al., 2020), our regret bound shows a big improvement on the worst case constant κ (see Table 1). (4) If τ 1, the constraint violations bound V(T) will be zero since the counterfactual fairness constraint is satisfied for all the decisions (see (5)) and our problem falls into the setting of logistic bandits without constraint.

See Appendix F for the full proof. We next highlight a few key parts of the proof.

Proof Sketch of Theorem 1. We first derive the following key decomposition of total regret and constraint violations that holds for any ϕ [0, ρ] : R+(T) + ϕV(T) R1 + R2 +

t=1 (ϕt 1) π , g(f(Zd) θt) g(f(Zd) θ ) ,

h g(f(Zdt) θt) g(f(Zdt) θ ) +

ϕ g(f(Z dt) θt) g(f(Z dt) θ ) i .

This is attained by employing a dual variable update and necessary algebraic operations. Note that this bound will serve as the cornerstone for the subsequent analysis of both the regret and constraint violations. To further bound R1 and R2, we apply the following proposition for logistic bandits regret:

Proposition 2. With probability at least 1 α for any α (0, 1], under the CCLB algorithm, we have:

t=1 g(f(Zdt) θt) g(f(Zdt) θ ) = O n S

T + n2S2κZ .

The central idea to obtain the above regret (Proposition 2) is by applying Taylor expansion which tightly links estimation errors (e.g. between θt and θ ) to prediction errors (e.g. between g(f(Zdt) θt) and g(f(Zdt) θ )), readers can refer Appendix E for more technical details. As for the logistic bandits regret based on the counterfactual feature vectors,

i.e., PT t=1 g(f(Z dt) θt) g(f(Z dt) θ ), we observe that the counterfactual feature vector Z dt and the factual feature vector Zdt are both lie in the same feature space Z for our problem. Thus, PT t=1 g(f(Z dt) θt) g(f(Z dt) θ ) exhibits the same asymptotic upper bound up to logarithmic factors as PT t=1 g(f(Zdt) θt) g(f(Zdt)θ ), thus R1 and R2 are bounded.

Therefore, the regret upper bound R+(T) can be obtained by choosing ϕ = 0. Inspired by (Beck, 2017), we apply tools from constrained convex optimization to obtain the bound on constraint violations V(T). First, we define the the probability distribution π by π , g(f(Zd) θ ) = g(f(Zdt) θ ) and π , |b d(Xt)| τ = |b dt(Xt)| τ where the policy π only puts probability mass (equal to 1) on decision dt chosen by the learner after the observation of contextual information at every round t. Then, we have,

R+(T) + ϕV(T) =

h π , g(f(Zd) θ )

π , g(f(Zd) θ ) + ϕ π , |b d(Xt)| τ i .

Since π , g(f(Zd) θ ) is convex over π , both π , g(f(Zd) θ ) and π , |b d(Xt)| τ are convex over π , by utilizing (Beck, 2017, Theorem 3.60), we obtain the upper bound on the cumulative constraint violations:

| dt(Xt)| τ

+ 2 R1 + R2 +

Which finishes the proof.

3.4. Improved regret and constraint violations bounds

In Section 3.3, our analysis demonstrates that the proposed CCLB algorithm (Algorithm 1) achieves both sublinear regret and sublinear constraint violations upper bounds. Another natural question to consider is whether the constraint violations bound can be further improved. It turns out that by introducing a tightness parameter ϵ in the dual update in Algorithm 1, for ϵ < δ,

ϕt+1 = Proj[0,ρ] ϕt + 1/η(|b dt(Xt)| τ + ϵ) , (15)

one can achieve a bounded and in some cases even zero constraint violations by trading the regret slightly while still preserving the same asymptotic order of regret as before. Intuitively, with a tightness parameter ϵ > 0 in the constraint, the learner will be more cautious in selecting actions by effectively working with a stricter constraint (e.g. with fairness threshold τ ϵ instead of τ). Then, under this new hypothetical pessimistic constraint function, the primal problem is modified as: maxπ{ π, g(f(Zd) θ ) : π, |b d(Xt)| τ + ϵ 0}. Let π ϵ be the optimal solution

Causal Logistic Bandits with Counterfactual Fairness Constraint

to this new constrained optimization problem, then we have the following relationship between policy π ϵ and π :

Proposition 3. Let policies π and π ϵ be the optimal solutions for the constrained problem maxπ{ π, g(f(Zd) θ ) : π, |b d(Xt)| τ 0} and maxπϵ{ πϵ, g(f(Zd) θ ) : πϵ, |b d(Xt)| τ 0}. For ϵ < δ, we have,

t=1 π , g(f(Zd) θ )

t=1 π ϵ , g(f(Zd) θ ) ϵT

To further investigate how the user-chosen parameter ϵ will impact the regret and constraint violations upper bounds, we define the regret associated with the policy π ϵ as: Rϵ +(T) = PT t=1 π ϵ , g(f(Zd) θ ) g(f(Zdt) θ ), while the constraint violations remain defined by V(T) = PT t=1 | dt(Xt)| τ

+. We then state the following theoretical results for Rϵ +(T) and V(T):

Theorem 2. Suppose ρ 2/δ, and η =

T/ρ. For 0 τ < 1 and the user-chosen parameter ϵ [0, δ), under Slater s constraint qualification in Assumption 3 and regularity assumptions in Assumption 1 and 2, the CCLB algorithm with refined constraint condition (see Equation (15)) attains the following theoretical upper bounds with probability at least 1 α for any α (0, 1] :

Rϵ +(T) = O ρn S

T + ρn2S2κZ + ρ

T(1 + ϵ)2 ,

V(T) = O n S

T + n2S2κZ + (1 + ϵ)2

Again, both Rϵ +(T) and V(T) hides dependencies on log 1

Remark 4. One could notice that (1) by introducing a tightness parameter ϵ in the dual update, the associated regret Rϵ +(T) still achieves a comparable asymptotic upper bound as R+(T) in Theorem 1; nevertheless, the constraint violations V(T) upper bound exhibits an ϵT reduction compared to the result in Theorem 1. Consequently, by selecting ϵ appropriately, one can offset the other terms through the subtraction of ϵT, thereby obtaining a constant upper bound (with respect to the horizon T) on the constraint violations. (2) The difference in regret bounds between Theorem 2 with a user selected ϵ [0, δ) and Theorem 1 is ρ

T(2ϵ + ϵ2). For a problem-dependent (fixed) Slater s constraint qualification constant δ > 0, increasing ϵ only worsens the regret bound, as the learner is increasingly cautious (increasing in ϵ), selecting from a smaller set of actions than the learner would have with ϵ = 0. If δ is large, ρ shrinks towards 2 and so for a fixed tightness ϵ the regret bound reduces. Larger δ also allow for a bigger range of ϵ and thus more room for caution (and regret).

Proposition 4. By conditions stated in Theorem 2, for the user-selected parameter ϵ = (

T +C1n log(T)+(C2+C3κZ)n2((log(T))2p

/2C4 1, where C1, C2, C3, C4 are the universal constants independent of n, S, T, κZ, if n 2 and ϵ < δ then one could achieve a zero upper bound on the cumulative constraint violations when selecting ϵ (ϵ , δ).

Note that this user-chosen parameter ϵ trades off between the upper bounds of the regret and constraint violations (Jenatton et al., 2016). Minimizing regret often encourages exploration and adaptability to changing environments, which might lead to occasional violations of constraints. Conversely, strictly adhering to constraints may limit the algorithm s ability to adapt, potentially increasing regret.

4. Numerical Experiments

We next evaluate the empirical performance of our proposed methods on a synthetic data set. See Appendix H for additional experiments for different values of the constraint threshold τ and tightness parameter ϵ.

Data set description:4 We generated the synthetic dataset from a structural causal model (modified an example from (Plecko & Bareinboim, 2024)), i.e.,

A UAW , W N(0, 1 UAW

( N(0, |W|/2 + |UM|/3) if A = 1, N(0, |W|/3 + |UM|/2) if A = 0,

Di N(0, max{|W|, |M|}) i = 1, ..., 20, D {D1, D2, ..., D20}, Y 1(UY + 1

P(U) = {UAW Bern(0.5), UM, UY N(0, 1).}

As defined in Figure 1, A denotes the sensitive attribute (binary valued), W is the confounded feature, M represents the intermediate feature, D D is the agent s decision, and Y is the outcome. At every round, we generate a set of 20 feature vectors {[A, W, M, Di]}20 i=1 along with their corresponding counterfactual feature vectors. We use rejection sampling over the sets to make sure that at least twelve of the feature vectors are feasible.

Algorithms:5 We evaluate four different algorithms: GLMUCB (Filippi et al., 2010) (unconstrained generalized linear bandits ), OFULog+ (Lee et al., 2024) (unconstrained logistic bandits ), CCLB (our method, causal logistic bandits

4The source code is available at https://github.com/ jchen-research/CCLB. 5Another potential baseline is (Huang et al., 2022b), which also studied counterfactual fairness in the causal bandits framework, though for a different causal graph. Their code was not available at the time of this work.

Causal Logistic Bandits with Counterfactual Fairness Constraint

Figure 2: Plots for different algorithms GLM-UCB, OFULog+, CCLB (τ = 0.2), and ϵ-CCLB (ϵ = 0.14, τ = 0.2) on (a) cumulative regret; (b) cumulative constraint violations; (c) penalized cumulative regret.

with counterfactual fairness constraints, Algorithm 1), and ϵCCLB (our method with a user-chosen tightness parameter ϵ, Algorithm 2).

Metrics: We evaluated the algorithms using cumulative regret (6), cumulative constraint violations (7), and a penalized form of cumulative regret for different horizons. For the penalized cumulative regret, when the action picked by the learner violates the counterfactual fairness constraint, the learner still observes the reward value (i.e. the learner can improve the reward parameter estimate ˆθ), but we count the reward earned as being 0. In this way, constraint violations are allowed but are not (directly) profitable. This penalized form combines the two primary metrics for simpler analysis.

Results: The results are plotted in Figure 2. Beginning with penalized cumulative regret (Figure 2 (c)), where rewards are only received for fair actions, there is a large gap between our method (CCLB) and the methods of OFULog+ and GLM-UCB, with the gap growing larger for longer horizons. This is expected since GLM-UCB and OFULog+ do not account for constraints. Both baselines have nearly linear penalized cumulative regret across horizons used. OFULog+ does because it frequently violates constraints, and thus large cumulative constraint violations, despite learning good actions (for the unconstrained problem). For cumulative regret (unpenalized), OFULog+ performs better than our method (which seeks to satisfy the constraint).

GLM-UCB performs poorly at identifying good actions within the horizons (Figure 2 (a)). GLM-UCB s regret bound has a linear dependence on the κ (see Table 1). GLMUCB is also designed for a more general class of reward functions. Though ϵ-CCLB has a larger cumulative regret and (slightly) larger cumulative constraint violations than CCLB (but less than GLM-UCB), the penalized regret of ϵ-CCLB are smaller than CCLB when T goes larger (Figure 2 (c)), especially, the growth rate of ϵ-CCLB is nearly 0 from horizon T = 6, 000 to horizon T = 10, 000, which

rarely violates the constraints.

5. Conclusion

This paper introduced a framework for logistic bandits with counterfactual fairness constraints built within a causal structure. The proposed approach attains satisfactory results, demonstrating sublinear growth in both regret and constraint violations by effectively balancing exploration and exploitation within the environment via primal-dual optimization. By introducing a user-chosen parameter, one can trade the upper bounds between regret and constraint violations to achieve zero cumulative constraint violations.

Several promising directions emerge for future research. (1) One important direction is to extend our method to work with unobserved confounders (i.e. W would be unobserved). (2) Another interesting direction is to extend our model to handle distribution shifts over time. (3) A third interesting direction would be to extend our work to handle budget constraints and consider a fairness notion defined by the resource assignment, potentially building on existing work in bandits with knapsacks (Tran-Thanh et al., 2012; Badanidiyuru et al., 2018; Nie et al., 2024).

Acknowledgments

We thank Wen Huang for discussion. This material is based upon work supported by the National Science Foundation under Award No. 2321786.

Impact Statement

This paper presents work whose goal is to advance the field of machine learning. The results in this paper may have implications bringing together strong practical results in causality and decision-making. Such integration is anticipated to contribute to the development of trustworthy AI.

Causal Logistic Bandits with Counterfactual Fairness Constraint

Abeille, M., Faury, L., and Calauz enes, C. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pp. 3691 3699. PMLR, 2021.

Agrawal, S. and Devanur, N. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Thompson sampling for the MNL-bandit. In Conference on Learning Theory, pp. 76 78. PMLR, 2017.

Amani, S., Alizadeh, M., and Thrampoulidis, C. Linear stochastic bandits under safety constraints. Advances in Neural Information Processing Systems, 32, 2019.

Amani, S., Alizadeh, M., and Thrampoulidis, C. Generalized linear bandits with safety constraints. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3562 3566. IEEE, 2020.

Amari, S.-i. Information Geometry and Its Applications. Springer Publishing Company, Incorporated, 1st edition, 2016. ISBN 4431559779.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2 3):235 256, May 2002. ISSN 0885-6125. doi: 10. 1023/A:1013689704352. URL https://doi.org/ 10.1023/A:1013689704352.

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. J. ACM, 65(3), March 2018. ISSN 00045411. doi: 10.1145/3164539. URL https://doi. org/10.1145/3164539.

Beck, A. First-Order Methods in Optimization. SIAMSociety for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2017. ISBN 1611974984.

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 19 26. JMLR Workshop and Conference Proceedings, 2011.

Brekelmans, R., Masrani, V., Wood, F., Steeg, G. V., and Galstyan, A. All in the exponential family: Bregman duality in thermodynamic variational inference. ar Xiv preprint ar Xiv:2007.00642, 2020.

Chiappa, S. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 7801 7808, 2019.

Chouldechova, A. and Roth, A. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82 89, 2020.

Correa, J. and Bareinboim, E. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 10093 10100, 2020.

Darlington, R. B. Another look at cultural fairness . Journal of Educational Measurement, 8(2):71 82, 1971. ISSN 00220655, 17453984. URL http://www.jstor. org/stable/1433960.

Ding, D., Wei, X., Yang, Z., Wang, Z., and Jovanovic, M. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pp. 3304 3312. PMLR, 2021.

Dong, S., Ma, T., and Van Roy, B. On the performance of thompson sampling on logistic bandits. In Conference on Learning Theory, pp. 1158 1160. PMLR, 2019.

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214 226, 2012.

Efroni, Y., Mannor, S., and Pirotta, M. Explorationexploitation in constrained MDPs. ar Xiv preprint ar Xiv:2003.02189, 2020.

Faury, L., Abeille, M., Calauz enes, C., and Fercoq, O. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp. 3052 3060. PMLR, 2020.

Faury, L., Abeille, M., Jun, K.-S., and Calauz enes, C. Jointly efficient and optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pp. 546 580. PMLR, 2022.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010.

Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. Logistic regression: The importance of being improper. In Conference on Learning Theory, pp. 167 208. PMLR, 2018.

Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29, 2016.

Causal Logistic Bandits with Counterfactual Fairness Constraint

Hu, Y. and Zhang, L. Achieving long-term fairness in sequential decision making. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 9549 9557, 2022.

Hu, Y., Wu, Y., and Zhang, L. Long-term fair decision making through deep generative models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 22114 22122, 2024.

Huang, W., Labille, K., Wu, X., Lee, D., and Heffernan, N. Achieving user-side fairness in contextual bandits. Human-Centric Intelligent Systems, 2(3):81 94, 2022a.

Huang, W., Zhang, L., and Wu, X. Achieving counterfactual fairness for causal bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6952 6959, 2022b.

Imai, K. and Jiang, Z. Principal Fairness for Human and Algorithmic Decision-Making. Statistical Science, 38 (2):317 328, 2023. doi: 10.1214/22-STS872. URL https://doi.org/10.1214/22-STS872.

Imbens, G. W. and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.

Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In International Conference on Machine Learning, pp. 1617 1626. PMLR, 2017.

Jenatton, R., Huang, J., and Archambeau, C. Adaptive algorithms for online convex optimization with long-term constraints. In International Conference on Machine Learning, pp. 402 411. PMLR, 2016.

Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Meritocratic fairness for infinite and contextual bandits. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 158 163, 2018.

Khezeli, K. and Bitar, E. Safe linear stochastic bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 10202 10209, 2020.

Krause, A. and Guestrin, C. Near-optimal observation selection using submodular functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 7, pp. 1650 1654, 2007.

Kusner, M. J., Loftus, J., Russell, C., and Silva, R. Counterfactual fairness. Advances in Neural Information Processing Systems, 30, 2017.

Lavori, P. W. and Dawson, R. Adaptive treatment strategies in chronic disease. Annu. Rev. Med., 59(1):443 453, 2008.

Lee, J., Yun, S.-Y., and Jun, K.-S. Improved regret bounds of (multinomial) logistic bandits via regret-to-confidenceset conversion. In International Conference on Artificial Intelligence and Statistics, pp. 4474 4482. PMLR, 2024.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661 670, 2010.

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071 2080. PMLR, 2017.

Liu, Q., Xu, W., Wang, S., and Fang, Z. Combinatorial bandits with linear constraints: Beyond knapsacks and fairness. Advances in Neural Information Processing Systems, 35:2997 3010, 2022.

Liu, X., Li, B., Shi, P., and Ying, L. An efficient pessimisticoptimistic algorithm for stochastic linear bandits with general constraints. Advances in Neural Information Processing Systems, 34:24075 24086, 2021.

Murphy, S. A. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(2):331 355, 2003.

Nabi, R. and Shpitser, I. Fair inference on outcomes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Nie, G., Zhu, Y., Nadew, Y. Y., Basu, S., Pavan, A., and Quinn, C. J. Size-constrained k-submodular maximization in near-linear time. In Uncertainty in Artificial Intelligence, pp. 1545 1554. PMLR, 2023.

Nie, G., Aggarwal, V., and Quinn, C. J. Gradient methods for online dr-submodular maximization with stochastic long-term constraints. Advances in Neural Information Processing Systems, 2024.

Oh, M.-h. and Iyengar, G. Thompson sampling for multinomial logit contextual bandits. Advances in Neural Information Processing Systems, 32, 2019.

Osband, I. and Van Roy, B. Near-optimal reinforcement learning in factored MDPs. Advances in Neural Information Processing Systems, 27, 2014.

Pacchiano, A., Ghavamzadeh, M., Bartlett, P., and Jiang, H. Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pp. 2827 2835. PMLR, 2021.

Causal Logistic Bandits with Counterfactual Fairness Constraint

Pearl, J. Causal diagrams for empirical research. Biometrika, 82(4):669 688, 1995. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2337329.

Pearl, J. Direct and indirect effects. UAI 01, pp. 411 420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.

Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X.

Plecko, D. and Bareinboim, E. Causal fairness for outcome control. Advances in Neural Information Processing Systems, 36:47575 47597, 2023.

Plecko, D. and Bareinboim, E. Causal fairness analysis. Foundations and Trends in Machine Learning, Vol. 17, No. 3, pp 1 238. DOI: 10.1561/2200000106, 2024.

Plecko, D. and Bareinboim, E. Fairness-accuracy trade-offs: A causal perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 26344 26353, 2025.

Tran-Thanh, L., Chapman, A., Rogers, A., and Jennings, N. Knapsack based optimal policies for budget limited multi armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pp. 1134 1140, 2012.

Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press. ISBN 978-1-108-41519-4. doi: 10.1017/9781108231596.

Wu, H., Srikant, R., Liu, X., and Jiang, C. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. Advances in Neural Information Processing Systems, 28, 2015.

Wu, Y., Zhang, L., and Wu, X. Counterfactual fairness: Unidentification, bound and algorithm. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019.

Xu, H., Liu, Y., Lau, W. C., and Li, R. Combinatorial multi-armed bandits with concave rewards and fairness constraints. In IJCAI, pp. 2554 2560, 2020.

Yu, H., Neely, M., and Wei, X. Online convex optimization with stochastic constraints. Advances in Neural Information Processing Systems, 30, 2017.

Zafar, M. B., Valera, I., Gomez Rodriguez, M., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pp. 1171 1180, 2017.

Zhang, J. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. In International Conference on Machine Learning, pp. 11012 11022. PMLR, 2020.

Zhang, J. and Bareinboim, E. Fairness in decisionmaking the causal explanation formula. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Zhang, L., Wu, Y., and Wu, X. A causal framework for discovering and removing direct and indirect discrimination. ar Xiv preprint ar Xiv:1611.07509, 2016.

Zhang, Y.-J. and Sugiyama, M. Online (multinomial) logistic bandit: Improved regret and constant computation cost. Advances in Neural Information Processing Systems, 36, 2024.

Zhou, X. and Ji, B. On kernelized multi-armed bandits with constraints. Advances in Neural Information Processing Systems, 35:14 26, 2022.

Causal Logistic Bandits with Counterfactual Fairness Constraint

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 A Theoretical Framework for Constrained Causal Logistic Bandits 3

2.1 Logistic bandits with structural causal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Counterfactual fairness modeling via soft constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Model assumptions and definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Methods for Constrained Causal Logistic Bandits 5

3.1 Convex confidence set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Online learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Regret and constraint violations bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.4 Improved regret and constraint violations bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Numerical Experiments 8

5 Conclusion 9

A Preliminaries 14

B Additional Related Works 14

C Causal Logistic Bandits Framework 15

D Confidence Sets 16

E Logistic Error Upper Bound 18

F Regret and Constraint Violations Upper Bounds 23

F.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

F.2 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

F.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

F.4 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

G Technical Lemmas 29

H Additional Experiments 30

Causal Logistic Bandits with Counterfactual Fairness Constraint

A. Preliminaries

We first provide a formal definition for d-separation discussed in Section 1.3,

Definition 2 (d-separation (Pearl, 2009)). A path p is said to be d-separated (or blocked) by a set of nodes Z if and only if (1) p contains a chain i m j or a fork i m j such that the middle node m is in Z, (2) p contains an intervened fork (or collider) i m j such that the middle node m is not in Z.

We then detail below some useful notations that have been used throughout the paper.

D The set of all the decisions. θ true reward parameter vector (Rn). Y reward variable {0, 1}. Xt context vector including the specified attribute, the confounded features, and the intermediate features. f(Zdt) mapping feature vector (Rn). λt regularization parameter. ϕt dual variable. ρ truncated parameter. ϵ user-chosen tightness parameter. δ Slater s constant. α failure probability. Ct(α) confidence set. Bn p (1) n-dimensional ball of radius 1 under the ℓp norm. || || ℓ2 norm. || ||H H-weighted Euclidean norm (H is a symmetric positive definite matrix). Bern(p) Bernoulli distribution with parameter p.

We further recall and introduce the following functions and use it for the following analysis,

dt(Xt) = g(f(Zdt) θ ) g(f(Z dt) θ ) (16)

τ=1 f(Zdτ ) f(Zdτ ) + λt In (17)

τ=1 g(f(Zdτ ) θ )f(Zdτ )f(Zdτ ) + λt In (18)

Gt(θ, θ ) =

v=0 (1 v) g f(Zdτ ) θ + v f(Zdτ ) (θ θ ) dv f(Zdτ ) f(Zdτ ) + λt In (19)

ω(f(Zdτ ), θt, θ ) = Z 1

v=0 g f(Zdτ ) θt + v f(Zdτ ) (θ θt) dv (20)

Where the regularized design matrices Vt, Ht(θ ), Gt(θ, θ ), and ω(f(Zdτ ), θt, θ ) are defined for the proof of logistic bandits regret upper bound in Appendix E. In particular, Ht(θ ) measures the local behavior of the logistic function through g f(Zdτ ) θ .

B. Additional Related Works

Logistic bandits. The logistic bandits model represents a sequential decision-making framework that has attracted substantial attention within the parametric bandits literature (Li et al., 2010; Filippi et al., 2010; Li et al., 2017; Dong et al., 2019). In a recent work, Faury et al. (2020) proposed an optimistic algorithm based on a finer examination of the non-linearities of the reward function to study the prohibitive linear dependencies introduced by κ in the regret upper bound. Abeille et al. (2021) proved a minimax-optimal rate by deriving an Ω n p

T/κ (T) problem-dependent lower-bound, which implies that the non-linearity in logistic bandits can ease the exploration-exploitation trade-off in the long-term regime, i.e. κ (T) > 1. Faury et al. (2022) addressed the issue of computational tractability while preserving statistical efficiency by designing a new convex confidence set. Additionally, another line of research is the multinomial logit contextual bandit problem (Agrawal et al., 2017; Oh & Iyengar, 2019; Zhang & Sugiyama, 2024; Lee et al., 2024), which generalizes the binary logistic bandit

Causal Logistic Bandits with Counterfactual Fairness Constraint

by allowing the learner to select a subset of arms. In particular, (Zhang & Sugiyama, 2024; Lee et al., 2024) also improve the logistic bandits on the regret guarantee (with respect to S) and computational complexity, respectively.

Fairness. The body of research in fair machine learning is expanding and encompasses a variety of contexts. Within this field, three distinct tasks can be identified: (1) the detection and quantification of biases in currently deployed policies; (2) the development of fair predictive models for outcomes; and (3) the formulation of fair decision-making policies. Our work falls under the setting of online outcome control (task (3)) that explores fairness through a causal lens (Huang et al., 2022a;b; Plecko & Bareinboim, 2023; 2025). Unlike us, Plecko & Bareinboim (2023; 2025) explored the fairness through the path-specific counterfactual effect in an offline setting along with budget constraint. As for the online setting, Hu & Zhang (2022) studied achieving long-term fairness within a Markov Decision Process (MDP) framework, in which they quantified long-term fairness by evaluating the path-specific effects in a causal graph under interventions on sensitive attributes and predicted decisions. More recently, Hu et al. (2024) studied long-term fair decision-making through deep generative models.

Constrained MABs. There is a large body of work on bandits with different types of constraints, including knapsack bandits (Wu et al., 2015; Agrawal & Devanur, 2016), submodular maximization (Krause & Guestrin, 2007; Nie et al., 2023), bandits with hard safety constraints (Amani et al., 2019; Pacchiano et al., 2021), and bandits with cumulative soft constraints (Liu et al., 2021; Zhou & Ji, 2022). Among them, the bandit setting with cumulative soft constraints is most closely related to ours in that the goal is also to minimize the cumulative constraint violation. In particular, Zhou & Ji (2022) considered a general unknown reward function and a general unknown constraint function in kernelized bandits via primal-dual optimization. More broadly, this type of constrained problem has also been studied in the reinforcement learning (RL) setting (Efroni et al., 2020; Ding et al., 2021) where constraints are managed through convex optimization methods.

Figure 3: (a) A causal diagram representing GD; (b) another causal diagram representing GD,A.

C. Causal Logistic Bandits Framework

Here, we provide another example to motivate our constrained causal logistic bandits problem.

Online Recommendation System (Huang et al., 2022b). Customers arrive sequentially according to an underlying stochastic distribution, and an online decision-making model selects and recommends a specific item to each incoming individual based on a predefined strategy. In this context, each arm represents a distinct item or content piece available for recommendation to a user. The reward is determined by the user s interaction with the recommended item, such as whether the user clicks on it or not. The fairness constraint mandates that customers with similar profiles receive similar rewards, irrespective of their specific attributes and the particular items being recommended.

Next, we provide proofs for (2) and (3), which follow by the do-calculus rule (Pearl, 1995).

E[Y |do(dt), do(at), wt, mt] = E[Y |dt, at, wt, mt] (21)

= E[Y |Zdt] (22)

= g(f(Zdt) θ ), (23)

where (21) follows by (D, A Y |W, M)GD,A (see Figure 3b); (22) follows by denoting Zdt as the features from dt, wt, mt

Causal Logistic Bandits with Counterfactual Fairness Constraint

and at; and (23) follows by the logistic reward assumption ((1)). As for (3),

E[Y |do(dt), do(a t), wt, mt] = E[Y |dt, a t, wt, mt] (24)

= E[Y |Z dt] (25)

= g(f(Z dt) θ ), (26)

where (24) follows by (D, A Y |W, M)GD,A (see Figure 3b); (25) follows by denoting Z dt as the features from dt, wt, mt and a t; and (26) the last equality follows by the logistic reward assumption, similar as above.

D. Confidence Sets

In this section, we provide proofs for the construction of the improved convex confidence set for the estimated bandit parameter presented in Section 3.1. We borrow the techniques from (Lee et al., 2024, Section 3) to obtain the results.

Recall the convex confidence set definition:

Ct(α) = n θ Θ : Lt(θ) Lt(ˆθt) βt(α)2o ,

10n log( St

2n + e) + 2((e 2) + S) log 1

Proposition 1. Let α (0, 1], then

P t 1, θ Ct(α) 1 α.

Proof. The proof unfolds through three principal technical components similar with (Lee et al., 2024). First, we invoke decomposition identities for the logistic loss, expressing Lt(θ) Lt(ˆθt) as the sum of (i) the regret of the online learning algorithm, (ii) a martingale difference sequence, and (iii) a collection of KL-divergence terms. Second, in controlling the martingale sum, we derive and apply an anytime variant of Freedman s inequality tailored to martingales. Third, to bound the KL-divergence contribution, we fuse the self-concordant analysis of Abeille et al. (2021) with an information-geometric interpretation of the KL divergence.

Firstly, we denote ξτ as a real-valued martingale difference noise where yτ = g(f(Zdτ ) θ ) + ξτ, thus for the logistic loss ℓτ(θ) = yτ log g(f(Zdτ ) θ) (1 yτ) log 1 g(f(Zdτ ) θ) , we have that the following equality holds for any θ:

ℓτ(θ ) = ℓτ(θ) + ξτ f(Zdτ ), θ θ KL Bern(g(f(Zdτ ) θ )), Bern(g(f(Zdτ ) θt)) .

The equality follows from the first order Taylor expansion with an integral remainder (see (Lee et al., 2024, Appendix C.4.1) for more details). Setting θ to be the optimistic estimate θτ and taking a sum over time steps τ:

τ=1 ℓτ( θτ) ℓτ(θ ) KL Bern(g(f(Zdτ ) θ )), Bern(g(f(Zdτ ) θt)) + ξτ f(Zdτ ), θτ θ

τ=1 ℓτ( θτ) ℓτ(ˆθt) + ℓτ(ˆθt) ℓτ(θ ) KL Bern(g(f(Zdτ ) θ )), Bern(g(f(Zdτ ) θt)) + ξτ f(Zdτ ), θτ θ

τ=1 ℓτ(ˆθt) ℓτ(θ ) KL Bern(g(f(Zdτ ) θ )), Bern(g(f(Zdτ ) θt)) + ξτ f(Zdτ ), θτ θ +

τ=1 ℓτ( θτ) ℓτ(ˆθt)

where in (27) we add and subtract ℓτ(ˆθt) and in (28) we rearrange terms. We further define ζ1(t) = Pt τ=1 ξτ f(Zdτ ), θτ

θ , ζ2(t) = Pt τ=1 KL Bern(g(f(Zdτ ) θ )), Bern(g(f(Zdτ ) θt)), and ζ3(t) = Pt τ=1 h ℓτ( θτ) ℓτ(ˆθt) i . Using (28), we then have

Lt(θ) Lt(ˆθt) =

h ℓτ(θ ) ℓτ(ˆθt) i = ζ1(t) ζ2(t) + ζ3(t). (29)

Causal Logistic Bandits with Counterfactual Fairness Constraint

Upper Bounding ζ1(t). Recall that Fτ = σ {f(Zd1), y1, ..., f(Zdτ ), yτ, f(Zdτ+1)} is the filtration for our bandit model, f(Zdτ ) and θτ are Fτ 1-measurable, and ξτ is a martingale difference sequence w.r.t. Fτ 1. Thus, we have that,

E ξ2 τ f(Zdτ ), θτ θ 2|Fτ 1 = f(Zdτ ), θτ θ 2 E[ξ2 τ|Fτ 1] (30)

= f(Zdτ ), θτ θ 2 g(f(Zdτ ) θ ) 1 g(f(Zdτ ) θ ) (31)

= g(f(Zdτ ) θ ) f(Zdτ ), θτ θ 2. (32)

Where (30) follows by the measurability of f(Zdτ ), θτ θ w.r.t. Fτ 1; (31) is because E[ξ2 τ|Fτ 1] = V ar(yτ) = g(f(Zdτ ) θ ) 1 g(f(Zdτ ) θ ) . Also,

|ξτ f(Zdτ ), θτ θ | |ξτ| | f(Zdτ ), θτ θ |

| f(Zdτ ), θτ | + | f(Zdτ ), θ | (33)

Where (33) is by triangle inequality and |ξτ| 1. From (Beygelzimer et al., 2011, Theorem 1), we could apply Freedman s inequality to obtain the following result,

Lemma 1. (Lee et al., 2024, Lemma 3). Let Z1, ..., Zt be martingale difference sequence satisfying maxτ |Zτ| R a.s., and let Fτ be the σ field generated by Z1, ..., Zt. Then for any α (0, 1) and any η [0, 1/R], the following holds with probability at least 1 α:

τ=1 Zτ (e 2)η

τ=1 E[Z2 τ |Fτ 1] + 1

Thus, for η [0, 1

2S ] to be chosen later, by invoking Lemma 1 for the martingale difference sequence Z1, ..., Zt, the following holds with probability at least 1 α, t 1:

ζ1(t) (e 2)η

τ=1 g(f(Zdτ ) θ ) f(Zdτ ), θτ θ 2 + 1

Lower Bounding ζ2(t). From the standard result in information geometry (Amari, 2016; Brekelmans et al., 2020), we have the following result:

Lemma 2. (Lee et al., 2024, Lemma 4). Let m(z) := log(1 + ez) be the log-partition function for the Bernoulli distribution and g(z) = 1 1+e z . Then, we have that

KL Bern(g(z2)), Bern(g(z1)) = Dm(z1, z2),

where Dm(z1, z2) is the Bregman Divergence defined as Dm(z1, z2) = R z1 z2 m(z)(z1 z) dz.

Notice that

Dm(z1, z2) = Z z1

z2 m(z)(z1 z) dz = Z z1

log(1 + ez) (z1 z) dz = Z z1

z2 g(z)(z1 z) dz. (36)

Thus, we have the following lower bound on ζ2(t),

τ=1 KL Bern(g(z2)), Bern(g(z1)) (37)

τ=1 Dm f(Zdτ ) θτ, f(Zdτ ) θ (38)

Z f(Zdτ ) θτ

f(Zdτ ) θ g(z)(f(Zdτ ) θτ z) dz (39)

Causal Logistic Bandits with Counterfactual Fairness Constraint

τ=1 f(Zdτ ), θ θτ 2 Z 1

0 (1 v) g f(Zdτ ) (v θτ + (1 v)θ dv (40)

τ=1 f(Zdτ ), θ θτ 2 g(f(Zdτ ) θ ) 2 + |f(Zdτ ) (θ θτ)| (41)

τ=1 f(Zdτ ), θ θτ 2 g(f(Zdτ ) θ )

2 + 2S . (42)

Where (37) follows by definition of ζ2; (38) uses Lemma 2; (39) uses (36); (40) follows by change of variables; (41) follows by Lemma 6; and (42) follows by Assumptions 1 and 2.

Upper Bounding ζ3(t) (Lee et al., 2024, Theorem 2). From (Foster et al., 2018, Theorem 3), there exists an (improper learning) algorithm for online logistic regression with the following regret:

ζ3(t) 10n log e + St

Though our selected decisions are more conservative than those of (Lee et al., 2024) (we add a penalty term when selecting decisions to account for constraint violations), the estimation method to obtain ˆθt (i.e., MLE) and the way to compute the optimistic estimate θt (i.e., θt = arg maxθ Ct maxd D g(f(Zd) θ)) are the same as (Lee et al., 2024). See more justifications for using the improper learning algorithm in (Lee et al., 2024, Appendix B.2).

Combining Equation (29), (35), (42), and (43),

Lt(θ) Lt(ˆθt) = ζ1(t) ζ2(t) + ζ3(t)

τ=1 g(f(Zdτ ) θ ) f(Zdτ ), θτ θ 2 + 1

τ=1 f(Zdτ ), θ θτ 2 g(f(Zdτ ) θ )

+ 10n log e + St

10n log( St

2n + e) + 2((e 2) + S) log 1

Where (44) follows by η = 1 2(e 2)+2S < 1 2S , which finishes the proof.

E. Logistic Error Upper Bound

In this section, we provide the proofs for error in mean reward from an action dt based on optimistic θt instead of θ . Some of the details follow from (Faury et al., 2020, Appendix B) and (Abeille et al., 2021, Appendix C).

h E[b Y |do(dt), do(at), wt, mt] E[Y |do(dt), do(at), wt, mt] i

h g(f(Zdt) θt) g(f(Zdt) θ ) i (45)

h g(f(Zdt) θ ) f(Zdt) ( θt θ ) i

| {z } term (a)

h Z f(Zdt) θt

f(Zdt) θ g(u)(f(Zdt) θt u) du i

where the (45) comes from the expected reward in (2); the (46) is by performing the Taylor Series Expansion of g(f(Zdt) θt) on f(Zdt) θ with a first order integral remainder. Then we rewrite the logistic regret Rlog as term (a) and term (b), where,

h g(f(Zdt) θ ) f(Zdt) ( θt θ ) i

Causal Logistic Bandits with Counterfactual Fairness Constraint

h Z f(Zdt) θt

f(Zdt) θ g(u)(f(Zdt) θt u) du i .

We separately upper bound both terms. Firstly, we prove the following Lemma (Lee et al., 2024, Lemma 6) used throughout this section.

Lemma 3. With λt = 1 4S2(2+2S), η = 1 2(e 2)(2+2S), and ϵt = n

t , for any θ Ct(α), the following holds with probability at least 1 α:

θ θ 2 Ht(θ ) γt(α)2,

where γt(α)2 = 2(2 + 2S) h βt(α)2 + 2(2 + 2S)(e 2) log 1

α + log 5St

4 4Sη + ηϵt + 1 n i + 2.

Proof. By Proposition 1, we have that with probability at least 1 α, Lt(θ ) Lt(ˆθt) βt(α)2, we assume this event is true throughout this proof. Then,

Lt(θ) = Lt(θ ) + Lt(θ ) (θ θ ) + Z 1

0 (1 v)(θ θ ) 2Lt θ + v(θ θ ) (θ θ )dv (47)

= Lt(θ ) + Lt(θ ) (θ θ ) + θ θ 2 Gt(θ,θ ) λt In (48)

= Lt(θ ) + Lt(θ ) (θ θ ) + θ θ 2 Gt(θ,θ ) λt θ θ 2 2.

Where (47) is by the second-order Taylor expansion of Lt(θ) around θ ; (48) comes from R 1 0 (1 v) 2Lt θ + v(θ θ ) dv = R 1 0 (1 v) Pt τ=1 g f(Zdτ ) (θ + v(θ θ )) f(Zdτ )f(Zdτ ) dv = Gt(θ, θ ) λt In.

As for the relationship between Gt(θ, θ ) (19) and Ht(θ ) (18), we have the following result:

Gt(θ, θ ) =

v=0 (1 v) g f(Zdτ ) θ + v f(Zdτ ) (θ θ) dv f(Zdτ ) f(Zdτ ) + λt In

g f(Zdτ ) θ

2 + |f(Zdτ ) θ f(Zdτ ) θ | f(Zdτ ) f(Zdτ ) + λt In (49)

τ=1 g f(Zdτ ) θ f(Zdτ ) f(Zdτ ) + λt In (50)

1 2 + 2S Ht(θ ).

Where (49) follows by Lemma 6; (50) follows by Assumptions 1 and 2. Thus, we have that,

θ θ 2 Ht(θ ) (2 + 2S) θ θ 2 Gt(θ,θ ) (51)

= (2 + 2S) Lt(θ) Lt(θ ) + Lt(θ ) (θ θ) + λt θ θ 2 2

(2 + 2S) Lt(θ) Lt(ˆθt) + Lt(θ ) (θ θ) + λt θ θ 2 2 (52)

(2 + 2S)βt(α)2 + (2 + 2S) Lt(θ ) (θ θ) + 1. (53)

Where (51) is by semi-definite order monotonicity; (53) follows by λt = 1 4S2(2+2S); and (52) comes from ˆθt =

arg min||θ||2 S Lt(θ). Since βt(α) = q

10n log( St

2n + e) + 2((e 2) + S) log 1

α, we then go to bound Lt(θ ) (θ θ), which is done via a new concentration-type argument. Let Bn(2S) be a n dimensional ball of radius 2S and an arbitrary v Bn(2S). Firstly,

g(f(Zdτ ) θ ) yτ f(Zdτ ) v =

τ=1 ζτf(Zdτ ) v.

Causal Logistic Bandits with Counterfactual Fairness Constraint

As |ζτf(Zdτ ) v| < 2S and E[(ζτf(Zdτ ) v)2|Fτ 1] = g(f(Zdτ ) θ )(f(Zdτ ) v)2 from (34) and (32), by Freedman s inequality (35), for any η [0, 1 2BS ], the following holds:

τ=1 ζτf(Zdτ ) v (e 2)η

τ=1 g(f(Zdτ ) θ )(f(Zdτ ) v)2 + 1

By (Vershynin, Corollary 4.2.13) and (Lee et al., 2024, Appendix C.4.4) the following holds with probability at least 1 α:

Lt(θ ) (θ θ) (e 2)η||θ θ||2 Ht(θ ) + 1

4 4Sη + ηϵt + 1 ϵtt.

By (53), we finally have:

||θ θ ||2 Ht(θ ) 2 + 2S 1 (2 + 2S)(e 2)η

h βt(α)2 + 1

4 4Sη + ηϵt + 1 ϵtt i + 1

2(2 + 2S) h βt(α)2 + 2(2 + 2S)(e 2) log 1

α + log 5St

4 4Sη + ηϵt + 1 n i + 2.

Where (54) follows by η = 1 2(e 2)(2+2S) < 1 2S , ϵt = n

t . Which finishes the proof.

We start by examining term (a) and show the following upper bounds:

h g(f(Zdt) θ ) f(Zdt) ( θt θ ) i (55)

t=1 g(f(Zdt) θ ) ||f(Zdt)||H 1 t (θ ) θt θ Ht(θ ) (56)

t=1 g(f(Zdt) θ ) ||f(Zdt)||H 1 t (θ ) γt(α) (57)

t=1 g(f(Zdt) θ ) ||f(Zdt)||H 1 t (θ ) (58)

t=1 g(f(Zdt) θ )

t=1 g(f(Zdt) θ ) ||f(Zdt)||2 H 1 t (θ ) (59)

t=1 g(f(Zdt) θ )

t=1 ||ut||2 e V 1 t (60)

t=1 g(f(Zdt) θ )

n log λT + T

Where (56) and (59) is by the Cauchy-Schwarz inequality ( g f(Zdt) θ is non-negative); (57) comes from Lemma 3 and (58) is because γT (α) = maxt [T ] γt(α); in (60), we define vector ut = p

g(f(Zdt) θ ) f(Zdt) and matrix e Vt = Pt 1 τ=1 ut u t + λt In, and obtain:

g(f(Zdt) θ ) ||f(Zdt)||2 H 1 t (θ ) = || p

g(f(Zdt) θ ) f(Zdt)||2 H 1 t (θ ) = ||ut||2 e V 1 t ;

and (61) follows by Lemma 8.

Causal Logistic Bandits with Counterfactual Fairness Constraint

We then take a look at the first order of the logistic function g(f(Zdt) θ ) and derive a upper bound for it by a first-order Taylor expansion:

t=1 g(f(Zdt) θ ) =

t=1 g(f(Zdt) θt) +

f(Zdt) θt g(u) du (62)

t=1 g(f(Zdt) θt) +

v=0 g f(Zdt) θt + vf(Zdt) (θ θt) dv i f(Zdt) (θ θt) (63)

t=1 g(f(Zdt) θt) +

v=0 g f(Zdt) θt + vf(Zdt) (θ θt) dv i f(Zdt) (θ θt)

t=1 g(f(Zdt) θt) +

v=0 g f(Zdt) θt + vf(Zdt) (θ θt) dv i f(Zdt) ( θt θ )

t=1 g(f(Zdt) θt) +

g f(Zdt) θt + vf(Zdt) (θ θt) dv i f(Zdt) ( θt θ )

t=1 g(f(Zdt) θt) +

v=0 g f(Zdt) θt + vf(Zdt) (θ θt) dv i f(Zdt) ( θt θ ) (67)

t=1 g(f(Zdt) θt) +

h g f(Zdt) θt g f(Zdt) θ i (68)

t=1 g(f(Zdt) θt) + Rlog (69)

T + Rlog. (70)

Where (62) comes from the Taylor Expansion; (63) follows by changing variables; (64) is by taking the absolute value; (65) is because the optimistic estimate at step t, hence, f(Zdt) θt f(Zdt) θ ; (67) follows by the self-concordance property of logistic function | g| | g| and g > 0; and (68) is from the fundamental theorem of calculus; and (70) follows by g(f(Zdt) θt) 1.

Therefore, by (61) and (70), we intermediately obtain the following upper bound on term (a):

term (a) 2γT (α)

n log λT + T

n log λT + T

Rlog , (71)

where (71) is because p

Rlog for T > 0, Rlog > 0.

In order to upper bound the logistic bandits regret Rlog in (46), we still need to upper bound term (b) that includes the second order of logistic function:

h Z f(Zdt) θt

f(Zdt) θ g(u)(f(Zdt) θt u) du i (72)

v=0 (1 v) g f(Zdt) θ + v f(Zdt) ( θt θ ) dv i f(Zdt) ( θt θ ) 2 (73)

f(Zdt) ( θt θ ) 2 (74)

Causal Logistic Bandits with Counterfactual Fairness Constraint

1 2 f(Zdt) 2 H 1 t (θ ) θt θ 2 Ht(θ ) (75)

1 2 f(Zdt) 2 H 1 t (θ )γ2 t (α) (76)

t=1 f(Zdt) 2 H 1 t (θ ) (77)

2γ2 T (α) κZ

t=1 f(Zdt) 2 V t 1(θ ) (78)

2n γ2 T (α) κZ log λT + T

Where (73) follows by changing variables; (74) is because g 1; (75) follows by Cauchy-Schwarz inequality; (76) comes from Lemma 3; (77) is by CT (α) = maxt [T ] Ct(α); as for (78), we note that

τ=1 g f(Zdτ ) θ f(Zdτ ) f(Zdτ ) + λt In

τ=1 f(Zdτ ) f(Zdτ ) + λt In i (80)

= 1 κZ Vt(θ ),

where (80) follows by Definition 1, thus, H 1 t (θ ) κZV 1 t (θ ); and the last inequality (79) follows by Lemma 8.

Then by the upper bounds on term (a) in (71) and term (b) in (79), we then finally upper bound the logistic bandits regret Rlog in (46):

Rlog = term (a) + term (b)

n log λT + T

Rlog + 2n γ2 T (α) κZ log λT + T

This is a second-order polynomial equation in p

Rlog, by Lemma 5, we have

Rlog 2γT (α)

n log λT + T

n log λT + T

+ 2n γ2 T (α) κZ log λT + T

Using (x + y)2 2x2 + 2y2, we obtain:

Rlog 8nγ2 T (α) log λT + T

n log λT + T

T + 4nγ2 T (α) κZ log λT + T

To further simplify the logistic bandits regret Rlog, we write γt(α) as:

γt(α) = O S

n log e + St

2n + log 5St

Therefore, as for the Rlog, we obtain the following bounds:

Rlog 8nγ2 T (α) log λT + T

n log λT + T

T + 4nγ2 T (α) κZ log λT + T

= O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 , (81)

which finishes the proof.

Causal Logistic Bandits with Counterfactual Fairness Constraint

F. Regret and Constraint Violations Upper Bounds

In this section, we provide proofs for upper bounds of both reward regret and constraint violations. Our proofs build on the greedy procedure in Algorithm 1 and standard convex optimization analysis.

F.1. Proof of Theorem 1

We first prove the regret upper bound. Under Slater s constraint qualification in Assumption 3, we have the boundedness of the optimal dual solution by standard convex optimization analysis from (Beck, 2017, Theorem 8.42), where,

0 ϕ π , g(f(Zd) θ ) π0, g(f(Zd) θ )

the r.h.s. is because the logistic function is less than 1. Now, we turn to establish a bound over R+(T) + ϕV(T). Firstly,

R+(T) + ϕV(T) =

h π , g(f(Zd) θ ) g(f(Zdt) θ ) + ϕ(| dt(Xt)| τ) i (82)

h π , g(f(Zd) θ ) g(f(Zdt) θ ) + ϕ(| dt(Xt)| τ) ϕt π , | d(Xt)| τ i (83)

π , g(f(Zd) θ ) ϕt π , | d(Xt)| τ g(f(Zd) θt) ϕt(|b dt(Xt)| τ) +

| {z } term (1)

h g(f(Zd) θt) g(f(Zdt) θ ) + ϕ (| dt(Xt)| τ) (|b dt(Xt)| τ) i +

| {z } term (2)

h ϕ(|b dt(Xt)| τ) ϕt(|b dt(Xt)| τ) i

| {z } term (3)

π , g(f(Zd) θ ) ϕt π , | d(Xt)| τ g(f(Zd) θt) ϕt(|b dt(Xt)| τ) +

| {z } term (1)

t=1 g(f(Zd) θt) g(f(Zdt) θ ) + ϕ (| dt(Xt)| τ) (|b dt(Xt)| τ)

| {z } term (2)

where (83) holds since ϕt 0 and π , | dt(Xt)| τ 0; (84) holds by adding and subtracting PT t=1 g(f(Zd) θt), PT t=1 ϕt(|b dt(Xt)| τ), PT t=1 ϕ(|b dt(Xt)| τ); and (85) follows by upper bounding term (3) from Lemma 4.

We are then going to bound term (1):

π , g(f(Zdt) θ ) ϕt π , | d(Xt)| τ g(f(Zd) θt) ϕt(|b dt(Xt)| τ)

t=1 π , g(f(Zd) θ ) g(f(Zd) θt) + ϕt π , (|b d(Xt)| τ) (| d(Xt)| τ) +

π , g(f(Zd) θt) ϕt (|b d(Xt)| τ) g(f(Zdt) θt) + ϕt (|b dt(Xt)| τ) (86)

Causal Logistic Bandits with Counterfactual Fairness Constraint

t=1 π , g(f(Zd) θ ) g(f(Zd) θt) +

t=1 ϕt π , (|b d(Xt)| τ) (| d(Xt)| τ) (87)

t=1 ϕt π , (|b d(Xt)| τ) (| d(Xt)| τ) (88)

t=1 π , (|b d(Xt)| τ) (| d(Xt)| τ) (89)

= ρ O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 . (90)

Where (86) follows by adding and subtracting terms; (87) comes from the greedy action dt (see (13)) chosen at step t in Algorithm 1; (88) comes from the optimistic estimate θt (see (12)) in Algorithm 1; and (89) is because ϕt ρ; as for the result in (90), we note that if b d(Xt) 0 then d(Xt) 0, and by the counterfactual fairness effect (see (4)), we have,

(|b d(Xt)| τ) (| d(Xt)| τ) = g(f(Zd) θt) g(f(Z d) θt) g(f(Zd) θ ) + g(f(Z d) θ )

g(f(Zd) θt) g(f(Zd) θ ). (91)

where (91) follows by selecting optimistic reward parameter at every round (see (12)) in Algorithm 1, thus f(Z d) θ f(Z d) θt. On the another hand, if b d(Xt) < 0 then d(Xt) < 0,

(|b d(Xt)| τ) (| d(Xt)| τ) = g(f(Z d) θt) g(f(Zd) θt) + g(f(Zd) θ ) g(f(Z d) θ )

g(f(Z d) θt) g(f(Z d) θ ). (92)

Again, (92) follows by selecting optimistic reward parameter at every round (see (12)) in Algorithm 1, thus f(Zd) θ f(Zd) θt. Here, we notice that, the factual feature Zdt and the counterfactual feature Z dt reside within the feature space Z, in which the boundness assumption (Assumption 1) and problem dependent constant (Definition 1) are defined by. Therefore, the logistic bandits regret of g(f(Zdt) θt) g(f(Zdt)θ ) has the same asymptotic upper bound up to logarithmic factors as g(f(Z dt) θt) g(f(Z dt)θ ) (see (81)).

We further bound term (2). When b d(Xt) 0:

g(f(Zdt) θt) g(f(Zdt) θ ) + ϕ (| dt(Xt)| τ) (|b dt(Xt)| τ)

t=1 g(f(Zdt) θt) g(f(Zdt) θ ) + ϕ g(f(Zdt) θ ) g(f(Z dt) θ ) ϕ g(f(Zdt) θt) g(f(Z dt) θt)

g(f(Zdt) θt) g(f(Zdt) θ ) + ϕ g(f(Z dt) θt) g(f(Z dt) θ ) . (94)

Where (93) follows by the counterfactual fairness effect (see (4)); and (94) follows by (12) in Algorithm 1.

When b d(Xt) < 0, we have that:

g(f(Zdt) θt) g(f(Zdt) θ ) + ϕ (| dt(Xt)| τ) (|b dt(Xt)| τ)

t=1 g(f(Zdt) θt) g(f(Zdt) θ ) + ϕ g(f(Z dt) θ ) g(f(Zdt) θ ) ϕ g(f(Z dt) θt) g(f(Zdt) θt)

g(f(Zdt) θt) g(f(Zdt) θ ) + ϕ g(f(Zdt) θt) g(f(Zdt) θ ) . (96)

Causal Logistic Bandits with Counterfactual Fairness Constraint

Similarly with above, (95) follows by the counterfactual fairness effect (see (4)); and (96) follows by (12) in Algorithm 1.

Since the logistic bandits regret of g(f(Zdt) θt) g(f(Zdt)θ ) has the same asymptotic upper bound up to logarithmic factors as g(f(Z dt) θt) g(f(Z dt)θ ) (see discussions on (90)), thus combining (94) and (96),

term (2) (1 + ϕ) O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 . (97)

Thus, by (90) and (97), the upper bound on R+(T) + ϕV(T) for any ϕ [0, ρ] is the following:

R+(T) + ϕV(T) = (1 + ϕ) O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 +

ρ O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 + O(ρ

Regret R+(T). By setting ϕ = 0, then we obtain the upper bounds on the total regret guarantee with high probability:

R+(T) = (ρ + 1) O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 + O(ρ

= O ρ n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 + ρ

= e O ρ(n S

T + n2S2κZ) + ρ

where (98) is because the truncated parameter ρ 2/δ; and (99) is to write the regret upper bound in a logarithmic asymptotic notation.

Constraint violations. Next, to obtain a bound on V(T), we employ tools from constrained convex optimization. First, we define probability distribution π t by

π , g(f(Zd) θ ) = g(f(Zdt) θ ) and π , | d(Xt)| τ = | dt(Xt)| τ,

Thus, we could rewrite R+(T) + ϕV(T) in (82) as:

R+(T) + ϕV(T) =

h π , g(f(Zd) θ ) π , g(f(Zd) θ ) + ϕEπ t(| d(Xt)| τ) i . (100)

Then we apply the following theorem from Theorem 3.60 in (Beck, 2017).

Theorem 3. Consider the following convex constrained problem f(π ) = maxπ Π {f(π) : g(π) 0}, where both f and g are convex over the convex set Π in a vector space. Suppose f(π ) is finite and there exists a slater point π0 such that g(π0) δ, and a constant ρ 2ϕ where ϕ is the optimal dual variable, i.e., ϕ = argminϕ 0(maxπ f(π) ϕg(π)). Assume that π Π satisfies

f(π ) f(π ) + ρ[g(π )]+ ϵ,

for some ϵ > 0, then we have [g(π )]+ 2ϵ

Since PT t=1 π , g(f(Zd) θ ) is convex over {π t }T t=1, PT t=1 π , g(f(Zd) θ ) and PT t=1 π , | d(Xt)| τ are convex over {π t}T t=1. Then (100) satisfies the conditions in Theorem 3 and we have that:

t=1 (| dt(Xt)| τ)

= O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 +

T + n2S2κZ +

Where (101) follows by ϕ [0, ρ] and 1/ρ < 1; and (102) writes the bound up to polylogarithmic factors.

Causal Logistic Bandits with Counterfactual Fairness Constraint

Lemma 4. Under the dual update of ϕt in Algorithm 1, we have the following for any ϕ [0, ρ]:

t=1 (ϕ ϕt)(|b dt(Xt)| τ) 1 2η (ϕ1 ϕ)2 +

η 2((|b dt(Xt)| τ))2.

Proof. By the dual update of ϕt in Algorithm 1:

(ϕt+1 ϕ)2 = (ϕt + 1

η (|b dt(Xt)| τ) ϕ)2 (103)

= (ϕt ϕ)2 + (1

η (|b dt(Xt)| τ))2 + 2(ϕt ϕ)(1

η [|b dt(Xt)| τ]) (104)

= (ϕt ϕ)2 + 2

η (ϕt ϕ)(|b dt(Xt)| τ) + (1

η (|b dt(Xt)| τ))2. (105)

Where (103) follows by (14); both (104) and (105) write the technical terms. Summing over T steps and multiplying both sides by η

η 2(ϕt+1 ϕ)2 =

η 2(ϕt ϕ)2 +

t=1 (ϕt ϕ)(|b dt(Xt)| τ) +

1 2η ((|b dt(Xt)| τ))2. (106)

t=1 (ϕ ϕt)(|b dt(Xt)| τ) =

η 2(ϕt+1 ϕ)2 +

1 2η ((|b dt(Xt)| τ))2 (107)

2(ϕT +1 ϕ)2 +

1 2η ((|b dt(Xt)| τ))2 (108)

1 2η ((|b dt(Xt)| τ))2 (109)

T 2ρ ϕ2 + ρ

T 2 ((|b dt(Xt)| τ))2 (110)

Where (107) follows by (106); (108) comes from telescopic sum; (109) is because η

2(ϕT +1 ϕ)2 0; (110) follows by ϕ1 = 0 and η =

T/ρ initialized in Algorithm 1; and (111) is because |b dt(Xt)| [0, 1], ϕ [0, ρ], and 0 τ 1.

F.2. Proof of Proposition 3

Proposition 3 states the relationship between policy π and π ϵ for the regret upper bounds, we provide a proof in the following.

Proposition 3. Let policies π and π ϵ be the optimal solution for constrained problem maxπ{ π, g(f(Zd) θ ) : π, | d(Xt)| τ 0} and maxπ{ π, g(f(Zd) θ ) : π, | d(Xt)| τ + ϵ 0}, we have,

t=1 π , g(f(Zd) θ )

t=1 π ϵ , g(f(Zd) θ ) ϵT

Proof. The policies π and π ϵ are defined as:

π = max π { π, g(f(Zd) θ ) : π, | d(Xt)| τ 0}

Causal Logistic Bandits with Counterfactual Fairness Constraint

π ϵ = max π { π, g(f(Zd) θ ) : π, | d(Xt)| τ + ϵ 0}.

Let one policy πϵ = (1 ϵ

δπ0, where π0 is the policy satisfies the Slater s constrained qualification, i.e., π0, | d(Xt)| τ δ, t [T]. Note that

πϵ, | d(Xt)| τ = (1 ϵ

δ ) π , | d(Xt)| τ + ϵ

δ π0, | d(Xt)| τ 0 + ϵ

Therefore, πϵ is a feasible solution of the baseline problem πϵ, g(f(Zd) θ ) : πϵ, | d(Xt)| τ + ϵ 0. Thus,

t=1 π , g(f(Zd) θ )

t=1 π ϵ , g(f(Zd) θ )

t=1 π , g(f(Zd) θ )

t=1 πϵ, g(f(Zd) θ ) (112)

t=1 π , g(f(Zd) θ ) (1 ϵ

δ ) π , g(f(Zd) θ ) ϵ

δ π0, g(f(Zd) θ ) (113)

π , g(f(Zd) θ ) π0, g(f(Zd) θ ) ϵT

Where (112) follows by that π ϵ is the optimal solution while πϵ is a feasible solution to πϵ, g(f(Zd) θ ) : πϵ, | d(Xt)| τ + ϵ 0; (113) is from πϵ = (1 ϵ

δπ0; and (114) comes from g(f(Zd) θ ) [0, 1].

Algorithm 2 ϵ CCLB Algorithm

1: Input: Horizon T, truncated interval ρ, step size η =

T/ρ, and the initial dual value ϕ1 = 0, user-select parameter ϵ [0, δ). 2: for t = 1, 2, 3, . . . , T do 3: The learner observes the contextual information Xt = {at, wt, mt}. 4: Update the estimated reward parameter ˆθt. 5: Build a confidence set Ct(α) from (11),

Ct(α) = n θ Θ : Lt(θ) Lt(ˆθt) βt(α)2o . (115)

6: Greedy procedure. Choose the optimistic reward parameter and select the greedy action:

θt = arg maxθ Ctmaxd D g(f(Zd) θ ) (116)

dt = arg max d D g(f(Zd) θt) ϕt(|b d(Xt)| τ + ϵ). (117)

7: Update the dual variable:

ϕt+1 = Proj[0,ρ] ϕt + 1/η(|b dt(Xt)| τ + ϵ) . (118)

8: Received a random reward yt+1. 9: end for

F.3. Proof of Theorem 2

In this section, we establish upper bounds on both regret and constraint violation for the revised constraint condition (see Algorithm 2). This is achieved by introducing a slackness variable ϵ, which serves to tighten the constraint. First, we decompose Rϵ +(T) + ϕVϵ(T) as follows, where Vϵ(T) = PT t=1 | dt(Xt)| τ + ϵ.

Rϵ +(T) + ϕVϵ(T) =

h π ϵ , g(f(Zd) θ ) g(f(Zdt) θ ) + ϕ[| dt(Xt)| τ + ϵ] i (119)

Causal Logistic Bandits with Counterfactual Fairness Constraint

t=1 π ϵ , g(f(Zd) θ ) g(f(Zdt) θ ) + ϕ[| dt(Xt)| τ + ϵ] ϕt π ϵ , | d(Xt)| τ + ϵ

t=1 π ϵ , g(f(Zd) θ ) ϕt π ϵ , | d(Xt)| τ + ϵ g(f(Zd) θt) + ϕt[|b dt(Xt)| τ + ϵ] +

| {z } term (1-ϵ)

h g(f(Zd) θt) g(f(Zdt) θ ) + ϕ [| dt(Xt)| τ + ϵ] [|b dt(Xt)| τ + ϵ] i +

| {z } term (2-ϵ)

h ϕ[|b dt(Xt)| τ + ϵ] ϕt[|b dt(Xt)| τ + ϵ] i .

| {z } term (3-ϵ)

Where (119) holds since ϕt 0 and π , | dt(Xt)| τ + ϵ 0; and (120) follows by adding and subtracting technical terms. Similar to the techniques in Section F.1, we can upper bound term (1-ϵ), term (2-ϵ), term (3-ϵ) according to (90), (97), and Lemma 4 as following:

term (1-ϵ) =

t=1 π ϵ , g(f(Zd) θ ) ϕt π ϵ , | d(Xt)| τ + ϵ g(f(Zd) θt) + ϕt[|b dt(Xt)| τ + ϵ]

= ρ O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 ,

term (2-ϵ) =

h g(f(Zd) θt) g(f(Zdt) θ ) + ϕ [| dt(Xt)| τ + ϵ] [|b dt(Xt)| τ + ϵ] i

= (1 + ϕ) O n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2 ,

term (3-ϵ) =

h ϕ[|b dt(Xt)| τ + ϵ] ϕt[|b dt(Xt)| τ + ϵ] i = O(ρ(1 + ϵ)2

Thus, we have R+(T)+ϕV(T) = (ρ+ϕ) O n S log(T)

T +n2S2(log(T))2 +n2S2κZ(log(T))2 +O(ρ(1+ϵ)2

Regret Rϵ +(T). By setting ϕ = 0, we have:

R+(T) = O ρ n S log(T)

T + n2S2(log(T))2 + n2S2κZ(log(T))2

T + ρn2S2κZ + ρ

T(1 + ϵ)2 .

Constraint violations. By applying (Beck, 2017, Theorem 3.60), we have Vϵ(T) = e O n S

T + n2S2κZ + (1 + ϵ)2

to obtain a bound V(T), we notice that:

V(T) = Vϵ(T)

t=1 ϵ = e O n S

T + n2S2κZ + (1 + ϵ)2

Which finishes the proof.

F.4. Proof of Proposition 4

Proposition 4. By conditions stated in Theorem 2, for the user-selected parameter ϵ =

T + C1n log(T) + (C2 + C3κZ)n2((log(T))2p

1/T /2C4 1, where C1, C2, C3, C4 are the universal con-

Causal Logistic Bandits with Counterfactual Fairness Constraint

stants independent of n, S, T, κZ, if n 2 and ϵ < δ, then one could achieve a zero upper bound on the constraint violations when select ϵ [ϵ , δ).

Proof. To show the result in cumulative zero constraint violations, we write it as the following where C1, C2, C3, C4 are the universal constants which is independent of n, S, T, κZ, and ϵ [0, δ):

V(T) C1n S log(T)

T + C2n2S2(log(T))2 + C3κZ n2S2(log(T))2 + C4(1 + ϵ)2

we solve it when the right-hand-side is less than 0:

where Γ = C1n S log(T)

T + C2n2S2((log(T)))2 + C3κZ n2S2(log(T))2 + C4

T. First, The upper bound of ϵ have the following inequality:

T 2C4 1 T 2C4

Where (121) is because C4 = ρ while T ρ (typically we choose ρ < 100 for the numerical experiments). Now we look at the lower bound of ϵ,

T C1n S log(T)

T + C2n2S2((log(T))2 + C3κZ n2S2(log(T))2 + C4

T + C1n S log(T)T + C2n2S2((log(T))2

T + C3κZ n2S2(log(T))2

T + C1n S log(T) + (C2 + C3κZ)n2S2((log(T))2p

If the lower bound ϵ is less than the Slater s constant, then when the learner choose ϵ [ϵ , δ), we could achieve zero cumulative constraint violations.

G. Technical Lemmas

Lemma 5 ((Abeille et al., 2021) Lemma 7). Let b, c R+, and u R. The following implication holds:

u2 bu + c = u b + c.

Lemma 6 ((Abeille et al., 2021) Lemma 8). Let g be a strictly increasing function such that | g| | g|, and let Z be any bounded interval of R. Then, for all z1, z2 Z: Z 1

v=0 (1 v) g(z1 + v(z2 z1)) dv g(z) 2 + |z1 z2|.

Lemma 7 ((Abeille et al., 2021) Lemma 11). Let {uτ} τ=1 be a sequence in Rn such that ||uτ|| B for all τ N, and let λ be a non-negative scalar. For t 1 define Vt = Pt 1 τ=1 uτu τ + λIn. The following inequality holds:

det(Vt) tr(Vt)

n λ + (t 1)B2

Lemma 8 ((Abeille et al., 2021) Lemma 12). Let {uτ} τ=1 be a sequence in Rn such that ||uτ|| B for all τ N. Further let {λτ} τ=1 be an non-decreasing sequence in R+ s.t. λ1 = 1. For t 1 define Vt = Pt 1 τ=1 uτu τ + λt In. Then:

t=1 ||ut||2 V 1 t 2n(1 + B2) log λT + TB2

Causal Logistic Bandits with Counterfactual Fairness Constraint

H. Additional Experiments

In this section, we provide additional evaluations for the ϵ-CCLB and CCLB algorithms when selecting different tightness parameter ϵ (Figure 4) and counterfactual fairness threshold τ (Figure 5), respectively.

Figure 4. Increasing the user-chosen tightness parameter ϵ, the cumulative regret increases as well, but the cumulative constraint violations decreases (the learner becomes more conservative). When we pick the ϵ equals τ (both are 0.16), we observe that it incurs a high cumulative constraint violations (Figure 4 (b)) since π , | d(Xt)| τ + ϵ > 0 therefore there does not exist feasible decisions (notice that ϵ < δ τ).

Figure 5. As the counterfactual fairness threshold τ increases, the feasible region is larger (more of the actions are feasible), which means the fixed comparator could be better but at the same time easier to avoid violating constraints (since more actions are feasible in the first place when |D| is fixed), thus reduce the cumulative constraint violations (see Figure 5 (b), the cumulative constraint violations are nearly 0 when τ = 0.86). On the other hand, when τ is small, i.e., | dt(Xt)| τ almost every round. The dual variable ϕt will increase as well, which renders the learner penalizes more on the counterfactual fairness constraint (thus more conservative), therefore decrease the cumulative constraint violations (see Figure 5 (b), the cumulative constraint violations are relative small when τ = 0.06).

Figure 4: Plots for the ϵ-CCLB (τ=0.16) algorithm when selecting different ϵ on (a) cumulative regret; (b) cumulative constraint violations; (c) penalized cumulative regret.

Figure 5: Plots for CCLB algorithm when selecting different counterfactual fairness threshold τ on (a) Cumulative regret; (b) Cumulative constraint violations; (c) Penalized cumulative regret.