# linear_contextual_bandits_with_interference__5dfc064b.pdf

Linear Contextual Bandits With Interference

Yang Xu 1 Wenbin Lu 1 Rui Song 1

Interference, a key concept in causal inference, extends the reward modeling process by accounting for the impact of one unit s actions on the rewards of others. In contextual bandit (CB) settings where multiple units are present in the same round, interference can significantly affect the estimation of expected rewards for different arms, thereby influencing the decision-making process. Although some prior work has explored multi-agent and adversarial bandits in interference-aware settings, how to model interference in CB remains significantly underexplored. In this paper, we introduce a systematic framework to address interference in Linear CB (Lin CB), bridging the gap between causal inference and online decision-making. We propose a series of algorithms that explicitly quantify the interference effect in the reward modeling process and provide comprehensive theoretical guarantees, including sublinear regret bounds, finite sample upper bounds, and asymptotic properties. The effectiveness of our approach is demonstrated through simulations and synthetic data generated based on Movie Lens data.

1. Introduction

Decision making widely exists in various real-world settings, such as advertising (Zhou et al., 2023), clinical trials (Tsiatis et al., 2008), and transportation (Wang et al., 2024). The core of decision-making problems lies in selecting the optimal action from a set of possible options to maximize an outcome or reward that is valuable to the decision-maker.

To simplify decision-making involving multiple individuals (or agents/units), it is often assumed that an individual s outcome depends only on the action applied to them and is unaffected by actions taken on others. This assumption, part of the Stable unit Treatment Value Assumption (SUTVA),

1Department of Statistics, North Carolina State University, USA. Correspondence to: Rui Song <songray@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

is common in causal inference literature. However, in many real-world situations, this assumption is frequently violated due to interference, where an individual s outcome is influenced by the actions of others (Rosenbaum, 2007).

During the COVID-19 pandemic, local governments have sought optimal personalized quarantine policies over extended periods. Here, each local government acts as an agent capable of making decisions over time. The observed reward, such as the number of infected individuals, reflects the community s health status following the application of a specific policy. Over time, governments adjust their policies based on the latest health status to mitigate virus spread. In this case, one community s health outcomes can be influenced by the quarantine policies of neighboring communities due to population movement and the nature of virus spread, illustrating the presence of interference.

Another example is in advertising markets, where advertisers manage multiple ad lines targeting overlapping audiences through different channels. The overlap in target consumers and the similarity between ad lines can cause the Return on Investment (ROI) of one ad to be influenced by the impression status of another. For instance, if one ad targets students interested in Nike shoes and another in Adidas shoes, the ROI of the Nike ad might be affected by whether customers have already seen the Adidas ad due to the shared audience and similar products.

In classical causal inference literature, the issue of interference has been extensively explored in both experimental design (Aronow & Samii, 2017; Viviano, 2020; Leung, 2022a;b; Viviano, 2024) and observational studies (Su et al., 2019; Bargagli-Stoffi et al., 2020; Forastiere et al., 2021; Qu et al., 2021). However, interference within online sequential decision-making frameworks, such as bandits, remains relatively underexplored.

When multiple agents, each capable of sequential decisionmaking, are involved, the problem is often known as the multi-agent multi-armed bandits (MAMAB). However, existing work in MAMAB faces limitations in incorporating agent-specific information that can enhance personalized decision-making. For example, in the COVID-19 spread scenario, communities (treated as agents) are not static; there may be population flows, new communities joining, and existing communities dissolving, each with unique charac-

Linear Contextual Bandits With Interference (Lin CBWI)

teristics. Similarly, in the advertising market, ad lines (also treated as agents) evolve quickly across campaigns, each with distinct characteristics such as size, content, and target audiences. Therefore, it is essential to consider a bandit framework under interference that accounts for agents or individuals evolving over time. This problem, often incorporating agent-specific information, is typically addressed within the framework of contextual bandits (CB).

Addressing interference in the CB framework presents two primary challenges. First, the structure-wise challenge: since the actions of multiple units influence each other s reward modeling, the resulting action space becomes highdimensional without a precise understanding of the interference structure (Agarwal et al., 2024). Estimating heterogeneous treatment effects under interference in single-stage settings is already a complex task (Viviano, 2020; Leung, 2022b), and extending this to sequential decision-making scenarios like bandits (where the balance between exploration and exploitation is crucial) further compounds the challenge. Second, the theory-wise challenge: classical statistical inference tools, such as the Central Limit Theorem (CLT), become inapplicable as each action and corresponding reward may depend on all previously collected data. This presents a dual challenge: dependencies arise both across rounds due to online updates and within rounds due to interference between units. These dependencies violate the standard assumption of sample independence, complicating the application of traditional inference techniques.

While some recent works have addressed the issue of interference in MAMAB (Verstraeten et al., 2020; Bargiacchi et al., 2018; Dubey et al., 2020; Agarwal et al., 2024) and adversarial bandits (Jia et al., 2024), to the best of our knowledge, no existing work addresses interference within the contextual bandits framework. To bridge this gap, we introduce a comprehensive framework for linear contextual bandits in interference-aware scenarios, providing theoretical guarantees via both statistical inference and regret analysis. Our contributions are as follows.

First, we are the very first work to address the interference issue in contextual bandits with multiple units involved in each round, bridging the gap between SUTVA violations in causal inference and online decision making.

Second, we propose a systematic framework that extends the classical Lin CB to interference-aware scenarios, offering comprehensive theoretical guarantees, including finitesample upper bounds and sublinear regret. Remarkably, the regret bound achieves the minimax optimal rate in terms of the number of observed samples in the classical contextual bandits literature, even in the presence of interference.

Third, we are also the first work to establish the asymptotic properties of regression coefficients and the optimal value

function in an online setting with interference. We introduce a probability of exploration and a small clipping rate to ensure both estimation consistency and asymptotic properties. This clipping rate also balances the trade-off between statistical efficiency and regret minimization. The performance of our estimator is validated through simulation studies and synthetic data based on Movie Lens.

The structure of this paper is as follows. Section 2 reviews recent work on interference and its integration into bandit frameworks. Section 3 introduces our framework for reward modeling and action selection, covering both offline optimization and online learning. In Section 4, we establish the theoretical foundations of the proposed estimators and the optimal value function, along with a regret analysis. Sections 5 and 6 demonstrate the effectiveness of our proposed algorithm through a simulation study and a synthetic dataset based on Movie Lens. Finally, we provide a brief summary and discuss potential extensions in Section 7.

2. Related Work

Interference in Single Stage. In single-stage setting, existing literature varies significantly in defining interference, often assuming different structures for simplifying reward modeling. There are some work focuses on using partial interference and exposure mapping to quantify interference (Sobel, 2006; Qu et al., 2021; Hudgens & Halloran, 2008; Forastiere et al., 2021; Aronow & Samii, 2017; Bargagli Stoffi et al., 2020). While this approach typically requires fewer assumptions during the reward modeling stage, it often relies on additional requirements, such as knowing the form of the exposure mapping function (Manski, 2013; Aronow & Samii, 2017; Bargagli-Stoffi et al., 2020) or assuming i.i.d. clusters (Qu et al., 2021). These assumptions can be overly restrictive in later stages and may not clearly explain direct and spillover effects.

On the contrary, another body of work, such as Su et al. (2019), considers the reward as a linear function of neighbors covariates and treatments, which makes more interpretable assumptions in interference modeling and allows for comprehensive study in theory. This formulation is similar to Cliff & Ord (1981) and Getis (2009), where an autoregressive model is used to capture spatial interference with similar parametric modeling assumptions. In our work, we consider linear CB as the first step to handle interference while maintaining interpretability, and study both theories and algorithms under this framework.

Cooperative Multi-Agent Bandits. Multi-agent bandits typically assume a fixed set of N agents making decisions over time. Some existing works, such as Mart ınez-Rubio et al. (2019) and Landgren et al. (2016), focus on information sharing between agents in a distributed system. While

Linear Contextual Bandits With Interference (Lin CBWI)

sharing historical data can enhance the reward learning process among agents, these studies still assume that each agent s reward depends solely on its own actions, excluding the possibility of interference.

Another line of research, including Verstraeten et al. (2020), Bargiacchi et al. (2018), Dubey et al. (2020), Agarwal et al. (2024), and Jia et al. (2024), considers more general reward models where the actions of other agents can affect an individual agent s reward. Specifically, Bargiacchi et al. (2018) and Verstraeten et al. (2020) extended the Upper Confidence Bound (UCB) algorithm and Thompson Sampling (TS) from the classical MAB setting to multi-agent scenarios. Dubey et al. (2020) introduced a kernelized UCB algorithm, where interference is mediated through network contexts. Although Jia et al. (2024) also considered interference in bandits, their focus was on adversarial bandits with homogeneous actions, which is not as flexible as our approach where heterogeneous actions are allowed for units within the same round.

However, all of the aforementioned literature only considers an MAB setting with a fixed and finite number of agents, which is different from our setting where agents or units can vary over time with evolving contextual information.

3. Problem Formulation

In each round t {1, . . . , T}, we assume there are Nt units with contextual information Xti Rd in a network making sequential decisions simultaneously. At each time step t, unit i {1, . . . , Nt} chooses an action Ati A and collects a reward Rti. Define Nt = Pt s=1 Nt as the total number of units up to round t. Due to interference, the potential outcome of unit i at round t is defined as Rti(at), where at = (at1, . . . , at Nt)T is the action assignment vector for all units at round t.

To quantify the interference level between any two units in the same round, we suppose there exists a (normalized) adjacency matrix W t = {Wt,ij}1 i,j Nt RNt Nt at round t, such that Wt,ij denotes the causal interference strength from unit i to unit j. Note that in our setup, W t is not required to be symmetric, i.e., the causal interference strength from unit i to unit j can differ from that of j to i. By default, we assume W t,ij [ 1, 1], and W t,ii = 1 for any t and 1 i, j Nt.

Defining an interference matrix W t is both intuitive and flexible enough to model various real-world scenarios. For instance, in the special case where W t is symmetric and takes values from {0, 1}, it can represent a neighborhood structure or friend network, where W t,ij = 1 indicates that units i and j are connected, and W t,ij = 0 means they are not. Since this information is derived from societal interactions (for example, during Covid-19, it was often known

which communities were connected by their geographical locations), we first consider the case where W t is known1. We assume the reward of unit i at round t follows

j=1 Wt,ij f(Xtj, Atj) + ηti, (1)

where the reward Rti is a linear combination of some payoff function f, and ηti N(0, σ2) is a noise term satisfying ηti (Xt, W t)|At2. Throughout this paper, we assume that |E[Rti]| U, i.e., the expected reward of each unit i at round t can be uniformly bounded by a large constant U.

Assuming the reward generation process follows Equation (1) is both intuitive and possesses a very beneficial property. With some simple algebra, we can show that

i=1 E[Rti] =

j=1 Wt,ij f(Xtj, Atj)

i=1 Wt,jif(Xti, Ati) =

i=1 ωtif(Xti, Ati),

where we define ωti := PNt j=1 Wt,ji as the interference weight of unit i at round t. The last term further indicates that the optimal action depends solely on the covariates of each individual unit, with interference influencing the direction of optimality through the sign of the weight ωti. In the specific case where W t {0, 1}Nt Nt, ωti can be interpreted as the number of friends or connections for unit i at round t. This measure is straightforward to obtain; for example, in a COVID-19 context, it could represent the number of geographically adjacent communities. Since the optimal action that maximizes PNt i=1 E[Rti] is determined by Xti and ωti, we do not need to account for the covariate information and actions of all units to achieve the globally optimal action. This simplifies the decision-making process and makes it more practical for real-world applications.

3.1. Offline Optimization

Let s consider a linear payoff function for f to establish the entire framework and theory behind. Suppose there exists a coefficient vector βa Rd for each action a A to quantify the effect of each covariate in Xti. Similar to classical linear CB, we assume

f(Xtj, atj) = X

a A X tjβa 1{atj = a}. (3)

By incorporating Equation (2), the optimal action maximizing cumulative reward for each round-unit pair (t, i)

1Extensions to unknown W t will be discussed in Section 7. 2Here we assume the noise term ηti is conditionally independent of the contextual information and the interference matrix W t, given the action taken at round t. This is a more relaxed assumption compared to i.i.d. random noise.

Linear Contextual Bandits With Interference (Lin CBWI)

(without considering exploration) is given by

Ati = arg max a A ωti X tiβa, (4)

where the sign of ωti determines if the interference weight causes a flip in action that maximizes the cumulative reward.

To simplify notation, we introduce the following vector representations. Denote β = (β 1 , . . . , β K) Rd K, At = (At1, . . . , At Nt) RNt, Rt = (Rt1, . . . , Rt Nt)

as the vector of rewards collected at round t, Xt = (Xt1, . . . , Xt Nt) RNt d as the covariate information matrix for all units at round t, and W ti = diag({Wt,ij}1 j Nt) as an Nt Nt diagonal matrix. Furthermore, define IAt=a := diag(1{At = a}) as an Nt Nt diagonal matrix where the ith diagonal element is the indicator of 1{Ati = a}. We then define a d K-dimensional transformed covariate vector for each (t, i) as:

f Xti = (1 Nt W ti IAt=1Xt, . . . , 1 Nt W ti IAt=KXt) . (5)

With some straightforward algebra, the expected reward can

be expressed linearly as E[Rti] = f X tiβ.

Denote f Xt = f Xt1, . . . , f Xt Nt RNt d K as the transformed covariate information matrix at round t. Furthermore, define f X1:t = f X 1 , . . . , f X t R Nt d K, and R1:t = (R 1 , . . . , R t ) R Nt. Without exploration, the offline ordinary least square (OLS) estimator is given by

bβ t = f X 1:tf X1:t 1f X 1:t R1:t Rd K. (6)

In an offline optimization setting, one can replace the true value of β in Equation (4) with bβ t to obtain an estimate of the optimal individualized treatment rule.

3.2. Online Algorithms

In the context of online bandits with interference, we extend three algorithms from classical contextual bandits to account for the presence of interference: Linear Epsilon Greedy With Interference (Lin EGWI), Linear Upper Confidence Bound With Interference (Lin UCBWI), and Linear Thompson Sampling With Interference (Lin TSWI). These algorithms, summarized in Algorithm 1, differ primarily in Line 13 based on their respective exploration strategies.

Lin EGWI: For EG-based exploration, we select action

ati = (1 Zti) arg max a ωti X ti bβti,a + Zti DU(1, K), (7) where Zti Ber(ϵti), and DU(1, K) denotes the discrete uniform distribution s.t. P(A = a) = 1

K for any a [K].

Lin UCBWI: Define eΣt := f X 1:tf X1:t 1. The UCB un-

Algorithm 1 Linear Contextual Bandits with Interference Input: Number of units Nt; Burning period T0; Interference structure {W t}1 t T ; Clipping rate pt > O( N 1/2 t ).

1: for Time t = 1, , T0 do 2: ati DU(1, K), 1 i Nt; 3: end for

4: A f X 1:T0f X1:T0, b X 1:T0R1:T0; 5: for Time t = T0 + 1, , T do 6: Observe Nt units with features {Xti}1 i Nt 7: Update bβt 1 A 1b 8: for unit i = 1, 2, , Nt do 9: Estimate the optimal arm

bπti = arg max a A ωti X ti bβt 1,a;

10: if λmin 1 Nt 1 Pt 1 s=1 PNt i=1 f Xsif X si < pt 1

λmin 1 Nt 1 Pt 1 s=1 PNt i=1 Xsi X si then

11: Choose Ati DU(1, K), 1 i Nt; 12: else 13: Choose arm ati by Equation (7), (9), or (10); 14: end if 15: Receive reward Rti; 16: end for 17: Update f Xti by Equation (5), i {1, . . . , Nt}

18: Update A A + f X t f Xt, b b + f X t Rt 19: end for

der Ati = a {1, . . . , K} can be derived as

UCBti,a ωti X ti bβt 1,a + α|ωti| q

X ti(eΣt 1) 1 a Xti, (8) where α is a hyperparameter that controls the explorationexploitation tradeoff, and (eΣt 1) 1 a denotes the d d blockdiagonal submatrix of (eΣt 1) 1 corresponding to the variance of bβt 1,a. Thus, Lin UCBWI selects the arm ati by

ati = arg max a UCBti,a. (9)

Lin TSWI: Define v as a hyper-parameter denoting the level of exploration. For unit i at round t, Lin TSWI will sample eβti N(βt,post, v2Σ 1 t,post) and then choose arm ati s.t.

ati = arg max a ωti X ti eβti,a. (10)

For space limit, we save the expression for posteriors, as well as other derivation details, to Appendix C.

There are two main differences between Algorithm 1 and classical Lin CB algorithms. First, due to the presence of interference, β is estimated using the transformed covariate

Linear Contextual Bandits With Interference (Lin CBWI)

information f Xti. This transformation depends on not only the covariates, but also the interference matrix and actions involving all units in round t. Second, we incorporate an additional clipping step in Line 10 to ensure that the probability of exploration does not decay faster than O( N 1/2 t ) (see Assumption A.2 in Appendix A). This clipping step is crucial for maintaining sufficient exploration, which is necessary for estimation consistency and valid inference of β, as will be detailed in the theory section. Note that when W t I for all t, our method downgrades to the classical linear CB algorithms, aside from adding a step for clipping to ensure valid statistical inference.

Remark. The computational complexity of Lin CBWI is approximately O(d2K2 NT ). Specifically, for each unit i in round t, the primary computational costs arise from matrix inversion (Line 7 of Algorithm 1) and the smallest eigenvalue computation (Line 10 of Algorithm 1), both requiring O(d3K3) in standard implementations. Using optimization techniques such as iterative eigenvalue decomposition and the Sherman-Morrison update can reduce this to O(d2K2). Multiplying by the total number of units NT gives the overall complexity of Algorithm 1. In classical Lin CB, the complexity is O(d2K NT ). Lin CBWI introduces an additional factor of K due to interference considerations, requiring a joint update of βa for all a A. A detailed comparison of computational time is provided in Section 5.3, where we show that although Lin CBWI incurs slightly higher runtime, it remains computationally efficient in practice.

In this section, we provide comprehensive theoretical guarantees for the Linear CB under interference. Due to space limit, we move all assumptions to Appendix A.

4.1. Tail bound of the online OLS estimator

Theorem 4.1. (Tail Bound of the Online OLS Estimator) Suppose Assumptions A.1-A.2 hold. In either Lin UCBWI, Lin TSWI or Lin EGWI, for any h > 0, we have

P bβt β 1 > h d K exp h2 Ntp2 tλ2

2d3K3σ2L2w L2x

where Lw and Lx are some constants for boundedness (see Assumption A.1), and pt controls the clipping rate in Algorithm 1.

Remark. Given that d, σ, Lw and Lx are positive constants, the tail bound for the online OLS estimator simplifies to P bβt β 1 > h exp( h2 Ntp2 t 1). As detailed in Assumption A.2, pt is a non-increasing sequence that controls the level of exploration in Line 10 of Algorithm 1. As long as Ntp2 t , bβt will converge in probability to β.

Therefore, in Algorithm 1, we set the clipping rate at round t to pt > O( N 1/2 t ) to ensure sufficient exploration and thus the convergence of the online OLS estimator.

4.2. The probability of exploration

Define κti(ωti, Xti) = P(ati = bπt 1(Xti)), where the probability operator P is taken with respect to ati and all historical data collected before round t. The term κti(ωti, Xti) represents the probability of exploration for unit i at round t.We define the limit of κti(ωti, Xti) as κ (ω, x) = lim Nt P(ati = π (x)). Since κti is nonnegative by definition, it follows immediately from the Sandwich Theorem that κ exists for both UCB and TS. For EG, κ (ω, X) = lim Nt κti(ω, X) = lim Nt ϵti/K. In the following theorem, we establish the exploration upper bounds for Lin UCBWI, Lin TSWI, and Lin EGWI, which are crucial for understanding the consistency conditions of the online OLS estimator and the necessity of clipping.

Theorem 4.2. Suppose Assumptions A.1-A.3 hold. In either Lin UCBWI, Lin TSWI or Lin EGWI, for any 0 < ξ < |ζti|/2 with ζti = ωti X ti(β1 β0), we have

(1) In UCB, there exists a constant C > 0, such that

κti(ωti, Xti) CK2 2αLw Lx p Nt 1pt 1λ + ξ

+ d K3 exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

(2) In TS, we have

κti(ωti, Xti) K2 exp Nt 1pt 1λ(|ζti| ξ)2

+ 2d K3 exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

(3) In EG, we have κti(ωti, Xti) = ϵti(K 1)/K.

Remark. This theorem extends the results from Ye et al. (2023) to scenarios with interference and K > 2. As Ntpt , the exploration probability in both UCB and TS will converge to 0 when Nt . Specifically, in UCB, the exploration upper bound consists of two components: the first term arises from the margin condition, and the second term from the tail bound of bβt. When Nt is large, the second term, which decays at a rate of O(exp{ Nt 1p2 t 1}), will be dominated by the first term, which decays at a rate of O(( Nt 1pt 1) γ/2) if we set ξ = O(( Nt 1p2 t 1) 1/2). In TS, the upper bound is dominated by the second term, which converges to 0 at a rate of O(exp{ Nt 1p2 t 1}).

Lw serves as an upper bound that controls the overall level of interference in Assumption A.1.c. A larger Lw would increase the upper bound of exploration for both UCB and TS.

Linear Contextual Bandits With Interference (Lin CBWI)

This is reasonable, as a higher level of interference typically results in a more relaxed upper bound on the probability of exploration. However, Lw has no effect on EG where the probability of exploration is often pre-specified.

4.3. Statistical inference on the online OLS estimator bβt

Accurate estimation and inference of β are crucial for effective decision-making. A consistent and stable estimate of βa plays a key role in the algorithm update (Line 13 of Algorithm 1), and valid inference provides deeper insights into feature importance/selection. Therefore, this section focuses on establishing the asymptotic properties of the online OLS estimator for β.

Theorem 4.3. Suppose Assumptions A.1-A.3 hold, and Ntpt as Nt . We have p Nt(bβt β) D N(0d K, σ4G 1), (11)

where, for simplicity, the expression of G is specified in Equation (53) of Appendix H.

Remark. Theorem 4.3 establishes the asymptotic normality of the online OLS estimator, providing an explicit form for its asymptotic variance. This result holds for the EG, UCB, and TS algorithms used for exploration. Despite the presence of interference, the asymptotic normality of the estimator only requires the total number of units Nt to approach infinity. In other words, bidirectional asymptotic normality is achieved as long as either t or the number of units at some stage Nt .

The asymptotic normality of the online OLS estimator under interference requires careful treatment of dependencies between units both across time (due to the nature of online decision-making) and within each round (due to the existence of interference). To address this, we construct a specialized martingale difference sequence (as shown in the proof in Appendix H) that effectively accounts for the Nt units collected up to time t. This decomposition, together with the linear structure of interference, allows us to circumvent the independence requirement of classical CLT, which is central to establishing the asymptotic properties discussed in this and the following subsection.

4.4. Statistical inference on the optimal value b V π

In real applications, researchers may want to understand how closely the value function approaches its optimal level under the optimal policy π , at the end of a round t, as well as the magnitude of the potential impact of the optimal policy. This further motivate us to conduct estimation and inference on the optimal value function in online learning.

Suppose that the contextual information Xti PX and the interference weight ωti W. Define the oracle policy

as π (Xti) = arg maxa ωti X tiβa, and the optimal value function as V π , which represents the expected reward under the oracle policy π . Specifically,

a A 1{π (X) = a}ωβ a X

where the expectation is taken w.r.t. X and ω.

Define bµ(t,i) t 1 (Xt, At) = f Xtibβt 1 as the expected reward that unit i can obtain given the covariate information and actions of all units at round t. We propose a doubly robust (DR) estimator for V π , where

b V DR t = 1

1{asi = bπs 1(Xsi)}

1 ˆκs 1(ωti, Xsi) n rsi

bµ(i,s) s 1 (Xs, bπs 1(Xs)) o + bµ(i,s) s 1 (Xs, bπs 1(Xs)) i .

As a combination of inverse probability weighting (IPW) and the direct method (DM), our proposed doubly robust (DR) estimator offers double protection for consistency with V π under model misspecification in ˆκt 1 or bµ(i,t) t 1 . For brevity, some derivation details are moved to Appendix D.

Theorem 4.4. Suppose Assumptions A.1-A.4 hold. We have p Nt(b V DR t V π ) D N(0d K, σ2 V ), (13)

where σ2 V is given by

σ2 V = E σ2

a A ωx βa 1{π (x) = a} i .

Remark. The asymptotic variance of the optimal value function comprises two components. The first term arises from the IPW estimator and accounts for the variance of the random noise ηti. The second term originates from the DM estimator and captures the variance due to uncertainty in the context x and the interference weight ω. Notably, our theorem extends the results of Ye et al. (2023) by establishing the asymptotic properties of the estimated optimal value function under interference. In the special case where ω 1 in Equation (14), our results reduce to theirs.

4.5. Regret Bound

Now we establish the regret bound for Algorithm 1. Define the regret at the end of round T as

i=1 E ωti X tiβπ (Xti) ωti X tiβati .

Theorem 4.5. For Lin EGWI, Lin UCBWI and Lin TSWI in Algorithm 1, the general regret bound under interference

Linear Contextual Bandits With Interference (Lin CBWI)

0 100 200 300 400 500 Time

Coverage Probability of beta

0.95 EG UCB TS

0 100 200 300 400 500 Time

Coverage Probability of Estimated Optimal Value

0.95 EG UCB TS

Figure 1. The coverage plot of β (a. left) and V π (b. right)

can be derived as

i=1 E[R ti Rti] = O( N 1/2 T log NT ),

which is sublinear in NT .

Remark. The regret upper bound, even in the presence of interference, remains proportional to the square root of the total number of units (up to a logarithmic term). This aligns with the best achievable regret in literature without interference. This upper bound can be decomposed into two components: (1) the regret due to estimation accuracy (exploitation), denoted by R(1) T , and (2) the regret due to exploration, denoted by R(2) T . For the EG, UCB, and TS algorithms, the regret from exploitation is proven to be o( N 1/2 T ) and is thus negligible. However, the regret due to exploration varies across algorithms. Specifically, in UCB and TS, R(2) T also depends on the interference level Lw, which increases as Lw becomes larger. In contrast, for EG, the probability of exploration is user-specified and independent of the interference weight ω. As a result, R(2) T is O( N 1/2 T log NT ) by setting ϵti properly. For detailed expressions of the upper bounds for each algorithm and the order for hyper-parameters, please refer to Appendix J.

Before concluding this section, we summarize the main findings of Sections 4.3-4.5. Sections 4.3-4.4 and 4.5 are connected through a clipping rate , pt. Ideally, setting pt to 0 minimizes regret, but may undermine the consistency of β due to insufficient sampling of certain arms. On the other hand, a larger pt would lead to over-exploration and increase overall regret. This trade-off between Sections 4.3-4.4 and 4.5 is a novel perspective, motivating us to examine both aspects to balance statistical efficiency with regret minimization.

5. Simulation

5.1. Coverage Probability

To demonstrate the asymptotic normality of β and V π in Theorem 4.3-4.4, we estimate the asymptotic variance and verify whether the true value of β and V π falls within the estimated confidence interval with a high probability of coverage under B = 1000 times of replicates. By Equation (11) and (13), β falls into the confidence region if and only if Nt σ4 (bβ β) G(bβ β) χ2 α(d K), where df = d K is the degree of freedom of the chi-square distribution. Similarly, V π falls into the confidence interval if and only if p Nt|b V DR t V π | zα/2σV . Additional details of the simulation setup are summarized in Appendix B.1.

Coverage probabilities of the OLS estimator bβ and optimal value function V π under three exploration algorithms (Lin EGWI, Lin UCBWI, and Lin TSWI) are shown in Figure 1. As we can see, the coverage consistently hovers around 95%, with the estimated confidence band almost always covering the red line. This result supports the validity of the statistical inference presented in Theorems 4.3 and 4.4.

5.2. Comparison with Baseline Approaches

First, we compare our proposed method with the classical linear CB algorithms to illustrate the importance of taking interference into consideration. The results are shown in Figure 2 based on B = 100 times of replication.

As shown, our approaches (Lin EGWI, Lin UCBWI, and Lin TSWI) achieve significantly lower average regrets at a faster rate than classical Lin CB algorithms, showcasing the effectiveness of Linear Contextual Bandits With Interference (Lin CBWI) in addressing interference. Without interference (Figure 6 in Appendix B.4), our algorithm reduces to the classical Lin CB method and delivers similar

Linear Contextual Bandits With Interference (Lin CBWI)

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

Figure 2. Comparison of average regret in the presence of interference with K = 3 arms

results. Further details can be found in Appendix B.2.

5.3. Computational Time of Lin CBWI

At the end of Section 3.2, we briefly discussed the theoretical computational complexity of running Algorithm 1. To empirically validate this analysis, we compare the runtime of Lin CBWI to that of classical Lin CB under the same setting as in Section 5.2. The results are presented in Figure 3. As expected, Lin CBWI incurs slightly higher computational time due to its joint updates, but the overall runtime remains within a few seconds. Therefore, computational complexity is unlikely to pose a practical bottleneck.

0 100 200 300 400 500 t

Cumulative Time Spent (s)

Lin EG Lin UCB

Lin TS Lin EG_WI

Lin UCB_WI Lin TS_WI

Figure 3. Computational Time Comparison between our proposed algorithm (Lin CBWI) and baselines (Lin CB)

5.4. Sensitivity Analysis When Wt Misspecified

In this subsection, we conduct a sensitivity analysis to evaluate the impact of misspecifying the interference matrix. The simulation setup follows that of Section 5.2, except we manually introduce a perturbed matrix for Lin CBWI: W t = W t + Ξt, where Ξt = {ξij}1 i,j Nt with ξij Unif( b, b). Any resulting values outside the range [ 1, 1] are clipped. Intuitively, a larger b indicates a higher degree of misspecification. We consider b {0, 0.5, 1, 2} to assess Lin CBWI s robustness.

Due to space constraints, full results are deferred to Appendix E; the main paper presents results for b = 1 as a representative case. Notably, b = 1 corresponds to a substantial level of misspecification, given that W t is normalized to lie within [ 1, 1].

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

Figure 4. Average regret comparison when b = 1

Figure 4 shows that Lin CBWI remains robust, achieving near-zero average regret despite the severe perturbation. This suggests that incorporating interference structures, even when estimated imprecisely, enhances decision-making performance and provides deeper modeling flexibility. The robustness of Lin CBWI alleviates practical concerns about unknown or misspecified interference.

6. Synthetic Data based on Movie Lens

The Movie Lens 1M dataset (Harper & Konstan, 2015) contains over 1 million movie ratings from 6k users, aiding movie recommendations based on historical ratings. At each round t, user i with context Xti is recommended a movie genre (Ati) and provides a rating (Rti). Here, we define A = 1 for Comedy and A = 0 for Drama . There are two types of interference affect reward modeling for Rti,

Linear Contextual Bandits With Interference (Lin CBWI)

0 25 50 75 100 125 150 175 200 T

Comparison of Rewards Over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

0 25 50 75 100 125 150 175 200 T

Comparison of Rewards Over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

Figure 5. Average rating comparison under reward generating model I (RGP1, left) and II (RGP2, right)

which is overlooked in classical bandits:

(1) Users often rate multiple movies within a short time (a round), indicating that one recommendation made to a user may influence their ratings for others in the same round.

(2) Contextual connections (e.g., occupation, ZIP code, age) between users in the same round can cause one recommendation made to a user to affect the ratings of other users.

We consider two different pseudo-true reward generating processes (RGPs), RGP1 and RGP2, to further evaluate the performance of our algorithms. Specifically, RGP1 assumes only the presence of interference, while RGP2 additionally assumes that the true reward model satisfies linearity. Detailed setup descriptions are summarized in Appendix B.3.

The comparison results for both RGPs are shown in Figure 5. In both figures, our algorithms consistently outperform classical contextual bandit approaches. Notably, in the case of RGP1, there is a gap of average reward between our algorithms and the oracle model, which disappears in RGP2. This gap likely arises because the true reward model may not be linear. Despite so, the effectiveness of our approach can be validated through both scenarios as it consistently outperforms baselines due to its ability to account for the potential interference structure. All supplementary code is available at our Github repository.

7. Summary and Future Directions

In this paper, we propose a series of algorithms to address interference in Lin CB, supported by thorough analyses of regret, upper bounds, and the asymptotic properties of the online estimators. Before concluding, we outline two practical extensions of this work:

Our algorithm can naturally extend to cases where the interference matrix W t is unknown. Inspired by low-

rank factorization (Shi et al., 2019; Agarwal et al., 2020), we can assume that W t exhibits a learnable low-rank structure informed by agents contextual information, i.e., W t = XtΦX t , where Φ Rd d represents a universal weighting matrix. This matrix can be learned alongside the bandit algorithm to generalize the interference structure. Extending the algorithm in this way requires jointly estimating Φ and β, which can be formulated as the optimization

problem arg minβ,Φ PT t=1 1 Rt 1 f Xtβ 2 +λ Φ 1, where λ controls the sparsity of Φ. While promising, this approach requires rigorous justification to ensure identifiability, as well as further computational and theoretical development. Given the scope of this extension, it presents an intriguing direction for future research.

It is also possible to generalize the reward model in Equations (2) and (3). One plausible extension involves replacing the linear assumption with a neural network, as in neural contextual bandits (Zhou et al., 2020; Zhang et al., 2020). While this enhances modeling flexibility and practical applicability, it may introduce extra complexity for statistical inference in theoretical analysis.

Another promising direction is to investigate the regret lower bound in the presence of interference. In this paper, we have shown that, despite interference, extensions of classical algorithms such as EG, UCB, and TS maintain the standard optimal regret rate proportional to the square root of the sample size. This is the best achievable regret bound within these exploration algorithms. Parallel research efforts, though at the cost of reduced generalizability, have explored methods to decrease exploration probabilities and achieve smaller regret bounds, as discussed in Goldenshluger & Zeevi (2013), Han et al. (2020), and He et al. (2022). Extending such approaches to settings with interference remains an open and exciting direction for future research.

Linear Contextual Bandits With Interference (Lin CBWI)

Acknowledgments

This research was conducted under the support of the National Science Foundation through Grant DMS-2113637.

Impact Statement

This paper is the first to address the interference problem in the contextual bandit problems, with broad applicability to real-world domains such as recommendation systems, personalization, clinical applications, and more. We believe that this work, along with the potential extensions discussed in the summary, will inspire future research that advances decision-making across various machine learning applications.

Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095 20107, 2020.

Agarwal, A., Agarwal, A., Masoero, L., and Whitehouse, J. Multi-armed bandits with network interference. ar Xiv preprint ar Xiv:2405.18621, 2024.

Aronow, P. M. and Samii, C. Estimating average causal effects under general interference, with application to a social network experiment. 2017.

Audibert, J.-Y. and Tsybakov, A. B. Fast learning rates for plug-in classifiers. 2007.

Bargagli-Stoffi, F. J., Tort u, C., and Forastiere, L. Heterogeneous treatment and spillover effects under clustered network interference. ar Xiv preprint ar Xiv:2008.00707, 2020.

Bargiacchi, E., Verstraeten, T., Roijers, D., Now e, A., and Hasselt, H. Learning to coordinate with coordination graphs in repeated single-stage multi-agent decision problems. In International conference on machine learning, pp. 482 490. PMLR, 2018.

Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 68 (1):276 294, 2020.

Chen, H., Lu, W., and Song, R. Statistical inference for online decision making: In a contextual bandit setting. Journal of the American Statistical Association, 116(533): 240 255, 2021.

Cliff, A. D. and Ord, J. K. Spatial processes: models & applications. (No Title), 1981.

Dedecker, J. and Louhichi, S. Maximal inequalities and empirical central limit theorems. In Empirical process techniques for dependent data, pp. 137 159. Springer, 2002.

Deshpande, Y., Mackey, L., Syrgkanis, V., and Taddy, M. Accurate inference for adaptive linear models. In International Conference on Machine Learning, pp. 1194 1203. PMLR, 2018.

Dubey, A. et al. Kernel methods for cooperative multiagent contextual bandits. In International Conference on Machine Learning, pp. 2740 2750. PMLR, 2020.

Feller, W. An introduction to probability theory and its applications, Volume 2, volume 81. John Wiley & Sons, 1991.

Forastiere, L., Airoldi, E. M., and Mealli, F. Identification and estimation of treatment and interference effects in observational studies on networks. Journal of the American Statistical Association, 116(534):901 918, 2021.

Getis, A. Spatial autocorrelation. In Handbook of applied spatial analysis: Software tools, methods and applications, pp. 255 278. Springer, 2009.

Goldenshluger, A. and Zeevi, A. A linear response bandit problem. Stochastic Systems, 3(1):230 261, 2013.

Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S., and Athey, S. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the national academy of sciences, 118(15):e2014602118, 2021.

Hall, P. and Heyde, C. C. Martingale limit theory and its application. Academic press, 2014.

Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020.

Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

He, J., Zhang, J., and Zhang, R. Q. A reduction from linear contextual bandits lower bounds to estimations lower bounds. In International Conference on Machine Learning, pp. 8660 8677. PMLR, 2022.

Hudgens, M. G. and Halloran, M. E. Toward causal inference with interference. Journal of the American Statistical Association, 103(482):832 842, 2008.

Jia, S., Frazier, P., and Kallus, N. Multi-armed bandits with interference. ar Xiv preprint ar Xiv:2402.01845, 2024.

Linear Contextual Bandits With Interference (Lin CBWI)

Kennedy, E. H. Semiparametric doubly robust targeted double machine learning: a review. ar Xiv preprint ar Xiv:2203.06469, 2022.

Landgren, P., Srivastava, V., and Leonard, N. E. On distributed cooperative decision-making in multiarmed bandits. In 2016 European Control Conference (ECC), pp. 243 248. IEEE, 2016.

Leung, M. P. Causal inference under approximate neighborhood interference. Econometrica, 90(1):267 293, 2022a.

Leung, M. P. Rate-optimal cluster-randomized designs for spatial interference. The Annals of Statistics, 50(5):3064 3087, 2022b.

Luedtke, A. R. and Van Der Laan, M. J. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713, 2016.

Manski, C. F. Identification of treatment response with social interactions. The Econometrics Journal, 16(1): S1 S23, 2013.

Mart ınez-Rubio, D., Kanade, V., and Rebeschini, P. Decentralized cooperative stochastic bandits. Advances in Neural Information Processing Systems, 32, 2019.

Qu, Z., Xiong, R., Liu, J., and Imbens, G. Efficient treatment effect estimation in observational studies under heterogeneous partial interference. ar Xiv preprint ar Xiv:2107.12420, 2021.

Rosenbaum, P. R. Interference between units in randomized experiments. Journal of the american statistical association, 102(477):191 200, 2007.

Shi, C., Lu, W., and Song, R. Determining the number of latent factors in statistical multi-relational learning. Journal of Machine Learning Research, 20(23):1 38, 2019.

Sobel, M. E. What do randomized studies of housing mobility demonstrate? causal inference in the face of interference. Journal of the American Statistical Association, 101(476):1398 1407, 2006.

Su, L., Lu, W., and Song, R. Modelling and estimation for optimal treatment decision with interference. Stat, 8(1): e219, 2019.

Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Statistics in medicine, 27(23):4658 4677, 2008.

Verstraeten, T., Bargiacchi, E., Libin, P. J., Helsen, J., Roijers, D. M., and Now e, A. Multi-agent thompson sampling for bandit applications with sparse neighbourhood structures. Scientific reports, 10(1):6728, 2020.

Viviano, D. Experimental design under network interference. ar Xiv preprint ar Xiv:2003.08421, 2020.

Viviano, D. Policy targeting under network interference. Review of Economic Studies, pp. rdae041, 2024.

Wang, H., Fan, J., Chen, Z., Li, H., Liu, W., Liu, T., Dai, Q., Wang, Y., Dong, Z., and Tang, R. Optimal transport for treatment effect estimation. Advances in Neural Information Processing Systems, 36, 2024.

Ye, S., Cai, H., and Song, R. Doubly robust interval estimation for optimal policy evaluation in online learning. Journal of the American Statistical Association, (justaccepted):1 20, 2023.

Zhang, K., Janson, L., and Murphy, S. Inference for batched bandits. Advances in neural information processing systems, 33:9818 9829, 2020.

Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020.

Zhou, H., Li, S., Jiang, G., Zheng, J., and Wang, D. Direct heterogeneous causal learning for resource allocation problems in marketing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 5446 5454, 2023.

Linear Contextual Bandits With Interference (Lin CBWI)

A. Assumptions

Assumption A.1. (Boundedness)

a. Define Σ := E xx as the covariance of contextual information. There exists a constant λ > 0, such that λmin(Σ) > λ.

b. Xti X, there exists a constant Lx, such that Xti 2 Lx for any t [T] and i [Nt].

c. W t W, there exists a constant Lw, such that P

j |Wt,ij| Lw and P

j |Wt,ji| Lw for any t [T] and i [Nt].

Assumption A.2. (Clipping) For any action a and round t, there exists a positive and non-increasing sequence pt, such that

λmin 1 Nt 1 Pt 1 s=1 PNs i=1 f Xsif X si > pt 1 λmin(Σ).

Assumption A.3. (Margin Condition) For any ϵ > 0, there exists a positive constant γ > 0, such that for any two different arms a and a , we have P(0 < |f(X, a) f(X, a )| < ϵ) = O(ϵγ).

Assumption A.1 includes several bounded conditions. Assumption A.1.a ensures that there is no strong collinearity between different features, which is necessary for a stable OLS estimator. This condition is commonly assumed in bandit-related inference papers (Zhang et al., 2020; Chen et al., 2021; Ye et al., 2023). Assumptions A.1.b and A.1.c ensure that the contextual information and the interference level for each individual unit are bounded.

Assumption A.2 is a technical requirement that guarantees the bandit algorithm explores all actions sufficiently at a rate of pt, enabling consistent estimation of the OLS estimator. This exploration procedure is widely assumed in bandits inference literature (Deshpande et al., 2018; Hadad et al., 2021; Ye et al., 2023). Notice that Assumption A.2 is actively enforced in our algorithm. Specifically, Line 10 of Algorithm 1 ensures that if an arm is explored insufficiently (determined based on and a smallest eigenvalue comparison), the algorithm enforces additional exploration, preventing extreme imbalance and ensuring statistical consistency. In our simulation studies, we simply set pt 0.01 and observed that Line 10 is rarely triggered. Therefore, this assumption is unlikely to pose significant practical concerns.

Assumption A.3, known as the margin condition, is a standard assumption in the contextual bandits literature (Audibert & Tsybakov, 2007; Luedtke & Van Der Laan, 2016). It ensures that the expected rewards of different arms are sufficiently separated, thereby making the bandit learning problem well-posed and meaningful.

Assumption A.4. (Rate Double Robustness) Define the L2 norm as zt 2,NT = q

1 NT PT t=1 PNt i=1 z2 t . We assume that

the convergence rate of propensity score model ˆκt 1(ωti, Xti) κt 1(ωti, Xti) 2,NT = Op( N α1 T ), the convergence of outcome regression model bµ(t,i) t 1 (Xt, bπt 1(Xt)) µ(t,i)(Xt, bπt 1(Xt)) 2,NT = Op( N α2 T ), with α1 + α2 > 1/2.

Assumption A.4 requires that the convergence rates of the conditional mean function and the estimated probability of exploration satisfy certain conditions. This is a standard assumption in causal inference literature, as noted in Luedtke & Van Der Laan (2016); Kennedy (2022). In our setting, this assumption is almost always satisfied, given that bµ(t,i) t 1 (Xt, bπt 1(Xt)) µ(t,i)(Xt, bπt 1(Xt)) 2,NT = Op( N 1/2 T ) follows directly from Theorem 4.3. Therefore, it suffices to ensure that ˆκt 1(ωti, Xti) κt 1(ωti, Xti) 2,NT = op(1) for Assumption A.3 to hold. This can be easily achieved by using a sample-based exploration estimand. In practice, as ˆκt 1(ωti, Xti) tends to be small as t increases, we set ˆκt 1(ωti, Xti) = P

s t 1,i [Ns] 1{Asi = bπ(Xsi)}/ Nt 1, which proves to be sufficient in simulation and real data analysis.

B. Simulation Setup and a Supplementary Plot Without Interference

B.1. Simulation Setup in Section 5.1

The simulation setup of testing coverage probability is as follows. In the estimation of β, the entire process is replicated for B = 1000 times to calculate the empirical coverage. For each replication, we assume there are a total of T = 500 rounds, and we randomly sample the true β from β0 = (2, 3, 1) and β1 = (1, 1, 3) .

In the estimation of β, we assume Nt Poisson(5) units are interacting with the environment. Xti = (Xti,1, . . . , Xti,3) R3 denotes the feature information of unit i at round t, where Xti,1 1, Xti,2 N(4, 1), Xti,3 Unif(0, 3), and all of the samples are i.i.d. over (t, i). At each round, we generate the interference matrix W t RNt Nt as follows. Suppose the

Linear Contextual Bandits With Interference (Lin CBWI)

diagonal elements Wii = 1. For each i > j, we generate Wij α Unif( 0.6, 0.3) + (1 α) Unif(0.1, 0.4), where α DU(1, K).

In the estimation of V π , for a more balanced variance composition in Equation (14), we set up the data generating process for W t and Xti as follows. For each i = j, we generate Wt,ij α Unif( 0.2, 0.1) + (1 α) Unif(0.05, 0.2), where α DU(1, K). For contextual information, we generate Xti = (Xti,1, . . . , Xti,3) R3 for unit i at round t, where Xti,1 0.2, Xti,2 N(0.8, 0.04), Xti,3 Unif(0, 0.6), and all of the samples are i.i.d. over (t, i).

To establish the asymptotic normality of the optimal value function, we need to posit another mild assumption regarding the convergence rates of the two estimation models (propensity score and outcome regression models) presented in the main paper.

B.2. Simulation Setup in Section 5.2

In reward comparison, we set K = 3 arms and a total of T = 500 rounds. For each round t, a total of Nt Poisson(λ) units will interact with the environment. We generate the interference matrix W t RNt Nt as follows: Suppose the diagonal elements Wii = 1. For each i > j, we generate Wij α Unif( 0.9, 0.6) + (1 α) Unif(0.1, 0.4), where α DU(1, K). The lower triangular elements are set equivalent to the upper triangular.

Define Xti = (Xti,1, . . . , Xti,d) Rd as the feature information of unit i at round t. Here, we let d = 5, where the first column is intercept 1, (Xti,2, Xti,3) MV N(µ, Σ), and (Xti,4, Xti,5) follows some uniform distribution.

Following the reward-generating process described in Equation (2) with a linear payoff function (or equivalently, Equation (15)), we uniformly sample β0 Unif( 5, 1), β1 Unif( 1, 5), β2 Unif( 3, 3), and replicate this process for S = 100 times to test the robustness of different approaches w.r.t. the change of environment. All experiments were conducted on a local computer with 16 GB of memory.

B.3. Real Data Setup in Section 6

Based on the timestamps of each rating and the relative user density, we divided the dataset into T = 200 rounds. For each round-unit pair (t, i), Xti Rd is a d = 7 dimensional vector that includes an intercept term, age, gender, and 4 dummy variables representing the top 4 most popular occupation types. We construct an interference matrix W t based on the contextual information of users in the same round using normalized Jaccard similarity. Note that if a user provides multiple ratings in the same round (which is highly likely according to our observations), we treat them as different users with the same contextual information, thus the corresponding element in W t is set to 1. We proceed with two different pseudo-true reward generating processes.

I: For each user j, we calculate Rj(a) as the average rating of user j under movie type A = a. Then the true reward of user i at round t is given by Rti = PNt j=1 Wt,ij Rj(a).

II: Following Equation (2) with linear payoff functions, we fit a linear regression model to Rti, with

j=1 Wt,ij X tjβa1{Atj = a} + ϵti, (15)

to estimate β0, β1, and σ as specified above. We then use these estimated values to regenerate e Rti, which we assume represents the true reward of user i at round t.

B.4. Results Comparison Without Interference

Using the same simulation setup as in Section 5.2, but with the interference matrix W t replaced by an identity matrix, we compare the results of the classical linear CB algorithm with our proposed methods. The results, shown in Figure 6, indicate that all methods yield comparable performance over time, with average regrets converging to zero at a rapid rate. This illustrates that our method, Lin CBWI, will downgrade to classical Lin CB algorithms when interference does not exist.

Linear Contextual Bandits With Interference (Lin CBWI)

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

Figure 6. Comparison of average regret in the absence of interference

C. The Derivation of Three Exploration Strategies

In this section, we provide a detailed derivation of the exploration strategies for Lin EGWI, Lin UCBWI, and Lin TSWI, which were briefly introduced in Section 3.2.

C.1. Lin EGWI

First, to generalize the classical EG algorithm in the presence of interference, we explore different arms with probability ϵti and select the estimated optimal arm with probability 1 ϵti. That is,

ati = (1 Zti) arg max a ωti X ti bβti,a + Zti DU(1, K), (16)

where Zti Ber(ϵti), and DU(1, K) denotes the discrete uniform distribution s.t. P(A = a) = 1

K for any a [K].

C.2. Lin UCBWI

We next consider the extension of linear UCB to interference-existing scenarios. The key idea behind UCB is to use the variance of the parameter estimates, specifically the upper confidence bound, to guide exploration. In the presence of interference, this process is equivalent to comparing the UCBs of ωti X ti bβa and selecting the action that maximizes the

value. Define eΣt := f X 1:tf X1:t 1. Since Var(bβt) = σ2 eΣt, the UCB under Ati = a {1, . . . , K} can be derived as follows:

UCBti,a ωti X ti bβt 1,a + α|ωti| q

X ti(eΣt 1) 1 a Xti, (17)

where α is a hyperparameter that controls the exploration-exploitation tradeoff, and (eΣt 1) 1 a denotes the d d blockdiagonal submatrix of (eΣt 1) 1 corresponding to the variance of bβt 1,a. Thus, Lin UCBWI algorithm selects the arm ati according to

ati = arg max a UCBti,a. (18)

Linear Contextual Bandits With Interference (Lin CBWI)

C.3. Lin TSWI

In linear Thompson sampling, the prior of β is often pre-specified. At each round, units with transformed covariate matrix f Xti is used to update the posterior of β after collecting the reward Rti. Here, we adapt a normal prior for β, which follows

(Prior) β π(β) := N(µ0, Σ0)

(Update) Rti = f X tiβ + ηti, i {1, . . . , Nt} (19)

The posterior distribution of β given {X1:t, A1:t, R1:t} can be derived as

f β|{X1:t, A1:t, R1:t} f(R1:t|β, f X1:t) π(β),

After simple calculations, the posterior mean and variance for β can be obtained by

Σ 1 t,post Σ 1 0 + X

i,t f Xtif X ti/σ2,

βt,post Σt,post n Σ 1 0 µ0 + X

i,t Rtif Xti/σ2o . (20)

Suppose v is a hyper-parameter deciding the level of exploration in TS. For unit i at round t, Lin TSWI will sample eβti N(βt,post, v2Σ 1 t,post) and then choose arm ati such that

ati = arg max a ωti X ti eβti,a. (21)

D. Optimal Value Function Estimators: A Detailed Introduction

In Section 4.4 of the main paper, we present the final doubly robust (DR) estimator b V DR t and establish its asymptotic normality in Theorem 4.4. Here, we offer a detailed introduction to three commonly used estimators in causal inference and elaborate on the construction of the DR estimator.

As detailed in the main paper, our goal lies in providing different estimates for V π , where

V π = E h π (X)ωβ 1 X + (1 π (X))ωβ 0 X i .

The first estimator we propose to estimate V π is the Inverse Probability Weighting (IPW) estimator, also known as the Importance Sampling (IS) estimator in reinforcement learning. The core idea is to use the propensity ratio, 1{ati=π (Xti)}

P{ati=π (Xti)}, to adjust for distribution shifts caused by exploration. However, since the true values of π and P{ati = π (Xti)} are unknown, we replace them with their corresponding sample estimates. Therefore,

b V IPW t = 1

1{asi = bπs 1(Xsi)}

1 ˆκs 1(ωsi, Xsi) rsi,

where ˆκt 1(ωti, Xti) = P

s t 1,i [Ns] 1{Asi = bπ(Xsi)}/ Nt 1, and bπs 1(Xsi) is obtained from Line 9 of Algorithm 1.

The second estimator we propose is the Direct Method (DM). The concept is straightforward: we substitute the unknown true values, such as π and β, with their sample estimates in the optimal value function V π to directly estimate the optimal reward. Thus,

b V DM t = 1

a A 1{bπs 1(Xsi) = a}ωsi X sibβs 1,a.

Following the same logic used to derive Equation (2), the above estimator can be rewritten as

b V DM t = 1

i=1 bµ(i,s) s 1 (Xs, bπs 1(Xs)),

Linear Contextual Bandits With Interference (Lin CBWI)

where bµ(t,i) t 1 (Xt, At) = f Xtibβt 1 denotes the expected reward that unit i can obtain given the covariate information and actions of all units at round t.

By combining the two estimators mentioned above, we derive the doubly robust (DR) estimator, where

b V DR t = 1

1{asi = bπs 1(Xsi)}

1 ˆκs 1(ωti, Xsi) n rsi bµ(i,s) s 1 (Xs, bπs 1(Xs)) o + bµ(i,s) s 1 (Xs, bπs 1(Xs)) .

In b V DR t , the second term, i.e., bµ(i,s) s 1 (Xs, bπs 1(Xs)), corresponds to the direct estimator. The first term involving the propensity ratio is an augmentation term derived from the IPW estimator, which provides additional protection against model misspecifications, thereby ensuring double robustness. Specifically, as long as either the propensity score model ˆκt 1 or the outcome regression model bµ(t,i) t 1 is correctly specified, b V DR t becomes a consistent estimator of the optimal value function V π . Furthermore, under Assumption A.4 (which is demonstrated to be quite mild in our setting in Appendix A), the DR estimator achieves asymptotic normality as the total number of units, Nt, approaches infinity. Details are summarized in Theorem 4.4.

E. Sensitivity Analysis on Misspecified Interference Matrix

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

Figure 7. Average regret comparison when perturbation parameter b = 0 (i.e., no misspecification, left) and b = 0.5 (right)

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

Figure 8. Average regret comparison when perturbation parameter b = 1 (left) and b = 2 (right)

Linear Contextual Bandits With Interference (Lin CBWI)

In this section, we provide the full set of sensitivity analysis results omitted from Section 5.4 due to space constraints. As introduced in the main paper, we evaluate the impact of interference misspecification by introducing a perturbed matrix for Lin CBWI, defined as W t = W t + Ξt. Here, Ξt = {ξij}1 i,j Nt denotes a perturbation matrix with ξij Unif( b, b). A larger value of b indicates a more severe level of misspecification. We consider b {0, 0.5, 1, 2} to assess Lin CBWI s robustness to such perturbations.

The results are presented in Figures 7 8. When b = 0, i.e., no misspecification, the result exactly matches Figure 2. As b increases, we observe a modest degradation in Lin CBWI s performance: the average regret moves further from zero, as seen in the right panel of Figure 8 for b = 2. Nonetheless, even under this severe level of perturbation, Lin CBWI consistently outperforms classical Lin CB. These results suggest that incorporating interference structures, even when estimated with considerable error, improves decision-making performance and provides valuable modeling flexibility in the bandit framework.

Lastly, we consider an extreme scenario in which the interference matrix used by Lin CBWI is entirely uninformative. Specifically, we generate W t Unif( 1, 1), meaning each entry of W t is independently drawn from a uniform distribution and bears no relation to the true interference matrix W t. In this case, Lin CBWI performs comparably to classical Lin CB and even slightly better, as shown in Figure 9. We attribute this marginal improvement to potential overfitting introduced by the more complex modeling structure of Lin CBWI.

Importantly, this finding demonstrates that even when the interference structure is completely misspecified, Lin CBWI does not suffer from performance degradation relative to Lin CB. Taken together with the results in Figures 7 8, this highlights the robustness and stability of our proposed method in the presence of misspecified or uninformative interference information.

0 100 200 300 400 500 t

Average Regret over Time

Lin EG Lin EG_WI

Lin UCB Lin UCB_WI

Lin TS Lin TS_WI

Figure 9. Average regret comparison when the interference matrix used in Lin CBWI is completely uninformative, i.e., W t Unif( 1, 1)

F. Proof of Theorem 4.1: the Upper Bound of the Online OLS Estimator

The proof of this theorem was originally presented in Bastani & Bayati (2020). Here, we provide a slightly modified version

to fit our specific context with interference. Define eΣt = 1 Nt Pt s=1 PNs i=1 f Xsif X si. According to the definition of bβt,

i=1 f Xsiηsi

i=1 f Xsiηsi

Since eΣt is a symmetric positive semi-definite matrix, we have

eΣ 1 t 2 = λmax eΣ 1 t = n λmin(eΣt) o 1 ,

Linear Contextual Bandits With Interference (Lin CBWI)

where the right hand side of the above equation, by Assumption A.1-A.2, is lower bounded by ptλ. Therefore,

bβt β 2 eΣ 1 t 2

i=1 f Xsiηsi

2 = n λmin(eΣt) o 1

i=1 f Xsiηsi

i=1 f Xsiηsi

Define the lth element of f Xti as e Xti,l, where l = 1, . . . , d K, with d denoting the dimension of covariates and K denoting the number of arms. For any h > 0,

P bβt β 2 h P

i=1 f Xsiηsi

i=1 e Xsi,1ηsi

d K , . . . ,

i=1 e Xsi,d Kηsi

i=1 e Xsi,lηsi

To proceed with deriving the lower bound of the above equation, we will utilize Lemma 1 from Chen et al. (2021). As this lemma is directly applicable to our context, we will state it here and refer readers to the original paper for the proof.

Lemma F.1. Suppose {Fq : q = 1, . . . , NT } is an increasing filtration of σ fields. Let {Zq : q = 1, . . . , NT } be a sequence of random variables such that Zq is Fq 1 measurable and |Zq| L. Let ηq : q = 1, . . . , NT be independent σ gaussian, and ηq Fq 1 for all q. Let S = {s1, . . . , s|S|} {1, . . . , NT } be an index set where |S| is the number of elements in S. Then for any h > 0,

In our context, we flatten the unit {t, i}1 t T,1 i Nt to a unit queue Q(t, i) = Pt 1 s=1 Ns + i, such that all of the units are measured in a chronological order. Notice that the specific order of units within a round does not matter, as estimation and parameter updates occur only after all units in the round have been observed. As such, we also use e Xq,l to denote the lth element of f Xti. To use Lemma F.1, we define a filtration Fq as

Fq = σ( e X1,lη1, . . . , e Xq,lηq),

which satisfies ηq Fq 1 for any q {1, . . . , NT }. Let Zq = e Xq,j. Then by Assumption A.1.b-c,

| e Xq,l|2 f Xq 2 2 K

j=1 Wt,ij Xtj 2

j |Wt,ij| 2 d KL2 w L2 x.

Define L :=

d KLw Lx, so that | e Xq,l| L. According to the conclusion of Lemma F.1, we have

i=1 e Xsi,lηsi

exp h2 Ntp2 tλ2

2d2K2σ2L2w L2x

Combining the result of Equation (22) and (24), we have

P bβt β 2 h 1

i=1 e Xsi,lηsi

l=1 exp h2 Ntp2 tλ2

2d2K2σ2L2w L2x

= 1 d K exp h2 Ntp2 tλ2

2d2K2σ2L2w L2x

Linear Contextual Bandits With Interference (Lin CBWI)

Lastly, since v 1

d K v 2 for any v Rd K, we have

P bβt β 1 h P bβt β 2 h

1 d K exp h2 Ntp2 tλ2

2d3K3σ2L2w L2x

The proof of Theorem 4.1 is thus complete.

G. Proof of Theorem 4.2: the Upper Bound of Exploration

G.1. Probability of exploration for UCB

In this subsection, we aim to prove that the upper bound of κti for UCB algorithm satisfies

κti(ωti, Xti) CK(K 1)

2αLw Lx p Nt 1pt 1λ + ξ

+ d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

We split the proof into three steps.

Step 1: Rewrite κti(ωti, Xti) = P

1 a,a K,a =a 1

2P |ωti X ti(bβt 1,a bβt 1,a )| < α{bσta (Xti) bσta(Xti)} .

For clarity, we first consider the case where K = 2 and then extend the formula to the general case of K. When K = 2,

κti(ωti, Xti) = P(Ati = bπt 1(Xti))

= P(Ati = bπt 1(Xti), bπt 1(Xti) = 1) | {z } δ1

+ P(Ati = bπt 1(Xti), bπt 1(Xti) = 0) | {z } δ0

We first consider δ1. Based on the exploration criteria of UCB algorithm, we have

Ati = 1 n ωti X ti bβt 1,1 + αbσt 1,1(Xti) > ωti X ti bβt 1,0 + αbσt 1,0(Xti) o ,

where bσt 1,a(Xti) = |ωti|

X ti f X 1:(t 1)f X1:(t 1) 1

aa Xti, with f X 1:(t 1)f X1:(t 1) 1

aa denoting the a-th block-

diagonal matrix of dimension d d that quantifies the inverse covariance associated with arm a. Thus, given that bπt 1(Xti) = 1, i.e., ωti X ti(bβt 1,1 bβt 1,0) > 0, we have

δ1 =P(Ati = bπt 1(Xti), bπt 1(Xti) = 1)

=P(ωti X ti bβt 1,1 + αbσt 1,1(Xti) < ωti X ti bβt 1,0 + αbσt 1,0(Xti), ωti X ti bβt 1,1 > ωti X ti bβt 1,0)

=P 0 < ωti X ti(bβt 1,1 bβt 1,0) < α{bσt 1,0(Xti) bσt 1,1(Xti)} .

Similarly, we have

δ0 =P(Ati = bπt 1(Xti), bπt 1(Xti) = 0)

=P(ωti X ti bβt 1,1 + αbσt 1,1(Xti) > ωti X ti bβt 1,0 + αbσt 1,0(Xti), ωti X ti bβt 1,1 < ωti X ti bβt 1,0)

=P α{bσt 1,0(Xti) bσt 1,1(Xti)} < ωti X ti(bβt 1,1 bβt 1,0) < 0 .

Combining the result of Equation (26) and (27), we have that when K = 2,

κti(ωti, Xti) = δ1 + δ0 = P |ωti X ti(bβt 1,1 bβt 1,0)| < α{bσt 1,0(Xti) bσt 1,1(Xti)} . (28)

Now let s consider a general K. Similarly,

κti(ωti, Xti) = P(Ati = bπt 1(Xti)) =

a =1 P(Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a ).

Linear Contextual Bandits With Interference (Lin CBWI)

Notice that when a = a , the probability P(Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a ) equals 0. For any a = a , we can treat them as arms 1 and 0 and repeat the proof for K = 2, as shown above. Consequently, for a general K, Equation (28) can be extended to

κti(ωti, Xti) = X

1 a,a K a =a

1 2P |ωti X ti(bβt 1,a bβt 1,a )| < α{bσt 1,a (Xti) bσt 1,a(Xti)} . (29)

Step 2: Bound the variance term bσt 1,a(Xti).

For any a [K], notice that

bσt 1,a(Xti)2 = ω2 ti X ti f X 1:(t 1)f X1:(t 1) 1

aa Xti ω2 ti Xti 2 2 max v 2=1 v T f X 1:(t 1)f X1:(t 1) 1

ω2 ti Xti 2 2 λmax

f X 1:(t 1)f X1:(t 1) 1

L2 w L2 x λmax

f X 1:(t 1)f X1:(t 1) 1

= L2 w L2 x

λmin n f X 1:(t 1)f X1:(t 1) o,

where the first inequality holds by the definition of eigenvalues, and the last inequality holds by Assumption A.1.b-c.

Furthermore, by Assumption A.1 and A.2,

λmin n f X 1:(t 1)f X1:(t 1) o = Nt 1 λmin

i=1 f Xsif X si

> Nt 1 pt 1 λmin(Σ) Nt 1 pt 1λ.

Combining the result above to the expression of bσt 1,a(Xti), we can further derive

bσt 1,a(Xti) Lw Lx r

λmin n f X 1:(t 1)f X1:(t 1) o Lw Lx p Nt 1pt 1λ . (30)

Therefore, for any 1 a, a K and a = a ,

α bσt 1,a(Xti) bσt 1,a (Xti) α n bσt 1,a(Xti) + bσt 1,a (Xti) o 2αLw Lx p Nt 1pt 1λ .

Combining the result above and Equation (29), we have

κti(ωti, Xti) K(K 1)

2 P |ωti X ti(bβt 1,a bβt 1,a )| < α bσt 1,a (Xti) bσt 1,a(Xti)

|ωti X ti(bβt 1,a bβt 1,a )| < 2αLw Lx p Nt 1pt 1λ

Step 3: Further bound the RHS of Equation (31).

For brevity, we denote bζti = ωti X ti(bβt 1,a bβt 1,a ) and ζti = ωti X ti(βa βa ). Note that in bζti and ζti, we omit the dependence on the arms (a, a ) when there is no ambiguity, as they simply represent any two different arms in [K].

For any ξ > 0, define a event E := |bζti ζti| ξ . By Holder s inequality and v v 2, we have

|ωti X ti bβt 1,a ωti X tiβa| ωti Xti

bβt 1,a βa 1 Lw Xti 2 bβt 1,a βa 1 Lw Lx bβt 1,a βa 1 .

Linear Contextual Bandits With Interference (Lin CBWI)

According to Theorem 4.1,

P |ωti X ti bβt 1,a ωti X tiβa| > ξ P n Lw Lx bβt 1,a βa 1 > ξ o = P n bβt 1,a βa 1 > ξ Lw Lx

P bβt 1 β 1 > ξ Lw Lx

d K exp ξ2 Nt 1p2 t 1λ2

2d3K3σ2L4w L4x

By the triangle inequality,

|bζti ζti| = {ωti X ti bβt 1,a ωti X tiβa} {ωti X ti bβt 1,a ωti X tiβa }

ωti X ti bβt 1,a ωti X tiβa + ωti X ti bβt 1,a ωti X tiβa .

Thus, for |bζti ζti|, we have

P(|bζti ζti| > ξ) P ωti X ti bβt 1,a ωti X tiβa + ωti X ti bβt 1,a ωti X tiβa > ξ

P ωti X ti bβt 1,a ωti X tiβa > ξ/2 + P ωti X ti bβt 1,a ωti X tiβa > ξ/2

d K exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

+ d K exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

= 2d K exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

Therefore, event E satisfies

P(E) 1 2d K exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

On event E, we have |bζti| |ζti| |bζti ζti| |ζti| ξ. Then going back to Equation (31), we further have

κti(ωti, Xti) K(K 1)

|bζti| < 2αLw Lx p Nt 1pt 1λ

|bζti| < 2αLw Lx p Nt 1pt 1λ

|ζti| ξ < 2αLw Lx p Nt 1pt 1λ

+ d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

|ζti| < 2αLw Lx p Nt 1pt 1λ + ξ

+ d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

(33) By definition, |ζti| = |ωti X ti(βa βa )| = |ωti| f(Xti, a) f(Xti, a ) . Since Wt,ii = 1, we always have |ωti| 1 for any round-unit pair (t, i). Therefore, |ζti| f(Xti, a) f(Xti, a ) . According to Assumption A.3, there exists some

constant γ such that P n |ζti| < 2αLw Lx Nt 1pt 1λ+ξ o P n f(Xti, a) f(Xti, a ) < 2αLw Lx Nt 1pt 1λ+ξ o O n 2αLw Lx Nt 1pt 1λ+

By taking this result back to Equation (33), we are able to show that there exists a constant C, such that

κti(ωti, Xti) K(K 1)

|ζti| < 2αLw Lx p Nt 1pt 1λ + ξ

+ d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

2αLw Lx p Nt 1pt 1λ + ξ

+ d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

where C is a large positive constant. The proof is thus complete.

Linear Contextual Bandits With Interference (Lin CBWI)

G.2. Probability of exploration for TS

In this subsection, we aim to prove that the upper bound of κti for TS algorithm satisfies

κti(ωti, Xti) K(K 1) exp Nt 1pt 1λ(|ζti| ξ)2

+ 2d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

The proof can be split into three steps.

Step 1: Decompose κti(ωti, Xti) and bound it by P(E).

By definition,

κti(ωti, Xti) = P(Ati = bπt 1(Xti)) =

a =1 P(Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a ),

where a should be different from a , otherwise the corresponding probability equals 0.

Similar to Step 3 of Section G.1, we define a event E := |bζti ζti| ξ for any ξ (0, |ζti|/2), where bζti = ωti X ti(bβt 1,a bβt 1,a ), and ζti = ωti X ti(βa βa ). According to the result of Equation (32), we have

P(E) 1 2d K exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

Taking the result above back to the definition of κti(ωti, Xti), we have

κti(ωti, Xti) = P(Ati = bπt 1(Xti)) X

1 a,a K a =a

P(Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | E) + P(Ec)

1 a,a K a =a

P(Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | E) + 2d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

Next, we focus on bounding the first term P(Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | E) for any arm pair (a, a ), which is equivalent to

E E h 1 Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | bζti i | E .

Step 2: Bound the probability of E h 1 Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | bζti i on event E.

Recall that in TS, we have Ati = arg maxa{ωti X ti eβt 1,a}, where eβt 1 N(bβt 1, v2A 1 t 1) with bβt = f X 1:tf X1:t 1 f X 1:t R1:t and At = f X 1:tf X1:t. After simple transformations, for any arm pair (a, a ), ωti X ti eβt 1,a

ωti X ti eβt 1,a also follows a normal distribution with

ωti X ti eβt 1,a ωti X ti eβt 1,a N ωti X ti(bβt 1,a bβt 1,a ), v2ω2 ti X ti Dt 1Xti , (37)

where Dt 1 = Var(bβt 1,a bβt 1,a ) = A 1 t 1

aa + A 1 t 1

a a 2 A 1 t 1

aa , with A 1 t 1

aa denoting the d d dimensional block-diagonal matrix of A 1 t 1 that corresponds to the variance of βt 1,a, and A 1 t 1

aa denoting the sub-

matrix in A 1 t 1 representing the covariance between bβt 1,a and bβt 1,a . Since Dt 1 is symmetric and positive semi-definite, Dt 1 2 A 1 t 1

aa + 2 A 1 t 1

a a , with X Y denoting that (Y X) is positive semi-definite. For the simplicity of implementation, we exclude the interaction term A 1 t

aa in Algorithm 1, i.e., assuming independence between bβa,t and bβa ,t while making decisions.

Linear Contextual Bandits With Interference (Lin CBWI)

According to the distribution derived in Equation (37), we have

E h 1 Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | bζti i

P ωti X ti(eβt 1,a eβt 1,a ) > 0, ωti X ti(bβt 1,a bβt 1,a ) < 0 | ωti X ti(bβt 1,a bβt 1,a )

= 1 Φ h ωti X ti(bβt 1,a bβt 1,a) q

v2ω2 ti X ti Dt 1Xti i ,

where Φ( ) is the cumulative distribution function of N(0, 1).

Denote bzti = ωti X ti(bβt 1,a bβt 1,a) q

v2ω2 ti X ti Dt 1Xti. According to the tail bound established for standard normal distribution in Section 7.1 of Feller (1991), we have

E h 1 Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | bζti i

E exp( bz2 ti/2) = E exp ω2 ti X ti(bβt 1,a bβt 1,a) 2

2v2ω2 ti X ti Dt 1Xti

Define bσt 1,a(Xti) = |ωti|

X ti f X 1:(t 1)f X1:(t 1) 1

aa Xti. According to the upper bound derived in Equation (30),

ω2 ti X ti Dt 1Xti 2bσt 1,a(Xti)2 + 2bσt 1,a (Xti)2 4L2 w L2 x Nt 1pt 1λ.

Combining the result above to Equation (38), we can further derive

E h 1 Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | bζti i

E exp ω2 ti X ti(bβt 1,a bβt 1,a ) 2

2v2ω2 ti X ti Dt 1Xti

E exp Nt 1pt 1λω2 ti X ti(bβt 1,a bβt 1,a ) 2

Note that on event E, for any 0 < ξ < |ζti|/2, bζ2 ti = ω2 ti X ti(bβt 1,a bβt 1,a ) 2 (|ζti| ξ)2. Therefore,

E E h 1 Ati = bπt 1(Xti), Ati = a, bπt 1(Xti) = a | bζti i | E

E exp Nt 1pt 1λ(|ζti| ξ)2

exp Nt 1pt 1λ(|ζti| ξ)2

Step 3: Summary.

Combining the results of Equation (36) and (39), we finally have

κti(ωti, Xti) K(K 1) exp Nt 1pt 1λ(|ζti| ξ)2

+ 2d K2(K 1) exp ξ2 Nt 1p2 t 1λ2

8d3K3σ2L4w L4x

The proof is thus complete.

H. Proof of Theorem 4.3: the Asymptotic Normality of bβt

Recall that E[Rti] = f X tiβ, where f Xti = (1 Nt W ti IAt=1Xt, . . . , 1 Nt W ti IAt=KXt) Rd K. Define the number of samples collected till the end of round t as Nt = Pt s=1 Nt and further define N := NT . Therefore, we estimate bβt at round t by

bβt = f X 1:tf X1:t 1 f X 1:t R1:t =

i=1 f Xsif X si

i=1 f Xsi Rsi

Linear Contextual Bandits With Interference (Lin CBWI)

Since Rsi = f X tiβ + ηsi, we can write

p Nt(bβt β) =

i=1 f Xsif X si

i=1 f Xsiηsi

Here, with a slight abuse of notation, we denote the two terms in the above equation as η1 and η2, which are to be estimated separately in later steps. Notice that this differs from the sub-Gaussian error of the reward-generating function in Equation (1).

Step 1: Show that η1 = 1

Nt Pt s=1 PNs i=1 f Xsiηsi D N(0d K, G).

According to Cramer-Wold device, it suffices to show that for any v Rd K,

η1(v) = 1 p Nt

i=1 v f Xsiηsi D N(0d K, v Gv). (42)

Before proceeding, let s flatten the round-unit pairs {(t, i)}1 t T,1 i Nt to an unit queue Q(t, i) = Pt 1 s=1 Ns + i, such that all of the units are measured in a chronological order. Notice that the order of units in the same round does not matter, since the action decisions for all units in round t are made at the end of that round. For any flattened unit index q0 = Q(i0, t0) {1, . . . , N}, we define Hq0 as the σ-algebra containing the information up to unit q0. That is,

Hq0 = σ(v f X1η1, . . . , v f Xq0ηq0).

For different indices q, there is a jump in information gathering for Hq whenever q = Q(i = 1, t) for some t. Since f Xq Hq, all of the action assignment information collected at round t, i.e., At, is contained in Hq at the beginning of this round. With a slight abuse of notation, in the following proof, we will also use Ht to denote all historical data collected up to round t.

The tricky part of establishing asymptotic properties for bβt lies in the data dependence. Specifically, the transformed covariate vector f Xsi is a function of (W t, As, Xs), thus depending on all of the actions and original covariates information collected at round t. As such, f Xsiηsi f Xi s ηi s for any (s, s ), since (1) if s = s , units in the same round s are correlated by W s; (2) if s < s , unit are dependent since the later decisions made on As will depend on (W s, Xs, As).

Now let s use Martingale CLT to establish the asymptotic properties. We will prove shortly that {v f Xsiηsi}, or equivalently {v f Xqηq} after flattening, is a Martingale difference sequence. That is, we would like to show

E[v f Xqηq|Hq 1] = 0, q {1, . . . , N}. (43)

Suppose q = Q(t, i) for some (t, i) pair. According to our assumption on the noise term in the main paper, ηti (Xt, W t)|At ηti f Xt|At as f Xt is a function of (W t, Xt, At).

E[v f Xqηq|Hq 1] = E h E[v f Xqηq|Hq 1, At]|Hq 1 i (A1) = E E[v f Xq|Hq 1, At] E[ηq|Hq 1, At] | {z } =0

Now it suffice to (1) check the Lindeberg condition, and (2) derive the limit of conditional variance.

(1) We first check the Lindeberg condition. For any δ > 0, we define

" 1 Nq (v f Xq)2η2 q 1 1 p Nq v f Xqηq > δ Hq 1

Linear Contextual Bandits With Interference (Lin CBWI)

According to Assumption A.1.b-c,

(v f Xq)2 v 2 2 f Xq 2 2 K v 2 2

j=1 Wt,ij Xtj 2

2 K v 2 2 L2 xd X

j |Wt,ij| 2 d KL2 w L2 x v 2 2. (45)

1 1 p Nq v f Xqηq > δ 1 d KL2 w L2 x v 2 2η2 q > Nqδ2 = 1 η2 q > Nqδ2

d KL2w L2x v 2 2

ψ d KL2 w L2 x v 2 2 Nq

q=1 E η2 q 1 η2 q > Nqδ2

d KL2w L2x v 2 2

Define f Nq = d KL2 w L2 x v 2 2 Nq P Nq q=1 η2 q 1 η2 q > Nqδ2

d KL2w L2x v 2 2

, and g Nq = d KL2 w L2 x v 2 2 Nq P Nq q=1 η2 q. It is obvious that

|f Nq| g Nq a.s. and for all q. Since

E[η2 q|Hq 1] = E E[η2 q|At, Hq 1]|Hq 1 = σ2 < ,

E[g Nq|Hq 1]d KL2 w L2 x v 2 2 Nq

q=1 σ2 d KL2 w L2 x v 2 2σ2 < ,

thus g Nq is integrable for all q.

For each realization of random variable sequence {ηq} q=1, lim Nq f Nq = 0 as 1 η2 q > Nqδ2

d KL2w L2x v 2 2

= 0 when Nq is

large enough.

Therefore, by Generalized Dominated Convergence Theorem (GDCT), it follows from Equation (46) that ψ E[f Nq|Hq 1] 0 as q . The Lindeberg condition is thus verified.

(2) We next derive the limit of conditional variance.

i=1 E (v f Xq)2η2 q|Hq 1 = 1

i=1 E (v f Xq)2E[η2 q|Ai]|Hq 1

i=1 σ2E (v f Xq)2|Hq 1 (47)

where the last equality holds since η2 q|Ai i.i.d. follows N(0, σ2).

Recall that for any unit index q = Q(t, i),

f Xq = f Xti = (1 Nt W tidiag(1{Ati = 0}1 i Nt)Xt, 1 Nt W tidiag(1{Ati = 1}1 i Nt)Xt) .

After some manipulations, we have

f Xqf X q = f Xtif X ti :=

M1,1 M1,2 M1,K M2,1 M2,2 M2,K ... ... ... ... MK,1 MK,2 MK,K

Rd K d K, (48)

where Ma,a := PNt k=1 PNt l=1 Wt,ik Wt,il Xtk X tl 1{Atk = a}1{Atl = a } is a K by K sub-matrix.

Since we assume Xti X and W t W are known to us, the main challenge of deriving the conditional variance lies in estimating the conditional expectation of E[1{Atk = a}1{Atl = a }|W t, Xt, Hq 1] for any q {1, . . . , Nq}.

Linear Contextual Bandits With Interference (Lin CBWI)

Recall that Hq 1 is defined as the σ-algebra containing the information up to unit q 1. That is,

Hq 1 = σ(v f X1η1, . . . , v f Xq 1ηq 1).

For different indices q, there is a jump in information gathering for Hq whenever q = Q(i = 1, t) for some t. Since f Xq Hq, all of the action assignment information collected at round t, i.e., At, is contained in Hq at the beginning of this round. Thanks to the property of ηq that E[ηq|At] = 0, the conditional variance E (v f Xq)2ϵ2|Hq 1 = E (v f Xq)2ϵ2 = E (v f Xq)2ϵ2|Ht 1 for any q = Q(t, i). Still, we take the d d sub-matrix Ma,a as an example to calculate the asymptotic variance.

E[Ma,a ] = E Nt X

l=1 Wt,ik Wt,il Xtk X tl 1{Atk = a}1{Atl = a }

l=1 Wt,ik Wt,il Xtk X tl 1{Atk = a}1{Atl = a } W t, Xt, Ht 1

l=1 Wt,ik Wt,il Xtk X tl E 1{Atk = a}1{Atl = a } W t, Xt, Ht 1 #

Notice that Atk Atl|W t, Xt, Ht 1 for any (k, l) {1, . . . , Nt} and k = l. This independence arises because the actions assigned to units Q(k, t) and Q(l, t) are determined by two factors: exploitation and exploration.

1. Exploitation: The action Atk is partially determined by ˆπt 1(Xtk), where ˆπt 1 is a function of Ht 1 and is obtained by fitting a model to data from the first t 1 rounds. Therefore, given (W t, Xt, Ht 1), ˆπt 1(Xtk) and ˆπt 1(Xtl)|Ht 1 are both constants and thus independent from each other.

2. Exploration: The action Atk is also influenced by a specific exploration method based on the optimal action identified during exploitation. In ϵ-greedy, the level of exploration is determined by ϵti, which is independently assigned to each unit. For UCB and TS, the exploration level for each unit is a function of Ht 1, making them mutually independent given Ht 1.

E 1{Atk = a}1{Atl = a } W t, Xt, Ht 1 = E 1{Atk = a} W t, Xt, Ht 1 E 1{Atl = a } W t, Xt, Ht 1 . (50)

For the simplicity of notation, we define νti(ωti, Xti, Ht 1) = P(Ati = π (Xti)|W t, Xti, Ht 1) = P(Ati = π (Xti)|ωti, Xti, Ht 1). Since

1{Ati = a} = 1{Ati = π (Xti)} 1{π (Xti) = a} + 1{Ati = π (Xti)} 1{π (Xti) = a},

E 1{Ati = a} W t, Xt, Ht 1

= E 1{Ati = π (Xti)} 1{π (Xti) = a} + 1{Ati = π (Xti)} 1{π (Xti) = a} W t, Xt, Ht 1

= P(Ati = π (Xti)|W t, Xt, Ht 1)1{π (Xti) = a} + P(Ati = π (Xti)|W t, Xt, Ht 1)1{π (Xti) = a}

= (1 νti(ωti, Xti, Ht 1))1{a = arg max b ωti Xtiβb} + νti(ωti, Xti, Ht 1)1{a = arg max b ωti Xtiβb}.

Plugging in the result of Equation (50), (51) to Equation (49), one can obtain

Linear Contextual Bandits With Interference (Lin CBWI)

E[Ma,a ] = EW t,Xt

l=1 Wt,ik Wt,il Xtk X tl E 1{Atk = a}1{Atl = a } W t, Xt, Ht 1 #

1 k,l Nt Wt,ik Wt,il Xtk X tl

(1 νtk(ωtk, Xtk, Ht 1))1{a = arg max b ωtk Xtkβb} + νtk(ωtk, Xtk, Ht 1)1{a = arg max b ωtk Xtkβb}

(1 νtl(ωtl, Xtl, Ht 1))1{a = arg max b ωtl Xtlβb} + νtl(ωtl, Xtl, Ht 1)1{a = arg max b ωtl Xtlβb}

+ 1{a = a }

1 k,l Nt W 2 t,ik Xtk X tk

(1 νtk(ωtk, Xtk, Ht 1))1{a = arg max b ωtk Xtkβb} + νtk(ωtk, Xtk, Ht 1)1{a = arg max b ωtk Xtkβb} #

Define κ (ω, x) = limq P(Ati = π (x)). Following similar procedure as page 19-20 and Lemma B.1 in Ye et al. (2023), we can also derive that νtk(ω, x, Ht 1) p κ (ω, x), where the limit is free of historical data Ht 1. Therefore,

q=1 σ2E[Ma,a ] 1

q=1 σ2EW t,Xt

1 k,l Nt Wt,ik Wt,il Xtk X tl J (ωtk, Xtk, β, a)J (ωtl, Xtl, β, a )

+ 1{a = a }

1 k,l Nt W 2 t,ik Xtk X tk J (ωtk, Xtk, β, a)

where J (ω, X, β, a) = n (1 κ (ω, X))1{a = arg maxb ωX βb} + κ (ω, X)1{a = arg maxb ωX βb} o .

Define v = (v 1, v 2) where v1 and v2 are both d-dimensional vector.

Then η1(v) = 1 Nq P Nq q=1 v f Xqηq D N(0d K, v Gv) with

1 Nq P Nq q=1 σ2E[M1,1] 1 Nq P Nq q=1 σ2E[M1,K] ... ... ...

1 Nq P Nq q=1 σ2E[MK,1] 1 Nq P Nq q=1 σ2E[MK,K]

where the detailed expression of each submatrix in G is given in Equation (52).

Step 2: Show that η2 = n 1 Nt Pt s=1 PNs i=1 f Xsif X si o 1 p σ2G 1, where σ2 = E[η2 ti|At].

Based on Lemma 6 of Chen et al. (2021), it suffice to find the limit of 1 Nq P Nq q=1 v f Xqf X q v. According to Equation (45), we have

P(|v f Xqf X q v| > h) P(d KL2 w L2 x v 2 2 > h).

Therefore, by Theorem 2.19 in Hall & Heyde (2014), we have

h v f Xqf X q v E (v f Xq)2|Hq 1 i p 0 as q .

Linear Contextual Bandits With Interference (Lin CBWI)

Based on the results in Step 1, one can easily derive E (v f Xq)2|Hq 1 = v Gv/σ2. Combining the results above and by Continuous Mapping Theorem, we have

i=1 f Xsif X si

) 1 p (G/σ2) 1 = σ2G 1, (54)

which finishes the proof of this step.

Step 3: Summary. According to the results of Step 1-2 and Slutsky s Theorem, we can conclude that p Nt(bβt β) = η2η1 D N(0d K, σ4G 1), (55)

where G is specified in Equation (53).

In the special case when Nt = 1 for all t, i.e., there is no interference, the asymptotic variance would degenerate to

q=1 σ2 K (β) 0 0 e K (β)

with K (β) = Z

x xx n (1 κ (x))1{x (β0 β1) 0} + κ (x)1{x (β0 β1) < 0} o d Px

e K (β) = Z

x xx n (1 κ (x))1{x (β1 β0) 0} + κ (x)1{x (β1 β0) < 0} o d Px.

which align perfectly with Ye et al. (2023) in the cases without interference.

The proof of this theorem is complete.

I. Proof of Theorem 4.4: the Asymptotic Normality of V π

Recall that the DR optimal value function estimator we derived is given by

b V DR T = 1 NT

1{ati = bπt 1(Xti)}

1 ˆκt 1(Xti) rsi bµ(t,i) t 1 (Xt, bπt 1(Xt)) + bµ(t,i) t 1 (Xt, bπt 1(Xt)) . (57)

For the brevity of notation, we will omit the superscript in b V DR t in the following proof.

Now we defined two related value functions e VT and VT as below:

e VT = 1 NT

1{ati = bπt 1(Xti)}

1 κt 1(Xti) rti µ(t,i)(Xt, bπt 1(Xt)) + µ(t,i)(Xt, bπt 1(Xt)) ,

1{ati = π (Xti)}

P(ati = π (Xti)) rti µ(t,i)(Xt, π (Xt)) + µ(t,i)(Xt, π (Xt)) .

The proof of this theorem can be decomposed into three steps. In step 1, we aim to prove b VT = e VT + op( N 1/2 T ). In step

2, we show that b VT = e VT + op( N 1/2 T ). In step 3, we show p NT ( VT V π ) D N(0, σ2 V ), where the variance term is given by

σ2 V = Z σ2

1 κ (x)d Px +

i,t ω2 ti NT Var π (x) x β1 + {1 π (x)} x β0 .

Combining the above three steps, the proof of theorem 4.4 is thus complete.

Now, let s detail the proof of step 1-3.

Linear Contextual Bandits With Interference (Lin CBWI)

Step 1: Prove that b VT = e VT + op( N 1/2 T ).

Notice that the different between b VT and e VT lies in the estimation accuracy of (1) the propensity score function ˆκt 1 and (2) outcome estimation function bµ(t,i) t 1 . To simplify this problem, we introduce another intermediate value function VT as

1{ati = bπt 1(Xti)}

1 κt 1(Xti) rti bµ(t,i) t 1 (Xt, bπt 1(Xti)) + bµ(t,i) t 1 (Xt, bπt 1(Xti)) .

Now the problem becomes proving (1) b VT = VT + op( N 1/2 T ), and (2) VT = e VT + op( N 1/2 T ).

First, let s prove (1) b VT = VT + op( N 1/2 T ). Notice that

b VT VT = 1 NT

1{ati = bπt 1(Xti)}

1 ˆκt 1(Xti) 1{ati = bπt 1(Xti)}

1 κt 1(Xti)

n rti bµ(t,i) t 1 (Xt, bπt 1(Xt)) o

ˆκt 1(Xti) κt 1(Xti)

1{ati = bπt 1(Xti)} n rti bµ(t,i) t 1 (Xt, bπt 1(Xt)) o

{1 ˆκt 1(Xti)}{1 κt 1(Xti)}

This can be further decomposed to two parts:

ˆκt 1(Xti) κt 1(Xti)

1{ati = bπt 1(Xti)} n rti µ(t,i)(Xt, bπt 1(Xt)) o

{1 ˆκt 1(Xti)}{1 κt 1(Xti)}

ˆκt 1(Xti) κt 1(Xti)

1{ati = bπt 1(Xti)} n µ(t,i)(Xt, bπt 1(Xt)) bµ(t,i) t 1 (Xt, bπt 1(Xt)) o

{1 ˆκt 1(Xti)}{1 κt 1(Xti)}

We first show that 1 is op(T 1/2). Similar to the proof of Theorem 3 in Ye et al. (2023), we define a class of measurable functions

F(Xt, ati, rti) =

( ˆκt 1(Xti) κt 1(Xti) 1{ati = bπt 1(Xti)} n rti µ(t,i)(Xt, bπt 1(Xt)) o

{1 ˆκt 1(Xti)}{1 κt 1(Xti)}

: ˆκt, κt Λ, bπt Π

where Λ and Π are two classes of functions mapping context Xti to a probability in [0, 1]. Denote the empirical measure Gn = n Pn(f Pf). Here, n = Q(t, i) is the sample index, which is determined by reordering the units i {1, . . . , Nt} according to time t. Denote z F = supf F |z(f)|. Therefore,

Gn F := sup π Π

F(Xt, ati, rti) E F(Xt, ati, rti) | Ht 1 . (58)

Since µ(t,i) in F is correctly specified, we always have

E F(Xt, ati, rti) | Ht 1

" ˆκt 1(Xti) κt 1(Xti)

( 1{ati = bπt 1(Xti)} rti µ(t,i)(Xt, bπt 1(Xt))

{1 ˆκt 1(Xti)}{1 κt 1(Xti)}

" ˆκt 1(Xti) κt 1(Xti)

( 1{ati = bπt 1(Xti)} eti {1 ˆκt 1(Xti)}{1 κt 1(Xti)}

Linear Contextual Bandits With Interference (Lin CBWI)

According to the iteration of expectation, the equality above can be further derived as

E F(Xt, ati, rti) | Ht 1

" ˆκt 1(Xti) κt 1(Xti)

( E[1{ati = bπt 1(Xti)}|Xti] {1 ˆκt 1(Xti)}{1 κt 1(Xti)} E[eti|Xti, At]

where the last equality holds by E[ηti|Xti, At] = 0 according to the definition of noise ηti.

Therefore, Equation (58) can be simplified as

Gn F = sup π Π

q Q(t,i) F(Xt, ati, rti) .

Following Section 4.2 of Dedecker & Louhichi (2002), we define

d1(f) := E |f(X1, a11, r11)| H0 , d2(f) := E (f(X1, a11, r11))2 H0 1/2 .

First, we show that both d1(f) and d2(f) are finite numbers. For the brevity of content, we will take d2(f) as an example and d1(f) < can be proved similarly.

In a valid bandits algorithm, the probability of exploration κt(Xti) is bounded away from 1. That is, there exists a constant C1 < 1, such that κt(Xti) = P(ati = bπt 1(Xti)) C1 < 1, and ˆκt(Xti) C1 < 1. Therefore, for any t {1, . . . , T},

E (f(Xt, ati, rti))2 Ht 1

"( ˆκt 1(Xti) κt 1(Xti) 1{ati = bπt 1(Xti)} rti µ(t,i)(Xt, bπt 1(Xt))

{1 ˆκt 1(Xti)}{1 κt 1(Xti)}

"( ˆκt 1(Xti) κt 1(Xti) {1 ˆκt 1(Xti)}{1 κt 1(Xti)}

1{ati = bπt 1(Xti)} rti µ(t,i)(Xt, bπt 1(Xt)) 2 Ht 1

"( ˆκt 1(Xti) κt 1(Xti) {1 ˆκt 1(Xti)}{1 κt 1(Xti)}

1{ati = bπt 1(Xti)} E[η2 ti|Xt, At] Ht 1

Therefore, by Rosenthal s inequality derived for Martingale [see Dedecker & Louhichi (2002) for details], we have

E [ Gn F] K

d2(f) + 1 p NT

max q Q(t,i)

F(Xt, ati, rti) E F(Xt, ati, rti) | Ht 1 1

Since the right hand side is Op(T 1/2), we have

q F(Xt, ati, rti) 1 p NT E [ Gn F] = Op(T 1) = op(T 1/2). (60)

Linear Contextual Bandits With Interference (Lin CBWI)

Now let s derive the order for 2.

ˆκt 1(Xti) κt 1(Xti)

1{ati = bπt 1(Xti)} n µ(t,i)(Xt, bπt 1(Xt)) bµ(t,i) t 1 (Xt, bπt 1(Xt)) o

{1 ˆκt 1(Xti)}{1 κt 1(Xti)}

i=1 C2 ˆκt 1(Xti) κt 1(Xti) bµ(t,i) t 1 (Xt, bπt 1(Xt)) µ(t,i)(Xt, bπt 1(Xt))

ˆκt 1(Xti) κt 1(Xti) 2 1 NT

bµ(t,i) t 1 (Xt, bπt 1(Xt)) µ(t,i)(Xt, bπt 1(Xt)) 2

= op( N 1/2 T ), (61) where the last line holds by Cauchy-Schwartz inequality, and the last line holds by Assumption A.4.

Combining the results of Equation (60) and (61), we have

b VT VT = 1 + 2 = op( N 1/2 T ) + op( N 1/2 T ) = op( N 1/2 T ). (62)

Now the question becomes proving (2) VT = e VT + op( N 1/2 T ).

VT e VT = 1 NT

1 1{ati = bπt 1(Xti)}

1 κt 1(Xti)

n bµ(t,i) t 1 (Xt, bπt 1(Xt)) µ(t,i)(Xt, bπt 1(Xt)) o .

Following similar structure as we prove 1 = op( N 1/2 T ), one can define a new class of functions

F (Xt, ati, rti) =

( 1 1{ati = bπt 1(Xti)}

1 κt 1(Xti)

n bµ(t,i) t 1 (Xt, bπt 1(Xt)) µ(t,i)(Xt, bπt 1(Xt)) o : bµ(t,i) t 1 , µ Λ, bπt Π

and using Rosenthal s inequality for Martingale to prove that VT e VT = op( N 1/2 T ).

Step 2: Prove that e VT = VT + op( N 1/2 T ).

By definition of e VT and VT , we have

p NT (e VT VT ) = 1 p NT

1{ati = bπt 1(Xti)}

1 κt 1(Xti) 1 n µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) o

1{ati = bπt 1(Xti)}

1 κt 1(Xti) 1{ati = π (Xti)}

P(ati = π (Xti))

n rti µ(t,i)(Xt, π (Xt)) o

Step 2.1: We start from proving 3 = op(1). Since κt(Xti) C1 < 1, 1{ati=bπt 1(Xti)}

1 κt 1(Xti) 1 is upper bounded by a

constant. Therefore, to prove 3 = op(1), it suffice to show that

h µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) i = op(1). (64)

Linear Contextual Bandits With Interference (Lin CBWI)

Before proceeding, let s break down this term to do some transformation. Notice that

i=1 µ(t,i)(Xt, bπt 1(Xt)) =

a A Wt,ij X tjβa 1{bπt 1(Xtj) = a}

a A Wt,ji X tiβa 1{bπt 1(Xti) = a}

j=1 Wt,ji o X tiβa 1{bπt 1(Xti) = a}

a A ωti X tiβa 1{bπt 1(Xti) = a}

where the second equality holds by switching the index of (i, j) to (j, i), and the third equality holds by Fubini s theorem.

Going back to the previous equation, we have

h µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) i

h ωti X tiβa 1{π (Xti) = a} ωti X tiβa 1{bπt 1(Xti) = a} i

1 a,a K ωti X ti(βa βa ) 1{π (Xti) = a, bπt 1(Xti) = a , a = a }

1 a,a K a =a

ωti X ti(βa βa ) 1 ωti X ti(bβt 1,a bβt 1,a ) 0

Again, for the brevity of notation, we denote bζti = ωti X ti(bβt 1,a bβt 1,a ), and ζti = ωti X ti(βa βa ).

Let s first consider the case where ζti > 0. The opposite scenario can be derived in a similar manner. When ζti > 0, the RHS of the above equation is thus equivalent to

1 a,a K a =a

1 bζti 0 ζti = 1 p NT

1 a,a K a =a

1 bζti 0 ζti.

Since 1 bζti 0 bζti 0, we have

µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) 1 bζti 0 ζti 1 bζti 0 ζti bζti .

To show that N 1/2 T PT t=1 PNt i=1 µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) = op(1), it suffice to prove

ζ := N 1/2 T

i=1 1 bζti 0 ζti bζti = op(1).

Linear Contextual Bandits With Interference (Lin CBWI)

For any α (0, 1/2), we can further decompose

ζ = P(0 < ζti < N α T ) 1 p NT

i=1 1 bζti 0 ζti bζti 1{0 < ζti < N α T }

+ P(ζti N α T ) 1 p NT

i=1 1 bζti 0 ζti bζti 1{ζti N α T }

First, we show ζ1 = op(1). According to Theorem 4.3, bζti ζti = Op( N 1/2 t ) = op( N (1/2 αγ) t ) for any αγ > 0. Therefore, 1 p NT

1 bζti 0 ζti bζti 1{0 < ζti < N α T } 1 p NT

ζti bζti = p NT op( N (1/2 αγ) T ) = op( N αγ T ),

where the second last equality holds by Lemma 6 in Luedtke & Van Der Laan (2016).

Since |ω| 1, by setting ϵ = N α T in Assumption A.3, we have P 0 < |ωf(X, a) ωf(X, a )| < N α T P 0 < |f(X, a) f(X, a )| < N α T = O( N αγ T ). Therefore,

ζ1 = P(0 < ζti < N α T ) 1 p NT

i=1 1 bζti 0 ζti bζti 1{0 < ζti < N α T }

P(0 < ζti < N α T ) 1 p NT

1 bζti 0 ζti bζti 1{0 < ζti < N α T }

O( N αγ T ) op( N αγ T ) = op(1).

Next, we show ζ2 = op(1).

Since 1{bζti 0} = 1{bζti ζti ζti} = 1{|bζti ζti| > ζti}, we have

1{bζti 0}(bζti ζti) = 1 |bζti ζti| > ζti bζti ζti| |bζti ζti|

ζti bζti ζti = |bζti ζti|2

Since we assumed that ζti > 0, based on the result of Equation (67), we further have

1{bζti 0}(ζti bζti) |bζti ζti|2

as ζti bζti 0 always holds. Additionally, notice that 1{ζti N α T } ζti N α T . Therefore,

ζ2 = P(ζti N α T ) 1 p NT

i=1 1 bζti 0 ζti bζti 1{ζti N α T }

|bζti ζti|2

ζti ζti N α T = N 1/2+α T 1 NT

i=1 |bζti ζti|2.

By Theorem 4.3, |bζti ζti| = Op( N 1/2 t ), which implies |bζti ζti|2 = Op( N 1 t ). According to Lemma 6 of Luedtke & Van Der Laan (2016), N 1 T PT t=1 PNt i=1 |bζti ζti|2 = Op( N 1 T ). Therefore,

ζ2 N 1/2+α T 1 NT

i=1 |bζti ζti|2 N 1/2+α T Op( N 1 T ) = op(1) (68)

Linear Contextual Bandits With Interference (Lin CBWI)

for any α < 1/2.

Combining the result of Equation (66) and Equation (68), we have

ζ = ζ1 + ζ2 = op(1) + op(1) = op(1). (69)

Therefore, N 1/2 T PT t=1 PNt i=1 µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) = op(1), and thus 3 = op(1). The proof of first part is done.

Step 2.2: Next, we show that 4 = op(1) as well.

Recall that

1{ati = bπt 1(Xti)}

1 κt 1(Xti) 1{ati = π (Xti)}

P(ati = π (Xti))

n rti µ(t,i)(Xt, π (Xt)) o

1{ati = bπt 1(Xti)}

1 κt 1(Xti) 1{ati = π (Xti)}

P(ati = π (Xti))

n rti µ(t,i)(Xt, π (Xt)) o 1{ati = π (Xti)}

1{ati = bπt 1(Xti)}

1 κt 1(Xti)

n rti µ(t,i)(Xt, π (Xt)) o 1{ati = π (Xti)}

1{ati = bπt 1(Xti)}

1 κt 1(Xti) 1{ati = π (Xti)}

P(ati = π (Xti))

ηti 1{ati = π (Xti)}

1{ati = bπt 1(Xti)}

1 κt 1(Xti)

n µ(t,i)(Xt, bπt 1(Xt)) µ(t,i)(Xt, π (Xt)) o 1{ati = π (Xti)}

1{ati = bπt 1(Xti)}

1 κt 1(Xti)

ηti 1{ati = π (Xti)}

(70) We only need to show ζ3, ζ4 and ζ5 are all op(1). The proof for ζ5 is similar to that for ζ3 using Rosenthal s inequality for Martingales. Therefore, we will focus on proving ζ3 and ζ4, and omit the details for ζ5 for brevity.

To prove ζ3 = op(1), we define a function class

F(Xt, ati, ηti) =

( 1{ati = bπt 1(Xti)}

1 κt 1(Xti) 1{ati = π (Xti)}

P(ati = π (Xti))

ηti 1{ati = π (Xti)} : κt Λ, bπt Π

where Λ and Π are two classes of functions mapping context Xti to a probability in [0, 1]. Define the supremum of the empirical process indexed by F as

Gn F := sup π Π

F(Xt, ati, ηti) E F(Xt, ati, ηti) | Ht 1 . (71)

Since E[ηti|Xti, At] = 0, according to the iteration of expectation, the second term in the above equation can be derived as

E F(Xt, ati, ηti) | Ht 1

" 1{ati = bπt 1(Xti)}

1 κt 1(Xti) 1{ati = π (Xti)}

P(ati = π (Xti))

1{ati = π (Xti)} E[ηti|Xti, At] Ht 1

Linear Contextual Bandits With Interference (Lin CBWI)

Therefore, Equation (71) can be simplified as

Gn F = sup π Π

q Q(t,i) F(Xt, ati, ηti) .

Following a similar derivation structure as that used between Equation (58) and Equation (59) in Step 1, we have

ζ3 E [ Gn F] K

d2(f) + 1 p NT

max q Q(t,i)

F(Xt, ati, ηti) E F(Xt, ati, ηti) | Ht 1 1

= op(1). (72)

Next, let s prove ζ4 = op(1). Since both 1{ati=bπt 1(Xti)}

1 κt 1(Xti) and 1{ati = π (Xti)} can be upper bounded, it suffice to prove that 1 p NT

n µ(t,i)(Xt, bπt 1(Xt)) µ(t,i)(Xt, π (Xt)) o = op(1), (73)

which has already been established in Equation (64) in Step 2.1. Therefore, ζ4 = op(1).

Combining the results above, we have 4 = ζ3 + ζ4 + ζ5 = op(1). (74)

The proof of Step 2 is thus complete.

Step 3: Prove that p NT ( VT V π ) D N(0, σ2 V ) and derive the asymptotic variance σ2 V .

Recall that

1{ati = π (Xti)}

P(ati = π (Xti)) rti µ(t,i)(Xt, π (Xt)) + µ(t,i)(Xt, π (Xt))

1{ati = π (Xti)}

P(ati = π (Xti)) ηti + µ(t,i)(Xt, π (Xt)) .

Given the derivation of Equation (65), we have

i=1 µ(t,i)(Xt, π (Xt)) =

a A ωti X tiβa 1{π (Xti) = a}. (75)

Combining the above term with the expression of VT , we have

( 1{ati = π (Xti)}

P(ati = π (Xti)) ηti + X

a A ωti X tiβa 1{π (Xti) = a}

To decompose, we define

ξq := 1{aq = π (Xq)}

P(aq = π (Xq)) ηq | {z } ξ1q

a A ωti X tiβa 1{π (Xti) = a} V π i

where q denotes an unit in a flattened unit queue Q(t, i) = Pt 1 s=1 Ns + i. Similar to the the proof of Theorem 4.3, we define Hq as the σ algebra containing the information up to unit q where Hq0 = σ(v f X1η1, . . . , v f Xq0ηq0).

Since E[ξ2q] = E h X

a A ωti X tiβa 1{π (Xti) = a} i V π = 0,

Linear Contextual Bandits With Interference (Lin CBWI)

it holds that E[ξ2q|Hq 1] = 0. Additionally, notice that

E[ξ1q|Hq 1] = E 1{aq = π (Xq)}

P(aq = π (Xq)) ηq Hq 1

= E 1{aq = π (Xq)}

P(aq = π (Xq)) E[ηq|Hq 1, At] Hq 1

Thus, E[ξq|Hq 1] = 0, and {ξq}1 q NT is a Martingale difference sequence. To show that p NT ( VT V π ) D N(0, σ2 V ) and derive the asymptotic variance σ2 V , it suffice to check the Lindeberg condition and use Martingale CLT to establish asymptotic normality.

(1) First, let s check the Lindeberg condition.

NT ξ2 q 1 n 1 N T ξq δ o Hq 1

q=1 E h ξ2 q1 ξ2 q NT δ2 Hq 1 i .

Notice that ξ2 q1 ξ2 q NT δ2 converges to 0 as NT goes to infinity and is bounded by ξ2 q given Hq 1. Therefore, we only need to check the integrability of ξ2 q given Hq 1, then by Dominated Convergence Theorem (DCT), the Lindeberg condition is checked.

Since the derivation of E[ξ2 q|Hq 1] is exactly the asymptotic variance σ2 V , we will leave the details to part (2).

(2) Next, we derive the limit of conditional variance σ2 V = 1 NT P NT q=1 E[ξ2 q|Hq 1].

E[ξ2 1q|Hq 1] = E 1{aq = π (Xq)}

[P{aq = π (Xq)}]2 η2 q Hq 1

= E 1{aq = π (Xq)}

[P{aq = π (Xq)}]2 E[η2 q|Hq 1, At] Hq 1

= E 1{aq = π (Xq)}

[P{aq = π (Xq)}]2 σ2 Hq 1

q=1 E[ξ2 1q|Hq 1] = σ2 E

1 νq(Xq, Ht 1) [P{aq = π (Xq)}]2

where νq(Xq, Ht 1) = νti(Xti, Ht 1) = P(Ati = π (Xti)|Xti, Ht 1).

Following similar proof structure of Ye et al. (2023) in Appendix page 34-35, we are able to establish that

q=1 E[ξ2 1q|Hq 1] = σ2 E

1 νq(Xq, Ht 1) [P{aq = π (Xq)}]2

1 κ (x)d Px,

where κ (x) = limq P(Ati = π (x)).

Since E[ξ2q] = 0 and the randomness in ξ2q only comes from (X, ω), we have

E[ξ2 2q|Hq 1] = Var(ξ2q) = Var h X

a A ωti X tiβa 1{π (Xti) = a} i .

Furthermore,

E[ξ1qξ2q|Hq 1] = E ξ2q 1{aq = π (Xq)}

P(aq = π (Xq)) ηq Hq 1

= E ξ2q 1{aq = π (Xq)}

P(aq = π (Xq)) E[ηq|Hq 1, At] Hq 1

q=1 E[ξ2 q|Hq 1] = 1 NT

q=1 E[(ξ1q + ξ2q)2|Hq 1]

q=1 E[ξ2 1q|Hq 1] + 1 NT

q=1 E[ξ2 2q|Hq 1] + 2 NT

q=1 E[ξ1qξ2q|Hq 1]

1 κ (x)d Px + Var h X

a A ωtix βa 1{π (x) = a} i .

Linear Contextual Bandits With Interference (Lin CBWI)

σ2 V = Z σ2

1 κ (x)d Px + Var h X

a A ωx βa 1{π (x) = a} i . (78)

Finally, by combining the results of Step 1-3, we are able to show that p NT (b V DR T V π ) D N(0, σ2 V ), which concludes the proof of this theorem.

J. Proof of Regret Bound

Step 1: Decompose RT = R(1) T + R(2) T , which accounts for the regret of exploitation and exploration.

Recall that the regret RT is defined as

i=1 E µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, At)

i=1 E µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) +

i=1 E µ(t,i)(Xt, bπt 1(Xt)) µ(t,i)(Xt, At) ,

which can be decomposed into

i=1 E h µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) i

i=1 E h µ(t,i)(Xt, bπt 1(Xt)) µ(t,i)(Xt, At) 1{Ati = bπt 1(Xti)} i

By definition, R(1) T is nonzero only when π (Xti) = bπt 1(Xti), which accounts for the regret caused by estimation accuracy, i.e., exploitation. R(2) T is nonzero only Ati = bπt 1(Xti), which accounts for the regret caused by exploration. In Step 2-3, we will derive the regret bound of R(1) T and R(2) T separately to prove the sublinearity of RT .

Step 2: Prove that R(1) T = o( N 1/2 T ).

Notice that in the proof of Theorem 4.4, step 2.1, we ve proved in Equation (64) that

h µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) i = op(1).

i=1 E h µ(t,i)(Xt, π (Xt)) µ(t,i)(Xt, bπt 1(Xt)) i = o( N 1/2 T ).

Step 3: Prove that R(2) T = O( N 1/2 T log NT ).

According to the upper bound derived in Theorem 4.2,

i=1 E h µ(t,i)(Xt, bπt 1(Xt)) µ(t,i)(Xt, At) 1{Ati = bπt 1(Xti)} i

i=1 E 1{Ati = bπt 1(Xti)} = 2U

i=1 κti(ωti, Xti).

Linear Contextual Bandits With Interference (Lin CBWI)

Now let s decompose according to different exploration algorithms. For simplicity of notations, we continue with the flattened unit queue q = Q(t, i) Pt 1 s=1 Ns + t as shown in the proof of Theorem 4.1. As such, we can extend the definition of pt to pq by simply setting pq = pt for any unit q in round t. As such, pq is still a non-increasing sequence w.r.t. q. By Theorem 4.2, we have

1. In UCB, κti(ωti, Xti) is upper bounded by O(K2Lγ w ( Nq 1pq 1) γ/2). Let pq = N u/γ 1 q with u > 0. For γ > u, pq is decreasing. Then

i=1 κti(ωti, Xti) Lγ w K2

q=1 q u/2 = O(Lγ w K2 N 1 u/2 T ).

Taking u = 1 gives us R(2) T = O(Lγ w K2 N 1/2 T = O( N 1/2 T ). note that when the interference constraint Lw or the number of arms K is large, the regret bound R(2) T tends to be larger.

2. In TS, κti(ωti, Xti) is upper bounded by O(d K3 exp{ Nq 1p2 q 1/L4 w}). Let pq = p

α log q/ Nq. Then κti(ωti, Xti) O(d K3 exp{ α log(q 1)/L4 w}) = O(d K3q L 4 w α). Thus,

i=1 κti(ωti, Xti) d K3 NT X

q=1 q L 4 w α = O(d K3 N 1 L 4 w α T ).

When the interference constraint Lw is large, the regret bound R(2) T tends to be larger. By taking α = L4 w/2, we have R(2) T = O(d K3 N 1/2 T ) = O( N 1/2 T ).

3. In EG, κti(ωti, Xti) = ϵq/2. If we set ϵq = O(q m) with any m < 1/2, then

i=1 κti(ωti, Xti)

q=1 q m = O( N 1 m T ).

By setting ϵq = O(log q/ q), we have R(2) T = O( N 1/2 T log NT ).

Thus, in UCB, TS, and EG, the regret caused by exploration can be controlled by R(2) T = O( N 1/2 T log NT ).

Therefore, by combining the results of Step 1-3, we are able to show that

i=1 E[R ti Rti] = O( N 1/2 T log NT ),

which is sublinear in NT .