# mutliarmed_bandits_with_network_interference__cca2ac20.pdf

Multi-Armed Bandits with Network Interference

Abhineet Agarwal Department of Statistics UC Berkeley aa3797@berkeley.edu

Anish Agarwal Department of IEOR Columbia University aa5194@columbia.edu

Lorenzo Masoero Amazon masoerl@amazon.com

Justin Whitehouse Computer Science Department Carnegie Mellon University jwhiteho@andrew.cmu.edu

Online experimentation with interference is a common challenge in modern applications such as e-commerce and adaptive clinical trials in medicine. For example, in online marketplaces, the revenue of a good depends on discounts applied to competing goods. Statistical inference with interference is widely studied in the offline setting, but far less is known about how to adaptively assign treatments to minimize regret. We address this gap by studying a multi-armed bandit (MAB) problem where a learner (e-commerce platform) sequentially assigns one of possible A actions (discounts) to N units (goods) over T rounds to minimize regret (maximize revenue). Unlike traditional MAB problems, the reward of each unit depends on the treatments assigned to other units, i.e., there is interference across the underlying network of units. With A actions and N units, minimizing regret is combinatorially difficult since the action space grows as AN. To overcome this issue, we study a sparse network interference model, where the reward of a unit is only affected by the treatments assigned to s neighboring units. We use tools from discrete Fourier analysis to develop a sparse linear representation of the unit-specific reward rn : [A]N R, and propose simple, linear regression-based algorithms to minimize regret. Importantly, our algorithms achieve provably low regret both when the learner observes the interference neighborhood for all units and when it is unknown. This significantly generalizes other works on this topic which impose strict conditions on the strength of interference on a known network, and also compare regret to a markedly weaker optimal action. Empirically, we corroborate our theoretical findings via numerical simulations.

1 Introduction

Online experimentation is an indispensable tool for modern decision-makers in settings ranging from e-commerce marketplaces [Li et al., 2016] to adaptive clinical trials in medicine [Durand et al., 2018]. Despite the wide-spread use of online experimentation to assign treatments to units (e.g., individuals, subgroups, or goods), a significant challenge in these settings is that outcomes of one unit are often affected by treatments assigned to other units. That is, there is interference across the underlying network of units. For example, in e-commerce, the revenue for a given good depends on discounts applied to related or competing goods. In medicine, an individual s risk of disease depends not only on their own vaccination status but also on that of others in their network.

The research presented in this paper was conducted independently and is entirely unrelated to the author s current appointment at Amazon.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Figure 1: A visual representation of sparse network interference. In this toy example, we have N = 9 units, and visualize the interference pattern. For unit 2 (orange), its outcomes are affected by the treatments of its neighbours (blue) N(2) = {1, 2, 3, 6, 7}.

Network interference often invalidates standard tools and algorithms for the design and analysis of experiments. While there has been significant work done to develop tools for statistical inference in the offline setting (see Section 2), this problem has mostly been unaddressed in the online learning setting. In this paper, we address this gap by studying the multi-armed bandit (MAB) problem with network interference. We consider the setting where a learner (online marketplace) assigns one of possible A N actions (varying discounts) to N units (goods) over T rounds to minimize average regret. In our setting, the reward of a unit n [N] := {1, . . . , N} depends on the actions assigned to other units.2 With N units and A actions, achieving low regret is difficult since there are AN possible treatment assignments. Naively applying typical MAB methods such as the upper confidence bound (UCB) algorithm [Auer et al., 2002] leads to regret that scales as O(

ANT), which can be prohibitively large due to the exponential dependence on N. Further, without any assumptions on the interference pattern, regret scaling as eΩ(

ANT) is unavoidable due to lower bounds from the MAB literature [Lattimore and Szepesvári, 2020].

To overcome this issue, we consider a natural and widely-studied model of sparse network interference, where the reward rn : [A]N R for unit n is affected by the treatment assignment of at most s other units, i.e., neighbours. See Figure 1 for a visualization. Under this model, we provide algorithms that provably achieve low regret both when the learner observes the network (i.e., the learner knows the s neighbors for all units n), and when it is unknown. Our results allow for more general interference patterns and define regret with respect to a significantly stronger comparator policy than existing results in the literature. Contributions.

(i) For each unit n [N], we use the Fourier analysis of discrete functions to re-express its reward rn : [A]N : R as a linear function in the Fourier basis with coefficients θn RAN . We show sparse network interference implies θn is As sparse for all n [N]. This sparse linear representation motivates a simple explore-then-commit style algorithm that uniformly explores actions, then fits a linear model to estimate unit-specific rewards (i.e., θn). (ii) With known interference (i.e., the learner knows the s neighbors for all n [N]), our algorithm exploits this knowledge to estimate rn by performing ordinary least squares (OLS) locally (i.e., per unit) on the Fourier basis elements where θn is non-zero. Our analysis establishes regret O((As T)2/3) for this algorithm. (iii) With unknown interference, we use the Lasso instead of OLS locally which adapts to sparsity of θn and establish regret O(N 1/3(As T)2/3). We argue this T 2/3 scaling cannot be improved. (iv) Numerical simulations with network interference show our method outperforms baselines.

2 Related Work

Causal inference and bandits with interference. The problem of learning causal effects in the presence of cross-unit interference has received significant study from the causal inference com-

2For any positive integer x, we let [x] := {1, . . . x}.

munity (see [Bajari et al., 2023] for a thorough overview). Cross-unit interference violates basic assumptions for causal identifiability, invalidating standard designs and analyses.3 As a result, authors have developed methodologies for estimating causal effects under several models of interference such as intra-group interference [Hudgens and Halloran, 2008, Rosenbaum, 2007], interference neighborhoods [Gao and Ding, 2023, Ugander et al., 2013, Bhattacharya et al., 2020, Yu et al., 2022, Cen et al., 2022], in bipartite graphs representative of modern online markets [Pouget-Abadie et al., 2019, Bajari et al., 2021, 2023], in panel data settings [Agarwal et al., 2022] as well as under a general model of interference, generally encoded via exposure mappings [Aronow, 2012, Aronow and Samii, 2017]. Despite this large literature, there is much less work on learning with interference in online settings. Jia et al. [2024] take an important step towards addressing this gap by studying MABs with network interference, but assume a known, grid-like interference pattern, where the strength of the interference decays as the ℓ1 distance between units grows. Moreover, their focus unlike ours is on establishing regret rates with respect to the best constant policy, i.e. the best policy that assigns each unit the same treatment. We also note that the authors consider a setting more closely aligned with the adversarial bandit literature, whereas the results in this paper are closer to those in the stochastic bandit literature. See Section 3 for a detailed description of these differences.

Bandits with high-dimensional action spaces. In MAB problems, regret is typically lower bounded by eΩ( #Actions T), where #Actions = AN in our setting. Typically, this curse of dimensionality is addressed by sparsity constraints on the rewards, where only a small fraction of actions have non-zero rewards [Kwon et al., 2017, Abbasi-Yadkori et al., 2012, Hao et al., 2020]. Particularly relevant to this paper is the work of Hao et al. [2020] who consider sparse linear bandits. The authors utilize a explore-then-commit style algorithm to uniformly explore actions before using the Lasso to estimate the sparse linear parameter. We utilize a similar algorithm but allow for arbitrary interaction between neighboring units, instead using discrete Fourier analysis to linearly represent rewards [Negahban and Shah, 2012, O Donnell, 2014, Agarwal et al., 2023]. This is similar to kernel bandits [Srinivas et al., 2009, Chowdhury and Gopalan, 2017, Whitehouse et al., 2024], which assume there exists a feature map such that the rewards can be linearly represented (non-sparsely) in a high-dimensional reproducing kernel Hilbert space. Also related are stochastic combinatorial bandits [Chen et al., 2013, Cesa-Bianchi and Lugosi, 2012], in which the action space is assumed to be a subset of {0, 1}N but rewards are typically inherently assumed to be linear in treatment assignments. That is, these works typically assume the reward r = θ, a for a {0, 1}N, with valid actions a often having at most s non-zero components. Our work (with A = 2), considers an arbitrary function r : {0, 1}N R, but explicitly constructs a feature map via discrete Fourier analysis such that rewards can be represented linearly.

3 Model & Background

In this section, we first describe the problem setting, and our notion of regret. Then, we introduce the requisite background on discrete Fourier analysis that we will use to motivate our algorithm and theoretical analysis. Last, we introduce the model that we study in this paper. Throughout this paper, we use boldface to represent vectors and matrices.

3.1 Problem Set-up

We consider an agent that sequentially interacts with an environment consisting of N individual units over a series of T rounds. We index units n [N], and rounds t [T]. At each time step t, the agent simultaneously administers each unit n action (or treatment) a [A]. Let ant denote the treatment received by unit n at time step t, and let at = (a1t, . . . , a Nt) [A]N denote the entire treatment vector. Each unit n possesses an unknown reward mapping rn : [A]N [0, 1]. Note that we allow the reward for a given unit n to depend on the treatments assigned to all other units, i.e., we allow for cross-unit interference. After assigning a treatment to all units in round t, the agent then observes the noisy reward for unit n as Rnt = rn(at) + ϵnt. Denote the vector of observed rewards as Rt := (R1t . . . RNt). We assume the following standard condition on the noise ϵnt.

Assumption 1. (ϵnt : n [N], t [T]) is a collection of mutually independent 1-sub-Gaussian random variables.

3Specifically, it violates the stable unit treatment value assumption (SUTVA) [Rubin, 1978].

Regret. To measure the performance of the learning agent, we define the average reward function r : [A]N [0, 1] by r(a) = 1

N PN n=1 rn(a). Then, for a sequence of (potentially random) treatment assignments a1 . . . a T , the regret at the horizon time T is defined as the quantity

t=1 r(at), (1)

where a arg maxa [A]N r(a). In Sections 4 and 5, we provide and analyse algorithms that achieve small regret with high probability.

Comparison to other works. Previous works studying network bandits such as Jia et al. [2024] measure regret with respect to the best constant action a := arg maxa [A] r(a1) where 1 RN denotes the all 1 vector of dimension N. We compare regret to the optimal action a arg maxa [A]N r(a), which is combinatorially more difficult to minimize since the policy space is exponentially larger (AN vs A). Our setup is also different than the traditional MAB setting since the agent in this problem does not observe a single scalar reward, but one for each unit (similar to semi-bandit feedback in the combinatorial bandits literature [Cesa-Bianchi and Lugosi, 2012]). As we show later, this crucially allows us to exploit local, unit-specific information that allow for better regret rates.

3.2 Background on Discrete Fourier Analysis

In this section, we provide background on discrete Fourier analysis, which we heavily employ in both our algorithm and analysis. Specifically, these Fourier-analytic tools provide a linear representation of the discrete unit-specific rewards rn : [A]N [0, 1], which will allow us to leverage well-studied linear bandit algorithms. For the rest of paper, assume A is a power of 2. If instead, if 2ℓ< A < 2ℓ+1 for some ℓ 0, we can redundantly encode actions to obtain A = 2ℓ+1 total treatments. As seen later, this encoding does not affect the overall regret.

Boolean encoding of action space. Since by assumption A is a power of 2, every action a [A] can be uniquely represented as a binary number using log2(A) bits. Explicitly, let v(a) = ( v1(a), . . . vlog2(A)(a)) {0, 1}log2(A) denote this vectorized binary representation. For ease of analysis, we use the Boolean representation instead v(a) = 2 v(a) 1 { 1, 1}log2 A. For a [A]N, define v(a) = (v(a1), . . . , v(a N)) { 1, 1}N log2(A). Note each action a [A]N corresponds to a unique Boolean vector v(a).

Boolean representation of discrete functions. Let F = {f : [A]N R} and FBool = { f : { 1, 1}N log2(A) R} be the collection of all real-values functions defined on the set [A]N

and { 1, 1}N log2(A) respectively. Since every a [A]N has a uniquely Boolean representation v(a) { 1, 1}N log2(A), the set of functions F can be naturally identified within FBool. Specifically, any f F can be identified with the function f FBool by f( ) = f(v( )).

Fourier series of Boolean functions. This identification is key for our use since the space of Boolean functions admits a number of attractive properties.

(1) Hilbert space. FBool forms a Hilbert space defined by the following inner product: for any h, g FBool, h, g B = A N P

x { 1,1}N log2(A) h(x)g(x). This inner product induces the norm h, h B := h 2 B = A N P

x { 1,1}p h2(x).

(2) Simple orthonormal basis. For each subset S [N log2(A)], define a basis function χS(x) = Q

i S xi where xi is the ith coefficient of x { 1, 1}N log2(A). One can verify that for any S [N log2(A)], χS B = 1, and that χS, χS B = 0 for any S = S. Since |{χS : S [N log2(A)]}| = AN, the functions χS are an orthonormal basis of Fbool. We refer to χS as the Fourier character for the subset S.

(3) Linear Fourier expansion of FBool. Any h Fbool can be expanded via the following Fourier decomposition: h(x) = P

S [N log2(A)] θSχS(x), where the Fourier coefficient θS is given by θS =

h, χS B. For h FBool, we refer to θh = (θS : S [N log2(A)]) RAN as the vector of Fourier coefficients associated with it. For x { 1, 1}N log2(A), let χ(x) = (χS(x) : S [N log2(A))])

{ 1, 1}AN be the vector of associated Fourier character outputs. For a [A]N, abbreviate χS(v(a)) and χ(v(a)) as χS(a) and χ(a) respectively.

3.3 Model: Sparse Network Interference

The unit-specific reward function rn : [A]N R can be equivalently viewed as a real-valued Boolean function over the hypercube { 1, 1}N log2(A). That is, rn takes as input a vector of actions a [A]N, converts it to a Boolean vector v(a) { 1, 1}N log2(A), and outputs a reward rn(a). From the discussion in Section 3.2, we can represent unit n s reward as rn(a) = P

S [N log2(A)] θn,SχS(a) =

θn, χ(a) , where θn = (θn,S : S N log2(A)) RAN is a vector of Fourier coefficients.

Without any assumptions on the nature of the interference pattern, achieving low regret is impossible since it requires estimating AN Fourier coefficients per unit. To overcome this fundamental challenge, we impose a natural structure on the interference pattern which assumes that the reward rn only depends on the the treatment assignment of a subset of s units. This assumption is often observed in practice, e.g., the revenue of a good does not depend on discounts applied to all other goods, but only those applied to similar or related ones. Assumption 2. (Sparse Network Interference) For any unit n [N], there exists a neighborhood N(n) [N] of size |N(n)| s such that rn(a) = rn(b) for all a, b { 1, 1}N log2 A satisfying (am : m N(n)) = (bm : m N(n)).

We typically assume that n N(n), i.e. unit n s reward depends on its own treatment. This model allows for completely arbitrary interference between these s units, generalizing the results of Jia et al. [2024] who allow for interaction between all N units but assume the strength of interference decays with a particular notion of distance between units. Next, we show using our Fourier analytic tools, that Assumption 2 implies that the reward can be re-expressed as a sparse linear model. We prove the following in Appendix A. Proposition 3.1. Let Assumption 2 hold. Then, for any unit n, and action a [A]N, we have the following representation of the reward rn(a) = θn, χ(a) , where θn 0 As.4

Proposition 3.1 shows sparse network interference implies θn is As sparse with non-zero coordinates corresponding to the interactions of treatments between units in N(n). Indeed, the Boolean encoding v(a) can be represented as blocks of log2(A) dimensional Boolean vectors:

v(a) = (v(a)1:log2(A) | {z } Unit 1 s treatment

, . . . , v(a)(i 1) log2(A)+1:i log2(A) | {z } Unit i s treatment

, . . . , v(a)(N 1) log2(A)+1:N log2(A) | {z } Unit N s treatment

Unit n s reward depends on a small collection of these blocks, those indexed by its neighbors. Define

B(n) := n i [N log2(A)] : i [(m 1) log2(A) + 1 : m log2(A)] for some m N(n) o .

B(n) contains the indices of v(a) corresponding to treatments of units m N(n) and the non-zero entries of θn are indexed by subsets S B(n). E.g., consider N = 3, A = 2, with N(1) = {1, 2}. Then B(1) = {1, 2} and S B(n) = { , {1}, {2}, {1, 2}}, where is the empty set.

Graphical interpretation. Assumption 2 can be interpreted graphically as follows. Let G = ([N], E) denote a directed graph over the N units, where E [N] [N] denotes the edges of G. For unit n, we add to the edge set E a directed edge (n, m) for each m N(n), thus justifying calling N(n) the neighborhood of n. That is, unit n s reward is affected by the treatment of another unit m only if there is a directed edge from n to m. See Figure 1 for an example of a network graph G.

4 Network Multi-Armed Bandits with Known Interference

We now present our algorithms and regret bounds when the interference pattern is known, i.e. the learner observes G and knows N(n) for each unit n. The unknown case is analysed in Section 5. Assuming knowledge of G is reasonable in e-commerce, where the platform (learner) assigning discounts (treatments) to goods (units) understands the underlying similarity between goods.

4For a vector x Rd, we define x 0 := Pd i=1 1(xi = 0)

Our algorithm requires the following additional notation: for a [A]N, let χa(Bn) = (χS(a) : S B(n)) { 1, 1}As, where χS(a) are the Fourier characteristics corresponding to subsets of B(n). Further, let U([A]N) denote the uniform discrete distribution on the action space [A]N.

Algorithm 1 Network Explore-Then-Commit with Known Interference

1: Input: Time horizon T, exploration steps E, interference graph G = ([N], E). 2: Sample a1, . . . , a E i.i.d. U [A]N

3: Observe reward vectors Rt = (R1t, , RNt) for t [E], where Rnt = θn, χ(at) + ϵnt. 4: for n [N] do 5: Let Xn = (χai(Bn) : i [E]) { 1, 1}E As // χa(Bn) = (χS(a) : S B(n)) 6: Let Yn = (Rn1, . . . , Rn E). 7: Set eθn := arg minθ RAs Yn Xnθ 2 2 8: Define bθn by bθn S = eθn S if S B(n) else set bθn S = 0. // Coordinates of eθn indexed by subsets of B(n) 9: Set bθ := N 1 PN n=1 bθn.

10: Play ba := arg maxa [A]N bθ, χ(a) for the T E remaining rounds.

Algorithm 1 is a explore-then-commit style which operates in two phases. First, the learner assigns units treatments uniformly at random for E rounds, and observes rewards for each unit. In the second phase, the algorithm performs least squares regressions of the observed rewards against χa(Bn) for each unit n. This is because when G is known, the learner knows the positions of the non-zero elements of θn which are precisely the subsets of B(n), Once the estimates ˆθn are obtained for each unit, they are aggregated to estimate the average reward for each action a [A]N. In the remaining T E rounds, the learner greedily plays the action with the highest estimated average reward.

Determining exploration length E. Theoretically, we detail the length of E below to achieve low regret in Theorem 4.1. Practically, the learner can continue to explore and assess the error of the learnt bθn via cross-validation (CV). Once the CV error for all units falls below a (user-specified) threshold, commit to the action with highest average reward. We use this approach for selecting E in our simulations in Section 6.

4.1 Regret Analysis

Here, we establish high-probability regret bounds of Algorithm 1 using O( ) notation. We prove the following in Appendix B. Theorem 4.1. Suppose Assumptions 1 and 2 hold. For T = Ω A2s[log(2N/δ) + s log(A)] and

any failure probability δ (0, 1), Algorithm 1 run with E := (TAs)2/3 log N

δ + s log (A) 1/3

satisfies Reg T = O [s log(A/δ)]1/3 (TAs)2/3 ,

with probability at least 1 δ.

Establishing Theorem 4.1 requires trading-off the exploration time E to accurately estimate θn with the exploitation time. It also requires T to be large enough such that we can accurately estimate θn. Next, we compare regret of Algorithm 1 to other methods, ignoring any dependencies on logarithmic factors to ease the discussion.

Comparison to other approaches.

(a) Naïve MAB learner. A naïve learner who treats the entire network of units as a single multiarmed bandit system with AN actions will obtain regret e O(

TAN). For sparse networks with s N and T AN, our regret bound is significantly tighter. (b) Global estimation. An alternate algorithm would be to estimate Fourier coefficients θ := 1/N PN i=1 θn of r directly rather than estimate each θn (i.e., rn) individually. That is, perform

the least squares regression by compressing the observed, unit-specific rewards into Rt := N 1 PN n=1 Rtn. An analysis similar to the one presented in Appendix B would yield rate of e O(s1/3(TNAs)2/3), which suffers an additional N 2/3 cost as compared to Theorem 4.1. (c) Jia et al. [2024]. Comparing regret to this work is difficult because they assume decaying interference strength on a grid-like network structure and establish regret only with respect to the best constant action, i.e., a := arg maxa [A] r(a1). We also note that the framework of Jia et al. [2024] is closer to that of adversarial bandits, whereas our framework is closer to that of stochastic bandits.

5 Network Multi-Armed Bandits with Unknown Interference

Next, we consider the case in which the underlying network G governing interference is not known. We present Algorithm 2, which extends Algorithm 1 to account for the fact that the learner does not observe the network graph G and thus does not know N(n) for all n. Unknown network interference is common in medical trials, e.g., vaccine roll-outs where an individual s social network (i.e., G) is unavailable to the learner.

Algorithm 2 Network Explore-Then-Commit with Unknown Interference

1: Input: Time horizon T, exploration steps E, regularization parameter λ > 0 2: Sample a1, . . . , a E i.i.d. U [A]N

3: Observe reward vectors Rt = (R1t, , RNt) for t [E], where Rnt = θn, χ(at) + ϵnt.

4: Let X = (χ(ai) : i [E]) { 1, 1}E AN

5: for n [N] do 6: Let Yn := (Rn1, . . . , Rn E). 7: Set bθn := arg minθ RAN 1

2E Xθ Yn 2 2 + λ θ 1

8: Set bθ := N 1 PN n=1 bθn.

9: Play ba := arg maxa [A]N bθ, χ(a) for the T E remaining rounds.

Algorithm 2 is similar to Algorithm 1, but differs in how it learns θn. Since G is unknown, the learner cannot identify the Fourier characteristics which correspond to the non-zero elements of θn. Therefore, we regress against the entire Fourier characteristic χ(a), using Lasso instead of ordinary least squares to adapt to the underlying sparsity of θn. A similar CV approach, as discussed after Algorithm 1, can be used to determine both the exploration length E, and regularization parameter λ.

Low-order interactions. When AN is very large, the computational cost of running the Lasso can be large. Further, if the underlying network is indeed believed to be sparse, one can regress against all characteristics χS where |S| d. A similar approach is explored in Yu et al. [2022]. In practice, one can choose degree d via CV.

Partially observed network graph G. In many settings, network interference graphs G are partially observed. For example, on e-commerce platforms, interference patterns between established classes of goods is well-understood, but might be less so for newer products. Our framework can naturally be adapted to this setting by running Algorithm 1 on the observed portion of G, and Algorithm 2 on the unobserved graph. Specifically, if N(n) is observed for unit n, replace the Lasso in line 7 of Algorithm 2 with OLS (i.e., line 8) in Algorithm 1.

5.1 Regret Analysis

We now establish high-probability bounds on the regret for Algorithm 2 in Theorem 5.1. We prove the following in Appendix C. Theorem 5.1. Suppose Assumptions 1 and 2 hold, and assume T = Ω(A2s [log(N/δ) + N log(A)]). Then, with failure probability δ (0, 1), Algorithm 2 run with λ = 4 p

E 1 log(2AN) +

δ where E := (TAs)2/3 log N

δ + N log(A) 1/3 satisfies

Reg T = O [N log(A/δ)]1/3 (TAs)2/3

with probability at least 1 δ.

We note the regret bound requires the horizon T to be sufficiently large in order to learn the network graph G a necessary detail in order to ensure Lasso convergence. This is because the proof of Theorem 5.1 requires establishing that the matrix of Fourier coefficients for the sampled actions (i.e., design matrix X) satisfies the the necessary regularity conditions to learn θn accurately. Specifically, we show that X is incoherent, i.e., approximately orthogonal, with high probability. See Appendix C for a formal definition of incoherence, and Rigollet and Hütter [2023], Wainwright [2019] for a detailed study of the Lasso.

Comparison to other approaches. Algorithm 2 achieves the same dependence in A, s, T as in the known interference case, but pays a factor of N 1/3 as compared to s1/3. This additional cost which is logarithmic in the ambient dimension AN is typical in sparse online learning. This regret rate is still significantly lower than naïve approaches that scale as O(

ANT) when one assumes T is much smaller then AN. Further, as argued before, estimating per-unit rewards (i.e., θn) results in lower regret as compared to directly estimating r by a factor of N 2/3.

Dependence on horizon T. Generally, the dependence on T cannot be improved. Hao et al. [2020] lower bound regret for sparse linear bandits as eΩ((sparsity T)2/3), i.e., eΩ((As T)2/3) in our setting. They show improved dependence on T can only be achieved under stronger assumptions on the size of non-zero coefficients of θn.

6 Simulations

In this section, we perform simulations to empirically validate our algorithms and theoretical findings. We compare Algorithms 1 and 2 to UCB. We could not compare to Jia et al. [2024] since we did not find a public implementation. For our Algorithms, we choose all hyper-parameters via 3-fold CV, and use the scikit-learn implementation of the Lasso. Code for our methods and experiments can be found at https://github.com/aagarwal1996/Network MAB. Our experimental setup and results are described below.

Data Generating Process. We generate interference patterns with varying number of units N {5, . . . , 10}, and A = 2. For each N, we use s = 4. We generate rewards rn = θn, χ(a) , where the non-zero elements of θn (i.e., θn,S for S Bn) are drawn uniform from [0, 1]. We normalize rewards so that they are contained in [0, 1], and add 1 sub-gaussian noise to sampled rewards. We measure regret as we vary T, and set a max horizon of Tmax = 10 2N for each N. Classical MAB algorithms need the horizon T to satisfy T > 2N since they first explore by pulling all 2N arms. We emphasize that these time horizons scaling as T = C AN are often unreasonable in practice, as even for A = 2 and N = 100 there would already be 1.27e30 actions to explore. We include such large time horizons for the sake of making a complete comparison. Our methods circumvent the need for exponentially large exploration times by effectively exploiting sparsity.

Results. We plot the regret at the maximum horizon time as a function of N, and the cumulative regret as we vary T for N = 13 in Figure 2 below. Our results are averaged over 5 repetitions, with shaded regions representing 1 standard deviation measured across repetitions. Algorithms 1 and 2 are denoted by Network MAB (Known) and Network MAB (Unknown) respectively. We discuss both sets of plot separately below.

Regret Scaling with N. We plot the cumulative regret when T = Tmax for N = 9 in Figure 2 (a). Classical MAB algorithms such as UCB see an exponential growth in the regret as N increases. Both Algorithm 1 and Algorithm 2 have much milder scaling with N. Algorithm 1 uses G to reduce the ambient dimension of the regression, hence suffering less dependence on N as compared to Algorithm 2.

Regret Scaling with T. We plot the cumulative regret for N = 9 in Figure 2 (b). Despite the poorer scaling of our regret bounds with T, our algorithms lead to significantly better regret than UCB which takes a large horizon to converge. Algorithm 1 is able to end its exploration phase earlier than algorithm 2 since it does not need additional samples to learn the sparsity unlike the Lasso.

(a) Cumulative regret vs number of units N.

(b) Cumulative regret scaling vs horizon T.

Figure 2: We simulate rewards via a sparse network interference pattern, and plot the cumulative regret as a function of N and T. Our Network MAB algorithms out-perform UCB, irrespective of knowledge of G, and does not suffer exponential dependence in number of units N. The results also confirm our theoretical results that knowledge of G leads Algorithm 1 to have milder dependence in N and better regret than Algorithm 2.

7 Conclusion

This paper introduces a framework for regret minimization in MABs with network interference, a ubiquitous problem in practice. We study this problem under a natural sparsity assumption on the interference pattern and provide simple algorithms both when the network graph is known and unknown. Our analysis establishes low regret for these algorithms and numerical simulations corroborate our theoretical findings. The results in this paper also significantly generalize previous works on MABs with network interference by allowing for arbitrary and unknown (neighbourhood) interference, as well as comparing to a combinatorially more difficult optimal policy. This paper also suggests future directions for research such as designing algorithms that achieve better dependence on T in the known graph setting. Establishing lower bounds to understand optimal algorithms will also be valuable future work. Further extensions could also include considering interference in contextual bandits or reinforcement learning problems. We also hope this work serves as a bridge between online learning and discrete Fourier analysis.

Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1 9. PMLR, 2012.

Abhineet Agarwal, Anish Agarwal, and Suhas Vijaykumar. Synthetic combinations: A causal inference framework for combinatorial interventions. Advances in Neural Information Processing Systems, 36:19195 19216, 2023.

Anish Agarwal, Sarah H Cen, Devavrat Shah, and Christina Lee Yu. Network synthetic interventions: A causal framework for panel data under network interference. ar Xiv preprint ar Xiv:2210.11355, 2022.

Peter M Aronow. A general method for detecting interference between units in randomized experiments. Sociological Methods & Research, 41(1):3 16, 2012.

Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4):1912 1947, 2017. doi: 10.1214/16-AOAS1005. URL https://doi.org/10.1214/16-AOAS1005.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

Patrick Bajari, Brian Burdick, Guido W Imbens, Lorenzo Masoero, James Mc Queen, Thomas Richardson, and Ido M Rosen. Multiple randomization designs. ar Xiv preprint ar Xiv:2112.13495, 2021.

Patrick Bajari, Brian Burdick, Guido W Imbens, Lorenzo Masoero, James Mc Queen, Thomas S Richardson, and Ido M Rosen. Experimental design in marketplaces. Statistical Science, 38(3): 458 476, 2023.

Rohit Bhattacharya, Daniel Malinsky, and Ilya Shpitser. Causal inference under interference and network uncertainty. In Uncertainty in Artificial Intelligence, pages 1028 1038. PMLR, 2020.

Sarah Huiyi Cen, Anish Agarwal, Christina Yu, and Devavrat Shah. A causal inference framework for network interference with panel data. In Neur IPS 2022 Workshop on Causality for Real-world Impact, 2022.

Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012.

Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In International conference on machine learning, pages 151 159. PMLR, 2013.

Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In International Conference on Machine Learning, pages 844 853. PMLR, 2017.

Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine Learning for Healthcare Conference, pages 67 82. PMLR, 2018.

Mengsi Gao and Peng Ding. Causal inference in network experiments: regression-based analysis and design-based properties, 2023.

Botao Hao, Tor Lattimore, and Mengdi Wang. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753 10763, 2020.

Michael G Hudgens and M Elizabeth Halloran. Toward causal inference with interference. Journal of the American Statistical Association, 103(482):832 842, 2008.

Su Jia, Peter Frazier, and Nathan Kallus. Multi-armed bandits with interference. ar Xiv preprint ar Xiv:2402.01845, 2024.

Joon Kwon, Vianney Perchet, and Claire Vernade. Sparse stochastic bandits. ar Xiv preprint ar Xiv:1706.01383, 2017.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 539 548, 2016.

Sahand Negahban and Devavrat Shah. Learning sparse boolean polynomials. In 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 2032 2036. IEEE, 2012.

Ryan O Donnell. Analysis of boolean functions. Cambridge University Press, 2014.

Jean Pouget-Abadie, Kevin Aydin, Warren Schudy, Kay Brodersen, and Vahab Mirrokni. Variance reduction in bipartite experiments through correlation clustering. Advances in Neural Information Processing Systems, 32, 2019.

Philippe Rigollet and Jan-Christian Hütter. High-dimensional statistics. ar Xiv preprint ar Xiv:2310.19244, 2023.

Paul R Rosenbaum. Interference between units in randomized experiments. Journal of the American Statistical Association, 102(477):191 200, 2007.

Donald B Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of statistics, pages 34 58, 1978.

Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

Johan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg. Graph cluster randomization: Network exposure to multiple universes. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 329 337, 2013.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

Justin Whitehouse, Aaditya Ramdas, and Steven Z Wu. On the sublinear regret of GP-UCB. Advances in Neural Information Processing Systems, 36, 2024.

Christina Lee Yu, Edoardo M Airoldi, Christian Borgs, and Jennifer T Chayes. Estimating the total treatment effect in randomized experiments with unknown network structure. Proceedings of the National Academy of Sciences, 119(44):e2208975119, 2022.

A Proof of Proposition 3.1

By the discussion in Section 3, recall that for any action a [A]N and unit n, the reward can be expressed as rn = θn, χ(a) . To establish the proof, it suffices to show that for any S [N log2(A)] satisfying S \ B(n) = , χS, rn B = 0. Let i S \ B(n) be an arbitrary index, then, we have,

χS, rn B = A N X

x { 1,1}N log2(A) rn(x)χS(x)

x { 1,1}N log2(A) xi=1

rn(x)χS(x) + X

x { 1,1}N log2(A) xi= 1

x { 1,1}N log2(A) xi=1

xirn(x)χS\{i}(x) + X

x { 1,1}N log2(A) xi= 1

xirn(x)χS\{i}(x)

x { 1,1}N log2(A) xi=1

rn(x)χS\{i}(x) X

x { 1,1}N log2(A) xi= 1

rn(x)χS\{i}(x)

where the final equality follows from the fact that, by Assumption 2, rn(x) = rn(x ) when x and x differ only in positions indexed by i / B(n). Thus, the only subsets S [N log2(A)] where we can have rn, χS B = 0 are those satisfying S B(n), which proves the desired result.

B Proofs for for Known Interference

In this section, we prove Theorem 4.1. We establish helper lemmas before proving Theorem 4.1.

B.1 Helper Lemmas

Recall the following notation before establishing our results. We defined B(n) := {i [N log2(A)] : i [(m 1) log2(A) + 1 : m log2(A)] for m N(n)} as the set of indices of the treatment vector v(a) { 1, 1}N log2(A) belonging to neighbors m N(n). Additionally, Xn = (χai(Bn) : i [E]) { 1, 1}E As, where χa(Bn) = (χS(a) : S B(n)) { 1, 1}As. For a matrix A RN d, let σmin(A) denote its minimum singular value. To proceed, we quote the following theorem. Lemma B.1 (Theorem 5.41 in Vershynin [2018]). Let A RN d such that its rows Ai are independent isotropic random vectors in Rd. If Ai 2 m almost surely for all i [N], then, with probability at least 1 δ, one has

cm log(2d/δ)

for universal constant c > 0. Lemma B.2 (Minimum Eigenvalue of Fourier Characteristics). There exists a positive constant C4 > 0 such that if E C4As log(2As/δ), then,

with probability at least 1 δ.

Proof. We begin by showing the conditions for Lemma B.1 are satisfied. First, we prove Xn is isotropic, i.e., E[χa(Bn) (χa(Bn))T ] = IAs, where the expectation is taken over uniformly

sampling actions a uniformly and random from [A]N. This follows since for any two subsets S, S [N log2(A)],

E[χS(a)χS (a)] = 1 AN X

a AN χS(a)χS (a)

= χS, χS B = 1[S = S ]

Since, χa(Bn) { 1, 1}As for all actions a [A]N, χa(Bn) 2

As. Hence, by Lemma B.1, σmin(Xn)

c log(2As/δ)As. Next, using the fact that σmin(XT n Xn) = σ2 min(Xn), we get that

c EAs log(2As/δ)

Finally, plugging in As log(2As/δ) E/C for an appropriate C gives us the claimed result.

We quote the following theorem regarding the 2 error of bθn. Lemma B.3. [Theorem 2.2 in Rigollet and Hütter [2023]] Assume that Y = Xθ + ϵ, where ϵ is 1 sub-Gaussian, where X RE d. If d E, and covariance matrix ΣX = (XT X)/E has rank d, then we have with probability at least 1 δ,

Xθ Xbθ 2 C1

d + log(1/δ)

where bθ = arg minθ Rd Y Xθ 2 2 is the least squares estimator, and C1 > 0 is a positive universal constant.

While the above lemma bounds the mean-squared error the least-squares estimate, in our applications we can about bounding the ℓ2 distance between θ and bθ. Simple rearrangement on the above implies that, with probability at least 1 δ, we actually have

d + log(1/δ) E σmin(ΣX).

If, in particular, σmin X X

E 1/2, the above can be simplified to

d + log(1/δ)

E with probability at least 1 δ for some new, appropriate universal constant C2 > 0.

B.2 Proof of Theorem 4.1

Proof. Recall the notation bθ = N 1 PN n=1 bθn, and ba = arg maxa [A]N bθ, χ(a) . The average reward r(ba) can be bounded using the definition of ba and Holder s inequality as follows,

r(a ) r(ba) = θ, χ(a ) χ(ba)

= θ bθ, χ(a ) χ(ba) + bθ, χ(a ) χ(ba) | {z } 0

θ bθ, χ(a ) χ(ba)

i=1 θn bθn, χ(a ) χ(ba)

i=1 θn bθn, χa (Bn) χba(Bn)

i=1 θn bθn 2 χa (Bn) χˆa(Bn) 2

Using χa(Bn) 2

As then gives us

r(a ) r(ba)

i=1 θn bθn 2 (2)

Next, define good events for any unit n [N] as

Gn1 := σmin

bθn θn 2 C2

E 1 As + log 4NAs

where C2 > 0 is as stated above. Notice that there exists a sufficiently large universal constant C3 > 0, such that T C3 A2s[log(2N/δ) + s log(A)] implies E =

(TAs)2/3 log N

δ + s log (A) 1/3 C4As log(4NAs/δ). Hence, for any given n [N], we have via Lemma B.2 that Gn1 holds with probability 1 δ 2N Conditioned on Gn1, we get that P(Gn2|G1) 1 δ 2N . Summarizing, we get that for any n [N], the following holds

bθn θn 2 C2

E 1 As + log 4NAs

E 1As log 4NAs

with probability at least 1 δ 2N 2 1 δ/N, where C5 > 0 is an appropriate constant. Taking a union bound over all N units, and then substituting into (2) gives us

r(a ) r(ba) C5As s

E 1 log 4NAs

Finally, using this, the cumulative regret can be upper bounded with probability 1 δ as follows:

t=1 ( r(a ) r(ˆa))

t=1 ( r(a ) r(ˆa)) +

t=E+1 ( r(a ) r(ˆa))

E + C5TAs s

E 1 log 4NAs

Substituting E as in the theorem statement completes the proof.

C Proofs for Unknown Interference

In this appendix, we prove Theorem 5.1. Our proof requires the following lemmas.

C.1 Helper Lemmas for Theorem 5.1

The first lemma we prove details the (high-probability) incoherence guarantees of the uniformly random design matrix under the Fourier basis. Recall the following notation before stating and proving our results. We denote E as our exploration length, and χ(at) as the Fourier characteristic associated with action at [A]N. Let X = (χ(at) : t [E]) { 1, 1}E AN Additionally, we require the following definition of incoherence. Definition C.1. We say a matrix A RE d is s-incoherent if A A Id 1 32s, where Id is the identity matrix of dimension d.

Lemma C.2 (Incoherence of Fourier Characteristics). For E 1, suppose a1, . . . , a E iid U({ 1, +1}N log2(A)). Then,

v u u t2 log 2A2N

where A = maxi,j |Ai,j| denotes the maximum coordinates of a matrix. Thus, if E 4096A2s log 2

δ + 2N log (A) , X is As-incoherent with probability at least 1 δ.

Proof. Recall that, for any a [A]N, χ(a) := (χS1(a), . . . , χSAN (a)), where S1, . . . , SAN is some fixed enumeration of subsets S [N log2(A)]. Thus, each entry of (X X)/E can be viewed as being indexed by subsets S, S [N log2(A)].

To establish (C.2), we first examine diagonal elements of (X X)/E. For S [N log2(A)], we have

t=1 χ(at)(χ(at)) !

t=1 χS(at)χS(at) = 1, (3)

where the last equality follows from the fact that (χS(at))2 = 1.

Next, we consider off-diagonal elements, and bound their magnitude. Before doing so, we require the following. For subsets S, S [N log2(A)], let S S denote the symmetric difference of two subsets. For any two subsets S, S [N log2(A)], the product of their Fourier characteristics is,

χS(at)χS (at) =

i S S v(at)i.

Using this, for any distinct subsets S, S , we have X X

t=1 χS(at)χS (at) = 1

i S S v(at)i.

Since S = S , and at U({ 1, 1}N log2(A)), the set of random variables {v(at)i : i S S } are independent Rademacher random variables. Applying Hoeffding s inequality for ϵ > 0 gives us

Applying the inequality above and taking a union bound over all A2N elements of (X X)/E

max S,S [N log2(A)] S =S

2A2N exp Eϵ2

Choosing ϵ = r

2E 1 log 2A2N

max S,S [N log2(A)] S =S

v u u t2 log 2A2N

To complete the proof, observe that (3) implies that X X

E IAN = max S,S [N log2(A)] S =S

Substituting this observation into (4) above completes the proof.

In addition to the above lemma, we leverage the following Lasso convergence result. We state a version that can be found in the book on high-dimensional probability due to Rigollet and Hütter [2023].

Lemma C.3 (Theorem 2.18 in [Rigollet and Hütter, 2023] ). Suppose that Y = Xθ + ϵ, where X RE d, θ Rd is s-sparse, and ϵ has independent 1-sub-Gaussian coordinates. Further, suppose X X

T is s-incoherent. Then, for any δ (0, 1) and for λ = 4 p

E 1 log(2d) + 4 p

E 1 log(δ 1), we have, with probability at least 1 δ

s E 1 log 2d

where bθ denotes the solution to the Lasso and C > 0 is some absolute constant.

Using standard arguments (see the proof of Theorem 2.18 in Rigollet and Hütter [2023] or the statement of Theorem 7.3 in Wainwright [2019]), it can be further deduced that θ bθ 1 4 s θ bθ 2, so we actually have that, for any δ (0, 1), with probability at least 1 δ,

where C > 0 is again some absolute constant.

C.2 Proof of Theorem 5.1

Proof. Define θ = N 1 PN n=1 θn. Recall the notation bθ = N 1 PN n=1 bθn, and ba = arg maxa [A]N bθ, χ(a) . For any round t {E + 1, . . . , T}, we greedily play the action ba. The average reward r(ba) can be bounded using the definition of ba and Holder s inequality as follows,

r(a ) r(ba) = θ, χ(a ) χ(ba)

= θ bθ, χ(a ) χ(ba) + bθ, χ(a ) χ(ba) | {z } 0

θ bθ, χ(a ) χ(ba)

θ bθ 1 χ(a ) χ(ba) .

Next, substituting the definition of θ, bθ, χ(a) = 1, and using the triangle inequality into the equation above gives us,

r(a ) r(ba) 2 θ bθ 1 2

θn bθn 1 . (5)

Let us define the good events by

G1 := {X is As-incoherent} and G2 :=

n [N], bθn θn 1 CAs s

E 1 log 4NAN

where C > 0 is the constant following the discussion of Lemma C.3. Let us define the global good event by G := G1 G2. We show P(G) 1 δ.

First, there is a universal constant C > 0 such that E 4096A2s [log(4/δ) + 2N log(A)] when T C A2s[log(N/δ)+N log(A)]. Thus, by Lemma C.2, we know the matrix X with χa1, . . . , χa E as its rows is As-incoherent least 1 δ

2, i.e. P(G1) 1 δ

Next, conditioning on G1 and applying Lemma C.3 alongside a union bound over the N units yields

bθn θn 1 CAs s

E 1 log 4NAN

for all n [N] with probability at least 1 δ

2, i.e. P(G2 | G1) 1 δ

2. Thus, in total, we have P(G) = P(G1)P(G2 | G1) (1 δ/2)2 1 δ. We assume we are operating on the good event G going forward.

Plugging the per-unit ℓ1 norms into (5), the cumulative regret can be upper bounded with probability 1 δ as follows:

t=1 ( r(a ) r(ˆa))

t=1 ( r(a ) r(ˆa)) +

t=E+1 ( r(a ) r(ˆa))

E + 2CT As s

E 1 log 4NAN

Clearly, we should select E to roughly balance terms (up to multiplicative constants). In particular, using the choice of E as

E := (TAs)2/3 log N

+ N log(A) 1/3

and substituting E into (6) gives us

Reg T O (N log(AN/δ))1/3(TAs)2/3

with probability at least 1 δ, precisely the claimed result.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: in the abstract/introduction, we claim that we contribute (a) a framework for studying bandits with sparse interference, (b) algorithms for obtaining low regret under this framework, and (c) simulations to empirically back up our theoretical findings. We present these, respectively, in Section 3, Sections 4 and 5, and Section 6 of our paper. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss many shortcomings of our contributions. For instance, we note that in the known interference setting, our main algorithm obtains a dependence on the time horizon that grows as T 2/3, whereas one would ideally hope for T 1/2 dependence. We also address computational aspects of the Lasso in Section 5, providing heuristic approaches for speedup. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We detail all of our assumptions either in Section 3 or in the statements of theorems. We also provide full, rigorous proof of our results in the appendices.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We fully describe our theoretical framework, and painstakingly detail all algorithms, including how to choose free parameters. In our experimentation section, we provide full details on our setup.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We have attached our code as a supplement so it is viewable by reviewers. We have redacted written where the link to the repo will be included upon acceptance.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We fully describe our simulations in Section 6 in our work.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We provide error bars for our simulation results and we describe the methodology by which we produce our error bars.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [No]

Justification: Our experiments are extremely lightweight and can be run on any modern laptop.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We have read the code of ethics and have found no violations in our work.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: There are no negative societal impacts of our work. We have mentioned as positive societal impacts applicability of our results to tasks such as medical trials. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not produce any models or release any data that may be misused. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: We do not use existing assets in our work. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL.

The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: We do not introduce any new assets in our work.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: We do not use crowdsourcing nor do we conduct research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: Same as the justification for the above bullet point.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.