# practical_contextual_bandits_with_feedback_graphs__13628d02.pdf

Practical Contextual Bandits with Feedback Graphs

Mengxiao Zhang University of Southern California mengxiao.zhang@usc.edu

Yuheng Zhang University of Illinois Urbana-Champaign yuhengz2@illinois.edu

Olga Vrousgou Microsoft Research Olga.Vrousgou@microsoft.com

Haipeng Luo University of Southern California haipengl@usc.edu

Paul Mineiro Microsoft Research pmineiro@microsoft.com

While contextual bandit has a mature theory, effectively leveraging different feedback patterns to enhance the pace of learning remains unclear. Bandits with feedback graphs, which interpolates between the full information and bandit regimes, provides a promising framework to mitigate the statistical complexity of learning. In this paper, we propose and analyze an approach to contextual bandits with feedback graphs based upon reduction to regression. The resulting algorithms are computationally practical and achieve established minimax rates, thereby reducing the statistical complexity in real-world applications.

1 Introduction

This paper is primarily concerned with increasing the pace of learning for contextual bandits [Auer et al., 2002, Langford and Zhang, 2007]. While contextual bandits have enjoyed broad applicability [Bouneffouf et al., 2020], the statistical complexity of learning with bandit feedback imposes a data lower bound for application scenarios [Agarwal et al., 2012]. This has inspired various mitigation strategies, including exploiting function class structure for improved experimental design [Zhu and Mineiro, 2022], and composing with memory for learning with fewer samples [Rucker et al., 2022]. In this paper we exploit alternative graph feedback patterns to accelerate learning: intuitively, there is no need to explore a potentially suboptimal action if a presumed better action, when exploited, yields the necessary information.

The framework of bandits with feedback graphs is mature and provides a solid theoretical foundation for incorporating additional feedback into an exploration strategy [Mannor and Shamir, 2011, Alon et al., 2015, 2017]. Succinctly, in this framework, the observation of the learner is decided by a directed feedback graph G: when an action is played, the learner observes the loss of every action to which the chosen action is connected. When the graph only contains self-loops, this problem reduces to the classic bandit case. For non-contextual bandits with feedback graphs, [Alon et al., 2015] provides a full characterization on the minimax regret bound with respect to different graph theoretic quantities associated with G according to the type of the feedback graph.

However, contextual bandits with feedback graphs have received less attention [Singh et al., 2020, Wang et al., 2021]. Specifically, there is no prior work offering a solution for general feedback graphs

Equal contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

and function classes. In this work, we take an important step in this direction by adopting recently developed minimax algorithm design principles in contextual bandits, which leverage realizability and reduction to regression to construct practical algorithms with strong statistical guarantees [Foster et al., 2018, Foster and Rakhlin, 2020, Foster et al., 2020, Foster and Krishnamurthy, 2021, Foster et al., 2021, Zhu and Mineiro, 2022]. Using this strategy, we construct a practical algorithm for contextual bandits with feedback graphs that achieves the optimal regret bound. Moreover, although our primary concern is accelerating learning when the available feedback is more informative than bandit feedback, our techniques also succeed when the available feedback is less informative than bandit feedback, e.g., in spam filtering where some actions generate no feedback. More specifically, our contributions are as follows.

Contributions. In this paper, we extend the minimax framework proposed in [Foster et al., 2021] to contextual bandits with general feedback graphs, aiming to promote the utilization of different feedback patterns in practical applications. Following [Foster and Rakhlin, 2020, Foster et al., 2021, Zhu and Mineiro, 2022], we assume that there is an online regression oracle for supervised learning on the loss. Based on this oracle, we propose Square CB.G, the first algorithm for contextual bandits with feedback graphs that operates via reduction to regression (Algorithm 1). Eliding regression regret factors, our algorithm achieves the matching optimal regret bounds for deterministic feedback graphs, with e O(

αT) regret for strongly observable graphs and e O(d 1 3 T 2 3 ) regret for weakly observable graphs, where α and d are respectively the independence number and weakly domination number of the feedback graph (see Section 3.2 for definitions). Notably, Square CB.G is computationally tractable, requiring the solution to a convex program (Theorem 3.6), which can be readily solved with off-the-shelf convex solvers (Appendix A.3). In addition, we provide closed-form solutions for specific cases of interest (Section 4), leading to a more efficient implementation of our algorithm. Empirical results further showcase the effectiveness of our approach (Section 5).

2 Problem Setting and Preliminary

Throughout this paper, we let [n] denote the set {1, 2, . . . , n} for any positive integer n. We consider the following contextual bandits problem with informed feedback graphs. The learning process goes in T rounds. At each round t [T], an environment selects a context xt X, a (stochastic) directed feedback graph Gt [0, 1]A A, and a loss distribution Pt : X ([ 1, 1]A); where A is the action set with finite cardinality K. For convenience, we use A and [K] interchangeably for denoting the action set. Both Gt and xt are revealed to the learner at the beginning of each round t. Then the learner selects one of the actions at A, while at the same time, the environment samples a loss vector ℓt [ 1, 1]A from Pt( |xt). The learner then observes some information about ℓt according to the feedback graph Gt. Specifically, for each action j, she observes the loss of action j with probability Gt(at, j), resulting in a realization At, which is the set of actions whose loss is observed. With a slight abuse of notation, denote Gt( |a) as the distribution of At when action a is picked. We allow the context xt, the (stochastic) feedback graphs Gt and the loss distribution Pt( |xt) to be selected by an adaptive adversary. When convenient, we will consider G to be a K-by-K matrix and utilize matrix notation.

Other Notations. Let (K) denote the set of all Radon probability measures over a set [K]. conv(S) represents the convex hull of a set S. Denote I as the identity matrix with an appropriate dimension. For a K-dimensional vector v, diag(v) denotes the K-by-K matrix with the i-th diagonal entry vi and other entries 0. We use RK 0 to denote the set of K-dimensional vectors with non-negative entries. For a positive definite matrix M RK K, we define norm z M =

z Mz for any vector z RK. We use the e O( ) notation to hide factors that are polylogarithmic in K and T.

Realizability. We assume that the learner has access to a known function class F (X A 7 [ 1, 1]) which characterizes the mean of the loss for a given context-action pair, and we make the following standard realizability assumption studied in the contextual bandit literature [Agarwal et al., 2012, Foster et al., 2018, Foster and Rakhlin, 2020, Simchi-Levi and Xu, 2021].

Assumption 1 (Realizability). There exists a regression function f F such that E[ℓt,a | xt] = f (xt, a) for any a A and across all t [T].

Two comments are in order. First, we remark that, similar to [Foster et al., 2020], misspecification can be incorporated while maintaining computational efficiency, but we do not complicate the exposition here. Second, Assumption 1 induces a semi-adversarial setting, wherein nature is completely free to determine the context and graph sequences; and has considerable latitude in determining the loss distribution subject to a mean constraint.

Regret. For each regression function f F, let πf(xt) := argmina A f(xt, a) denote the induced policy, which chooses the action with the least loss with respective to f. Define π := πf as the optimal policy. We measure the performance of the learner via regret to π : Reg CB := PT t=1 ℓt,at PT t=1 ℓt,π (xt), which is the difference between the loss suffered by the learner and the one if the learner applies policy π .

Regression Oracle We assume access to an online regression oracle Alg Sq for function class F, which is an algorithm for online learning with squared loss. We consider the following protocol. At each round t [T], the algorithm produces an estimator bft conv(F), then receives a set of context-action-loss tuples {(xt, a, ℓt,a)}a At where At A. The goal of the oracle is to accurately predict the loss as a function of the context and action, and we evaluate its performance via the square loss P

a At( bft(xt, a) ℓt,a)2. We measure the oracle s cumulative performance via the following square-loss regret to the best function in F. Assumption 2 (Bounded square-loss regret). The regression oracle Alg Sq guarantees that for any (potentially adaptively chosen) sequence {(xt, a, ℓt,a)}a At,t [T ] in which At A,

bft(xt, a) ℓt,a 2 inf f F

a At (f(xt, a) ℓt,a)2 Reg Sq.

For finite F, Vovk s aggregation algorithm yields Reg Sq = O(log|F|) [Vovk, 1995]. This regret is dependent upon the scale of the loss function, but this need not be linear with the size of At, e.g., the loss scale can be bounded by 2 in classification problems. See Foster and Krishnamurthy [2021] for additional examples of online regression algorithms.

3 Algorithms and Regret Bounds

In this section, we provide our main algorithms and results.

3.1 Algorithms via Minimax Reduction Design

Our approach is to adapt the minimax formulation of [Foster et al., 2021] to contextual bandits with feedback graphs. In the standard contextual bandits setting (that is, Gt = I for all t), Foster et al. [2021] define the Decision-Estimation Coefficient (DEC) for a parameter γ > 0 as decγ(F) := sup b f conv(F),x X decγ(F; bf, x), where

decγ(F; bf, x) := inf p (K) decγ(p, F; bf, x)

:= inf p (K) sup a [K] f F

f (x, a) f (x, a ) γ

4 bf(x, a) f (x, a) 2 . (1)

Their proposed algorithm is as follows. At each round t, after receiving the context xt, the algorithm first computes bft by calling the regression oracle. Then, it solves the solution pt of the minimax problem defined in Eq. (1) with bf and x replaced by bft and xt. Finally, the algorithm samples an action at from the distribution pt and feeds the observation (xt, at, ℓt,at) to the oracle. Foster et al. [2021] show that for any value γ, the algorithm above guarantees that

E[Reg CB] T decγ(F) + γ

4 Reg Sq. (2)

However, the minimax problem Eq. (1) may not be solved efficiently in many cases. Therefore, instead of obtaining the distribution pt which has the exact minimax value of Eq. (1), Foster et al.

Algorithm 1 Square CB.G. Note Theorem 3.6 provides an efficient implementation of Eq. (4). Input: parameter γ 4, a regression oracle Alg Sq for t = 1, 2, . . . , T do

Receive context xt and directed feedback graph Gt. Obtain an estimator bft from the oracle Alg Sq. Compute the distribution pt (K) such that pt = argminp (K) decγ(p; bft, xt, Gt), where

decγ(p; bft, xt, Gt)

:= sup a [K] f Φ

f (xt, a) f (xt, a ) γ

4 EA Gt( |a)

a A ( bft(xt, a ) f (xt, a ))2 ##

and Φ := X [K] 7 R. Sample at from pt and observe {ℓt,j}j At where At Gt( |at). Feed the tuples {(xt, j, ℓt,j)}j At to the oracle Alg Sq. end

[2021] show that any distribution that gives an upper bound Cγ on decγ(p, F; bf, x) also works and enjoys a regret bound with decγ(F) replaced by Cγ in Eq. (2).

To extend this framework to the setting with feedback graph G, we define decγ(F; bf, x, G) as follows

decγ(F; bf, x, G)

:= inf p (K) decγ(p, F; bf, x, G)

:= inf p (K) sup a [K] f F

f (x, a) f (x, a ) γ

4 EA G( |a)

a A ( bft(x, a ) f (x, a ))2 ##

Compared with Eq. (1), the difference is that we replace the squared estimation error on action a by the expected one on the observed set A G( |a), which intuitively utilizes more feedbacks from the graph structure. When the feedback graph is the identity matrix, we recover Eq. (1). Based on decγ(F; bf, x, G), our algorithm Square CB.G is shown in Algorithm 1. As what is done in [Foster et al., 2021], in order to derive an efficient algorithm, instead of solving the distribution pt with respect to the supremum over f F, we solve pt that minimize decγ(p; bf, xt, Gt) (Eq. (4)), which takes supremum over f (X [K] 7 R), leading to an upper bound on decγ(F; bf, xt, Gt). Then, we receive the loss {ℓt,j}j At and feed the tuples {(xt, j, ℓt,j)}j At to the regression oracle Alg Sq. Following a similar analysis to [Foster et al., 2021], we show that to bound the regret Reg CB, we only need to bound decγ(pt; bft, xt, Gt).

Theorem 3.1. Suppose decγ(pt; bft, xt, Gt) Cγ β for all t [T] and some β > 0, Algorithm 1

with γ = max{4, (CT) 1 β+1 Reg 1 β+1 Sq } guarantees that E [Reg CB] O C 1 β+1 T 1 β+1 Reg

The proof is deferred to Appendix A. In Section 3.3, we give an efficient implementation for solving Eq. (4) via reduction to convex programming.

3.2 Regret Bounds

In this section, we derive regret bounds for Algorithm 1 when Gt s are specialized to deterministic graphs, i.e., Gt {0, 1}A A. We utilize discrete graph notation G = ([K], E), where E [K] [K]; and define N in(G, j) {i A : (i, j) E} as the set of nodes that can observe node j. In this case, at each round t, the observed node set At is a deterministic set which contains any node i satisfying at N in(Gt, i). In the following, we introduce several graph-theoretic concepts for deterministic feedback graphs [Alon et al., 2015].

Strongly/Weakly Observable Graphs. For a directed graph G = ([K], E), a node i is observable if N in(G, i) = . An observable node is strongly observable if either i N in(G, i) or N in(G, i) = [K]\{i}, and weakly observable otherwise. Similarly, a graph is observable if all its nodes are observable. An observable graph is strongly observable if all nodes are strongly observable, and weakly observable otherwise. Self-aware graphs are a special type of strongly observable graphs where i N in(G, i) for all i [K].

Independent Set and Weakly Dominating Set. An independence set of a directed graph is a subset of nodes in which no two distinct nodes are connected. The size of the largest independence set of a graph is called its independence number. For a weakly observable graph G = ([K], E), a weakly dominating set is a subset of nodes D [K] such that for any node j in G without a self-loop, there exists i D such that directed edge (i, j) E. The size of the smallest weakly dominating set of a graph is called its weak domination number. Alon et al. [2015] show that in non-contextual bandits with a fixed feedback graph G, the minimax regret bound is eΘ(

αT) when G is strongly observable and eΘ(d 1 3 T 2 3 ) when G is weakly observable, where α and d are the independence number and the weak domination number of G, respectively.

3.2.1 Strongly Observable Graphs

In the following theorem, we show the regret bound of Algorithm 1 for strongly observable graphs. Theorem 3.2 (Strongly observable graphs). Suppose that the feedback graph Gt is deterministic and strongly observable with independence number no more than α. Then Algorithm 1 guarantees that

decγ(pt; bft, xt, Gt) O α log(Kγ)

In contrast to existing works that derive a closed-form solution of pt in order to show how large the DEC can be [Foster and Rakhlin, 2020, Foster and Krishnamurthy, 2021], in our case we prove the upper bound of decγ(pt; bft, xt, Gt) by using the Sion s minimax theorem and the graph-theoretic lemma proven in [Alon et al., 2015]. The proof is deferred to Appendix A.1. Combining Theorem 3.2 and Theorem 3.1, we directly have the following corollary: Corollary 3.3. Suppose that Gt is deterministic, strongly observable, and has independence number no more than α for all t [T]. Algorithm 1 with choice γ = max n 4, q

αT/Reg Sq o guarantees that E[Reg CB] e O q

For conciseness, we show in Corollary 3.3 that the regret guarantee for Algorithm 1 depends on the largest independence number of Gt over t [T]. However, we in fact are able to achieve a move

adaptive regret bound of order e O q PT t=1 αt Reg Sq

where αt is the independence number of Gt.

It is straightforward to achieve this by applying a standard doubling trick on the quantity PT t=1 αt, assuming we can compute αt given Gt, but we take one step further and show that it is in fact unnecessary to compute αt (which, after all, is NP-hard [Karp, 1972]): we provide an adaptive tuning strategy for γ by keeping track the the cumulative value of the quantity minp (K) decγ(p; bft, xt, Gt) and

show that this efficient method also achieves the adaptive e O q Pt t=1 αt Reg Sq

regret guarantee;

see Appendix D for details.

3.2.2 Weakly Observable Graphs

For the weakly observable graph, we have the following theorem. Theorem 3.4 (Weakly observable graphs). Suppose that the feedback graph Gt is deterministic and weakly observable with weak domination number no more than d. Then Algorithm 1 with γ 16d guarantees that

decγ(pt; bft, xt, Gt) O

d γ + eα log(Kγ)

where eα is the independence number of the subgraph induced by nodes with self-loops in Gt.

The proof is deferred to Appendix A.2. Similar to Theorem 3.2, we do not derive a closed-form solution to the strategy pt but prove this upper bound using the minimax theorem. Combining Theorem 3.4 and Theorem 3.1, we are able to obtain the following regret bound for weakly observable graphs, whose proof is deferred to Appendix A.2. Corollary 3.5. Suppose that Gt is deterministic, weakly observable, and has weak domination number no more than d for all t [T]. In addition, suppose that the independence number of the subgraph induced by nodes with self-loops in Gt is no more than eα for all t [T]. Then, Algorithm 1

with γ = max{16d, q

eαT/Reg Sq, d 1 3 T 2 3 Reg 2

3 Sq } guarantees that

E[Reg CB] e O d 1 3 T 2 3 Reg

eαTReg Sq .

Similarly to the strongly observable graph case, we also derive an adaptive tuning strategy for γ

to achieve a more refined regret bound e O q PT t=1 eαt Reg Sq + PT t=1 dt 2

is the independence number of the subgraph induced by nodes with self-loops in Gt and dt is the weakly domination number of Gt. This is again achieved without explicitly computing eαt and dt; see Appendix D for details.

3.3 Implementation

In this section, we show that solving argminp (K) decγ(p; bf, x, G) in Algorithm 1 is equivalent to solving a convex program, which can be easily and efficiently implemented in practice.

Theorem 3.6. Solving argminp (K) decγ(p; bf, x, G) is equivalent to solving the following convex optimization problem.

min p (K),z p bf + z (5)

subject to a [K] : 1

γ p ea 2 diag(G p) 1 bf(x, a) + z,

where bf in the objective is a shorthand for bf(x, ) RK, ea is the a-th standard basis vector, and means element-wise greater.

The proof is deferred to Appendix A.4. Note that this implementation is not restricted to the deterministic feedback graphs but applies to the general stochastic feedback graph case. In Appendix A.3, we provide the 20 lines of Python code that solves Eq. (5).

4 Examples with Closed-Form Solutions

In this section, we present examples and corresponding closed-form solutions of p that make the value decγ(p; bf, x, G) upper bounded by at most a constant factor of minp decγ(p; bf, x, G). This offers an alternative to solving the convex program defined in Theorem 3.6 for special (and practically relevant) cases, thereby enhancing the efficiency of our algorithm. All the proofs are deferred to Appendix B.

Cops-and-Robbers Graph. The cops-and-robbers feedback graph GCR = 11 I, also known as the loopless clique, is the full feedback graph removing self-loops. Therefore, GCR is strongly observable with independence number α = 1. Let a1 be the node with the smallest value of bf and a2 be the node with the second smallest value of bf. Our proposed closed-form distribution p is only supported on {a1, a2} and defined as follows:

2 + γ( bfa2 bfa1) , pa2 = 1

2 + γ( bfa2 bfa1) . (6)

In the following proposition, we show that with the construction of p in Eq. (6), decγ(p; bf, x, GCR) is upper bounded by O(1/γ), which matches the order of minp decγ(p; bf, x, G) based on Theorem 3.2 since α = 1.

Proposition 1. When G = GCR, given any bf, context x, the closed-form distribution p in Eq. (6) guarantees that decγ(p; bf, x, GCR) O 1 γ .

Apple Tasting Graph. The apple tasting feedback graph GAT = 1 1 0 0

consists of two nodes,

where the first node reveals all and the second node reveals nothing. This scenario was originally proposed by Helmbold et al. [2000] and recently denoted the spam filtering graph [van der Hoeven et al., 2021]. The independence number of GAT is 1. Let bf1 be the oracle prediction for the first node and let bf2 be the prediction for the second node. We present a closed-form solution p for Eq. (4) as follows:

( 1 bf1 bf2 2 4+γ( b f1 b f2) bf1 > bf2 , p2 = 1 p1. (7)

We show that this distribution p satisfies that decγ(p; bf, x, GAT) is upper bounded by O(1/γ) in the following proposition. We remark that directly applying results from [Foster et al., 2021] cannot lead to a valid upper bound since the second node does not have a self-loop.

Proposition 2. When G = GAT, given any bf, context x, the closed-form distribution p in Eq. (7) guarantees that decγ(p; bf, x, GAT) O( 1

Inventory Graph. In this application, the algorithm needs to decide the inventory level in order to fulfill the realized demand arriving at each round. Specifically, there are K possible chosen inventory levels a1 < a2 < . . . < a K and the feedback graph Ginv has entries G(i, j) = 1 for all 1 j i K and G(i, j) = 0 otherwise, meaning that picking the inventory level ai informs about all actions aj i. This is because items are consumed until either the demand or the inventory is exhausted. The independence number of Ginv is 1. Therefore, (very) large K is statistically tractable, but naively solving the convex program Eq. (5) requires superlinear in K computational cost. We show in the following proposition that there exists an analytic form of p, which guarantees that decγ(p; bf, x, Ginv) can be bounded by O(1/γ).

Proposition 3. When G = Ginv, given any bf, context x, there exists a closed-form distribution p (K) guaranteeing that decγ(p; bf, x, Ginv) O( 1

γ ), where p is defined as follows: pj = max{ 1 1+γ( b fj mini b fi) P

j >j pj , 0} for all j [K].

Undirected Self-Aware Graph. For the undirected and self-aware feedback graph G, which means that G is symmetric and has diagonal entries all 1, we also show that a certain closed-form solution of p satisfies that decγ(p; bf, x, G) is bounded by O( α

Proposition 4. When G is an undirected self-aware graph, given any bf, context x, there exists a closed-form distribution p (K) guaranteeing that decγ(p; bf, x, G) O α γ .

5 Experiments

In this section, we use empirical results to demonstrate the significant benefits of Square CB.G in leveraging the graph feedback structure and its superior effectiveness compared to Square CB. Following Foster and Krishnamurthy [2021], we use progressive validation (PV) loss as the evaluation metric, defined as Lpv(T) = 1 T PT t=1 ℓt,at. All the feedback graphs used in the experiments are deterministic. We run experiments on CPU Intel Xeon Gold 6240R 2.4G and the convex program solver is implemented via Vowpal Wabbit [Langford et al., 2007].

Figure 1: Left figure: Performance of Square CB.G on RCV1 dataset under three different feedback graphs. Right figure: Performance comparison between Square CB.G and Square CB under random directed self-aware feedback graphs.

5.1 Square CB.G under Different Feedback Graphs

In this subsection, we show that our Square CB.G benefits from considering the graph structure by evaluating the performance of Square CB.G under three different feedback graphs. We conduct experiments on RCV1 dataset and leave the implementation details in Appendix C.1.

The performances of Square CB.G under bandit graph, full information graph and cops-and-robbers graph are shown in the left part of Figure 1. We observe that Square CB.G performs the best under full information graph and performs worst under bandit graph. Under the cops-and-robbers graph, much of the gap between bandit and full information is eliminated. This improvement demonstrates the benefit of utilizing graph feedback for accelerating learning.

5.2 Comparison between Square CB.G and Square CB

In this subsection, we compare the effectiveness of Square CB.G with the Square CB algorithm. To ensure a fair comparison, both algorithms update the regressor using the same feedbacks based on the graph. The only distinction lies in how they calculate the action probability distribution. We summarize the main results here and leave the implementation details in Appendix C.2.

5.2.1 Results on Random Directed Self-aware Graphs

We conduct experiments on RCV1 dataset using random directed self-aware feedback graphs. Specifically, at round t, the deterministic feedback graph Gt is generated as follows. The diagonal elements of Gt are all 1, and each off-diagonal entry is drawn from a Bernoulli(3/4) distribution. The results are presented in the right part of Figure 1. Our Square CB.G consistently outperforms Square CB and demonstrates lower variance, particularly when the number of iterations was small. This is because when there are fewer samples available to train the regressor, it is more crucial to design an effective algorithm that can leverage the graph feedback information.

5.2.2 Results on Synthetic Inventory Dataset

In the inventory graph experiments, we create a synthetic inventory dataset and design a loss function for each inventory level at [0, 1] with demand dt [0, 1]. Since the action set [0, 1] is continuous, we discretize the action set in two different ways to apply the algorithms.

Fixed discretized action set. In this setting, we discretize the action set using fixed grid size ε { 1

100, 1 300, 1 500}, which leads to a finite action set A of size 1

ε + 1. Note that according to Theorem 3.2, our regret does not scale with the size of the action set (to within polylog factors), as the independence number is always 1. The results are shown in the left part of Figure 2.

Figure 2: Performance comparison between Square CB.G and Square CB on synthetic inventory dataset. Left figure: Results under fixed discretized action set. Right figure: Results under adaptive discretization of the action set. Both figures show the superiority of Square CB.G compared with Square CB.

We remark several observations from the results. First, our algorithm Square CB.G outperforms Square CB for all choices K {101, 301, 501}. This indicates that Square CB.G utilizes a better exploration scheme and effectively leverages the structure of Ginv. Second, we observe that Square CB.G indeed does not scale with the size of the discretized action set A, since under different discretization scales, Square CB.G has similar performances and the slight differences are from the improved approximation error with finer discretization. This matches the theoretical guarantee that we prove in Theorem 3.2. On the other hand, Square CB does perform worse when the size of the action set increases, matching its theoretical guarantee which scales with the square root of the size of the action set.

Adaptively changing action set. In this setting, we adaptively discretize the action set [0, 1] according to the index of the current round. Specifically, for Square CB.G, we uniformly discretize the action set [0, 1] with size

t , whose total discretization error is O(

T) due to the Lipschitzness of the loss function. For Square CB, to optimally balance the dependency on the size of the action set and the discretization error, we uniformly discretize the action set [0, 1] into t 1 3 actions. The results are illustrated in the right part of Figure 2. We can observe that Square CB.G consistently outperforms Square CB by a clear margin.

6 Related Work

Multi-armed bandits with feedback graphs have been extensively studied. An early example is the apple tasting problem of Helmbold et al. [2000]. The general formulation was introduced by Mannor and Shamir [2011]. Alon et al. [2015] characterized the minimax rates in terms of graphtheoretic quantities. Follow-on work includes relaxing the assumption that the graph is observed prior to decision [Cohen et al., 2016]; analyzing the distinction between the stochastic and adversarial settings [Alon et al., 2017]; considering stochastic feedback graphs [Li et al., 2020, Esposito et al., 2022]; instance-adaptivity [Ito et al., 2022]; data-dependent regret bound [Lykouris et al., 2018, Lee et al., 2020]; and high-probability regret under adaptive adversary [Neu, 2015, Luo et al., 2023].

The contextual bandit problem with feedback graphs has received relatively less attention. Wang et al. [2021] provide algorithms for adversarial linear bandits with uninformed graphs and stochastic contexts. However, this work assumes several unrealistic assumptions on both the policy class and the context space and is not comparable to our setting, since we consider the informed graph setting with adversarial context. Singh et al. [2020] study a stochastic linear bandits with informed feedback graphs and are able to improve over the instance-optimal regret bound for bandits derived in [Lattimore and Szepesvari, 2017] by utilizing the additional graph-based feedbacks.

Our work is also closely related to the recent progress in designing efficient algorithms for classic contextual bandits. Starting from [Langford and Zhang, 2007], numerous works have been done to the development of practically efficient algorithms, which are based on reduction to either cost-sensitive classification oracles [Dudik et al., 2011, Agarwal et al., 2014] or online regression oracles [Foster and Rakhlin, 2020, Foster et al., 2020, 2021, Zhu and Mineiro, 2022]. Following the latter trend, our work assumes access to an online regression oracle and extends the classic bandit problems to the bandits with general feedback graphs.

7 Discussion

In this paper, we consider the design of practical contextual bandits algorithm with provable guarantees. Specifically, we propose the first efficient algorithm that achieves near-optimal regret bound for contextual bandits with general directed feedback graphs with an online regression oracle.

While we study the informed graph feedback setting, where the entire feedback graph is exposed to the algorithm prior to each decision, many practical problems of interest are possibly uninformed graph feedback problems, where the graph is unknown at the decision time. It is unclear how to formulate an analogous minimax problem to Eq. (1) under the uninformed setting. One idea is to consume the additional feedback in the online regressor and adjust the prediction loss to reflect this additional structure, e.g., using the more general version of the E2D framework which incorporates arbitrary side observations [Foster et al., 2021]. Cohen et al. [2016] consider this uninformed setting in the non-contextual case and prove a sharp distinction between the adversarial and stochastic settings: even if the graphs are all strongly observable with bounded independence number, in the adversarial setting the minimax regret is Θ(T) whereas in the stochastic setting the minimax regret is Θ(

αT). Intriguingly, our setting is semi-adversarial due to realizability of the mean loss, and therefore it is apriori unclear whether the negative adversarial result applies.

In addition, bandits with graph feedback problems often present with associated policy constraints, e.g., for the apple tasting problem, it is natural to rate limit the informative action. Therefore, another interesting direction is to combine our algorithm with the recent progress in contextual bandits with knapsack [Slivkins and Foster, 2022], leading to more practical algorithms.

Acknowledgments HL and MZ are supported by NSF Awards IIS-1943607.

Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert Schapire. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19 26. PMLR, 2012.

Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638 1646. PMLR, 2014.

Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems, 26, 2013.

Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23 35. PMLR, 2015.

Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1 8. IEEE, 2020.

Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811 819. PMLR, 2016.

Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 169 178, 2011.

Emmanuel Esposito, Federico Fusco, Dirk van der Hoeven, and Nicolò Cesa-Bianchi. Learning on the edge: Online learning with stochastic feedback graphs. In Advances in Neural Information Processing Systems, 2022.

Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199 3210. PMLR, 2020.

Dylan Foster, Alekh Agarwal, Miroslav Dudik, Haipeng Luo, and Robert Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539 1548. PMLR, 2018.

Dylan J Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Advances in Neural Information Processing Systems, 34: 18907 18919, 2021.

Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478 11489, 2020.

Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021.

David P Helmbold, Nicholas Littlestone, and Philip M Long. Apple tasting. Information and Computation, 161(2):85 139, 2000.

Shinji Ito, Taira Tsuchiya, and Junya Honda. Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs. In Advances in Neural Information Processing Systems, 2022.

Richard M Karp. Reducibility among combinatorial problems. page 85 103, 1972.

John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96 1, 2007.

John Langford, Lihong Li, and Alex Strehl. Vowpal wabbit online learning project, 2007.

Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728 737. PMLR, 2017.

Chung-Wei Lee, Haipeng Luo, and Mengxiao Zhang. A closer look at small-loss bounds for bandits with graph feedback. In Conference on Learning Theory, pages 2516 2564. PMLR, 2020.

David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361 397, 2004.

Shuai Li, Wei Chen, Zheng Wen, and Kwong-Sak Leung. Stochastic online learning with probabilistic graph feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4675 4682, 2020.

Haipeng Luo, Hanghang Tong, Mengxiao Zhang, and Yuheng Zhang. Improved high-probability regret for adversarial bandits with time-varying feedback graphs. In International Conference on Algorithmic Learning Theory, pages 1074 1100. PMLR, 2023.

Thodoris Lykouris, Karthik Sridharan, and Éva Tardos. Small-loss bounds for online learning with partial information. In Conference on Learning Theory, pages 979 986. PMLR, 2018.

Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24, 2011.

Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

Mark Rucker, Joran T Ash, John Langford, Paul Mineiro, and Ida Momennejad. Eigen memory tree. ar Xiv preprint ar Xiv:2210.14077, 2022.

David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 2021.

Rahul Singh, Fang Liu, Xin Liu, and Ness Shroff. Contextual bandits with side-observations. ar Xiv preprint ar Xiv:2006.03951, 2020.

Aleksandrs Slivkins and Dylan Foster. Efficient contextual bandits with knapsacks via regression. ar Xiv preprint ar Xiv:2211.07484, 2022.

Dirk van der Hoeven, Federico Fusco, and Nicolò Cesa-Bianchi. Beyond bandit feedback in online multiclass classification. Advances in Neural Information Processing Systems, 34:13280 13291, 2021.

Vladimir G Vovk. A game of prediction with expert advice. In Proceedings of the eighth annual conference on Computational learning theory, pages 51 60, 1995.

Lingda Wang, Bingcong Li, Huozhi Zhou, Georgios B Giannakis, Lav R Varshney, and Zhizhen Zhao. Adversarial linear contextual bandits with graph-structured side observations. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10156 10164, 2021.

Yinglun Zhu and Paul Mineiro. Contextual bandits with smooth regret: Efficient learning in continuous action spaces. In International Conference on Machine Learning, pages 27574 27590. PMLR, 2022.

A Omitted Details in Section 3

Theorem 3.1. Suppose decγ(pt; bft, xt, Gt) Cγ β for all t [T] and some β > 0, Algorithm 1

with γ = max{4, (CT) 1 β+1 Reg 1 β+1 Sq } guarantees that E [Reg CB] O C 1 β+1 T 1 β+1 Reg

Proof. Following [Foster et al., 2020], we decompose Reg CB as follows:

t=1 f (xt, at)

t=1 f (xt, π (xt))

f (xt, at) f (xt, π (xt)) γ

4 EA Gt( |at)

bft(xt, a) f (xt, a) 2 #!#

t=1 EA Gt( |at)

bft(xt, a) f (xt, a) 2 ##

t=1 max a [K] f (X [K]7 R)

f(xt, at) f(xt, a ) γ

4 EA Gt( |at)

bft(xt, a) f(xt, a) 2 ##

t=1 EA Gt( |at)

bft(xt, a) f (xt, a) 2 ##

decγ(pt; bft, xt, Gt)

t=1 EA Gt( |at)

bft(xt, a) f (xt, a) 2 ##

t=1 EA Gt( |at)

bft(xt, a) f (xt, a) 2 ##

Next, since E[ℓt,a | xt] = f (xt, a) for all t [T] and a A, we know that

t=1 EA Gt( |at)

bft(xt, a) f (xt, a) 2 ##

t=1 EA Gt( |at)

bft(xt, a) ℓt,a 2 X

a A (f (xt, a) ℓt,a)2 ##

Reg Sq, (9)

where the final inequality is due to Assumption 2.

Therefore, we have

E[Reg CB] CTγ β + γ

Picking γ = max 4, CT Reg Sq

1 β+1 , we obtain that

E [Reg CB] O C 1 β+1 T 1 β+1 Reg

A.1 Proof of Theorem 3.2

Before proving Theorem 3.2, we first show the following key lemma, which is useful in proving that decγ(p; bf, x, G) is convex for both strongly and weakly observable feedback graphs G. We

highlight that the convexity of decγ(p; bf, x, G) is crucial for both proving the upper bound of minp (K) decγ(p; bf, x, G) and showing the efficiency of Algorithm 1.

Lemma A.1. Suppose u, v, x Rd with v, x > 0. Then both g(x) = u,x 2

v,x and h(x) = (1 u,x )2

v,x are convex in x.

Proof. The function f(x, y) = x2/y is convex for y > 0 due to

2f(x, y) = 2

By composition with affine functions, both g(x) = f( u, x , v, x ) and h(x) = f(1 u, x , v, x ) are convex.

Theorem 3.2 (Strongly observable graphs). Suppose that the feedback graph Gt is deterministic and strongly observable with independence number no more than α. Then Algorithm 1 guarantees that

decγ(pt; bft, xt, Gt) O α log(Kγ)

Proof. For conciseness, we omit the subscript t. Direct calculation shows that for all a [K],

f (x, a) f (x, a ) γ

a Nin(G,a) ( bf(x, a ) f (x, a ))2

a=1 paf (x, a) f (x, a ) γ

a=1 Wa bf(x, a) f (x, a) 2 ,

where Wa = P

a Nin(G,a) pa . Therefore, taking the gradient over f (x, ) and we know that

sup f (X [K] 7 R)

a=1 paf (x, a) f (x, a ) γ

a=1 Wa bf(x, a) f (x, a) 2 #

a=1 pa bf(x, a) bf(x, a ) + 1

γ p ea 2 diag(W ) 1.

Then, denote bf RK to be bf(x, ) and consider the following minimax form:

inf p (K) sup a A

a=1 pa bf(x, a) bf(x, a ) + 1

γ p ea 2 diag(W ) 1

= min p (K) max a A

a=1 pa bf(x, a) bf(x, a ) + 1

p2 a Wa + 1

= min p K max q K

a=1 (pa qa) bfa + 1

= max q K min p K

a=1 (pa qa) bfa + 1

where the last equality is due to Sion s minimax theorem and the fact that Eq. (10) is convex in p (K) by applying Lemma A.1 with u = ea and v = ga for each a [K], where ga {0, 1}K is defined as ga,i = 1{(i, a) E}, G = ([K], E), i [K].

Choose pa = (1 1

γ )qa + 1 γK for all a [K]. Let S be the set of nodes in [K] that have a self-loop. Then we can upper bound the value above as follows:

max q (K) min p (K)

a=1 (pa qa) bfa + 1

γ )qa + 1 γK 2 (1 qa) + qa 1 (1 1

γ )qa 1 γK 2

γ )2q2 a + 1 γ2K2 (1 qa) + qa 1 (1 1

2q2 a(1 qa) + 2qa (1 qa)2 + 2q3 a γ2 Wa

j N in(G,a) pj 1 γK for all a [K])

= max q (K)

q3 a Wa + 2

a S q2 a + 2

(if a / S, Wa = 1 pa K 1

Next we bound 2qa(1 qa)

Wa for each a [K]. If a [K]\S, we have Wa = 1 pa and

Wa 2qa(1 qa) 1 (1 1

γ )qa 1 γK 2qa(1 qa) (1 1

γ )(1 qa) + K 1

γ qa 4qa. (14)

If a S, we know that

2qa(1 qa) P j Nin(G,a)((1 1

γ )qj + 1 γK )

γ )qa + 1 γK )(1 qa) P

j Nin(G,a)((1 1

γ )qj + 1 γK )

γ )qa + 1 γK ) P

j N in(G,a)((1 1

γ )qj + 1 γK ) O(α log(Kγ)), (15)

where the last inequality is due to Lemma 5 in Alon et al. [2015]. We include this lemma (Lemma E.1) for completeness. Combining all the above inequalities, we obtain that

inf p (K) sup a A

a=1 pa bf(x, a) bf(x, a ) + 1

γ p ea 2 diag(W ) 1

= max q (K) min p (K)

a=1 (pa qa) bfa + 1

O α log(Kγ)

A.2 Proof of Theorem 3.4

Theorem 3.4 (Weakly observable graphs). Suppose that the feedback graph Gt is deterministic and weakly observable with weak domination number no more than d. Then Algorithm 1 with γ 16d guarantees that

decγ(pt; bft, xt, Gt) O

d γ + eα log(Kγ)

where eα is the independence number of the subgraph induced by nodes with self-loops in Gt.

Proof. Similar to the strongly observable graphs setting, for weakly observable graphs, we know that

decγ(p; bf, x, G)

= max q K min p K

a=1 (pa qa) bfa + 1

Choose pa = (1 1

γ ηd)qa + 1 γK + η1{a D} where D with |D| = d is the minimum weak dominating set of G and 0 < η 1 4d is some parameter to be chosen later. Substituting the form of p to Eq. (16) and using the fact that | bfa| 1 for all a [K], we can obtain that

decγ(p; bf, x, G)

( 2 γ + ηd + 1

Then we can upper bound the value above as follows:

decγ(p; bf, x, G)

2 γ + ηd + 1

γ ηd)qa + 1 γK + η1{a D} 2 (1 qa)

2 γ + ηd + 1

qa + 1 γK 2 (1 qa) + qa (1 qa) + 1

γ qa + ηdqa 2

qa + 1 γK + η 2 (1 qa) + qa (1 qa) + 1

γ qa + ηdqa 2

2 γ + ηd + 1

2 q2 a + 1 γ2K2 (1 qa) + 3qa (1 qa)2 + q2 a γ2 + η2d2q2 a

3 q2 a + 1 γ2K2 + η2 (1 qa) + 3qa (1 qa)2 + q2 a γ2 + η2d2q2 a

Now consider a D. If a S, then we know that Wa η; Otherwise, we know that this node can be observed by at least one node in D, meaning that Wa η. Combining the two cases above, we know that

3 q2 a + 1 γ2K2 + η2 (1 qa) + 3qa (1 qa)2 + 1 γ2 q2 a + η2d2q2 a

q2 a + 1 γ2K2 + η2 (1 qa) + qa

(1 qa)2 + 1

γ2 q2 a + η2d2q2 a

qa q2 a + 1

γ2 q3 a + η2d2q3 a + 1 γ2K2 + η2

(η 1 4d and γ 16d)

where the last inequality is because η 1 4d and γ 16d. Consider a / D. Let S0 be the set of nodes which either have a self loop or can be observed by all the other node. Recall that S represents the set of nodes with a self-loop. Then similar to the derivation of Eq. (13), we know that for a S0,

2 q2 a + 1 γ2K2 (1 qa) + 3qa (1 qa)2 + q2 a γ2 + η2d2q2 a

2q2 a(1 qa) + 3qa (1 qa)2 + q2 a γ2 + η2d2q2 a

γ2 + 1 ηγ3K

(Wa 1 γK if a S and Wa η if a [K]\S)

a S,a/ D q2 a + 1

a S0,a/ D S

γ2 + 1 ηγ3K

a S,a/ D η2d2q2 a + 1

a S0,a/ D S

(for a S0, a / S, Wa max{ K 1

For a / S0, we know that Wa η. Therefore,

2 q2 a + 1 γ2K2 (1 qa) + 3qa (1 qa)2 + q2 a γ2 + η2d2q2 a

2 q2 a + 1 γ2K2

(1 qa) + 3qa

(1 qa)2 + q2 a γ2 + 1

2qa(1 qa) + 1 γ2K2 + 2q3 a γ2 + 3

Plugging Eq. (18), Eq. (19), and Eq. (20) to Eq. (17), we obtain that

decγ(p; bf, x, G) O

Consider the last term. If a S0\S, similar to Eq. (14), we know that

Wa qa(1 qa) 1 (1 1

γ dη)qa 1 γK qa(1 qa) (1 1

γ ηd)(1 qa) 1 1 1

γ ηdqa O(qa),

where the last inequality is due to γ 16d and η 1 4d. If a S, similar to Eq. (15), we know that X

j Nin(G,a)((1 1

γ ηd)qj + 1 γK )

γ ηd)qa + 1 γK )(1 qa) P

j Nin(G,a)((1 1

γ ηd)qj + 1 γK )

γ ηd)qa + 1 γK

j N in(G,a) (1 1

γ ηd)qj + 1 γK (γ 4, η 1 4d)

O(eα log(Kγ)), (22)

where the last inequality is again due to Lemma 5 in [Alon et al., 2015] and eα is the independence number of the subgraph induced by nodes with self-loops in G. Plugging Eq. (22) to Eq. (21) gives

decγ(p; bf, x, G) O ηd + 1

γη + eα log(Kγ)

Picking η = q

1 γd 1 4d proves the result.

Next, we prove Corollary 3.5 by combining Theorem 3.4 and Theorem 3.1. Corollary 3.5. Suppose that Gt is deterministic, weakly observable, and has weak domination number no more than d for all t [T]. In addition, suppose that the independence number of the subgraph induced by nodes with self-loops in Gt is no more than eα for all t [T]. Then, Algorithm 1

with γ = max{16d, q

eαT/Reg Sq, d 1 3 T 2 3 Reg 2

3 Sq } guarantees that

E[Reg CB] e O d 1 3 T 2 3 Reg

eαTReg Sq .

Proof. Combining Eq. (8), Eq. (9) and Theorem 3.4, we can bound Reg CB as follows:

E[Reg CB] O

d γ T + eαT log(Kγ)

γ + γReg CB

Picking γ = max n 16d, q

eαT/Reg Sq, d 1 3 T 2 3 Reg 2

3 Sq o finishes the proof.

A.3 Python Solution to Eq. (5)

def make Problem(nactions):

import cvxpy as cp

sqrtgamma G = cp.Parameter((nactions, nactions), nonneg=True) sqrtgammafhat = cp.Parameter(nactions) p = cp.Variable(nactions, nonneg=True) sqrtgammaz = cp.Variable() objective = cp.Minimize(sqrtgammafhat @ p + sqrtgammaz) constraints = [

cp.sum(p) == 1 ] + [

cp.sum([ cp.quad_over_lin(eai - pi, vi)

for i, (pi, vi) in enumerate(zip(p, v)) for eai in (1 if i == a else 0,) ]) <= sqrtgammafhata + sqrtgammaz for v in (sqrtgamma G @ p,) for a, sqrtgammafhata in enumerate(sqrtgammafhat) ] problem = cp.Problem(objective, constraints) assert problem.is_dcp(dpp=True) # proof of convexity return problem, sqrtgamma G, sqrtgammafhat, p, sqrtgammaz

This particular formulation multiplies both sides of the constraint in Eq. (5) by γ while scaling the objective by γ. While mathematically equivalent to Eq. (5), empirically it has superior numerical stability for large γ. For additional stability, when using this routine we recommend subtracting off the minimum value from bf, which is equivalent to making the substitutions γ bf γ bf γ mina bfa and z z + γ mina bfa and then exploiting the 1 p = 1 constraint.

A.4 Proof of Theorem 3.6

Theorem 3.6. Solving argminp (K) decγ(p; bf, x, G) is equivalent to solving the following convex optimization problem.

min p (K),z p bf + z (5)

subject to a [K] : 1

γ p ea 2 diag(G p) 1 bf(x, a) + z,

where bf in the objective is a shorthand for bf(x, ) RK, ea is the a-th standard basis vector, and means element-wise greater.

Proof. Denote f = f (x, ) RK. Note that according to the definition of G, we know that (G p)i denotes the probability that action i s loss is revealed when the selected action a is sampled from distribution p. Then, we know that

decγ(p; bf, x, G)

= sup a [K] f RK

4 EA G( |at)

bfa f a 2 ##

= sup a [K] f RK

(p ea ) f γ

a [K] bf f 2 diag(G p)

= sup a [K] (p ea ) bf + 1

γ p ea 2 diag(G p) 1 G p 0

= p bf + max a [K]

γ p ea 2 diag(G p) 1 e a bf ,

where the third equality is by picking f to be the maximizer and introduces a constraint. Therefore, the minimization problem minp (K) decγ(p; bf, x, G) can be written as the following constrained optimization by variable substitution:

min p (K),z p bf + z

subject to a [K] : 1

γ p ea 2 diag(G p) 1 e a bf + z,

The convexity of the constraints follows from Lemma A.1.

B Omitted Details in Section 4

In this section, we provide proofs for Section 4. We define Wa := P

a Nin(G,a) pa to be the probability that the loss of action a is revealed when selecting an action from distribution p. Let bf = bf(x, ) RK and f = f(x, ) RK. Direct calculation shows that for any a [K],

f = argmax f RK Ea p

f(x, a) f(x, a ) γ

a Nin(G,a) ( bft(x, a ) f(x, a ))2

γ diag(W) 1(p ea ) + bf.

Therefore, substituting f into Eq. (4), we obtain that

decγ(p; bf, x, G) = max a [K]

p2 a Wa + 1 2pa

+ D p ea , bf E)

Without loss of generality, we assume the mini [K] bfi = 0. This is because shifting bf by mini [K] bfi does not change the value of D p ea , bf E . In the following sections, we provide proofs showing that

a certain closed-form of p leads to optimal decγ(p; bf, x, G) up to constant factors for several specific types of feedback graphs, respectively.

B.1 Cops-and-Robbers Graph

Proposition 1. When G = GCR, given any bf, context x, the closed-form distribution p in Eq. (6) guarantees that decγ(p; bf, x, GCR) O 1 γ .

Proof. We use the following notation for convenience: p1 := pa1, p2 := pa2, bf1 := bfa1 = 0, bf2 := bfa2. For the cops-and-robbers graph and closed-form solution p in Eq. (6), Eq. (23) becomes:

decγ(p; bf, x, GCR) = max a [K]

p2 1 1 p1 + (1 p1)2

+ D p ea , bf E .

If a = a1 and a = a2, we know that

p2 1 1 p1 + (1 p1)2

+ D p ea , bf E

p2 1 1 p1 + (1 p1)2

p1 + 1 + p1 bf1 + p2 bf2 bfa

p2 1 1 p1 + (1 p1)2

p1 + 1 p1 bf2 ( bfa bf2 bf1 = 0)

1 1 p1 + 1 + 1 p1 bf2 (p1 [ 1

2, 1], p1 p2 [0, 1

4 + γ bf2 1 1

If a = a2, we can obtain that

p2 1 1 p1 + (1 p1)2

+ D p ea , bf E

p2 1 1 p1 + (1 p1)2

+ p1 bf1 + p2 bf2 bf2

p2 1 1 p1 + (1 p1)2

p1 + 1 2(1 p1)

p1 bf2 ( bf1 = 0)

1 1 p1 + 1 + 2 1

p1 bf2 (p1 [ 1

2, 1], p2 [0, 1

5 + γ bf2 1 1

bf2 (p1 = 1 2+γ b f2 )

If a = a1, we have 1 γ

p2 1 1 p1 + (1 p1)2

+ D p ea , bf E

p2 1 1 p1 + (1 p1)2

+ (1 p1) bf2

1 p1 + (1 p1)2

+ (1 p1) bf2

+ bf2 2 + γ bf2 (p1 [ 1

Putting everything together, we prove that decγ(p; bf, x, GCR) 6

B.2 Apple Tasting Graph

Proposition 2. When G = GAT, given any bf, context x, the closed-form distribution p in Eq. (7) guarantees that decγ(p; bf, x, GAT) O( 1

Proof. For the apple tasting graph and closed-form solution p in Eq. (7), Eq. (23) becomes:

decγ(p; bf, x, G) = max a [K]

p1 + (1 p1)2

+ D p ea , bf E .

Suppose bf1 = 0, we know that p1 = 1, p2 = 0 and

1. If a = 1, we have 1 γ

p1 + (1 p1)2

+ D p ea , bf E = 0.

2. If a = 2, direct calculation shows that 1 γ

p1 + (1 p1)2

+ D p ea , bf E 2

Suppose bf2 = 0, we know that p1 = 2 4+γ b f1 , p2 = 1 p1 and

1. If a = 1, we have 1 γ

p1 + (1 p1)2

+ D p ea , bf E

p1 + (1 p1)2

γp1 (1 p1) bf1

= (2 + γ bf1)2

γ(4 + γ bf1) (1 p1) bf1

γ + 2 bf1 4 + γ bf1 bf1 6

2. If a = 2, direct calculation shows that

p1 + (1 p1)2

+ D p ea , bf E = 2p1

γ + p1 bf1 1

γ + 2 bf1 4 + γ bf1 3

Putting everything together, we prove that decγ(p; bf, x, GAT) 6

B.3 Inventory Graph

Proposition 3. When G = Ginv, given any bf, context x, there exists a closed-form distribution p (K) guaranteeing that decγ(p; bf, x, Ginv) O( 1

γ ), where p is defined as follows: pj = max{ 1 1+γ( b fj mini b fi) P j >j pj , 0} for all j [K].

Proof. Based on the distribution defined above, define A [K] to be the set such that for all i A, pi > 0 and denote N = |A|. We index each action in A by k1 < k2 < < k N = K. According to the definition of pi, we know that pi is strictly positive only when bfi < bfj for all j > i and specifically, when pi > 0, we know that Wi = P

j i pj = 1 1+γ b fi (recall that mini bfi = 0 since we

shift bf). Therefore, define Wk N+1 = 0 and we know that

decγ(p; bf, x, Ginv)

i=1 pki bfki + 1

p2 a Wa + max a [K]

Wki Wki+1 bfki + 1

γ + max a [K]

1 + γ bfki 1

1 + γ bfki+1

! bfki + max a [K]

bfki bfki 1

1 + γ bfki + max a [K]

According to Lemma 9 of [Alon et al., 2013] (included as Lemma E.2 for completeness), we know that

bfki bfki 1

1 + γ bfki = 1

bfki bfki 1

1 γ + bfki mas(GA)

where GA is the subgraph of G restricted to node set A and mas(G) is the size of the maximum acyclic subgraphs of G. It is direct to see that any subgraph G of Ginv has mas(G) = 1.

Next, consider the value of a [K] that maximizes 1 2pa

γWa bfa . If a k1, then we know that

Wa = 1 and 1 2pa

γ . Otherwise, suppose that ki < a ki+1 for some i [N 1]. According to the definition of p, if a = ki+1 we know that pa = 0 and

1 + γ bfa X

j >a pj = Wki+1 = Wa .

γWa bfa = 1 γWa bfa 1

Otherwise, Wa = Wki+1 and 1 2pa

γWa bfa 1 γWki+1 bfki+1 = 1

γ . Combining the two cases above and Eq. (24), we obtain that

decγ(p; bf, x, Ginv) 3

B.4 Undirected and Self-Aware Graphs

Proposition 4. When G is an undirected self-aware graph, given any bf, context x, there exists a closed-form distribution p (K) guaranteeing that decγ(p; bf, x, G) O α γ .

Proof. We first introduce the closed-form of p and then show that decγ(p; bf, x, G) O( α

γ ). Specif-

ically, we first sort bfa in an increasing order and choose a maximal independent set by choosing the nodes in a greedy way. Specifically, we pick k1 = argmini [K] bfi. Then, we ignore all the nodes that are connected to k1 and select the node a with the smallest bfa in the remaining node set. This forms a maximal independent set I [K], which has size no more than α and is also a dominating set. Set pa = 1 α+γ b fa for a I\{k1} and pk1 = 1 P

a =k1,a I pa. This is a valid distribution as we only choose at most α nodes and pa 1/α for all a I\{k1}. Now we show that decγ(p; bf, x, G) O( α

γ ). Specifically, we only need to show that with this choice of p, for any a [K],

a=1 pa bfa bfa + 1

p2 a Wa + 1 2pa

Plugging in the form of p, we know that

a=1 pa bfa bfa + 1

p2 a Wa + 1 2pa

bfa α + γ bfa bfa + 1 2pa

γ (pa Wa for all a [K])

γ bfa + 1 2pa

γWa . (|I| α)

If a = k1, then we can obtain that 1 2pa

γWa 1 γWk1 α

α according to the definition of

p. Otherwise, note that according to the choice of the maximal independent set I, Wa 1 α+γ b fa for

some a I such that bfa bfa . Therefore,

bfa + 1 2pa

γWa bfa + 1 γWa bfa + α + γ bfa

Combining the two inequalities above together proves the bound.

C Implementation Details in Experiments

C.1 Implementation Details in Section 5.1

We conduct experiments on RCV1 [Lewis et al., 2004], which is a multilabel text-categorization dataset. We use a subset of RCV1 containing 50000 samples and K = 50 sub-classes. Therefore, the feedback graph in our experiment has K = 50 nodes. We use the bag-of-words vector of each sample as the context with dimension d = 47236 and treat the text categories as the arms. In each round t, the learner receives the bag-of-words vector xt and makes a prediction at [K] as the text category. The loss is set to be ℓt,at = 0 if the sample belongs to the predicted category at and ℓt,at = 1 otherwise.

The function class we consider is the following linear function class:

F = {f : f(x, a) = Sigmoid((Mx)a), M RK d},

where Sigmoid(u) = 1 1+e u for any u R. The oracle is implemented by applying online gradient descent with learning rate η searched over {0.1, 0.2, 0.5, 1, 2, 4}. As suggested by [Foster and

Krishnamurthy, 2021], we use a time-varying exploration parameter γt = c

αt, where t is the index of the iteration, c is searched over {8, 16, 32, 64, 128}, and α is the independence number of the corresponding feedback graph. Our code is built on Py Torch framework [Paszke et al., 2019]. We run 5 independent experiments with different random seeds and plot the mean and standard deviation value of PV loss.

C.2 Implementation Details in Section 5.2

C.2.1 Details for Results on Random Directed Self-aware Graphs

We conduct experiments on a subset of RCV1 containing 10000 samples with K = 10 sub-classes. Our code is built on Vowpal Wabbit [Langford and Zhang, 2007]. For Sqaure CB, the exploration parameter γt at round t is set to be γt = c

Kt, where t is the index of the round and c is the hyperparameter searched over set {8, 16, 32, 64, 128}. The remaining details are the same as described in Appendix C.1.

C.2.2 Details for Results on Synthetic Inventory Dataset

In this subsection, we introduce more details in the synthetic inventory data construction, loss function constructions, oracle implementation, and computation of the strategy at each round.

Dataset. In this experiment, we create a synthetic inventory dataset constructed as follows. The dataset includes T data points, the t-th of which is represented as (xt, dt) where xt Rm is the context and dt is the realized demand given context xt. Specifically, in the experiment, we choose m = 100 and xt s are drawn i.i.d from Gaussian distribution with mean 0 and standard deviation 0.1. The demand dt is defined as

dt = 1 mx t θ + εt,

where θ Rm is an arbitrary vector and εt is a one-dimensional Gaussian random variable with mean 0.3 and standard deviation 0.1. After all the data points {(xt, dt)}T t=1 are constructed, we normalize dt to [0, 1] by setting dt dt mint [T ] dt maxt [T ] dt mint [T ] dt . In all our experiments, we set T = 10000.

Loss construction. Next, we define the loss at round t when picking the inventory level at with demand dt, which is defined as follows:

ℓt,at = h max{at dt, 0} + b max{dt at, 0}, (25)

where h > 0 is the holding cost per remaining items and b > 0 is the backorder cost per remaining items. In the experiment, we set h = 0.25 and b = 1.

Regression oracle. The function class we use in this experiment is as follows:

F = {f : f(x, a) = h max{a (x θ + β), 0} + b max{x θ + β a, 0}, θ Rm, β R}.

This ensures the realizability assumption according to the definition of our loss function shown in Eq. (25). The oracle uses online gradient descent with learning rate η searched over {0.01, 0.05, 0.1, 0.5, 1}.

Calculation of pt. To make Square CB.G more efficient, instead of solving the convex program defined in Eq. (5), we use the closed-form of pt derived in Proposition 3, which only requires O(K) computational cost and has the same theoretical guarantee (up to a constant factor) as the one enjoyed by the solution solved by Eq. (5). Similar to the case in Appendix C.1, at each round t, we pick γt = c

t with c searched over the set {0.25, 0.5, 1, 2, 3, 4}. Note again that the independence number for inventory graph is 1.

We run 8 independent experiments with different random seeds and plot the mean and standard deviation value of PV loss.

D Adaptive Tuning of γ without the Knowledge of Graph-Theoretic Numbers

In this section, we show how to adaptively tune the parameter γ in order to

achieve e O q PT t=1 αt Reg Sq

regret in the strongly observable graphs case and

e O PT t=1 dt 2

1 3 Sq + q PT t=1 eαt Reg Sq

in the weakly observable graphs case.

D.1 Strongly Observable Graphs

In order to achieve e O PT t=1 αt Reg Sq regret guarantee without the knowledge of αt, we apply a

doubling trick on γ based on the value of minp (K) decγ(p; bft, xt, Gt). Specifically, our algorithm

goes in epochs with the parameter γ being γs in the s-th epoch. We initialize γ1 = q

T Reg Sq . As

proven in Theorem 3.2, we know that

γ min p (K) decγ(p; bft, xt, Gt) e O(αt).

Therefore, within each epoch s (with start round bs), at round t, we calculate the value

τ=bs min p (K) decγs(p; bft, xt, Gt), (26)

which is bounded by e O 1 γs Pt τ=bs ατ and is in fact obtainable by solving the convex program. Then, we check whether Qt γs Reg Sq. If this holds, we continue our algorithm using γs; otherwise, we set γs+1 = 2γs and restart the algorithm.

Now we analyze the performance of the above described algorithm. First, note that for any t, within epoch s,

meaning that the number of epoch S is bounded by S = log2 C1 + log4

T for certain C1 > 0 which only contains constant and log terms.

Next, consider the regret in epoch s with Is = [bs, es]. According to Eq. (8), we know that the regret within epoch s is bounded as follows:

t Is f (xt, at) X

t Is f (xt, π (xt))

decγs(pt; bft, xt, Gt)

t [bs,es 1]

decγs(pt; bft, xt, Gt)

e O(γs Reg Sq), (27) where the last inequality is because at round t = es 1, Qt γs Reg Sq is satisfied. Taking summation over all S epochs, we know that the overall regret is bounded as

s=1 e O(γs Reg Sq) =

s=1 e O 2s 1q

TReg Sq = e O

t=1 αt Reg Sq

which finishes the proof.

D.2 Weakly Observable Graphs

For the weakly observable graphs case, to achieve the target regret without the knowledge of eαt and dt, which are the independence number of the subgraph induced by nodes with self-loops in Gt and the weak domination number of Gt, we apply the same approach as the one applied in the strongly observable graph case. Note that according to Theorem 3.4, within epoch s, we

Pt τ=bs dτ γs +

Pt τ=bs eατ

for certain C2 > 1 only containing constants and log

factors. In the weakly observable graphs case, we know that the number of epoch is bounded by

S = 2 + log2 C2 + max

(PT t=1 dt) 2 3

T Reg 1 6 Sq

since we have

1 Reg Sq max

t=1 eαt, (PT t=1 dt) 2 3

and at round t in epoch S,

γS + PT τ=1 dτ γS

meaning that epoch S will never end. Therefore, following Eq. (27) and Eq. (28), we can obtain that

E[Reg CB] e O 2Sq

TReg Sq = e O

t=1 eαt Reg Sq +

which finishes the proof.

E Auxiliary Lemmas

Lemma E.1 (Lemma 5 in [Alon et al., 2015]). Let G = (V, E) be a directed graph with |V | = K, in which i N in(G, i) for all vertices i [K]. Assign each i V with a positive weight wi such that Pn i=1 wi 1 and wi ε for all i V for some constant 0 < ε < 1

j Nin(G,i) wj 4α(G) log 4K α(G)ε,

where α(G) is the independence number of G. Lemma E.2 (Lemma 9 in [Alon et al., 2013]). Let G = (V, E) be a directed graph with vertex set |V | = K, in which i N in(G, i) for all i [K]. Let p be an arbitrary distribution over [K]. Then, we have

j Nin(G,i) pj mas(G),

where mas(G) is the size of the maximum acyclic subgraphs of G.