# best_arm_identification_in_graphical_bilinear_bandits__34fe83e5.pdf Best Arm Identification in Graphical Bilinear Bandits Geovani Rizk 1 2 Albert Thomas 2 Igor Colin 2 Rida Laraki 1 3 Yann Chevaleyre 1 We introduce a new graphical bilinear bandit problem where a learner (or a central entity) allocates arms to the nodes of a graph and observes for each edge a noisy bilinear reward representing the interaction between the two end nodes. We study the best arm identification problem in which the learner wants to find the graph allocation maximizing the sum of the bilinear rewards. By efficiently exploiting the geometry of this bandit problem, we propose a decentralized allocation strategy based on random sampling with theoretical guarantees. In particular, we characterize the influence of the graph structure (e.g. star, complete or circle) on the convergence rate and propose empirical experiments that confirm this dependency. 1. Introduction In many multi-agent systems the contribution of an agent to a common team objective is impacted by the behavior of the other agents. The agents must coordinate (or be coordinated) to achieve the best team performance. Consider, for instance, the problem of configuring antennas of a wireless cellular network to obtain the best signal quality over the whole network (Siomina et al., 2006). The signal quality of the region covered by a given antenna might be degraded by the behavior of its neighboring antennas due to an increase of interferences or bad user handovers. Another example is the adjustment of the turbine blades of a wind farm where the best adjustment for one turbine may generate turbulence for its neighboring turbines and thus be suboptimal for the global wind farm objective (Bargiacchi et al., 2018). These real-life problems can be viewed as instances of a stochastic multi-agent multi-armed bandit problem (Robbins, 1952; Bargiacchi et al., 2018) where a learner (or a central entity) sequentially pulls a joint arm, one arm for each 1PSL - Universit e Paris Dauphine, CNRS, LAMSADE, Paris, France 2Huawei Noah s Ark Lab 3Liverpool University. Correspondence to: Geovani Rizk . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). agent (e.g., all the configuration parameters of the antennas), and receives an associated global noisy reward (e.g., the signal quality over the whole network). The goal of the learner can either be to maximize the accumulated reward, implying a trade-off between exploration and exploitation, or to find the joint arm maximizing the reward, known as pure exploration or best arm identification (Bubeck et al., 2009; Audibert and Bubeck, 2010). In this paper we focus on the best arm identification problem in a multi-agent system for which we assume the knowledge of a coordination graph G = (V, E) representing the agent interactions (Guestrin et al., 2002). At each round t, a learner 1. chooses for each node i V an arm x(i) t in an finite arm set X Rd, 2. observes for each edge (i, j) E a bilinear reward r(i,j) t = x(i) t M x(j) t + η(i,j) t . Here, we denote by M Rd d the unknown parameter matrix, and η(i,j) t a zero-mean σ-sub-Gaussian random variable for all edges (i, j) E and round t. The goal of the central entity is to find, within a minimum number of rounds, the joint arm (x(1) , . . . , x(|V |) ) such that the expected global reward P (i,j) E x(i) M x(j) is maximized. The reward r(i,j) t reflects the quality of the interaction between the neighboring nodes i and j when pulling respectively the arm x(i) t and x(j) t at time t. For instance, when configuring handover parameters of a wireless network, r(i,j) t can be any criterion assessing the handover quality between antenna i and antenna j, the parameters selected by each antenna both impacting this quantity. The bilinear setting appears as a natural extension of the commonly studied linear setting to model the interaction between two agents. Furthermore, instead of a global reward being a sum of independent linear agent rewards, the global reward is now the result of the interactions between neighboring agents. As exposed in Jun et al. (2019), the bilinear reward can be Best Arm Identification in Graphical Bilinear Bandits written as a linear reward in a higher dimensional space: r(i,j) t = vec x(i) t x(j) t vec (M ) + η(i,j) t , (1) where for any matrix A Rd d, vec (A) denotes the vector in Rd2 which is the concatenation of all the columns of A. Since the unknown parameter M is common to all the edges (i, j) of the graph, the expected global reward at time t can also be written as the scalar product DP (i,j) E vec x(i) t x(j) t , vec (M ) E . Hence, solving the best arm identification problem in the described graphical bilinear bandit boils down to solving the same problem in a global linear bandit. Although this trick allows to use classical algorithms in linear bandits, the number of joint arms is growing exponentially with the number of nodes, making such methods impractical. Another possible way to address this problem based on equation (1) is to consider one linear bandit per edge, with constraints between edges. For more clarity, let us define the arm set Z = {vec (xx ) | (x, x ) X 2}, and let us refer to any z Z as an edge-arm and to any x X as an node-arm. At each round t, the learner chooses for each edge (i, j) an edge-arm in Z with the constraint that for any pair of edges (i, j) and (i, k), if the edge-arm vec (xx ) is assigned to the edge (i, j) and the edge-arm vec (x x ) is assigned to the edge (i, k), then it must be that x = x . Given this constraint, how do we choose the appropriate sequence of edge-arms in order to build a good estimate of vec (M )? Moreover, assuming we have built such a good estimator, is there a tractable algorithm to identify the best joint arm, or at least to find a joint arm yielding a high expected reward? In this paper, we answer these questions and provide algorithms and theoretical guarantees. We show that even with a perfect estimator vec ( ˆM) = vec (M ), identifying the best joint arm is NP-Hard. To address this issue, we design a polynomial time twofold algorithm. Given vec ( ˆM), it first identifies the best edgearm z Z maximizing z , vec ( ˆM) . Then, it allocates z to a carefully chosen subset of edges. We show that this yields a good approximation ratio in Section 4. To build our estimator vec ( ˆM), we rely on the G-Allocation strategy, as in Soare et al. (2014). We show that there exists a sampling procedure over the node-arms such that the associated edge-arms follow the optimal G-allocation strategy developed in the linear bandit literature. This procedure allows us to avoid the difficulty of having to satisfy the edge-arm constraints explicitly. Furthermore, we analyze the sample complexity of this method. This is detailed in Section 5. In addition, we highlight the impact of the graph structure in Section 6 and provide the explicit repercussion on the convergence rate of the algorithm for different types: star, complete, circle and matching graphs. In particular, we show that for favorable graph structures (e.g. circles), our convergence rate matches that of standard linear bandits. Finally, Section 7 evidences the theoretical findings on numerical experiments. 2. Related Work Best arm identification in linear bandits. There exists a vast literature on the problem of best arm identification in linear bandits (Soare et al., 2014; Xu et al., 2018; Degenne et al., 2020; Kazerouni and Wein, 2019; Zaki et al., 2020; Jedra and Proutiere, 2020), would it be by using greedy strategies (Soare et al., 2014), rounding procedures (Fiez et al., 2019) or random sampling (Tao et al., 2018). Although our problem can be formulated as a linear bandit problem, none of the existing methods would scale-up with the number of agents. Nevertheless, we will be relying on classical techniques, and more specifically those developed in Soare et al. (2014). Bilinear bandits. Bandits with bilinear rewards have been studied in Jun et al. (2019). The authors derived a no-regret algorithm based on Optimism in the Face of Uncertainty Linear bandit (OFUL) (Abbasi-Yadkori et al., 2011), using the fact that a bilinear reward can be expressed as a linear reward in higher dimension. Our work extends their setting by considering a set of dependent bilinear bandits. Besides, the goal here is to find the best arm rather than minimizing the regret. Bandits and graphs. Graphs are often used to bring structure to a bandit problem. In Valko et al. (2014) and Mannor and Shamir (2011), the arms are the nodes of a graph and pulling an arm gives information on the rewards of the neighboring arms. The reader can also refer to Valko (2020) for an account on such problems. In Cesa-Bianchi et al. (2013) each node is an instance of a linear bandit and the neighboring nodes are assumed to have similar unknown regression coefficients. The main difference with our setting is that the rewards of the nodes are independent. Combinatorial and multi-agent bandits. Allocating arms to each node of a graph to then observe a global reward is a combinatorial bandit problem (Cesa-Bianchi and Lugosi, 2012), the number of joint arms scaling exponentially with the number of nodes. This has been extensively studied both in the regret-based (Chen et al., 2013; Perrault et al., 2020) and the pure exploration context (Chen et al., 2014; Cao and Krishnamurthy, 2019; Jourdan et al., 2021; Du et al., 2020). Our problem is closer to the one presented in Amin et al. (2011) and Bargiacchi et al. (2018), where several agents want to maximize a global team reward that Best Arm Identification in Graphical Bilinear Bandits can be decomposed into a sum of observable local rewards as in a semi-bandit game (Audibert et al., 2011; Chen et al., 2013). However, we study a more structured context as we assume observable bilinear rewards for each edge of the graph. Furthermore, note that our problem can be solved by the algorithm presented in Du et al. (2020) with a sample complexity increasing in the number of nodes. On the contrary, we propose in this paper an algorithm with a sample complexity decreasing in the number of nodes exploiting the structure of the bilinear reward and the graph. Finally, most of the algorithms developed for combinatorial bandits assume the availability of an oracle to solve the combinatorial optimization problem returning the arm to play or the final best arm recommendation. We make no such assumption. 3. Preliminaries and Notations Let G = (V, E) be a directed graph with V the set of nodes, E the set of edges where we assume that if (i, j) E then (j, i) E, and N(i) the set containing the neighbors of a node i V . We denote by n = |V | the number of nodes and m = |E| the number of edges. We define the graphical bilinear bandit on the graph G as the setting where a learner sequentially pulls at each round t a joint arm x(1) t , . . . , x(n) t X n, also called graph allocation or simply allocation when it is clear from the context, and then receives a bilinear reward r(i,j) t for each edge (i, j) E. At each round, the joint arm can be constructed simultaneously or sequentially, however all the bilinear rewards are only revealed after the joint arm has been pulled. We denote K = |X| the number of node-arms and it is assumed that X spans Rd. For each round t of the learning procedure and each node i V , x(i) t X represents the node-arm allocated to the node i V . For each edge (i, j) E, we denote z(i,j) t = vec (x(i) t x(j) t ) Z the associated chosen edge-arm. The goal is to derive an algorithm that minimizes the number of pulled joint arms required to find the one maximizing the sum of the associated expected bilinear rewards, for a given confidence level. For the sake of simplicity, we assume that the unknown parameter matrix M in the bilinear reward is symmetric. We provide an analysis of the non-symmetric case in Appendix E. For any finite set X, SX {λ [0, 1]|X|, P x X λx = 1} denotes the simplex in R|X|. For any vector x Rd, x will denote the ℓ2-norm of x. For any square matrix A Rd d, we denote by A supx: x =1 Ax the spectral norm of A. Finally, for any vector x Rd and a symmetric positive-definite matrix A Rd d, we define x A 4. An NP-Hard Problem In this section, we address the problem of finding the best joint arm given M or a good estimator ˆM. If the best edge-arm z is composed of a single node-arm x , that is z = vec (x x ), then finding the best joint arm is trivial and the solution is to assign x to all nodes. Conversely, if z is composed of two distinct node-arms (x , x ), the problem is harder. The following theorem states that, even with the knowledge of the true parameter M , identifying the best join-arm is NP-Hard with respect to the number of nodes n. Theorem 4.1. Consider a given matrix M Rd d and a finite arm set X Rd. Unless P=NP, there is no polynomial time algorithm guaranteed to find the optimal solution of max (x(1),...,x(n)) X n (i,j) E x(i) M x(j) . The proof of this theorem is in Appendix A and relies on a reduction to the Max-Cut problem. Hence, no matter which estimate ˆM of M one can build, the learner is not guaranteed to find in polynomial time the joint arm x(1) , . . . , x(n) maximizing the expected global reward. However, one can notice that, given the matrix M or even a good enough estimate ˆM, identifying the edge-arm z = vec (x x ) Z that maximizes the reward z vec (M ) requires only K2 reward estimations (we simply estimate all the linear reward associated to each edge-arm in Z). Thus, instead of looking for the best joint arm explicitely, we will first identify the best edge-arm z , and then allocate z to the largest number of edges in the graph. We will also show that this approach gives a guarantee on its associated global reward. Let us consider the graph allocation that places the maximum number of edge-arms z in G. It is easy to show that the subgraph containing only the edges where z has been pulled is the largest bipartite subgraph included in G. Recall that a graph G = (V , E ) is a bipartite if and only if one can partition the node set V = (V 1, V 2) such that (i, j) E (i, j) V 1 V 2 or (j, i) V 1 V 2 . Notice, that if G is the largest bipartite subgraph in G, the number of edges in E is the maximal number of edge-arms z that can be allocated with a single graph allocation. Hence, finding the joint arm with the largest number of edge-arms z allocated in the graph is equivalent to finding the largest bipartite subgraph G = (V , E ) in G. Once that subgraph is determined, we just need to allocate to all the nodes in V 1 the node-arm x and to all the nodes of V 2 the node-arm x (which is equivalent to allocating to all the edges in E the edge-arm z ). Best Arm Identification in Graphical Bilinear Bandits Furthermore, we know that every m-edge graph contains a bipartite subgraph of at least m/2 edges (Erdos, 1975). Therefore, we propose Algorithm 1 which iteratively constructs a bipartite subgraph and allocates the nodes accordingly to create at least m/2 edge-arms z . Algorithm 1 Bipartite graph algorithm for Best Arm Identification in Graphical Bilinear Bandits Input :G = (V, E), X, M Find (x , x ) arg max(x,x ) X 2 x M x Set V1 = , V2 = for i in V do Set n1 the number of neighbors of i in V1 Set n2 the number of neighbors of i in V2 if n1 > n2 then x(i) = x V2 V2 {i} else x(i) = x V1 V1 {i} end end return x = x(1), . . . , x(n) The following result gives the guarantee on the global reward associated to the joint arm returned by Algorithm 1. We refer the reader to Appendix A for the proof. Theorem 4.2. Let us consider the graph G = (V, E), a finite arm set X Rd and the matrix M given as input to Algorithm 1. Then, the expected global reward r = P (i,j) E x(i) M x(j) associated to the returned allocation x = x(1), . . . , x(n) X n verifies: r rmin r rmin 1 where r and rmin are respectively the highest and lowest global reward one can obtain with the appropriate joint arm. Finally, the complexity of the algorithm is in O(K2 + n). This type of approximation result is sometimes referred to as differential approximation or z-approximation, and is often viewed as a more subtle analysis than standard approximation ratio. We emphasize that finding a better ratio than 1 2 is a very hard task: such a finding would immediately yield an improved differential approximation ratio for the Max-Cut problem, which is an opened problem since 2001 (Hassin and Khuller, 2001). 5. Construction of the Estimate ˆM In the previous section, we designed a polynomial time method that computes a 1/2-approximation to the NP-Hard problem of finding the best joint arm given M . Notice that M is only used to identify the best edge-arm z . Thus, using an estimate ˆM of M having the following property: arg max z Z z vec ( ˆM) = arg max z Z z vec (M ) , (2) would still allow us to identify z , and would thus give us the same guarantees. In this section we tackle the problem of pulling the edgearms during the learning procedure such that the estimated unknown parameter verifies (2) in as few iterations as possible. To do so, we first formalize the objective related to the linearized version of the problem. Then, we propose an algorithm reaching the given objective with high probability while satisfying the edge-arms constraints. We denote by θ = vec (M ) the parameter of the linearized problem and ˆθt the Ordinary Least Squares (OLS) estimate of θ computed with all the data collected up to round t. The empirical gap between two edge-arms z and z in Z is denoted ˆ t (z, z ) (z z ) ˆθt. 5.1. A Constrained G-Allocation The goal here is to define the optimal sequence (z1, . . . , zmt) Zmt that should be pulled in the first t rounds so that (2) is reached as soon as possible. A natural approach is to rely on classical strategies developed for best arm identification in linear bandits. Most of the known strategies (see e.g., Soare et al. (2014); Xu et al. (2018); Fiez et al. (2019)) are based on a bound of the gap error |(θ ˆθt) (z z )| for all z, z Z. This bound is then used to derive a stopping condition, indicating a sufficient number of rounds t after which the OLS estimate ˆθt is precise enough to ensure the identification of the best edge-arm, with high probability. Let δ (0, 1) and let At = Pmt i=1 ziz i be the matrix computed with the mt edge-arms constructed during t rounds. Following the steps of Soare et al. (2014), we can show that if there exists z Z such that for all z Z the following holds: 8σ2 log 6m2t2K4 ˆ t (z, z ) , (3) then with probability at least 1 δ, the OLS estimate ˆθt leads to the best edge-arm. Details of the derivation are given in Appendix B. As mentioned in Soare et al. (2014), by noticing that max(z,z ) Z2 z z A 1 t 2 maxz Z z A 1 t , an admissible strategy is to pull edge-arms minimizing maxz Z z A 1 t in order to satisfy the stopping condition as soon as possible. More formally, one wants to find the Best Arm Identification in Graphical Bilinear Bandits sequence of edge-arms z mt = (z 1, . . . , z mt) such that: z mt arg min (z1,...,zmt) max z Z z mt X This is known as G-allocation (see e.g., Pukelsheim (2006); Soare et al. (2014)) and is NP-hard to compute (C ivril and Magdon-Ismail, 2009; Welch, 1982). One way to find an approximate solution is to rely on a convex relaxation of the optimization problem (G-opt-Z) and first compute a real-valued allocation λ SZ such that λ arg min λ SZ max z Z z X z Z λzzz ! 1 (G-relaxed-Z) One could either use random sampling to draw edge-arms as i.i.d. samples from the λ distribution or rounding procedures to efficiently convert each λ z into an integer. However, these methods do not take into account the graphical structure of the problem, and at a given round, the m chosen edge-arms may result in two different assignments for the same node. Therefore, random sampling or rounding procedures cannot be straightforwardly used to select edge-arms in Z. Nevertheless, (G-relaxed-Z) still gives a valuable information on the number of times, in proportion, each edge-arm z Z must be allocated to the graph. In the next section, we present an algorithm satisfying both the proportion requirements and the graphical constraints. 5.2. Random Allocation over the Nodes Our algorithm is based on a randomized method directly allocating node-arms to the nodes and thus avoiding the difficult task of choosing edge-arms and trying to allocate them to the graph while ensuring that every node has an unique assignment. The validity of this random allocation is based on Theorem 5.1 below showing that one can draw node-arms in X and allocate them to the graph such that the associated edge-arms follow the probability distribution λ solution of (G-relaxed-Z). Theorem 5.1. Let µ be a solution of the following optimization problem: min µ SX max x X x X x X µxxx ! 1 x . (G-relaxed-X) Let λ SZ be defined for all z = vec xx Z by λ z = µ xµ x . Then, λ is a solution of (G-relaxed-Z). Sketch of proof. The objective at the optimum in (G-relaxed-X) and (G-relaxed-Z) are respectively equal to d and d2 which is the dimension of their respective problem, a result known as the Equivalence Theorem (Kiefer and Wolfowitz, 1960). Thus, by multiplying the optimum value of (G-relaxed-X) by itself, we can show that for all z Z where z = vec xx with (x, x ) X 2, λ z can be written as the product µ xµ x . We refer to the Appendix C for the detailed proof. This theorem implies that, at each round t > 0 and each node i V , if x(i) t is drawn from µ , then for all pairs of neighbors (i, j) E the probability distribution of the associated edge-arms z(i,j) t follows λ . Moreover, as µ is a distribution over the node-arm set X, λ is a joint (product) probability distribution on X 2 with marginal µ . We apply the Frank-Wolfe algorithm (Frank et al., 1956) to compute the solution µ of (G-relaxed-X), as it is more suited to optimization tasks on the simplex than projected gradient descent. Although we face a min-max optimization problem, we notice that the function h(µ) = maxx X x P x X µxxx 1 x is convex. We refer the reader to Appendix F and references therein for a proof on the convexity of h and a discussion about using Frank Wolfe for solving (G-relaxed-X). Given the characterization in Theorem 5.1 and our objective to verify the stopping condition in (3), we present our sampling procedure in Algorithm 2. We also note that at each round the sampling of the node-arms can be done in parallel. Algorithm 2 Randomized G-Allocation strategy for Graphical Bilinear Bandits Input :graph G = (V, E), arm set X Set A0 = I ; b0 = 0 ; t = 1; Apply the Frank-Wolfe algorithm to find µ solution of (G-relaxed-X). while stopping condition (3) is not verified do // Sampling the node-arms Draw x(1) t , . . . , x(n) t iid µ and obtain for all (i, j) in E the rewards r(i,j) t ; // Estimating ˆθt with the associated edge-arms At = At 1 + P (i,j) E z(i,j) t z(i,j) t ; bt = bt 1 + P (i,j) E z(i,j) t r(i,j) t ; ˆθt = A 1 t bt t t + 1; end return ˆθt This sampling procedure implies that each edge-arm follows the optimal distribution λ . However, if we take the number of times each z Z appears in the m pulled edge-arms of a given round, we might notice that the observed proportion Best Arm Identification in Graphical Bilinear Bandits is not close to λ z, regardless of the size of m. This is due to the fact that the m edge-arms are not independent because of the graph structure (cf. Section 6). Conversely, since each group of m edge-arms are independent from one round to another, the proportion of each z Z observed among the mt pulled edge-arms throughout t rounds is close to λ z. One may wonder if deterministic rounding procedures could be used instead of random sampling on µ , as it is done in many standard linear bandit algorithms (Soare et al., 2014; Fiez et al., 2019). Applying rounding procedure on µ gives the number of times each node-arm x X should be allocated to the graph. However, it does not provide the actual allocations that the learner must choose over the t rounds to optimally pull the associated edge-arms (i.e., pull edge-arms following λ ). Thus, although rounding procedures give a more precise number of times each node-arm should be pulled, the problem of allocating them to the graph remains open, whereas by concentration of the measure, randomized sampling methods imply that the associated edge-arms follow the optimal probability distribution λ . In this paper, we present a simple and standard randomized G-allocation strategy, but other more elaborated methods could be considered, as long as they include the necessary randomness. On the choice of the G-allocation problem. We have considered the G-allocation optimization problem (G-opt-Z), however, one could want to directly minimize max(z,z ) Z2 z z A 1 t , known as the XY-allocation (Soare et al., 2014; Fiez et al., 2019). Hence, one may want to construct edge-arms that follow the distribution λ XY solution of the relaxed XY-allocation problem: min λ max z ,z (z z ) X z Z λzzz ! 1 Although efficient in the linear case, this approach outputs a distribution λ XY which is not a joint probability distribution of two independent random variables, and so cannot be decomposed as the product of its marginals. Hence, there is no algorithm that allocates identically and independently the nodes of the graph to create edge-arms following λ XY. Thus, we will rather deal with the upper bound given by the G-allocation as it allows sampling over the nodes. Static design versus adaptive design. Adaptive designs as proposed for example in Soare et al. (2014) and Fiez et al. (2019) provide a strong improvement over static designs in the case of linear bandits. In our particular setting however, it is crucial to be able to adapt the edge-arms sampling rule to the node-arms, which is possible thanks to Theorem 5.1. This result requires a set of edge-arms Z expressed as a product of node-arms set X. Extending the adaptive design of Fiez et al. (2019) to our setting would eliminate edgearms from Z at each phase, without trivial guarantees that the newly obtained edge-arms set Z Z could still be derived from another node-arms set X X. An adaptive approach is definitely a natural and promising extension of our method, and is left for future work. 5.3. Convergence Analysis We now prove the validity of the random sampling procedure detailed in Algorithm 2 by controlling the quality of the approximation maxz Z z A 1 t z with respect to the optimum of the G-allocation optimization problem maxz Z z Pmt i=1 z i z i 1 z described in (G-opt-Z). As is usually done in the optimal design literature (see e.g., Pukelsheim (2006); Soare et al. (2014); Sagnol (2010)) we bound the relative error α: max z Z z A 1 t z (1 + α) max z Z z mt X i=1 z i z i Our analysis relies on several results from matrix concentration theory. One may refer for instance to Tropp et al. (2015) and references therein for an extended introduction on that matter. We first introduce a few additional notations. Let f Z be the function such that for any non-singular matrix Q Rd2 d2, f Z(Q) = maxz Z z Q 1z and for any distribution λ SZ let ΣZ(λ) P z Z λzzz be the associated covariance matrix. Finally let A t = Pmt i=1 z i z i be the G-optimal design matrix constructed during t rounds. For i {1, . . . , n} and s {1, . . . , t}, let X(i) s be i.i.d. random vectors in X such that for all x X, P X(1) 1 = x = µ x . Each X(i) s is to be viewed as the random arm pulled at round s for the node i. Using this notation, the random design matrix At can be defined as (i,j) E vec X(i) s X(j) s vec X(i) s X(j) s . One can first observe that f Z(At) can be bounded by the following quantity: f Z(At) = max z Z z A 1 t (EAt) 1 + (EAt) 1 z max z Z z A 1 t (EAt) 1 z + f Z (mtΣZ(λ )) max z Z z 2 A 1 t (EAt) 1 + f Z (A t ) . Hence, one needs a bound on the maximum eigenvalue of A 1 t (EAt) 1. Simple linear algebra leads to: A 1 t (EAt) 1 = A 1 t (EAt At)(EAt) 1. Best Arm Identification in Graphical Bilinear Bandits Thus, in addition to bounding the maximum eigenvalue of A 1 t , which is equal to the minimum eigenvalue of At, we need a bound on At EAt . It may be derived from concentration results on sum of random matrices derived in Tropp et al. (2015). We now state the result controlling the relative error obtained with our randomized sampling allocation. The proof can be found in the Appendix C. Theorem 5.2. Let λ be a solution of the optimization problem (G-relaxed-Z). Let 0 δ 1 and let t0 be such that t0 = 2Ld2 log(2d2/δ)/νmin, where L = maxz Z z 2 and νmin is the smallest eigenvalue of the covariance matrix 1 z Z zz . Then, at each round t t0 with probability at least 1 δ, the randomized G-allocation strategy for graphical bilinear bandit in Algorithm 2 produces a matrix At such that: f Z(At) (1 + α)f Z(A t ) and v E[(A1 EA1)2]. We have just shown that the approximation value maxz Z z A 1 t z converges to the optimal value with a rate of O v/(m t) . In Section 6, we show that the best case graph implies a v = O (m) matching the convergence rate O 1/ mt of a linear bandit algorithm using randomized sampling to pull mt edge-arms without (graphical) constraints. Moreover, we will see that the worst case graph implies that v = O m2 . Since we filled the gap between our constraint objective and the problem of best arm identification in linear bandits, thanks to Theorem 5.1 and 5.2, we are able to extend known results for best arm identification in linear bandits on the sample complexity and its associated lower bound. Corollary 5.3 (Soare et al. (2014), Theorem 1). If the Gallocation is implemented with the random strategy of Algorithm 2, resulting in an α-approximation, then with probability at least 1 δ, the best arm obtained with ˆθt is z and t 128σ2d2(1 + α) log 6m2t2K4 where min = minz Z\{z }(z z) θ . Moreover, let τ be the number of rounds sufficient for any algorithm to determine the best arm with probability at least 1 δ. A lower bound on the expectation of τ can be obtained from the one derived for the problem of best arm identification in linear bandits (see e.g., Theorem 1 in Fiez et al. (2019)): E[τ] min λ SZ max z Z\{z } log 1 2σ2 z z 2 ΣZ(λ) 1 m (z z) θ 2 . As observed in Soare et al. (2014) this lower bound can be upper bounded, in the worst case, by 4σ2d2/(m 2 min) which matches our bound up to log terms and the relative error α. 6. Influence of the Graph Structure on v The convergence bound in Theorem 5.2 depends on v = E[(A1 EA1)2]. In this section, we characterize the impact of the graph structure on this quantity and, by extension, on the convergence rate. First of all, recall that (i,j) E vec X(i) 1 X(j) 1 vec X(i) 1 X(j) 1 . Let denote A(i,j) 1 = vec (X(i) 1 X(j) 1 ) vec (X(i) 1 X(j) 1 ) such that A1 = P (i,j) E A(i,j) 1 and let define for any random matrices A and B the operators Var(A) E[(A E[A])2] and Cov(A, B) E[(A E[A])(B E[B])]. We can derive the variance of A1 as follows: Var (A1) = X (i,j) E Var A(i,j) 1 (k,l) E (k,l) =(i,j) Cov(A(i,j) 1 , A(k,l) 1 ). One can decompose the sum of the covariances into three groups: a first group where k = i, j and l = i, j which means that the two edges do not share any node and Cov(A(i,j) 1 , A(k,l) 1 ) = 0, and two other groups where the edges share at least one node. For all edges (i, j) E we consider either the edges (i, k) E where k = j, yielding Cov(A(i,j) 1 , A(i,k) 1 ) or the edges (j, k) E, yielding Cov(A(i,j) 1 , A(j,k) 1 ). Hence, one has Var (A1) = X (i,j) E Var A(i,j) 1 k N(i) k =j Cov(A(i,j) 1 , A(i,k) 1 ) k N(j) Cov(A(i,j) 1 , A(j,k) 1 ) . Best Arm Identification in Graphical Bilinear Bandits 0 500 1000 1500 2000 Number of edges m Number of rounds t Best case (w=pi/2) matching star complete circle Worst case (w=0.1) all graph type 0 20 40 60 80 100 Dimension of the input space Z Number of rounds t Best case (w=pi/2) matching star complete circle Figure 1. Number of rounds t needed to verify the stopping condition (3) with respect to left: the number of edges m where the dimension of the edge-arm space Z is fixed and equal to 25 and right: the dimension of the edge-arm space Z where the number of edges is fixed and equal to 156. For both experiments we run 100 times and plot the average number of rounds needed to verify the stopping condition. Let P 0 be such that for all (i, j) E, Var A(i,j) 1 P I and M, N 0 such that for all (i, j) E: k N(i), Cov(A(i,j) 1 , A(i,k) 1 ) M I k N(j), Cov(A(i,j) 1 , A(j,k) 1 ) N I We want to compare the quantity Var(A1) for different types of graphs: star, complete, circle and a matching graph. To have a fair comparison, we want graphs that reveal the same number of rewards at each round of the learning procedure. Hence, we denote respectively n S, n Co, n Ci and n M the number of nodes in a star, complete, circle and matching graph of m edges and get: Star graph: Var(A1) m P + n2 S(M + N). Complete graph: Var(A1) m P + n3 Co(M + N). Circle graph: Var(A1) m P + n Ci(2M + 4N). Matching graph: Var(A1) m P + n MN. We refer the reader to Appendix D for more details on the given upper bounds. Since the star (respectively, complete, circle and matching) graph of m edges has a number of nodes n S = m/2 + 1 (respectively n Co = 1 + 4m + 1 /2, n Ci = m/2 and n M = m), we obtain the bounds stated in Table 1. Graph Upper bound on Var(A1) α Star m P + (M + N)O m2 O 1/ Complete m P + (M + N)O (m m) O 1/ m 1 4 Circle m P + (M + N)O (m) O 1/ Matching m P + m N O 1/ Table 1. Upper bound on the variance and convergence rate of Algorithm 2 for the star, complete, circle and matching graph with respect to the number of edges m and the number of rounds t. These four examples evidence the strong dependency of the variance on the structure of the graph. The more independent the edges are (i.e., with no common nodes), the smaller the quantity Var(A1) is. For a fixed number of edges m, the best case is the matching graph where no edge share the same node and the worst case is the star graph where all the edges share a central node. 7. Experiments In this section, we consider the modified version of a standard experiment introduced by Soare et al. (2014) and used in most papers on best arm identification in linear bandits (Xu et al., 2018; Tao et al., 2018; Fiez et al., 2019; Zaki et al., 2019) to evaluate the sample complexity of our algorithm on different graphs. We consider d + 1 node-arms in X Rd where d 2. This node-arm set is made of the d vectors (e1, . . . , ed) forming the canonical basis of Rd and one additional arm xd+1 = (cos(ω), sin(ω), 0, . . . , 0) with ω ]0, π/2]. Note that by construction, the edgearm set Z contains the canonical basis (e 1, . . . , e d2) of Rd2. The parameter matrix M has its first coordinate Best Arm Identification in Graphical Bilinear Bandits equal to 2 and the others equal to 0 which makes θ = vec (M ) = (2, 0, . . . , 0) Rd2. The best edge-arm is thus z = z(1,1) = e 1. One can note that when ω tends to 0, it is harder to differentiate this arm from z(d+1,d+1) = vec x(d+1)x (d+1) than from the other arms. We set η(i,j) t N(0, 1), for all edges (i, j) and round t. We consider the two cases where ω = 0.1 which makes the edge-arms z(1,1) and z(d+1,d+1) difficult to differentiate, and ω = π/2 which makes the edge-arm z(1,1) easily identifiable as the optimal edge-arm. For each of these two cases, we evaluate the influence of the graph structure, the number of edges m and the edge-arm space dimension d2 on the sampling complexity. Results are shown in Figure 1. When ω = 0.1, the type of the graph does not impact the number of rounds needed to verify the stopping condition. This is mainly due to the fact that the magnitude of its associated variance is negligible with respect to the number of rounds. Hence, even if we vary the number of edges or the dimension, we get the same performance for any type of graph including the matching graph. This implies that our algorithm performs as well as a linear bandit that draws m edge-arms in parallel at each round. When ω = π/2, the number of rounds needed to verify the stopping condition is smaller and the magnitude of the variance is no longer negligible. Indeed, when the number of edges or the dimension increases, we notice that the star graph takes more times to satisfy the stopping condition. Moreover, note that the sample complexities obtained for the circle and the matching graph are similar. This observation is in line with the dependency on the variance shown in Table 1. 8. Conclusion We introduced a new graphical bilinear bandit setting and studied the best arm identification problem with a fixed confidence. This problem being NP-Hard even with the knowledge of the true parameter matrix M , we first proposed an algorithm that provides a 1/2-approximation. Then, we provided a second algorithm, based on G-allocation strategy, that uses randomized sampling over the nodes to return a good estimate ˆM that can be used instead of M . Finally, we highlighted the impact of the graph structure on the convergence rate of our algorithm and validated our theoretical results with experiments. Promising extensions of the model include considering unknown parameters M(i,j) , different for each edge (i, j) of the graph, and investigating XY-allocation strategies. Acknowledgements We thank anonymous reviewers, whose comments helped us improve the paper significantly. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320. Amin, K., Kearns, M., and Syed, U. (2011). Graphical models for bandit problems. In Proceedings of the Twenty Seventh Conference on Uncertainty in Artificial Intelligence, page 1 10. Audibert, J.-Y. and Bubeck, S. (2010). Best arm identification in multi-armed bandits. In Proceedings of the 23th Annual Conference on Learning Theory, pages 41 53. Audibert, J.-Y., Bubeck, S., and Lugosi, G. (2011). Minimax policies for combinatorial prediction games. In Proceedings of the 24th Annual Conference on Learning Theory, pages 107 132. Bargiacchi, E., Verstraeten, T., Roijers, D., Now e, A., and Hasselt, H. (2018). Learning to coordinate with coordination graphs in repeated single-stage multi-agent decision problems. In International conference on machine learning, pages 482 490. Bubeck, S., Munos, R., and Stoltz, G. (2009). Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23 37. Springer. Cao, T. and Krishnamurthy, A. (2019). Disagreementbased combinatorial pure exploration: Sample complexity bounds and an efficient algorithm. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99, pages 558 588. Cesa-Bianchi, N., Gentile, C., and Zappella, G. (2013). A gang of bandits. In Advances in Neural Information Processing Systems, pages 737 745. Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422. Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. (2014). Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, volume 27, pages 379 387. Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151 159. Degenne, R., M enard, P., Shang, X., and Valko, M. (2020). Gamification of pure exploration for linear bandits. ar Xiv preprint ar Xiv:2007.00953. Best Arm Identification in Graphical Bilinear Bandits Du, Y., Kuroki, Y., and Chen, W. (2020). Combinatorial pure exploration with full-bandit or partial linear feedback. ar Xiv e-prints, pages ar Xiv 2006. Erdos, P. (1975). Problems and results on finite and infinite graphs. In Recent advances in graph theory (Proc. Second Czechoslovak Sympos., Prague, 1974), pages 183 192. Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. (2019). Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10667 10677. Frank, M., Wolfe, P., et al. (1956). An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110. Guestrin, C., Lagoudakis, M. G., and Parr, R. (2002). Coordinated reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, page 227 234. Hassin, R. and Khuller, S. (2001). z-approximations. Journal of Algorithms, 41(2):429 442. Jedra, Y. and Proutiere, A. (2020). Optimal best-arm identification in linear bandits. ar Xiv preprint ar Xiv:2006.16073. Jourdan, M., Mut y, M., Kirschner, J., and Krause, A. (2021). Efficient pure exploration for combinatorial bandits with semi-bandit feedback. In Proceedings of the 31st International Conference on Algorithmic Learning Theory. Jun, K.-S., Willett, R., Wright, S., and Nowak, R. (2019). Bilinear bandits with low-rank structure. In International Conference on Machine Learning, pages 3163 3172. Kazerouni, A. and Wein, L. M. (2019). Best arm identification in generalized linear bandits. ar Xiv preprint ar Xiv:1905.08224. Kiefer, J. and Wolfowitz, J. (1960). The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366. Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684 692. Perrault, P., Boursier, E., Valko, M., and Perchet, V. (2020). Statistical efficiency of thompson sampling for combinatorial semi-bandits. In Advances in Neural Information Processing Systems. Pukelsheim, F. (2006). Optimal Design of Experiments. Society for Industrial and Applied Mathematics. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535. Sagnol, G. (2010). Optimal design of experiments with application to the inference of traffic matrices in large networks: second order cone programming and submodularity. Ph D thesis, Ecole Nationale Sup erieure des Mines de Paris. Siomina, I., Varbrand, P., and Yuan, D. (2006). Automated optimization of service coverage and base station antenna configuration in UMTS networks. IEEE Wireless Communications, 13(6):16 25. Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828 836. Tao, C., Blanco, S., and Zhou, Y. (2018). Best arm identification in linear bandits with linear dimension dependency. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4877 4886. Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230. Valko, M. (2020). Bandits on graphs and structures. Valko, M., Munos, R., Kveton, B., and Koc ak, T. (2014). Spectral bandits for smooth graph functions. In Xing, E. P. and Jebara, T., editors, International conference on machine learning, volume 32 of Proceedings of Machine Learning Research, pages 46 54. Welch, W. (1982). Algorithmic complexity: Three np-hard problems in computational statistics. Journal of Statistical Computation and Simulation - J STAT COMPUT SIM, 15:17 25. Xu, L., Honda, J., and Sugiyama, M. (2018). A fully adaptive algorithm for pure exploration in linear bandits. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pages 843 851. Zaki, M., Mohan, A., and Gopalan, A. (2019). Towards optimal and efficient best arm identification in linear bandits. ar Xiv preprint ar Xiv:1911.01695. Zaki, M., Mohan, A., and Gopalan, A. (2020). Explicit best arm identification in linear bandits using no-regret learners. ar Xiv preprint ar Xiv:2006.07562. C ivril, A. and Magdon-Ismail, M. (2009). On selecting a maximum volume sub-matrix of a matrix and related problems. Theoretical Computer Science, 410(47):4801 4811.