# matroid_semibandits_in_sublinear_time__d25088b9.pdf Matroid Semi-Bandits in Sublinear Time Ruo-Chun Tzeng 1 Naoto Ohsaka 2 Kaito Ariu 2 We study the matroid semi-bandits problem, where at each round the learner plays a subset of K arms from a feasible set, and the goal is to maximize the expected cumulative linear rewards. Existing algorithms have per-round time complexity at least Ω(K), which becomes expensive when K is large. To address this computational issue, we propose Faster CUCB whose sampling rule takes time sublinear in K for common classes of matroids: O(D polylog (K) polylog (T)) for uniform matroids, partition matroids, and graphical matroids, and O(D Kpolylog (T)) for transversal matroids. Here, D is the maximum number of elements in any feasible subset of arms, and T is the horizon. Our technique is based on dynamic maintenance of an approximate maximum-weight basis over inner-product weights. Although the introduction of an approximate maximum-weight basis presents a challenge in regret analysis, we can still guarantee an upper bound on regret as tight as CUCB in the sense that it matches the gap-dependent lower bound by Kveton et al. (2014a) asymptotically. 1. Introduction Matroid semi-bandits model many real-world tasks. An instance of matroid semi-bandit is described by ([K], X, µ), where [K] {1, , K} is the ground set, each k [K] is associated with a probability distribution νk with mean µk, and X {0, 1}K is the set of bases of a given matroid M = ([K], I) of rank D. At each round t [T], the learner pulls an action x(t) X and observes a semi-bandit feedback, i.e., yk(t) νk iff xk(t) = 1. This formulation can be used to model online advertisting and news selection (Kale et al., 2010) with M as a uniform matroid. Ad placement Work done during an internship at Cyber Agent. 1EECS, KTH Royal Institue of Technology, Sweden 2AI Lab, Cyber Agent, Japan. Correspondence to: Ruo-Chun Tzeng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). (Bubeck et al., 2013; Streeter et al., 2009) and diversified recommendation (Abbassi et al., 2013) can be modeled with M as a partition matroid. Network routing (Kveton et al., 2014a) can be modeled with M as a graphical matroid. Task assignment (Chen et al., 2016) can be modeled with M as a transversal matroid. Popular algorithms include Combinatorial Upper Confidence Bound (CUCB) (Gai et al., 2012; Chen et al., 2013; Kveton et al., 2014a; 2015), Combinatorial Thompson Sampling (CTS) (Wang and Chen, 2018; Kong et al., 2021; Perrault, 2022), and the instance-specifically optimal algorithm KL-based Efficient Sampling for Matroids (KL-OSM) (Talebi and Proutiere, 2016). All of these algorithms rely on a greedy algorithm (see Algorithm 1) to determine the action to be pulled. The greedy algorithm takes time at least Ω(K) and at most O(K(log K + Tmember)), where Tmember is the time taken to determine whether x + ek I for some (x, k) I [K], and ek is the k-th canonical unit vector. However, when the number K of arms is large, performing the greedy algorithm at each round can become expensive. There is a need to develop a matroid semi-bandit algorithm with per-round time complexity sublinear in K. In this work, we present Faster CUCB (Algorithm 5), the first sublinear-time algorithm for matroid semi-bandit. The design of Faster CUCB is based on CUCB, but with a much faster sampling rule which takes time sublinear in K for many classes of matroids. For uniform matroids, partition matroids, and graphical matroids, it has per-round time complexity of O(D polylog (K) polylog (T)), which is optimal up to a polylogarithmic factor as compared to the trivial lower bound of Ω(D). For transversal matroids, the perround time complexity is O(D K polylog (T)), which is still sublinear in K when D = O(K 1 2 ϵ) for any ϵ > 0. Faster CUCB trades the accuracy for computational efficiency. In other words, the action computed by the sampling rule of Faster CUCB is an approximation to the optimal solution computed by the sampling rule of CUCB. This introduces difficulty in the regret analysis because prior analysis of CUCB (Kveton et al., 2014a) requires the exact solution. What is interesting is that we can still guarantee the same regret upper bound asymptotically as prior analysis of CUCB. To develop a sublinear-time sampling rule, we present a dynamic algorithm for maintaining maximum-weight base Matroid Semi-Bandits in Sublinear Time CUCB Faster CUCB Per-round Time Complexity O(K(log K + Tmember)) O(D polylog (T) Tupdate(A)) Uniform Matroid O(K log K) O(D log K polylog (T)) Partition Matroid O(K log K) O(D log K polylog (T)) Graphical Matroid O(K log K) O(D polylog (K) polylog (T)) Transversal Matroid O(K(log K + DK)) O(D K polylog (T)) Table 1. Per-round time complexity of CUCB (Kveton et al., 2014a) and Faster CUCB (Algorithm 5) for different classes of matroids. K is the number of arms and D is the maximum number of elements in any action in X. Tmember for different matroids is discussed in Appendix C. Tupdate(A) for different matroids is discussed in Section 3. over inner product weights (Section 4). There have been many sublinear-time algorithms for dynamic maximumweight base maintenance (see Section 3), which, however, may not be directly used in Faster CUCB because all arm weights representing the UCB index can change simultaneously at each round. Our insight for addressing this issue is that the UCB index of each arm k at round t can be decomposed into an inner product of the following twodimensional vectors: (1) a feature, which depends on k and is supposedly a pair of the empirical reward estimate and radius of confidence interval, and (2) a query, which depends only on round t. Our proposed dynamic algorithm consists of two speeding-up techniques. One is feature rounding, which rounds each feature into a few bins so as to reduce the number of distinct features to consider. The other is the minimum hitting set technique, which allows us to compute a small number of queries in advance and correctly identify an (approximate) maximum-weight base for any query. Sections are organized as follows. We introduce matroid semi-bandits and basic concepts in Section 2. We review relevant literature in Section 3. We develop a dynamic algorithm for maintaining a maximum-weight base over inner product weights in Section 4. We propose Faster CUCB based on the algorithms developed in Section 4 and analyzed its regret and time complexity in Section 5. 2. Preliminaries We use [n] to denote the set {1, , n}. We use i (µ) to denote any element in argmaxx X µ, x , and when it is clear from the context, we drop µ from i (µ) and write i . We use supp ( ) to denote the support set of a given vector. We use ek to denote the vector with 1 only on the k-th row and 0 s elsewhere, and use 0K to denote a K-dimensional vector with 0 on every row. We use log with base e. See Appendix A for a table of notation. Matroid. A matroid is described by M ([K], I), where [K] is called the ground set and I {0, 1}K is the set of independent sets satisfying (i) hereditary property, i.e., if supp (y) supp (x) and x I, then y I; and (ii) augmentation property, i.e., if x, y I and supp (y) supp (x), then there exists j supp (x) \ supp (y) such that y + ej I. We said x I is a basis if supp (x) is maximal, i.e., there does not exist y I such that supp (x) supp (y). All bases have the same cardinality, which is called the rank of the matroid. For v RK + , a maximum-weight basis i (v) argmaxx X PK k=1 vkxk can be found by a greedy algorithm (Algorithm 1) in O(K(log K + Tmember)) time, where Tmember is the time taken for the membership oracle to determine whether x + ek I for some x I and some k [K]\ supp (x). Algorithm 1 A greedy maximum-weight basis algorithm Input: v RK and the bases X {0, 1}K. Sort v in non-increasing order: vγ(1) vγ(K); x = 0K; i = 1; while x 0 < D do if x + eγ(i) I then x x + eγ(i) i i + 1; end Matroid semi-bandits. An instance of matroid semibandit is described by ([K], X, µ), where [K] is the ground set, X {0, 1}K is the set of bases of the given matroid M ([K], I) of rank D, and I {0, 1}K is the set of independent sets. Each k [K] is associated with a distribution νk with mean µk. The learner knows the matroid M, and aims to learn the best action i (µ) argmaxx X µ, x by playing a game with the environment: At each round t N, the learner plays an action x(t), and the environment draws a noisy reward yk(t) νk for each arm k [K] and reveals yk(t) to the learner iff k supp (x(t)). We assume arms rewards are bounded: Assumption 2.1. Assume the support of each arm νk is a subset of [a, b] and 0 < a < b < . The performance is measured by expected regret: R(T) T µ, i (µ) E t=1 µ, x(t) Matroid Semi-Bandits in Sublinear Time which is the difference between the expected cumulative reward of the learner and that of an algorithm who knows µ and always selects the best action i (µ). Common classes of matroids. Refer to Chapter 39 (Schrijver, 2003) or Chapter 1 (Oxley, 2011) for more details. A uniform matroid ([K], I) of rank D has independent sets I = {S [K] : |S| D} and the bases X consist of subsets whose cardinalities are exactly D. A partition matroid ([K], I) of rank D is given a partition S1, , SD of the ground set [K], the independent sets I = {S : |S Si| 1, i [D]}, and the bases X are subsets that choose exactly one element from each of the D sets. A graphical matroid is given a graph G = (V, E) with K edges, the bases X consist of all spanning forests in G, and the rank D is |V | minus the number of connected components in G. A transversal matroid is given a bipartite graph G = ([K] V, E) with a bipartition ([K], V ), |V | K, the independent sets I consist of S [K] such that there is a matching of S to |S| vertices in V , and X is the set of endpoints in [K] of all maximum matchings in G. We discuss the query time Tmember of membership oracle in Appendix C. For more examples on semi-bandits under different matroid constraints, we refer the readers to (Kveton et al., 2014a) for linear matroids, and (Kveton et al., 2014b) for polymatroid semi-bandits. CUCB. The action selected by CUCB (Gai et al., 2012; Chen et al., 2013; Kveton et al., 2014b) at round t N is: x(t) argmax x X ˆµk(t 1) + λt p (1) where ˆµk(t) 1 Nk(t) Pt s=1 yk(s)1{xk(s) = 1} is the em- pirical reward estimate, Nk(t) Pt s=1 1{xk(s) = 1} is the number of pulls of arm k, and λt > 0 controls the confidence width. The value ˆµk(t 1) + λt Nk(t 1) is called the UCB index of arm k. In (Kveton et al., 2014a), Eq. (1) is solved by a O(K(log K + Tmember))-time greedy algorithm shown in Algorithm 1. In Section 4, we will develop a faster algorithm for solving Eq. (1) with the following reformulation: x(t) argmax x X k=1 f k, q xk, where f k = (ˆµk(t 1), 1 Nk(t 1)) and q = (1, λt). 3. Related Works Semi-bandits and sublinear-time bandits. We provide an extensive survey on related bandit literature in Appendix B. To summarize here, for semi-bandit algorithms, CUCB (Gai et al., 2012; Chen et al., 2013; Kveton et al., 2014a), CTS (Wang and Chen, 2018; Kong et al., 2021; Perrault, 2022) and KL-OSM (Talebi and Proutiere, 2016) all rely on a O(K(log K + Tmember))-time greedy algorithm to compute the action to be pulled. In contrast, our Faster CUCB, as far as we know, is the first semi-bandit algorithm having per-round time complexity of o(K). For linear bandits, there exist several works (Jun et al., 2017; Yang et al., 2022) on reducing the per-round complexity to be sublinear in the number of arms. But, their results transferred to our setting are worse than what we have obtained both in terms of regret bound and the time complexity (see the discussion in Appendix B). Dynamic maintenance of maximum-weight base of a matroid Here, we review existing dynamic algorithms for maintaining a maximum-weight base of a matroid. In a standard (fully-)dynamic setting, we are given a weighted matroid M = ([K], I), where each arm s weight dynamically changes over time in an online manner. The objective is to maintain any (exact or approximate) maximum-weight basis of M over up-to-date arm weights as efficiently as possible. We use Tupdate(A) to denote the time complexity of a dynamic algorithm A required for updating an (approximate) maximum-weight base according to the change of a single arm weight. The best-known bound of Tupdate(A) for each matroid class is summarized as follows: For graphic matroids, a maximum-weight basis can be updated in O( K) worst-case time (Frederickson, 1985; Eppstein et al., 1997) and in O(polylog (K)) amortized time (Holm et al., 2001). For laminar matroids (which include uniform and partition matroids as special cases), the worst-case time complexity for exact dynamic algorithms is bounded by O(log K) (Henzinger et al., 2023). For transversal matroids, a O(K1.407)- time exact dynamic algorithm is known (van den Brand et al., 2019), while a (1 + η)-approximation dynamic algorithm runs in O(η 2 K) time (Gupta and Peng, 2013). We can safely assume that, after updating multiple arm weights, A returns an (approximate) maximum-weight basis in O(D) time, where D is the rank of a matroid. 4. Dynamic Maintenance of Maximum-weight Base over Inner Product Weight In this section, we develop a sublinear-time sampling rule, which is used as a subroutine in Faster CUCB. Recall that any static algorithm that solves the linear maximization of Eq. (1) from scratch requires at least Ω(K) time, which is computationally expensive. To circumvent this issue, Matroid Semi-Bandits in Sublinear Time we present a dynamic algorithm for maintaining an (approximate) maximum-weight base of a matroid where arm weights change over time. The next subsection begins with formalizing the problem setting. 4.1. Problem Setting and Technical Result Consider the following problem setting: Let M = ([K], I) be a matroid of rank D over K arms, and X be the set of its bases. Each arm k [K] has a (nonnegative) twodimensional vector f k = (αk, βk) R2 + referred to as a feature, which may change as time goes by. Given a (nonnegative) two-dimensional vector q R2 + as a query, we are required to find any (approximate) maximum-weight base of M, where arm k s weight is given by projecting its feature onto q, i.e., f k, q . Observe that in the matroid semi-bandit setting, each arm k s feature f k = (αk, βk) corresponds to a pair of the empirical reward estimate αk = ˆµk(t 1) and radius βk = 1 Nk(t 1) of confidence interval, and a query is q = (1, λt) at round t, both of which change over rounds. Hereafter, we make the following two assumptions. Assumption 4.1. Lower and upper bounds, denoted by αlb and αub (resp. βlb and βub), on the possible positive values of αk s (resp. βk s) at anytime are known; namely, it always holds that αk {0} [αlb, αub] and βk {0} [βlb, βub]. The precise values of these bounds will be discussed in Section 5. Assumption 4.2. There exists a dynamic algorithm A for maintaining a (1 + η)-approximate maximum-weight base of M with arm weights changing over time, where parameter η (0, 1) specifies the approximation guarantee. Denote by Tinit(A; η) and Tupdate(A; η) the time complexity of A required for initializing the data structure and updating a single arm s weight, respectively. We can safely assume that after updating multiple arm weights, A returns a maximumweight base in O(D) time. See Section 3 for existing implementations. Our dynamic algorithm is parameterized by a precision parameter ϵ (0, 1), and consists of the following three procedures: INITIALIZE: Given lower and upper bounds [αlb, αub] and [βlb, βub] as in Assumption 4.1, K features (αk, βk)k [K], a matroid M = ([K], I), a dynamic algorithm A for maximum-weight base maintenance as in Assumption 4.2, and a precision parameter ϵ, this procedure initializes the data structure used in the remaining two procedures. FIND-BASE: Given a query q, this procedure is supposed to return a (1 + ϵ)-approximate maximum-weight base of M, where arm k s weight is defined as f k, q for the up-to-date k s feature f k. UPDATE-FEATURE: Given an arm k and a new feature f k, this procedure reflects the change of arm k s feature on the data structure. Remark 4.3. Our problem setting is different from a canonical setting of dynamic maximum-weight base maintenance in a sense that the arm weights are revealed when a query is issued in FIND-BASE. Consequently, existing dynamic algorithms may not be used directly. The technical result in this section is stated below. Theorem 4.4 ( ). There exist implementations of INI- TIALIZE, FIND-BASE, and UPDATE-FEATURE such that the following are satisfied: FIND-BASE always returns a (1 + ϵ)-approximate maximum-weight base of a matroid M with arm k s weight defined as f k, q for an up-to-date k s feature f k and a query q. Moreover, INITIALIZE runs in O(K + poly (W) Tinit(A; ϵ 3)) time, FIND-BASE runs in O( poly (W) + D) time, and UPDATE-FEATURE runs in O( poly (W) Tupdate(A; ϵ 3)) time, where W = O ϵ 1 log αub Remark 4.5. The proof of Theorem 4.4 can be easily adapted to the case when (the update procedure of) dynamic algorithm A has only amortized complexity. In such case, Theorem 4.4 holds in the amortized sense rather than the worst-case sense. The remainder of this section is organized as follows: In Section 4.2, we apply a rounding technique to arm features to reduce the number of distinct features to consider, in Section 4.3, we investigate the representability of permutations induced by inner product weights to deal with multiple queries efficiently, and Section 4.4 finally develops our dynamic algorithm for maximum-weight base maintenance. All proofs of the lemmas appearing in this section are deferred to Appendix D. 4.2. Rounding Arm Features Here, we apply a rounding technique to arm features so as to reduce the number of distinct features to consider. Hereafter, let η ϵ 3, so that (1 + η)2 1 + ϵ for ϵ (0, 1). Define W max log1+η W { } [W] = { , 1, 2, 3, . . . , W}. (3) Since any features are within ({0} [αlb, αub]) ({0} [βlb, βub]) as guaranteed by Assumption 4.2, we shall partition the possible region of the features into |W|2 bins. For Matroid Semi-Bandits in Sublinear Time each q, r W, define BINq,r R2 + as BINq,r (αlb(1 + η)q 1, αlb(1 + η)q] (βlb(1 + η)r 1, βlb(1 + η)r], (4) where (αlb(1 + η) , αlb(1 + η) ] {0}, (5) (βlb(1 + η) , βlb(1 + η) ] {0}. (6) Note that these bins are pairwise disjoint, and that q,r WBINq,r covers ({0} [αlb, αub]) ({0} [βlb, βub]); i.e., any possible feature belongs to a unique BINq,r. For each q, r W, let domq,r R2 + denote the unique dominating point of BINq,r; namely, domq,r (αlb(1 + η)q, βlb(1 + η)r). (7) For any feature f k = (αk, βk), we will use dom(f k) = dom(αk, βk) to denote the dominating point domq,r such that f k BINq,r. See Figure 1 in Appendix D for illustration of BINq,r s, domq,r s, and dom(f k) s. Observe easily that for any feature f k R2 + and query q R2 +, 1 1 + η dom(f k), q < f k, q dom(f k), q . (8) By Eq. (8), we can replace each arm s feature by its dominating point without much deteriorating the quality of the (approximate) maximum-weight base, as shown below. Lemma 4.6 ( ). Let f 1, . . . , f K ({0} [αlb, αub]) ({0} [βlb, βub]) be K features, q R2 + be a query, and x dom be a (1 + η)-approximate maximum-weight base of matroid M with arm k s weight defined as dom(f k), q . Then, for any base x of M, it holds that k supp(x dom) f k, q 1 1 + ϵ X k supp(x) f k, q . (9) In particular, x dom is a (1 + ϵ)-approximate maximumweight base with arm k s weight defined as f k, q . 4.3. Handling Multiple Queries From weighting to permutation. Now we deal with multiple queries. Our idea is that, if two queries q1, q2 R2 + are very close to each other, then they should derive the common maximum-weight base (provided that features are fixed). This intuition can be justified with respect to orderings of arms. For two total orders and over [K], we say that is consistent with if k k implies k k for any k = k . The following fact is easy to confirm: Lemma 4.7 ( ). Let w = (w1, . . . , w K) RK + be K arm weights and be a total order over [K] such that k k if and only if wk wk . Let be a total order over [K] that is consistent with . If x is a base of matroid M obtained by running the greedy algorithm over any ordering of [K] consistent with , it is a maximum-weight base of M over arm k s weight wk; namely, for any base x of M, x , w x, w . (10) Lemma 4.7 implies that any maximum-weight base can be obtained by running the greedy algorithm over some total order ; moreover, we can safely assume that is strict (i.e., k k or k k for all k = k ), or equivalently, a permutation over [K]. Our strategy for dealing with multiple queries is: (1) we enumerate all possible permutations in advance, and (2) we guess a permutation consistent with the arm weights determined based on a query. To this end, the following question arises: What kind of and how many permutations are representable given a fixed set of features? Characterizing representable permutations. To answer the above question, we characterize representable permutations. Hereafter, let SK denote the set of all permutations over [K], and f 1, . . . , f K R2 + be any fixed, distinct K features. We say that a query q R2 over f 1, . . . , f K represents a permutation π SK if f π(i), q > f π(j), q for all 1 i < j K,1 and that π is representable if such q exists. For a permutation π SK to be representable, we wish for some query q R2 to ensure that for any i < j, arm π(i) s weight is (strictly) higher than arm π(j) s weight. This requirement is equivalent to f π(i) f π(j), q > 0; thus, if the following system of linear inequalities is feasible, any of its solutions q represents π: f π(i) f π(j), q > 0 for all 1 i < j K. (11) Observe now that the above system is feasible if and only if the intersection of Pi,j for all i < j is nonempty, where Pi,j is an open half-plane defined as Pi,j {q R2 : f π(i) f π(j), q > 0}. (12) Because each Pi,j is obtained by dividing R2 by a unique line that is orthogonal to line f π(i)f π(j) and intersects the origin 0, the set of feasible solutions for Eq. (11) is equal to (the interior of) a polyhedral cone defined by the boundaries of a particular pair of Pi,j s. Here, we characterize the representable permutations by the concept of arrangement of lines. Let L {l1, . . . , l( K 2)} denote K 2 lines, each of which is orthogonal to line f kf k for some k = k and intersects 0. Given such L, a cell C 1This definition does not allow ties ; i.e., no pair k = k satisfies f k, q = f k , q . Matroid Semi-Bandits in Sublinear Time in arrangement of L is defined as a maximum connected region of R2 that does not intersect with L (which is the interior of a polyhedral cone). Then, for each cell C, every query q in C represents the same permutation πC SK depending only on C; namely, there is a bijection between the representable permutations and the cells in arrangement of L. See Figure 2 in Appendix D for illustration. With this connection in mind, we demonstrate that reserving a single vector for each cell suffices to cover all representable permutations. A minimum hitting set of the cells in arrangement of L is defined as any minimum set H of vectors in R2 such that H and each cell have a non-empty intersection. Lemma 4.8 ( ). Let H be a minimum hitting set of the cells in arrangement of L. Then, for any query q R2, there exists a vector h H such that for any k = k , f k, h > f k , h = f k, q f k , q . (13) Lemma 4.8 along with Lemma 4.7 ensure that for any query q R2, there is a vector h in H such that a maximumweight base with arm k s weight f k, h is a maximumweight base with arm k s weight f k, q . Generating a minimum hitting set. Subsequently, we generate a minimum hitting set. One may think that it requires exponentially long time because the number of permutations in SK is K!. However, it turns out that the number of cells in arrangement of L is O(K2) (i.e., so is the number of representable permutations), and a minimum hitting set can be constructed in poly (K) time. Lemma 4.9 ( ). The number of cells in arrangement of L is at most O(K2). Moreover, we can generate a minimum hitting set in poly (K) time (by using Algorithm 6 in Appendix D). 4.4. Putting It All Together: Algorithm Description and Complexity We are now ready to implement the three procedures. We here stress that applying either of the feature rounding or minimum hitting set technique separately does not make sense: On one hand, if we only apply feature rounding, we would have to recompute each arm s weight every time a query is issued, which is expensive. On the other hand, if the minimum hitting set technique is only applied (to raw features f k s), then a minimum hitting set H cannot be constructed in advance due to a dynamic nature of features, and its size would be O(K2), which is prohibitive. By applying both techniques, (1) we know a priori the set of possible dominating points, whose size O(W 2) depends only on W; moreover, (2) we can create a minimum hitting set H of size O(W 4) beforehand at initialization. Pseudocodes of INITIALIZE, FIND-BASE, and UPDATEFEATURE are described in Algorithms 2 to 4, respectively. The proof of Theorem 4.4 follows from Lemmas 4.6 to 4.9, whose details are deferred to Appendix D. In INITIALIZE, we construct a hitting set H of domq,r s and 1 1+η domq,r s rather than solely of domq,r s, which incurs a constant-factor blowup in the time complexity. Though this change is not needed in the proof of Theorem 4.4, the following immediate corollary of Lemma 4.8 is crucial in the regret analysis of Section 5. Corollary 4.10 ( ). Let H be a minimum hitting set constructed in Algorithm 2. Then, for any query q R2, there exists a vector h H such that for any dom = domq,r and dom = domq ,r with q, r, q , r W, dom, h > dom , h = dom, q dom , q , dom, h > dom , h 1 + η = dom, q dom , q Algorithm 2 INITIALIZE. Input: lower and upper bounds [αlb, αub] and [βlb, βub]; K features (f k)k [K]; precision parameter ϵ (0, 1). Define W by Eq. (3); for each q, r W do Define BINq,r and domq,r by Eqs. (4) and (7); end Define η ϵ 3; Construct a minimum hitting set H of size O(W 4) for domq,r s and 1 1+η domq,r s by Lemma 4.9; for each h H do Create an instance Ah of dynamic maximum-weight base algorithm with precision parameter η ϵ 3, M, and arm k s weight dom(f k), h ; end Algorithm 3 FIND-BASE. Input: query q R2 +. Find h H s.t. q and h belong to (the closure of) the same cell in arrangement of V; Call Ah and return the maximum-weight base x of M with arm k s weight dom(f k), h ; Algorithm 4 UPDATE-FEATURE. Input: arm k [K]; new feature f k R2 +. for each h H do Change arm k s weight stored in Ah to dom(f k), h ; end 5. Our Proposed Algorithm: Faster CUCB In this section, we present Faster CUCB in Algorithm 5, which uses procedures introduced in Section 4. Matroid Semi-Bandits in Sublinear Time The purpose of initialization procedure is to ensure each arm is pulled at least once. It takes at most K rounds, and in each round, the computation of i (ek) takes O(K Tmember) time because the permutation γ in Algorithm 1 can be specified explicitly as γ(j) = k if j = 1, γ(j) = j 1 if 2 j < k, and γ(j) = j + 1 if j k. So, it only require to compute at most K membership tests. After the initialization, the computation of each round t consists of one call to FIND-BASE for computing the action x(t), the update of ˆµk(t) and Nk(t) for each k supp (x(t)), and one call to UPDATE-FEATURE for updating the feature of each arm k supp (x(t)) stored in the instances of the dynamic maxium-weight base algorithm. Algorithm 5 Faster CUCB Input: the total number of rounds T, λt, and m N Initialization: t = 0; while k [K] such that Nk(t) = 0 do Pull i (ek); t = t + 1; end INITIALIZE a, b, 1 T , 1, (ˆµk(t), Nk(t))k [K] , 1 logm T while t < T do x(t) FIND-BASE((1, λt)); Pull x(t) and receive yk(t) νk for each k supp (x(t)); for k supp (x(t)) do Nk(t) Nk(t 1) + 1; ˆµk(t) t 1 t ˆµk(t 1) + 1 UPDATE-FEATURE k, µk(t), 1 end t = t + 1; end 5.1. Per-round Time Complexity By Theorem 4.4, one call to FIND-BASE takes O ( poly (W) + D) and D calls to UPDATE-FEATURE take O(D poly (W) Tupdate(A; ϵ W = O logm T log b T = O logm+1 T , the per-round time complexity of Algorithm 5 is O D polylog (T) Tupdate A; ϵ . Here, we will set ϵ = 1 logm T for the regret analysis. 5.2. Regret Upper Bound Notation. Fix µ Λ and i argmaxx X µ, x . We introduce a few notation. Let {j}D j=1 be the permutation of supp (i ) such that µ1 µD. Define j,k µj µk and dk max{j [D] : j,k > 0} for j supp (i ) and k / supp (i ), and min mink/ supp(i ) dk,k. Theorem 5.1. Let λt = p 1.5(b a)2 log t and m N. Define T0 max{K, exp(( b min ) 1 m )}. For T N, the expected regret of Algorithm 5 is upper bounded by k/ supp(i ) j=1 j,k T0 + 12 dk,k(b a)2 log T µdk 1+log m T µk 2 k/ supp(i ) As a consequence of Theorem 5.1, setting T yields: lim T R(T) log T X k/ supp(i ) which matches Theorem 4 in (Kveton et al., 2014a), lim inf T R(T ) log T = Ω( K D min ), asymptotically up to a constant factor. Note that Faster CUCB is faster than CUCB when min = Ω( 1 polylog(K)) and when T = poly (K). Also, similar to (Cuvelier et al., 2021b), our per-round time complexity also goes to infinity as T , one way to address this issue is to use CUCB when the per-round time complexity of ours is larger than that of CUCB. Useful lemmas. Here we present two lemmas that will be used to show Theorem 5.1 in Section 5.3. First, inspired by Kveton et al. (2014a), we define a bijection gt from supp (i ) to supp (x(t)) with the following properties: Lemma 5.2. There exists a bijection gt : supp (i ) supp (x(t)) such that (i) gt(j) = j for j supp (i ) supp (x(t)); (ii) for any j supp (i ) \ supp (x(t)), xgt(j)(t) = 1 = D dom(f gt(j)), h E dom(f j), h 1 + 1 3 logm T . The proof of Lemma 5.2 is in Appendix E.1, where an explicit construction of gt is provided. Property (i) allows us to decompose the instantaneous regret i x(t), µ = P k/ supp(i ) P j supp(i ) j,k1{gt(j) = k}, and Property (ii), Lemma 5.2, allows us to derive a bound of PT t=1 1{gt(j) = k} that relates with UCB values. Second, for technical reasons, we need the precision parameter ϵ = log m T to be small enough so that i,j and µi (1+ϵ)µj have the same sign. The below lemma (proved in Appendix E.2) gives the threshold to make it happen: Lemma 5.3. Let ϵ < min b . Then, for any i supp (i ) and any j / supp (i ), µi µj > 0 = µi 1+ϵ µj > 0. 5.3. Proof of Theorem 5.1 For T T0, R(T) is trivially bounded by T µ, i Db T0. In the following, we assume T > T0. Matroid Semi-Bandits in Sublinear Time As gt is a bijection from supp (i ) to supp (x(t)) and gt(j) = j for j supp (i ) supp (x(t)), we can rewrite E[ i x(t), µ ] = X k/ supp(i ) j supp(i ) j,k E[1{gt(j) = k}] k/ supp(i ) j=1 j,k E 1 gt(j) = k so that the expected regret is bounded from the above by: k/ supp(i ) t=1 1 gt(j) = k # k/ supp(i ) j=1 j,k (I)j,k + (II)j,k , (I)j,k = PT t=1 E h 1 n gt(j) = k, Nk(t) nj,k oi (II)j,k = PT t=1 E h 1 n gt(j) = k, Nk(t) > nj,k oi and nj,k = max 6(b a)2 log T ( µj 1+log m T µk)2 , T0 . The proof is completed by bounding related terms of (I)j,k and (II)j,k by Lemma 5.4 (proved in Appendix E.2) and Lemma 5.5. Lemma 5.4. Let k / supp (i ) and j [dk]. For T > T0, j=1 j,k(I)j,k j=1 j,k T0 + 12(b a)2 dk,k log T ( µdk 1+log m T µk)2 . Lemma 5.5. Let k / supp (i ) and j [dk]. For T > T0, Proof sketch: See Appendix E.2 for the entire proof. Let ϵ 1 logm T . First, we claim: gt(j) = k = uk(Nk(t 1), T) mins log t and Nj(t 1) [t 1], uk(Nk(t 1), T) uj(Nj(t 1), t) 1 + ϵ mins n j,k}. From Eq. (14), (II)j,k is upper bounded by t Tj,k 1 uk(Nk(t 1), T) mins µj 1+ϵ See Appendix E.2 for the derivation of Eq. (16). Observe when t Tj,k, 1{A3,t,s} 1 µk + 2λT pnj,k + 1 > µj 1 + ϵ where the inequality is because Nk(t 1) > nj,k, and the equality is because nj,k 4λ2 T ( µj 1+ϵ µk)2 = 4λ2 T nj,k + 1 < µj and also µj 1+ϵ µk > 0 which is ensured by Lemma 5.3 as T > T0. Finally, from Eq. (16) and using Hoeffding s inequality, we get s 0 µk(t) the average 1 t Pt s=1 yk(s) of rewards of arm k in the first t rounds uk(s, t) the UCB value of µk(s) + λt s under s samples of arm k and with confidence parameter λt Table 2. Table of notation. Matroid Semi-Bandits in Sublinear Time B. Further Related Works In this section, we review relevant literatures on combinatorial semi-bandits and sublinear-time bandits. We focus on the stochastic setting. For ease of comparison, we assume the best action i argmaxx X x, µ is unique, and define min minj,k [K]:i j =1,i k=0,µj µk>0(µj µk), and minx =i : i x,µ >0 i x, µ . Matroid semi-bandits. Kveton et al. (2014a) showed an instance such that any uniformly good algorithm2 suffer R(T) = Ω (K D) log T . They also showed that CUCB (Gai et al., 2012; Chen et al., 2013) have a regret bound scaling as O (K D) log T . Talebi and Proutiere (2016) showed an instance-specific lower bound lim inf T R(T ) log T c(µ) for uniformly good algorithms, where c(µ) is the optimum of a semi-infinite linear program (Graves and Lai, 1997; Combes et al., 2015), and proposed KL-OSM whose regret upper bound matches this lower bound. The per-round compleixty of KL-OSM is K line search for generating the indices plus the time for solving a linear maximization. Both CUCB and KL-OSM rely on the greedy algorithm (Algorithm 1) to solve the linear maximization for determing the action to be pulled. The time complexity of the greedy algorithm is upper bounded by O(K(log K + Tmember)) time and lower bounded by Ω(K). (Perrault et al., 2019) showed that the sampling rule of many combinatorial semi-bandit algorithms is a maximization problem over a summation of a linear function and a submodular function, and proposed two efficient algorithms for matroid semi-bandits: One is based on local search and the other is a greedy algorithm. Both have per-round time complexity at least Ω(KD). In contrast, our Faster CUCB is the first matroid semi-bandit algorithm with per-round time complexity sublinear in K for many classes of matroids. Combinatorial semi-bandits. Here, we review works that focus on the standard setting of stochastic combinatorial semi-bandits. These consider a linear reward function and any action sets X, where linear maximization maxx X x, v for any v RK can be solved in time polynomial in K. We omit the discussion on works that focus on a specific action set (Chowdhury et al., 2023), with additional structural assumptions on the rewards (Wen et al., 2015; Perrault et al., 2020b), or with a different reward function (Papadigenopoulos and Caramanis, 2021). Perrault et al. (2020a) showed that CTS has a regret bound of O( K log2 D log T ) for mutually independent gaussian rewards and a regret bound of O( KD log2 D log T ) for correlated gaussian rewards. Perrault (2022) sharpen the regret bound of CTS for the case of mutually independent gaussian rewards to be O( K log D log T ). The per-round time complexity of CTS is at least Ω(K) due to sampling from the posterior distributions. Degenne and Perchet (2016) showed that ESCB2 has regret bound of O( K log2 D log T ) for independent subgaussian rewards, but its sampling rule is NP-hard (Atamt urk and G omez, 2017) to optimize. Cuvelier et al. (2021b) proposed AESCB that approximates ESCB2 with per-round time complexity of O(KD log3 K poly (log T)) while maintaining the same regret bound. Their technique is based on rounding and budgeted-linear maximization. OSSB (Combes et al., 2017) is an asymptotically instance-specifically optimal algorithm for general structured bandits, including combinatorial semi-bandits, but at each round, it requires to solve a semi-infinite linear program (Graves and Lai, 1997). Cuvelier et al. (2021a) developed a method that runs in time polynomial in K to solve the semi-infinite linear program for Gaussian rewards. They managed to maintain OSSB s asymptotic optimality for m-sets, but not for spanning trees and bipartite matchings. Ito (2021) and Tsuchiya et al. (2023) proposed algorithms based on the optimistic FTRL framework that achieve O( KD log T ) regret in the stochastic setting and O( KDT log T) in the adversarial setting. At each round, the proposed algorithms first use FTRL rule to obtain a vector a(t) in the convex hull of X and then sample an action x(t) based on a(t). Tsuchiya et al. (2023) mentioned that the computational efficiency of the sampling step has long been a problem in semi-bandits using the optimistic FTRL framework. Sublinear-time linear bandits. Several works (Jun et al., 2017; Yang et al., 2022) focusing on making per-round complexity of linear bandits sublinear in the number of arms. Maximum Inner Product Search (MIPS) is the primary tool used to design such algorithms. For N arms in Rd, Q-GLOC (Jun et al., 2017) achieves a high-probability regret bound of O(d 5 4 T) and per-round time complexity of O(d2N ρ log N) for some ρ (0, 1), where O hides polylogarithmic factors in T and d. Yang et al. (2022) considered the setting with arms addition (resp. addition and deletion), and proposed Sub-Elim (resp. Sub-TS), which has a high-probability regret bound of O(d T) (resp. O(d 3 2 T)) and per-round time complexity of N 1 Θ( 1 T 2 log2 T ) (resp. N 1 Θ( 1 T )). These results are applicable to our setting with d = K and N = |X|. Q-GLOC (Jun et al., 2017) applied to our setting has regret bound of O(K 5 4 T) and per-round time compleixty O(K2|X|ρ). 2A uniformly good algorithm has the expected regret R(T) = o(T α) hold for any α > 0. Matroid Semi-Bandits in Sublinear Time Sub-Elim (resp. Sub-TS) Yang et al. (2022) applied to our setting has regret bound of O(K T) (resp. O(K 3 2 has per-round time complexity of |X|1 Θ( 1 T 2 log T ) (resp. |X|1 Θ( 1 T )). These results have worse regret bounds than what we have obtained, and their per-round time complexity can be exponential in K. Matroid Semi-Bandits in Sublinear Time C. Membership Oracles for Different Matroids In this section, we discuss Tmember for the matroids shown in Section 2. For uniform matroid, the membership oracle is given x I and k [K]\ supp (x), and has to check whether | supp (x + ek)| D. Suppose the number n = | supp (x)| is maintained. Then, it takes O(1) time to check whether n + 1 D, and hence Tmember = O(1). For partition matroids, the membership oracle is given x I and k [K]\ supp (x), and has to check whether | supp (x + ek) Si| 1 for all i [D]. Suppose there is an integer array A of size K such that j SA[j] for each j [K], and suppose there is an integer array B of size D such that B[i] = P j supp(x) 1{j Si} for each i [D]. Then, to decide whether whether x + ek I, it only requires to check whether B[A[k]] + 1 1. This can be implemented in O(1) time, and thus Tmember = O(1). For graphical matroids, the membership oracle has to detect if there is a cycle. Using the union-find data structure, whether supp (x) {k} has a cycle can be detected in O(log K) time, so we have Tmember = O(log K). Refer to Section 4.6 in (Kleinberg and Tardos, 2006) for more detailed explanation. For transversal matroids, there is little discussion about its membership oracle. Here we present an implementation to answer a query (x, k) about whether x + ek I, where x I and k [K]\ supp (x). Suppose a maximum matching M on supp (x) V is maintained. Then, answering whether x + ek I is equivalent to checking whether an augmenting path on supp (x + ek) V from M can be found. Finding a augmentation path can be done by a breadth-first search (BFS) starting from k (see Section 17.2 in (Schrijver, 2003)), and it takes O(DK) time because there are at most K leaves in the BFS tree and the length of the path from k to each leaf is at most 2D. Thus, we have Tmember = O(DK). Matroid Semi-Bandits in Sublinear Time BIN1,𝑊 BIN2,𝑊 BIN𝑊 1,𝑊BIN𝑊,𝑊 BIN1,𝑊 1 BIN2,𝑊 1 BIN𝑊 1,𝑊 1 BIN𝑊,𝑊 1 BIN1,2 BIN2,2 BIN𝑊 1,2 BIN𝑊,2 BIN1,1 BIN2,1 BIN𝑊 1,1 BIN𝑊,1 𝛽lb 1 + 𝜂𝑊 1 Figure 1. Illustration of feature rounding. There are |W|2 bins, and features are assumed not to be in (the interior of) the shaded area. Each feature f k is rounded to its dominating point dom(f k), which is specified by a curved arrow. D. Omitted Proofs in Section 4 Proof of Lemma 4.6. By Eq. (8) and the optimality of x dom, for any base x, we have k supp(x dom) f k, q > 1 1 + η X k supp(x dom) dom(f k), q (by Eq. (8)) 1 (1 + η)2 X k supp(x) dom(f k), q (by optimality of x dom) 1 (1 + η)2 X k supp(x) f k, q (by Eq. (8)) k supp(x) f k, q . (as (1 + η)2 1 + ϵ) Proof of Lemma 4.7. The proof follows from the uniqueness of the maximum-weight base in the case of distinct weights; see, e.g., (Edmonds, 1971). Proof of Lemma 4.8. For a query q R2, let C R2 be a cell in arrangement of L whose closure contains q (which may not be uniquely determined). Then, there is a permutation π SK such that for any vector h H C, we have f π(i), h > f π(j), h whenever i < j. Since q is in the closure of C, it holds that f π(i), q f π(j), q for any i < j, implying the proof. Proof of Lemma 4.9. Since each cell in arrangement of L is a polyhedral cone generated by two lines of L that does not contain any other lines of L, there are only O(K2) cells, and each of their internal points can be found by using Algorithm 6, as desired. Matroid Semi-Bandits in Sublinear Time Figure 2. Illustration of characterization of representable permutations. There are three features f 1, f 2, f 3 on R2. Each dashed line denotes f if j for some i = j; each black bold line is orthogonal to some dashed line and intersects the origin. Such black bold lines generate six regions, each corresponding to a distinct permutation. For example, for any query q in the hatched area, it holds that f 1, q > f 2, q > f 3, q ; i.e., q represents a permutation π such that (π(1), π(2), π(3)) = (1, 2, 3). Algorithm 6 GENERATE-HITTING-SET. Input: K distinct features (f k)k [K]. let Θ ; for all k = k do let L be a unique line that is orthogonal to line f kf k and intersects 0; add the angle θ of L and θ to Θ; end let H ; for all neighboring (but distinct) θ1 and θ2 in Θ do let h cos( θ1+θ2 2 ), sin( θ1+θ2 2 ) be an internal point of a polyhedral cone generated by two half-lines whose angles are θ1 and θ2; add h to H; end return H; Proof of Theorem 4.4. The correctness of FIND-BASE is shown first. Given a query q R2 +, Algorithm 3 finds h H such that f k, h > f k , h implies f k, q f k , q due to Lemma 4.8. Calling Ah finds a (1 + η)-approximate maximum-weight base x of M with arm k s weight defined as dom(f k), h . Since a total order over [K] induced by arm weights dom(f k), h is consistent with that induced by arm weights dom(f k), q , by Lemma 4.7, x is also a (1 + η)-approximate maximum-weight base of M with arm k s weight defined as dom(f k), q . By Lemma 4.6, k supp(x ) f k, q 1 1 + ϵ X k supp(x) f k, q , (17) for any base x of M; namely, x is a (1 + ϵ)-approximate maximum-weight base of M with arm k s weight defined as f k, q , completing the correctness of FIND-BASE. Subsequently, we bound the time complexity of each subroutine as follows. INITIALIZE: Construction of BINq,r and domq,r for all q, r W completes in O(K + W 2) time. Then, a hitting set H for domq,r s and 1 1+η domq,r s of size |H| = O(W 4) can be constructed in poly (W) time due to Lemma 4.9. There will be |H| instances of algorithm A (with different arm weights), creating which takes O(W 4 Tinit(A; η)) time. FIND-BASE: Checking whether each h H and query q R2 + belong to (the closure of) the same cell in arrangement of Matroid Semi-Bandits in Sublinear Time V can be done in O(W 2) time by comparing the induced total orders. By brute-force search, a desired h can be found in O(W 6) time. Since calling Ah requires O(D) time, the entire time complexity is bounded by O( poly (W) + D). UPDATE-FEATURE: For |H| instances of A, a single arm s weight would be changed, each of which runs in Tupdate(A; η) time. Observe finally that W = max log1+η αlb ) + log( βub = O η 1 log αub = O ϵ 1 log αub where we used the fact that 1 log(1+η) < 1 η when η (0, 1), completing the proof. Matroid Semi-Bandits in Sublinear Time E. Proofs Related to Regret Analysis E.1. Proofs Related to the Bijection gt Lemma 5.2. There exists a bijection gt : supp (i ) supp (x(t)) such that (i) gt(j) = j for j supp (i ) supp (x(t)); (ii) for any j supp (i ) \ supp (x(t)), xgt(j)(t) = 1 = D dom(f gt(j)), h E dom(f j), h 1 + 1 3 logm T . Proof: Let η 1 3 logm T . The proof is inspired by Section 4.2 in (Kveton et al., 2014a), and several changes are made to deal with the usage of the dynamic(1 + η)-approximate maximum-weight basis algorithm in the FIND-BASE procedure. Let ξt : [D] supp (x(t)) be the ordering such that ξt(i) s arm weight D dom(f ξt(i)), h E is the i-th largest, where h H lies in the same cell as the query q = (1, λt) when invoking FIND-BASE procedure. Explicit construction of gt: We define gt(j) = ξt(π 1 t (j)), j supp (i ) , where the function πt : [D] supp (i ) is a bijection such that the following hold: (i) Pk 1 i=1 eξt(i) + eπt(k) I for all k [D] (ii) πt(k) = ξt(k) if ξt(k) supp (i ) supp (x(t)) The existence of πt is proved in Lemma E.1 and also by Lemma 1 of (Kveton et al., 2014a). Show (i) gt(j) = j for j supp (i ) supp (x(t)): Fix any j supp (i ) supp (x(t)). From the definition of πt, we have πt(j) = ξt(j) and hence gt(j) = ξt(π 1 t (j)) = ξt(ξ 1 t (j)) = j. Show (ii) xgt(j)(t) = 1 = D dom(f gt(j)), h E dom(f j),h 1+η : Fix any j supp (i ) \ supp (x). Let k = π 1 t (j). Observe that the bijection πt captures the situation that: The algorithm can choose πt(k) supp (i ) as the k-th element but instead chooses ξt(k) supp (x(t)). By the procedure of Algorithm 3 and Assumption 4.2, this happens when D dom(f ξt(k)), h E 1 1 + η D dom(f πt(k)), h E , and replacing k = π 1 t (j) completes the proof. Lemma E.1. Let x, i X, and ξ : [D] supp (x) be an arbitrary bijection. There exists a bijection π : [D] supp (i ) such that Pk 1 i=1 eξ(i) + eπ(k) I for all k [D]. Proof: This lemma is equivalent to Lemma 1 of (Kveton et al., 2014a). For reader s convenience, we provide a proof here. For k = D, consider PD 1 i=1 eξ(i) I (due to hereditary property), and i I. As the former has D 1 element while the latter has D elements, by augmentation property, there exists π(D) supp (i ) such that PD 1 i=1 eξ(i) + eπ(D) I. For the case when ξ(D) supp (i ) supp (x), we set π(D) = ξ(D). The proof is completed by repeating the following process for k = D 1, , 1. As Pk 1 i=1 eξ(i) I (due to hereditary property) has k 1 elements, and i PD i=k+1 eπ(i) I (due to hereditary property) has k elements, by augmentation property, there exists π(k) such that Pk 1 i=1 eξ(i) + eπ(k) I. If ξ(k) supp (i ) supp (x), we set π(k) = ξ(k). E.2. Lemmas Related to Regret Analysis of Algorithm 5 In this section, we fix a best action i argmaxx X µ, x and define j,k µj µk. Let {j}D j=1 be the permutation of supp (i ) such that µ1 µD. Define dk max{j supp (i ) : j,k > 0} and min mink/ supp(i ) dk,k. Matroid Semi-Bandits in Sublinear Time Lemma 5.3. Let ϵ < min b . Then, for any i supp (i ) and any j / supp (i ), µi µj > 0 = µi 1 + ϵ µj > 0. Proof: Fix i supp (i ) and j / supp (i ) such that µi µj > 0. We want the following to hold: µi 1 + ϵ µj > 0 µi (1 + ϵ)µj > 0 µi µj > ϵµj. As µi µj > ϵµj must hold for all such i and j, taking the minimum over all possible i and j on the left-hand side, and use the fact that µj b for all j on the right-hand side, we derive is the condition on ϵ to ensure µi µj > 0 = µi 1+ϵ µj > 0 holds for all i and j. Lemma 5.4. Let k / supp (i ) and j [dk]. For T > T0, j=1 j,k(I)j,k j=1 j,k T0 + 12(b a)2 dk,k log T ( µdk 1+log m T µk)2 . Proof: Recall (I)j,k = PT t=1 E h 1 n gt(j) = k, Nk(t) nj,k oi , where nj,k = max 6(b a)2 log T ( µj 1+log m T µk)2 , T0 First, we claim that: for any {aj}dk j=1 with a1 adk 0, j=1 aj(I)j,k a1n1,k + j=2 aj(nj,k nj 1,k). (19) Show Eq. (19): We show by induction. For the base case, we have a1(I)1,k + a2(I)2,k a1n1,k + a2(n2,k n1,k). (20) Eq. (20) is derived as follows. Since a1, a2 0 and (I)1,k + (I)2,k = t=1 E 1 gt(1) = k, Nk(t) n1,k + 1 gt(2) = k, Nk(t) n2,k max{n1,k, n2,k} = n2,k, therefore we can bound (I)2,k as (I)2,k n2,k (I)1,k, yields that: a1(I)1,k + a2(I)2,k (a1 a2)(I)1,k + 2,kn2,k. Then, since a1 a2 and (I)1,k n1,k, we derive a1(I)1,k + a2(I)2,k (a1 a2)n1,k + a2n2,k, which shows Eq. (20). Now, assume for any {bj}ℓ j=1 with b1 bℓ 0, the following j=1 bj(I)j,k b1n1,k + j=2 bj(nj,k nj 1,k) (21) holds for ℓ< dk. Matroid Semi-Bandits in Sublinear Time Fix any {aj}ℓ+1 j=1 with a1 aℓ+1 0. Consider Pℓ+1 j=1 aj(I)j,k. Since aj 0 for all j [ℓ+ 1] and j=1 (I)j,k = j=1 1 gt(j) = k, Nk(t) n1,k max j [ℓ+1] nj,k = nℓ+1,k, we can bound (I)ℓ+1,k as (I)ℓ+1,k nℓ+1,k Pℓ j=1(I)j,k, which results in: j=1 aj(I)j,k j=1 (aj aℓ+1)(I)j,k + aℓ+1nℓ+1,k. Since a1 aℓ+1 aℓ aℓ+1 0, using inductive hypothesis Eq. (21) with bj = aj aℓ+1 for all j [ℓ], we get j=1 aj(I)j,k (a1 aℓ+1)n1,k + j=2 (aj aℓ+1)(nj,k nj 1,k) + aℓ+1nℓ+1,k. j=2 aj(nj,k nj 1,k) aℓ+1 j=2 (nj,k nj 1,k) nℓ+1,k j=2 aj(nj,k nj 1,k) + aℓ+1(nℓ+1,k nℓ,k) j=2 aj(nj,k nj 1,k). Thus, Eq. (19) is proved by induction. Define ϵ log m T and j,k(ϵ) µj 1+ϵ µk. Using Eq. (19) with aj = j,k for j [dk] and recalling nj,k max n 6(b a)2 log T j,k(ϵ)2 , T0 o , we have j=1 j,k(I)j,k 1,kn1,k + j=2 j,k(nj,k nj 1,k) j=1 j,k T0 + 6(b a)2 log T 1,k 1,k(ϵ)2 + 1 j,k(ϵ)2 1 j 1,k(ϵ)2 We upper bound the last term by: 1,k 1,k(ϵ)2 + 1 j,k(ϵ)2 1 j 1,k(ϵ)2 j,k(ϵ) j+1,k(ϵ) ( j,k(ϵ))2 + dk,k dk,k(ϵ)2 j,k(ϵ) j+1,k(ϵ) j,k(ϵ) j+1,k(ϵ) + dk,k dk,k(ϵ)2 1 j+1,k(ϵ) 1 j,k(ϵ) + dk,k dk,k(ϵ)2 2 dk,k dk,k(ϵ)2 , where the first inequality is due to j,k(ϵ) j+1,k(ϵ), and the second upperbounds the telescoping series: 1 j+1,k(ϵ) 1 j,k(ϵ) = 1 dk,k(ϵ) 1 1,k(ϵ) 1 dk,k(ϵ) Matroid Semi-Bandits in Sublinear Time Hence, we derive an upper bound for the first part relevant to (I)j,k: j=1 j,k(I)j,k j=1 j,k T0 + 12(b a)2 dk,k log T Lemma 5.5. Let k / supp (i ) and j [dk]. For T > T0, Proof of Lemma 5.5: Let ϵ = 1 logm T . Recall t=1 E h 1 n gt(j) = k, Nk(t) > nj,k oi , where nj,k = max 6(b a)2 log T ( µj 1+log m T µk)2 , T0 First, we claim that gt(j) = k = uk(Nk(t 1), T) mins log t and Nj(t 1) [t 1], we further derive uk(Nk(t 1), T) uj(Nj(t 1), t) 1 + ϵ mins nj,k}. From Eq. (14), we derive t=nj,k+1 P h gt(j) = k, Nk(t 1) > nj,k i t=nj,k+1 P uk(Nk(t 1), T) mins B and t Tj,k} µk + 2λT Nk(t 1) > µj 1+ϵ and t Tj,k . The inclusion is because under the event Et,s {At < A t and Bt,s > B and t Tj,k}, we have Nk(t 1) = A t > At Bt,s > B = µj 1 + ϵ, where the first and last inequalities are due to the event {At < A t and Bt,s > B and t Tj,k}, and the second inequality is due to the event Et,s = {At Bt,s and t Tj,k}. Hence, we have the following inclusion: {At A t and t Tj,k} {Bt,s B and t Tj,k} {At < A t and Bt,s > B and t Tj,k} = {t Tj,k} {At Bt,s and t Tj,k} = Et,s. From union bound, P[Et,s] P h {At A t and t Tj,k} Et,s i + P h {Bt,s B and t Tj,k} Et,s i + P h {At < A t and Bt,s > B and t Tj,k} Et,s i Nk(t 1) µk(Nk(t 1)) and t Tj,k + P µj µj(s) + λt s and t Tj,k Nk(t 1) > µj 1 + ϵ and t Tj,k In Eq. (23), recall λt = p 1.5(b a)2 log t and observe that the last term 1.5(b a)2 log T Nk(t 1) µj 1 + ϵ and t Tj,k 1.5(b a)2 log T nj,k + 1 µj 1 + ϵ where the inequality is because t Tj,k implies Nk(t 1) nj,k + 1, and the equality is because nj,k 6(b a)2 log T ( µj 1+ϵ µk)2 = 6(b a)2 log T nj,k + 1 < µj Matroid Semi-Bandits in Sublinear Time and also we have µj 1+ϵ µk > 0 which is ensured by Lemma 5.3 as T > T0. Finally, from Eq. (22) and Eq. (23), µk(Nk(t 1)) µk + 1.5(b a)2 log T Nk(t 1) and t Tj,k 1.5(b a)2 log t s and t Tj,k µk(t 1) µk + 1.5(b a)2 log T 1.5(b a)2 log t e 3 log T + e 3 log t , where the second inequality is because {Nk(t 1)}t Tj,k is strictly increasing (as Nk(t) = Nk(t 1) + 1 when gt( j) = k) and thus is a subsequence of {nj,k + 1, , T}, and the last inequality is due to an application of Hoeffding s inequality (Lemma E.2) with s = p 1.5(t 1)(b a)2 log T and n = t 1 to bound the first term and with s = p 1.5s(b a)2 log t and n = s to bound the second term. The proof is completed by evaluating s 0, i=1 (Xi E[Xi]) s